👋 I have a question about <http://tech.ml|tech.ml>.dataset
and hope this is the channel to use 😄 I can't figure out how to tell the ->dataset
to parse Excel date cells as dates (and not numerics). Which cells that are date cells is unknown when parsing, is there a way to default to dates? Currently I get a column with a numeric value (the Excel way of calculating dates: "days since 1900 1 jan").
In our scenario, we guess the contents of an excel file and present it in a UI as suggestion for later parsing.
I have solved the Excel date thing, by reading the file with ->dataset
as ususal. Then I take samples of columns from the dataset with type float64
to make an educated guess if it is a reasonable date, by using the Apache POI DateUtil/getLocalDateTime
function.
I collect the column names that are candidates for excel dates and build a parser-fn
map and run the ->dataset
again, with the parser functions.
There's an extra roundtrip, but could work 😄 Let's see what my colleagues think of it 😅
I like that approach honestly. It probably parses a file more or less instantly and you get the entire dataset to run your type heuristics on.
You can re-parse a column though: https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.column.html#var-parse-column Then you can just re-parse the columns using same syntax as your parser-fn.
I should have directed you towards https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.html#var-column-cast - that is more general and not specific towards string columns.
I read about the column-cast and will give that one a try too. I think I may have misunderstood the api of it yesterday, but it looks like a good approach. Thank you!
Example - a column is recognized as :datatype :float64
with values like 44258.0
@chris441 can you help here?
Could I get an example document? Do you have a safe, small one you can send? There is no 'date' type in POI and fastexcel's cell type. You can do it by supplying a custom parse function to convert from days-since-jan-1 to a local date type but I would want to see some data and I can get you a test parse line. Something like
(->dataset "data.xls" {:parser-fn {"date" [:local-date convert-to-local-date]}})
where convert-to-local-date
is a function from a number to your date type. Be wary of nils or strings; excel can really put anything in any cell.
Thanks! I think I succeeded with doing that, there was a similar example in the readme of the repo, I think.
Your function can return :http://tech.ml.dataset.parse/missing or :http://tech.ml.dataset.parse/parse-failure to denote a missing value or a value you expect to parse but it failed. In that case the value will be in the metadata of the column.
OK, great 🙂.
If the cell containing date is unknown, is there a way to figure out what column is a date, in the meta data or something similar?
Aha, do you mean that my parse function could do that - if parsing failed -> not a date?
I haven't found a way to do that yet. I think there may be but it is through the POI and fastexcel API's.
Also, if you want to build/edit spreadsheets there is a new library I think is really nice named https://github.com/zero-one-group/fxl.
Thanks! I'll have a look!