In our scenario, we guess the contents of an excel file and present it in a UI as suggestion for later parsing.
I have solved the Excel date thing, by reading the file with ->dataset
as ususal. Then I take samples of columns from the dataset with type float64
to make an educated guess if it is a reasonable date, by using the Apache POI DateUtil/getLocalDateTime
function.
I collect the column names that are candidates for excel dates and build a parser-fn
map and run the ->dataset
again, with the parser functions.
There's an extra roundtrip, but could work 😄 Let's see what my colleagues think of it 😅
I like that approach honestly. It probably parses a file more or less instantly and you get the entire dataset to run your type heuristics on.
You can re-parse a column though: https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.column.html#var-parse-column Then you can just re-parse the columns using same syntax as your parser-fn.
I should have directed you towards https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.html#var-column-cast - that is more general and not specific towards string columns.
I read about the column-cast and will give that one a try too. I think I may have misunderstood the api of it yesterday, but it looks like a good approach. Thank you!
1👍