data-science

Data science, data analysis, and machine learning in Clojure https://scicloj.github.io/pages/chat_streams/ for additional discussions
2021-03-26T16:04:54.011600Z

👋 I have a question about <http://tech.ml|tech.ml>.dataset and hope this is the channel to use 😄 I can't figure out how to tell the -&gt;dataset to parse Excel date cells as dates (and not numerics). Which cells that are date cells is unknown when parsing, is there a way to default to dates? Currently I get a column with a numeric value (the Excel way of calculating dates: "days since 1900 1 jan").

2021-03-27T12:30:03.014600Z

In our scenario, we guess the contents of an excel file and present it in a UI as suggestion for later parsing. I have solved the Excel date thing, by reading the file with -&gt;dataset as ususal. Then I take samples of columns from the dataset with type float64 to make an educated guess if it is a reasonable date, by using the Apache POI DateUtil/getLocalDateTime function. I collect the column names that are candidates for excel dates and build a parser-fn map and run the -&gt;dataset again, with the parser functions. There's an extra roundtrip, but could work 😄 Let's see what my colleagues think of it 😅

chrisn 2021-03-27T12:47:45.015Z

I like that approach honestly. It probably parses a file more or less instantly and you get the entire dataset to run your type heuristics on.

chrisn 2021-03-27T13:00:22.015200Z

You can re-parse a column though: https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.column.html#var-parse-column Then you can just re-parse the columns using same syntax as your parser-fn.

chrisn 2021-03-27T13:07:06.015400Z

I should have directed you towards https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.html#var-column-cast - that is more general and not specific towards string columns.

2021-03-27T13:26:29.015600Z

I read about the column-cast and will give that one a try too. I think I may have misunderstood the api of it yesterday, but it looks like a good approach. Thank you!

👍 1
2021-03-26T16:21:41.011800Z

Example - a column is recognized as :datatype :float64 with values like 44258.0

genmeblog 2021-03-26T16:55:14.012Z

@chris441 can you help here?

chrisn 2021-03-26T16:58:15.012200Z

Could I get an example document? Do you have a safe, small one you can send? There is no 'date' type in POI and fastexcel's cell type. You can do it by supplying a custom parse function to convert from days-since-jan-1 to a local date type but I would want to see some data and I can get you a test parse line. Something like

chrisn 2021-03-26T16:59:20.012400Z

(-&gt;dataset "data.xls" {:parser-fn {"date" [:local-date convert-to-local-date]}})

chrisn 2021-03-26T17:00:22.012600Z

where convert-to-local-date is a function from a number to your date type. Be wary of nils or strings; excel can really put anything in any cell.

2021-03-26T17:01:06.012800Z

Thanks! I think I succeeded with doing that, there was a similar example in the readme of the repo, I think.

chrisn 2021-03-26T17:01:07.013Z

Your function can return :http://tech.ml.dataset.parse/missing or :http://tech.ml.dataset.parse/parse-failure to denote a missing value or a value you expect to parse but it failed. In that case the value will be in the metadata of the column.

chrisn 2021-03-26T17:01:47.013200Z

OK, great 🙂.

2021-03-26T17:02:15.013400Z

If the cell containing date is unknown, is there a way to figure out what column is a date, in the meta data or something similar?

2021-03-26T17:02:51.013600Z

Aha, do you mean that my parse function could do that - if parsing failed -> not a date?

chrisn 2021-03-26T17:11:34.013800Z

I haven't found a way to do that yet. I think there may be but it is through the POI and fastexcel API's.

chrisn 2021-03-26T17:14:50.014Z

Also, if you want to build/edit spreadsheets there is a new library I think is really nice named https://github.com/zero-one-group/fxl.

👍 1
2021-03-26T17:15:21.014300Z

Thanks! I'll have a look!