data-science

Data science, data analysis, and machine learning in Clojure https://scicloj.github.io/pages/chat_streams/ for additional discussions
2020-03-17T15:32:27.037400Z

Hey I need to process a 300M line count CSV-like file At a wild guess that is like 80gb does that sound reasonable? I am being told to use Spark. But my iphone could process that right?

2020-03-18T17:05:13.042900Z

Most of the bases are covered above. Depends on use case, in particular whether you need all the data in memory at once, or can process it in a single (or small number of) scan(s). If you don't need it in memory, you could use https://github.com/metasoarous/semantic-csv

chrisn 2020-03-18T21:52:43.043700Z

Just saw this thread. I agree that if you don't want to have it in memory, @metasoarousโ€™s library is a good optionr. If you do use tech.ml.dataset, you can filter columns and take a max num rows to avoid processing any data you don't have to. Aside from that, with tech.ml.dataset it will be in memory and 30GB+ csv files in memory, unless there are a lot of repeated categorical values, will not be ideal I think.

1๐Ÿ‘1โœ”๏ธ1โž•
2020-03-17T15:39:14.037500Z

Yes, in my experience Spark is frequently overkill, and it definitely sounds like that's the case here. It depends quite a bit on what kind of analysis you need though

2020-03-17T15:42:09.037800Z

Also, depending on your row size, 80gb sounds pretty generous

2020-03-17T15:53:22.038100Z

Could try https://github.com/techascent/tech.ml.dataset

3๐Ÿ‘
2020-03-17T15:54:11.038400Z

the backing is https://github.com/jtablesaw/tablesaw

Eddie 2020-03-17T15:59:49.038700Z

My suggestion will differ based on how many passes you need to do over the data, and if you need to store intermediate results. The simplest use case would be mapping a function (and/or filtering) over each row and writing the result to a new file. You only need 1 row in memory at a time. Clojure lazy sequences over a stream of data from the file will be fine. Spark will probably be more setup than its worth. If you are doing analytics on the file and thus require aggregate computation, multiple passes, enhancement with other datasets, etc. It can start to get difficult to manage your resources. It is unlikely a single machine (including your iPhone ๐Ÿ™‚ ) has 80gb of ram, so putting the entire dataset into memory for efficient reuse will not be an option. You could end up thrashing to-and-from virtual memory, or maybe you with OOM. Spark would be a wonderful solution to that problem.

1๐Ÿ‘
kenny 2020-03-17T16:02:53.038900Z

We process large CSV files, though smaller than yours (up to ~30gb). We do just as @erp12 says. Simple map/filter over lazily read gzipped csv data and output to a much smaller csv file. Though @erp12, it would take far more than 80gb of memory to fit a 80gb sized csv file in memory.

1๐Ÿ‘
kenny 2020-03-17T16:04:50.039100Z

The Java process to do this uses around 3gb of memory. A new iPhone has 4gb of memory so it could probably do it haha.

Eddie 2020-03-17T16:05:01.039400Z

Good point.

kenny 2020-03-17T16:06:48.039800Z

BTW, if you're going to do take this approach @dustingetz, ensure locals clearing is enabled. I typically run a Cursive REPL in debugger mode. This disables locals clearing which easily causes OOMs.

1๐Ÿ‘
kenny 2020-03-17T16:24:29.041Z

Depending on the analysis you need to do, you can somewhat easily parallelize this processing by reading the csv in chunks.

2020-03-17T16:31:40.041200Z

Nice thread, I really should check out the tech-ml dataset, the local clearing is a good hint too. I also realized I should have put what Eddie said in my answer, more immediately helpful than what I said ๐Ÿ˜‰

jsa-aerial 2020-03-17T18:09:03.041700Z

There has been a tremendous amount of great new work on the tech.ml.dataset stack recently (as in daily over the last week or so) on just this sort of large scale load and processing. The discussion has been over on Zulip under #data-science > tech.ml.dataset @chris441 can say more - or head over there for much more info!

2020-03-17T20:28:32.042200Z

We are organizing an onlineย https://twitter.com/hashtag/Clojure?src=hashtag_clickย hackathon for studying COVID-19 data. Please mark your preferred dates: https://twitter.com/scicloj/status/1240010550555353088

3๐Ÿ‘