Data exploration with large files Hinted by Dragan's and practicalli's https://dragan.rocks/articles/20/Corona-1-Baby-steps-with-Covid-19-for-programmers https://github.com/practicalli/data-science-corvid-19 ... I wanted to explore this large file: https://open-covid-19.github.io/data/v2/latest/master.csv (Cf https://github.com/open-covid-19/data/blob/master/README.md#master) Alas, it's too big for my Java heap to slurp it in. Is there a library which allows to explore the data using e.g. map, reduce and core.async without ever loading the full file? I know there are ways, cf https://amp.reddit.com/r/Clojure/comments/5x2n47/counting_lines_60_faster_than_wc_with_clojure/ https://stackoverflow.com/questions/38340868/consuming-file-contents-with-clojures-core-async I just wanted to know if there is a set of ready to use functions for that, or at least a write up.
hmmm, this is not a large file, it's only 300kb
When I download it it's about 1.3 M
I can (slurp "<https://open-covid-19.github.io/data/v2/latest/master.csv>")
without problems
@vlaaad imagine it was large. What would you do?
Depending on how large. If it's possible to fit into memory, maybe I'd have a look at tech.ml.dataset
If not, I'd used streaming processing (aka lazy seq + reduce)
So instead of slurp it would be something that reads csv line by line and processes individual items
@nick.romer for example:
(require '[<http://clojure.java.io|clojure.java.io> :as io]
'[clojure.string :as str])
(with-open [reader (io/reader "<https://open-covid-19.github.io/data/v2/latest/master.csv>")]
(let [[header & rows] (line-seq reader)
keys (map keyword (str/split header #","))]
(->> rows
(map #(str/split % #","))
(map #(zipmap keys %))
(map :total_tested)
(remove empty?)
(map #(Integer/parseInt %))
(reduce +))))
I just tested it and it works fine. Very instructive for the newbie that I am, especially the use of zipmap. Is there a way to use reducers fold to speed it up?
(there is clojure.data.csv for csv parsing, I just was too lazy to restart a repl)
@nick.romer Have you tried processing it lazily?