data-science

Data science, data analysis, and machine learning in Clojure https://scicloj.github.io/pages/chat_streams/ for additional discussions
niveauverleih 2020-06-06T09:11:22.291500Z

Data exploration with large files Hinted by Dragan's and practicalli's https://dragan.rocks/articles/20/Corona-1-Baby-steps-with-Covid-19-for-programmers https://github.com/practicalli/data-science-corvid-19 ... I wanted to explore this large file: https://open-covid-19.github.io/data/v2/latest/master.csv (Cf https://github.com/open-covid-19/data/blob/master/README.md#master) Alas, it's too big for my Java heap to slurp it in. Is there a library which allows to explore the data using e.g. map, reduce and core.async without ever loading the full file? I know there are ways, cf https://amp.reddit.com/r/Clojure/comments/5x2n47/counting_lines_60_faster_than_wc_with_clojure/ https://stackoverflow.com/questions/38340868/consuming-file-contents-with-clojures-core-async I just wanted to know if there is a set of ready to use functions for that, or at least a write up.

vlaaad 2020-06-06T10:29:43.292200Z

hmmm, this is not a large file, it's only 300kb

jumar 2020-06-09T06:49:01.307300Z

When I download it it's about 1.3 M

vlaaad 2020-06-06T10:30:30.292600Z

I can (slurp "<https://open-covid-19.github.io/data/v2/latest/master.csv>") without problems

niveauverleih 2020-06-06T11:52:17.294Z

@vlaaad imagine it was large. What would you do?

vlaaad 2020-06-06T11:55:32.294100Z

Depending on how large. If it's possible to fit into memory, maybe I'd have a look at tech.ml.dataset

2☝️
vlaaad 2020-06-06T11:56:24.294200Z

If not, I'd used streaming processing (aka lazy seq + reduce)

vlaaad 2020-06-06T11:57:20.294300Z

So instead of slurp it would be something that reads csv line by line and processes individual items

vlaaad 2020-06-06T12:06:36.294600Z

@nick.romer for example:

(require '[<http://clojure.java.io|clojure.java.io> :as io]
         '[clojure.string :as str])
(with-open [reader (io/reader "<https://open-covid-19.github.io/data/v2/latest/master.csv>")]
  (let [[header &amp; rows] (line-seq reader)
        keys (map keyword (str/split header #","))]
    (-&gt;&gt; rows
         (map #(str/split % #","))
         (map #(zipmap keys %))
         (map :total_tested)
         (remove empty?)
         (map #(Integer/parseInt %))
         (reduce +))))

niveauverleih 2020-06-10T09:14:34.311100Z

I just tested it and it works fine. Very instructive for the newbie that I am, especially the use of zipmap. Is there a way to use reducers fold to speed it up?

vlaaad 2020-06-06T12:07:13.295200Z

(there is clojure.data.csv for csv parsing, I just was too lazy to restart a repl)

okwori 2020-06-06T15:35:20.295300Z

Do we have a fix as to why the arc pie chart is not displaying/throwing an exception yet...

2020-06-06T19:52:02.295900Z

@nick.romer Have you tried processing it lazily?