http://tech.ml.dataset can load that file I believe. It is far more efficient with memory. in general.
user> (require '[tech.ml.dataset :as ds])
nil
user> (def ds (ds/->dataset "<https://open-covid-19.github.io/data/v2/latest/master.csv>"))
#'user/ds
user> (require '[clj-memory-meter.core :as mm])
nil
user> (mm/measure ds)
"5.7 MB"
For a one-stop data exploration pathway that should work well for you:
https://github.com/cnuernber/simpledata/Dude; That memory-meter shit is dope! Going to stow that one away in my toolbox.
Haha, yeah totally. I really wish I had found that earlier as tracking down which object graph in a program is hogging ram is a serious problem sometimes 🙂.
@chris441 Do I need any special setup for http://tech.ml.dataset? The 2.0-beta... didn't work well so I've tried 1.73 but that gives another error:
1. Caused by java.lang.IllegalArgumentException
Missing config value: :tech-io-cache-local
core.clj: 198 tech.config.core/get-config
core.clj: 193 tech.config.core/get-config
providers.clj: 51 tech.io.providers/fn/fn
core.clj: 2753 clojure.core/map/fn
LazySeq.java: 42 clojure.lang.LazySeq/sval
LazySeq.java: 51 clojure.lang.LazySeq/seq
RT.java: 535 clojure.lang.RT/seq
core.clj: 137 clojure.core/seq
core.clj: 2809 clojure.core/filter/fn
LazySeq.java: 42 clojure.lang.LazySeq/sval
LazySeq.java: 51 clojure.lang.LazySeq/seq
RT.java: 535 clojure.lang.RT/seq
core.clj: 137 clojure.core/seq
core.clj: 930 clojure.core/reduce1
core.clj: 947 clojure.core/reverse
core.clj: 947 clojure.core/reverse
providers.clj: 42 tech.io.providers/provider-seq->wrapped-providers
providers.clj: 35 tech.io.providers/provider-seq->wrapped-providers
providers.clj: 48 tech.io.providers/fn
providers.clj: 47 tech.io.providers/fn
AFn.java: 152 clojure.lang.AFn/applyToHelper
AFn.java: 144 clojure.lang.AFn/applyTo
core.clj: 665 clojure.core/apply
core.clj: 6353 clojure.core/memoize/fn
RestFn.java: 397 clojure.lang.RestFn/invoke
io.clj: 35 <http://tech.io/*provider-fn*|tech.io/*provider-fn*>
io.clj: 35 <http://tech.io/*provider-fn*|tech.io/*provider-fn*>
io.clj: 80 <http://tech.io/input-stream|tech.io/input-stream>
io.clj: 76 <http://tech.io/input-stream|tech.io/input-stream>
RestFn.java: 410 clojure.lang.RestFn/invoke
base.clj: 572 tech.ml.dataset.base/->dataset
base.clj: 515 tech.ml.dataset.base/->dataset
base.clj: 580 tech.ml.dataset.base/->dataset
base.clj: 515 tech.ml.dataset.base/->dataset
With 2.0-beta-57
I get this error even earlier, when requiring the lib:
Syntax error (IllegalArgumentException) compiling . at (tech/ml/dataset/math.clj:136:11).
No matching method fit found taking 5 args for class smile.clustering.KMeans
I solved it - it seems that fastmath is using older versions of smile-*
dependencies. I had to manually specify the 2.4.0
versions in my project.clj.
The memory footpring indeed looks quite lower compared to clojure vector/hashmap https://github.com/jumarko/clojure-experiments/blob/master/src/clojure_experiments/csv.clj#L39-L49
(def csv-ds (csv/read-csv (slurp "<https://open-covid-19.github.io/data/v2/latest/master.csv>")))
;; don't be fooled by lazy seqs when measuring memory -> use vector
(mm/measure (vec csv-ds))
;; => "23.1 MB"
(mm/measure (vec (csv-data->maps csv-ds)))
;; => "31.8 MB"
(require '[tech.ml.dataset :as ds])
(def ds (ds/->dataset "<https://open-covid-19.github.io/data/v2/latest/master.csv>"))
(mm/measure ds)
;; => "5.1 MB" ;;
For more on that pathway:
https://gist.github.com/cnuernber/26b88ed259dd1d0dc6ac2aa138eecf37
If you get a dataset where the numeric data can be represented by short integers and the string columns have low numbers of unique items then the dataset library really will shine.
Also, if you measure the memory used by ds/mapseq-reader
you will see that the maps are really referring back to the original table data; you only pay for what you read in terms of converting a dataset back into a sequence of maps.
https://github.com/techascent/tech.ml.dataset/blob/master/java/tech/ml/dataset/FastStruct.java
I got that idea from @metasoarous’s semantic-csv library.