data-science

Data science, data analysis, and machine learning in Clojure https://scicloj.github.io/pages/chat_streams/ for additional discussions
chrisn 2020-06-07T14:22:28.297600Z

http://tech.ml.dataset can load that file I believe. It is far more efficient with memory. in general.

user> (require '[tech.ml.dataset :as ds])
nil
user&gt; (def ds (ds/-&gt;dataset "<https://open-covid-19.github.io/data/v2/latest/master.csv>"))
#'user/ds
user&gt; (require '[clj-memory-meter.core :as mm])
nil
user&gt; (mm/measure ds)
"5.7 MB"
For a one-stop data exploration pathway that should work well for you: https://github.com/cnuernber/simpledata/

👍 1
2020-06-08T18:47:10.302100Z

Dude; That memory-meter shit is dope! Going to stow that one away in my toolbox.

chrisn 2020-06-08T19:27:49.302400Z

Haha, yeah totally. I really wish I had found that earlier as tracking down which object graph in a program is hogging ram is a serious problem sometimes 🙂.

jumar 2020-06-09T06:51:19.307500Z

@chris441 Do I need any special setup for http://tech.ml.dataset? The 2.0-beta... didn't work well so I've tried 1.73 but that gives another error:

1. Caused by java.lang.IllegalArgumentException
   Missing config value: :tech-io-cache-local

                  core.clj:  198  tech.config.core/get-config
                  core.clj:  193  tech.config.core/get-config
             providers.clj:   51  tech.io.providers/fn/fn
                  core.clj: 2753  clojure.core/map/fn
              LazySeq.java:   42  clojure.lang.LazySeq/sval
              LazySeq.java:   51  clojure.lang.LazySeq/seq
                   RT.java:  535  clojure.lang.RT/seq
                  core.clj:  137  clojure.core/seq
                  core.clj: 2809  clojure.core/filter/fn
              LazySeq.java:   42  clojure.lang.LazySeq/sval
              LazySeq.java:   51  clojure.lang.LazySeq/seq
                   RT.java:  535  clojure.lang.RT/seq
                  core.clj:  137  clojure.core/seq
                  core.clj:  930  clojure.core/reduce1
                  core.clj:  947  clojure.core/reverse
                  core.clj:  947  clojure.core/reverse
             providers.clj:   42  tech.io.providers/provider-seq-&gt;wrapped-providers
             providers.clj:   35  tech.io.providers/provider-seq-&gt;wrapped-providers
             providers.clj:   48  tech.io.providers/fn
             providers.clj:   47  tech.io.providers/fn
                  AFn.java:  152  clojure.lang.AFn/applyToHelper
                  AFn.java:  144  clojure.lang.AFn/applyTo
                  core.clj:  665  clojure.core/apply
                  core.clj: 6353  clojure.core/memoize/fn
               RestFn.java:  397  clojure.lang.RestFn/invoke
                    io.clj:   35  <http://tech.io/*provider-fn*|tech.io/*provider-fn*>
                    io.clj:   35  <http://tech.io/*provider-fn*|tech.io/*provider-fn*>
                    io.clj:   80  <http://tech.io/input-stream|tech.io/input-stream>
                    io.clj:   76  <http://tech.io/input-stream|tech.io/input-stream>
               RestFn.java:  410  clojure.lang.RestFn/invoke
                  base.clj:  572  tech.ml.dataset.base/-&gt;dataset
                  base.clj:  515  tech.ml.dataset.base/-&gt;dataset
                  base.clj:  580  tech.ml.dataset.base/-&gt;dataset
                  base.clj:  515  tech.ml.dataset.base/-&gt;dataset

jumar 2020-06-09T06:53:38.307700Z

With 2.0-beta-57 I get this error even earlier, when requiring the lib:

Syntax error (IllegalArgumentException) compiling . at (tech/ml/dataset/math.clj:136:11).
No matching method fit found taking 5 args for class smile.clustering.KMeans

jumar 2020-06-09T07:17:04.308Z

I solved it - it seems that fastmath is using older versions of smile-* dependencies. I had to manually specify the 2.4.0 versions in my project.clj. The memory footpring indeed looks quite lower compared to clojure vector/hashmap https://github.com/jumarko/clojure-experiments/blob/master/src/clojure_experiments/csv.clj#L39-L49

(def csv-ds (csv/read-csv (slurp "<https://open-covid-19.github.io/data/v2/latest/master.csv>")))
  ;; don't be fooled by lazy seqs when measuring memory -&gt; use vector
  (mm/measure (vec csv-ds))
  ;; =&gt; "23.1 MB"
  (mm/measure (vec (csv-data-&gt;maps csv-ds)))
  ;; =&gt; "31.8 MB"

  (require '[tech.ml.dataset :as ds])
  (def ds (ds/-&gt;dataset "<https://open-covid-19.github.io/data/v2/latest/master.csv>"))
  (mm/measure ds)
  ;; =&gt; "5.1 MB"  ;;

chrisn 2020-06-09T13:42:28.309400Z

For more on that pathway: https://gist.github.com/cnuernber/26b88ed259dd1d0dc6ac2aa138eecf37 If you get a dataset where the numeric data can be represented by short integers and the string columns have low numbers of unique items then the dataset library really will shine. Also, if you measure the memory used by ds/mapseq-reader you will see that the maps are really referring back to the original table data; you only pay for what you read in terms of converting a dataset back into a sequence of maps. https://github.com/techascent/tech.ml.dataset/blob/master/java/tech/ml/dataset/FastStruct.java I got that idea from @metasoarous’s semantic-csv library.