core-matrix

intended for specific discussion around core.matrix (usage and development) For general data work check out #data-science
2016-03-09T13:14:02.000043Z

rickmoynihan and mikera I've been thinking about plumatic schema and core.matrix (esp datasets) lately. Any thoughts on how to do something like that?

2016-03-09T13:14:19.000044Z

it is soooo easy to do with a vector of maps would be nice to do on a dataset

2016-03-09T13:38:26.000045Z

@otfrom: funny you should say that... we've been having similar discussions at Swirrl

2016-03-09T13:41:51.000046Z

I've not made much use of core.matrix yet... but we use incanter datasets in grafter quite a lot though our usecase is a little different. Basically incanter/core.matrix like to load the whole dataset into memory etc... but because we want to use it for ETL, we've been trying to avoid that and instead keep a lazy-seq of :rows in the Dataset

2016-03-09T13:43:52.000047Z

but this means that validation of the rows at least is somewhat delayed - because you don't want to have to consume everything all the time in order to validate the rows

2016-03-09T13:46:21.000050Z

but my problem with incanter Datasets is that they allow arbitrary types in for keys. From my perspective this can cause a lot of problems and it'd be much nicer if they were always keywords (though I'd accept always strings too) - allowing them to be either causes problems with equality

2016-03-09T13:47:34.000051Z

I'd like to move away from incanter though - and perhaps define a Dataset type of our own that conforms to the core.matrix Dataset protocol

2016-03-09T13:53:48.000052Z

I'd also like to experiment with perhaps a reducer based implementation... but I've not seen many examples of people using reducers for I/O

2016-03-09T14:29:45.000054Z

rickmoynihan: AFAIK incanter 1.9 (and later 2.x) uses core.matrix.dataset

2016-03-09T14:30:19.000055Z

rickmoynihan: iota is good one to look at for reducers and IO https://github.com/thebusby/iota

2016-03-09T14:30:37.000057Z

I really should look again at what you have done in grafter

2016-03-09T15:11:10.000058Z

@otfrom: yeah I know incanter plans to use core.matrix.... but incanter 1.9 is basically a snapshot release... and there's been almost no movement on incanter for a long time as far as I can see

2016-03-09T15:13:49.000059Z

I've actually been looking at iota - it's one of the few examples of reducers and I/O that i've found - from the little I've seen it seems to assume too much about the file parsing...

2016-03-09T15:14:07.000060Z

but I need to look at it in more depth

2016-03-09T15:19:57.000061Z

@rickmoynihan: got schema and core.matrix.datasets working together

2016-03-09T15:21:25.000063Z

matty.core> (def DataSet {:column-names [s/Keyword]
                          :columns [[s/Num]]
                          :shape [(s/one s/Num "x-shape")
                                  (s/one s/Num "y-shape")]})
;; => #'matty.core/DataSet
matty.core> (def foo (ds/dataset [:a :b :c] [[10 11 12] [20 21 22]]))
;; => #'matty.core/foo
matty.core> (s/validate DataSet foo)
;; => {:column-names [:a :b :c], :columns [[10 20] [11 21] [12 22]], :shape [2 3]}

2016-03-09T15:21:42.000065Z

not quite sure what my problem was before

2016-03-09T15:22:07.000066Z

ds is [clojure.core.matrix.dataset :as ds]

2016-03-09T15:22:10.000067Z

cool

2016-03-09T15:22:29.000068Z

so just need to do the column-names and keywords as I want

2016-03-09T15:22:45.000069Z

not sure if I can do coercion yet, but at least I can do validation

2016-03-09T15:23:05.000070Z

I noticed this the other day that core.matrix has a column-wise representation now - I'm guessing the protocol doesn't require that

2016-03-09T15:25:18.000071Z

Regarding Grafter - our use cases are probably a little different. Firstly we have to avoid using incanter quite a lot; because incanter is eager... so the API isn't as expressive as what incanter provides... again we've been prefering laziness to eagerness (though that brings its own problems for sure) Also the main idea with Grafter was to support an openrefine like interface for building transformations - so the DSL functions are intentionally dumbed down for those reasons. Also syntactically we thread the ds through the first argument of functions rather than the last - mainly because I wanted the option of optional arguments on dataset functions in the DSL. The basic idea is each step in a -> is an undo point - allowing stepwise debugging at the granularity of the dataset functions via the UI

2016-03-09T15:28:53.000073Z

rickmoynihan: https://github.com/Swirrl/grafter/pull/61

2016-03-09T15:28:56.000075Z

😉

2016-03-09T15:32:28.000076Z

thanks :simple_smile:

2016-03-09T15:34:14.000077Z

I'm still a wiki gnome at ❤️

2016-03-09T15:34:46.000078Z

hmm the api docs also need updated - we're on 0.7.0 now

2016-03-09T15:39:15.000079Z

@otfrom: You can see a prototype Grafter UI that was built by SINTEF (an FP7 project partner from the Dapaas project) as the centre piece of this: https://datagraft.net/ Basically we'd decided to build Grafter to support a UI that we were planning to build as part of our product - but we didn't have enough resources in the project; so we let SINTEF (a norwegian research institution) build a prototype UI for us... They took a fair bit of direction from us about what to do and how to do it - and they did a pretty good job - but it's very much prototype quality

2016-03-09T15:39:44.000081Z

there's a youtube video on that page you can watch

2016-03-09T15:58:24.000083Z

cool. Will have a look

2016-03-09T16:06:12.000084Z

One thing I've been wondering - is whether it'd also be possible to use reducer/channel/seq (and therefore transducer backed datasets... from your core.matrix experience would this be possible on the API?

2016-03-09T16:06:28.000085Z

I mean now that there's a Dataset protocol presumably you could do this

2016-03-09T16:16:07.000086Z

IIRC mikera was suggesting getting at each row and processing it that way in a transducer (and I presume) reducer style

2016-03-09T16:16:35.000087Z

I think it partly comes down to whether or not the backing matrix implementation is faster than the trans|re ducer would be or not

2016-03-09T16:16:52.000088Z

as a lot of the performance stuff is baked into the matrix implementations themsevles

2016-03-09T17:52:38.000089Z

that's my understanding too... As I said though - our usecase is perhaps a little different - in that firstly there isn't really a suitable backing matrix implementation that I know of - and we want to avoid loading the file into RAM - so it'd be in e.g. grafter 2 to just build up a reducer inside the Dataset somehow - I'm guessing we could perhaps use the IReducible protocols for this... as our representation is currently #Dataset { :rows (...) :column-names [:foo :bar :baz]} and we'd need operations to keep the column names in sync with the row data. I'm not quite sure on how we could back it with a transducer yet though... as I'm not sure there are protocols for that

2016-03-09T17:53:32.000090Z

but it'd be very cool if you could switch a dataset from being pull based, push based, reducible/foldable, and sequence-able - but I think I need to learn a lot more about reducers and transducers

2016-03-09T17:53:38.000091Z

I've not made much use of either yet