rickmoynihan and mikera I've been thinking about plumatic schema and core.matrix (esp datasets) lately. Any thoughts on how to do something like that?
it is soooo easy to do with a vector of maps would be nice to do on a dataset
@otfrom: funny you should say that... we've been having similar discussions at Swirrl
I've not made much use of core.matrix yet... but we use incanter datasets in grafter quite a lot though our usecase is a little different.
Basically incanter/core.matrix like to load the whole dataset into memory etc... but because we want to use it for ETL, we've been trying to avoid that and instead keep a lazy-seq of :rows
in the Dataset
but this means that validation of the rows at least is somewhat delayed - because you don't want to have to consume everything all the time in order to validate the rows
but my problem with incanter Datasets is that they allow arbitrary types in for keys. From my perspective this can cause a lot of problems and it'd be much nicer if they were always keywords (though I'd accept always strings too) - allowing them to be either causes problems with equality
I'd like to move away from incanter though - and perhaps define a Dataset type of our own that conforms to the core.matrix Dataset protocol
I'd also like to experiment with perhaps a reducer based implementation... but I've not seen many examples of people using reducers for I/O
rickmoynihan: AFAIK incanter 1.9 (and later 2.x) uses core.matrix.dataset
rickmoynihan: iota is good one to look at for reducers and IO https://github.com/thebusby/iota
I really should look again at what you have done in grafter
@otfrom: yeah I know incanter plans to use core.matrix.... but incanter 1.9 is basically a snapshot release... and there's been almost no movement on incanter for a long time as far as I can see
I've actually been looking at iota - it's one of the few examples of reducers and I/O that i've found - from the little I've seen it seems to assume too much about the file parsing...
but I need to look at it in more depth
@rickmoynihan: got schema and core.matrix.datasets working together
matty.core> (def DataSet {:column-names [s/Keyword]
:columns [[s/Num]]
:shape [(s/one s/Num "x-shape")
(s/one s/Num "y-shape")]})
;; => #'matty.core/DataSet
matty.core> (def foo (ds/dataset [:a :b :c] [[10 11 12] [20 21 22]]))
;; => #'matty.core/foo
matty.core> (s/validate DataSet foo)
;; => {:column-names [:a :b :c], :columns [[10 20] [11 21] [12 22]], :shape [2 3]}
not quite sure what my problem was before
ds is [clojure.core.matrix.dataset :as ds]
cool
so just need to do the column-names and keywords as I want
not sure if I can do coercion yet, but at least I can do validation
I noticed this the other day that core.matrix has a column-wise representation now - I'm guessing the protocol doesn't require that
Regarding Grafter - our use cases are probably a little different.
Firstly we have to avoid using incanter quite a lot; because incanter is eager... so the API isn't as expressive as what incanter provides... again we've been prefering laziness to eagerness (though that brings its own problems for sure)
Also the main idea with Grafter was to support an openrefine like interface for building transformations - so the DSL functions are intentionally dumbed down for those reasons. Also syntactically we thread the ds through the first argument of functions rather than the last - mainly because I wanted the option of optional arguments on dataset functions in the DSL.
The basic idea is each step in a ->
is an undo point - allowing stepwise debugging at the granularity of the dataset functions via the UI
rickmoynihan: https://github.com/Swirrl/grafter/pull/61
😉
thanks :simple_smile:
I'm still a wiki gnome at ❤️
hmm the api docs also need updated - we're on 0.7.0 now
@otfrom: You can see a prototype Grafter UI that was built by SINTEF (an FP7 project partner from the Dapaas project) as the centre piece of this: https://datagraft.net/ Basically we'd decided to build Grafter to support a UI that we were planning to build as part of our product - but we didn't have enough resources in the project; so we let SINTEF (a norwegian research institution) build a prototype UI for us... They took a fair bit of direction from us about what to do and how to do it - and they did a pretty good job - but it's very much prototype quality
there's a youtube video on that page you can watch
cool. Will have a look
One thing I've been wondering - is whether it'd also be possible to use reducer/channel/seq (and therefore transducer backed datasets... from your core.matrix experience would this be possible on the API?
I mean now that there's a Dataset protocol presumably you could do this
IIRC mikera was suggesting getting at each row and processing it that way in a transducer (and I presume) reducer style
I think it partly comes down to whether or not the backing matrix implementation is faster than the trans|re ducer would be or not
as a lot of the performance stuff is baked into the matrix implementations themsevles
that's my understanding too...
As I said though - our usecase is perhaps a little different - in that firstly there isn't really a suitable backing matrix implementation that I know of - and we want to avoid loading the file into RAM - so it'd be in e.g. grafter 2 to just build up a reducer inside the Dataset somehow - I'm guessing we could perhaps use the IReducible protocols for this... as our representation is currently #Dataset { :rows (...) :column-names [:foo :bar :baz]}
and we'd need operations to keep the column names in sync with the row data.
I'm not quite sure on how we could back it with a transducer yet though... as I'm not sure there are protocols for that
but it'd be very cool if you could switch a dataset from being pull based, push based, reducible/foldable, and sequence-able - but I think I need to learn a lot more about reducers and transducers
I've not made much use of either yet