God morgen!
Guten Morgen!
Morning! This weekend I tried to make the finishing touches on https://github.com/babashka/babashka.process.
morning
mogge
Nifty feature:
what do people reach for when they want to do lots of streaming calcs on the fly, but share the early stages of it (does this make sense?). I'm doing things where I'm parsing quite large files and then doing some basic cleaning of what comes out of the files and then I'm doing different things based on that processed input. Usually I reach for core.async for this and a mult on the core bit of processing with some kind of reduce on the last channels. I'd like if possible to use as many cores as possible given that the jobs are parallel, but I'd also like to keep the amount of memory down. Any thoughts?
I'd use transduce personally
It's a shame that parallel transducers never happened.
Hehe.
oh hang on theres a bit of a blog post on that
There's the core.async parallel transducer context :)
essentially funnels each value into its own future
then run a few entries behind in order to give the futures a chance to execute before you deref
them
there's a bug in that blogpost code, BTW which is fixed in https://github.com/hammonba/clojure-attic/blob/master/src/clj/attic/parallelising_map.clj#L13
(you have to 'close off', the transducer by calling the arity-1 reducing function; if you forget to do that then you might lose some data)
so you have seperate functions for β’ scrubbing data β’ computing some kind of dispatching value β’ multitimethods to seperately compute the different streams Glued together using tranducing map and parallelised using something like that blogpost I linked to
I think what I mean is that I like composing my code, but I don't have good ways of composing my results as they flow
so I can do
(comp
(map this-fn)
(filter other-fn))
@otfrom is sequence
the right thing here? it caches results so then you will share the early results?
but I don't see a good way of using the results of that w/o doing transduce/into/reduce multiple times or going to core.async/mult
@borkdude sequence
is right if I was just after caching, but the data is large enough that I don't want all of the intermediate values in memory if I can avoid it
and the end of my pipes I do a lot of side effecting (writing to files for me atm, tho it has been a data base or msg stream/q in the past)
maybe use a database ;)
it might come to that
atm I can just run it mulitiple times and wait
or go for core.async which is my usual fave way of doing this kind of thing
thx @ben.hammond π
I dont see a problem with (a small number) of nested transduce/into/reduce as long as you dont let it get out of control
I don't either. I was just getting to the tipping point on that and was wondering how others were solving the problem
I find core.async to be awkward to debug from the repl
but perhaps that is because I haven't tried hard enough
@ben.hammond it can be
core.async is optimized for correct programs
(btw I find x/reduce and x/by-key to make my xf pipelines much more composable as it means I need the results of a transduce a lot less often)
I've gotten reasonably good at debugging core.async issues, but I mostly do it by using the xfs in a non-async way to make sure they are right and then building them up piece by piece
all programs are wrong some are useful
but I find the happy-hacking carefree nature of the REPL to be Clojure's killer feature certainly in terms of maintaining productivity and focus
@otfrom core.reducers is probably what you really want.
But no transducers there.
have you found https://clojure.org/reference/reducers to be useful, @dominicm I've been underwhelmed
I thought the most useful things were fjfork
and fjjoin
, and they're both declared as private
https://github.com/clojure/clojure/blob/master/src/clj/clojure/core/reducers.clj#L32-L34
er, so I mean I did not find clojure.core.reducers
to add much value on top of the raw Java Fork/Join mechanism
@dominicm it still doesn't give me a way to share my intermediate results. I've had good fun using them with memory mapped files though
when you say 'share', how do you expect to retrieve the intermediate results?
I'm not entirely sure about the mechanism, but I'd like to avoid re-calculating everything each time and I'd like to avoid having everything in memory all at once
wouldn't you just have a big map that you assoc
interesting things on to
oh you'd write to a temporary file then
and assoc its filename into your big map
> oh you'd write to a temporary file then I said, use database.
;P
I've never used the fork join mechanism. It gives me a pretty easy api to just merge together some bits.
Postgresβ indeed cool.
Like βI can declare indexes on paths in jsonβ-cool.
There's that cool library from Riverford that does that for maps
@otfrom maybe titanoboa?
hmm... my intermediate results don't have to go into a database tho. If I'm reading from a file a lot of times what I'll end up doing is reading line by line, then splitting the lines to multiple records using mapcat and turning things into maps with useful keys and tweaked types at that point
so it is really just a stream by that point and not something I want to write into a database as I'd just dump the intermediate result anyway
most of the work is pretty boring by the time it gets to something that fits into a database
@otfrom maybe redis? ;) - neh, temp file sounds fine then
I don't want to write out to files either. The computation usually fits into memory if I'm reasonably thoughtful about it
it is just coming up today b/c I have 2 charts I need to make that do different filtering on the same scrubbed source data, which is reasonably big (500MB of csv) so takes about 1 minute to run. I'd just like it to stay around that number rather than creep up as I add new things that hang off the back of the reading and scrubbing
@dominicm I've been keeping an eye on titanoboa. That might be the way to go, but I've not looked into it enough
sqlite has also decent json support if you want something more lightweight
Me neither. I'd love to hear how you get on, if it isn't/is a fit, why that is, etc. I'd love to know the fit.
@dominicm ah, looking at this I know why I'm not going to use it: https://github.com/mikub/titanoboa it is waaaay more than I need
yeah, nah
@otfrom; so you return a map that contains a bunch of atom values as your code processes stuff, it swap! s intermediate values into that map. isnt that the kind of thing that you are talking yourself into?
no, b/c most of what I want to do is before I'd turn anything into a map
hmm... I think an eduction with a redcible that handles the lines/records from the file sounds like the way to go atm. I get the code composition if not the result composition
and then if I want to re-wire it later into core.async then I can as it doesn't sound like there is an alternative for this bit of it
once it is a map then going in and out of a database sounds reasonable
I suppose my issue is that I don't want to keep a database around as it is all batch, so I'd be creating a whole bunch of things just to throw them away when I produce my results
(I go months between getting new bits of input)
You're worried about the expense of a database that you only use rarely?
@otfrom use redis + memory limit, if you're going to throw away the results afterwards?
from a dependency and remembering how to do it pov
if the results are bigger than memory
the final results for the things I do are usually quite small. I'm almost always rolling up a bunch of simulations and pushing them through a t-digest to create histograms (iqr, medians, that kind of thing). The problem is to get a new histogram for a different cut you have to push all the data through the t-digest engine again
so I end up calculating lots of different counts per simulation and then pushing them through
so, results tiny, input large-ish, but not large enough to need something like spark/titanoboa/other
I'm wondering if my route is eventually going to be http://tech.ml.dataset and lots of things on top of that as it seems to have lots of ways of doing fast memory mapped access
idk
I wonder if onyx in single node mode would be good for this
I bet there's a java-y thing that's not too bad either.
kafka :P
No, not kafka :)
Kafka streams-alike though
But for a single instance, and presumably with some kind of web interface or something for intermediates.
ChronicleQueue is a decent embedded solution for queue persistence with kafaka'ish semantics
I suppose they're more single threaded though, kafka streams is anyway iirc. But I'm no expert there.
I seem to be going doing the core.async route