clojure-europe

For people in Europe... or elsewhere... UGT https://indieweb.org/Universal_Greeting_Time
slipset 2020-10-19T06:07:44.042800Z

God morgen!

ordnungswidrig 2020-10-19T07:07:10.043Z

Guten Morgen!

borkdude 2020-10-19T08:02:15.043600Z

Morning! This weekend I tried to make the finishing touches on https://github.com/babashka/babashka.process.

2πŸŽ‰
2020-10-19T08:26:58.043800Z

morning

thomas 2020-10-19T09:40:42.044300Z

mogge

borkdude 2020-10-19T09:57:02.044500Z

Nifty feature:

1πŸ‘
2020-10-19T10:33:09.047800Z

what do people reach for when they want to do lots of streaming calcs on the fly, but share the early stages of it (does this make sense?). I'm doing things where I'm parsing quite large files and then doing some basic cleaning of what comes out of the files and then I'm doing different things based on that processed input. Usually I reach for core.async for this and a mult on the core bit of processing with some kind of reduce on the last channels. I'd like if possible to use as many cores as possible given that the jobs are parallel, but I'd also like to keep the amount of memory down. Any thoughts?

2020-10-19T10:36:26.049Z

I'd use transduce personally

dominicm 2020-10-19T10:36:27.049100Z

It's a shame that parallel transducers never happened.

dominicm 2020-10-19T10:36:32.049400Z

Hehe.

2020-10-19T10:36:41.049600Z

oh hang on theres a bit of a blog post on that

dominicm 2020-10-19T10:37:17.050400Z

There's the core.async parallel transducer context :)

2020-10-19T10:38:19.050500Z

er this https://juxt.pro/blog/multithreading-transducers

2020-10-19T10:39:07.050800Z

essentially funnels each value into its own future then run a few entries behind in order to give the futures a chance to execute before you deref them

2020-10-19T10:42:23.051200Z

there's a bug in that blogpost code, BTW which is fixed in https://github.com/hammonba/clojure-attic/blob/master/src/clj/attic/parallelising_map.clj#L13

2020-10-19T10:43:37.051500Z

(you have to 'close off', the transducer by calling the arity-1 reducing function; if you forget to do that then you might lose some data)

2020-10-19T10:57:17.052Z

so you have seperate functions for β€’ scrubbing data β€’ computing some kind of dispatching value β€’ multitimethods to seperately compute the different streams Glued together using tranducing map and parallelised using something like that blogpost I linked to

2020-10-19T11:07:45.052800Z

I think what I mean is that I like composing my code, but I don't have good ways of composing my results as they flow

2020-10-19T11:08:16.053600Z

so I can do

(comp
  (map this-fn)
  (filter other-fn))

borkdude 2020-10-19T11:08:35.054400Z

@otfrom is sequence the right thing here? it caches results so then you will share the early results?

2020-10-19T11:09:03.054700Z

but I don't see a good way of using the results of that w/o doing transduce/into/reduce multiple times or going to core.async/mult

2020-10-19T11:09:52.055800Z

@borkdude sequence is right if I was just after caching, but the data is large enough that I don't want all of the intermediate values in memory if I can avoid it

2020-10-19T11:10:34.056400Z

and the end of my pipes I do a lot of side effecting (writing to files for me atm, tho it has been a data base or msg stream/q in the past)

borkdude 2020-10-19T11:11:05.056600Z

maybe use a database ;)

2020-10-19T11:11:23.057Z

it might come to that

2020-10-19T11:11:32.057300Z

atm I can just run it mulitiple times and wait

2020-10-19T11:11:49.057800Z

or go for core.async which is my usual fave way of doing this kind of thing

2020-10-19T11:13:00.058900Z

thx @ben.hammond πŸ™‚

2020-10-19T11:13:20.059400Z

I dont see a problem with (a small number) of nested transduce/into/reduce as long as you dont let it get out of control

2020-10-19T11:14:51.060600Z

I don't either. I was just getting to the tipping point on that and was wondering how others were solving the problem

2020-10-19T11:16:14.061Z

I find core.async to be awkward to debug from the repl

2020-10-19T11:16:31.061500Z

but perhaps that is because I haven't tried hard enough

2020-10-19T11:19:24.061900Z

@ben.hammond it can be

borkdude 2020-10-19T11:19:29.062200Z

core.async is optimized for correct programs

2πŸ˜†1πŸ˜…
2020-10-19T11:20:01.062800Z

(btw I find x/reduce and x/by-key to make my xf pipelines much more composable as it means I need the results of a transduce a lot less often)

2020-10-19T11:20:48.063700Z

I've gotten reasonably good at debugging core.async issues, but I mostly do it by using the xfs in a non-async way to make sure they are right and then building them up piece by piece

borkdude 2020-10-19T11:23:35.063900Z

(https://twitter.com/borkdude/status/1079319687501103105)

1
2020-10-19T11:30:31.064100Z

all programs are wrong some are useful

1πŸ‘
2020-10-19T11:31:56.065300Z

but I find the happy-hacking carefree nature of the REPL to be Clojure's killer feature certainly in terms of maintaining productivity and focus

1
dominicm 2020-10-19T11:40:46.066500Z

@otfrom core.reducers is probably what you really want.

dominicm 2020-10-19T11:40:55.066800Z

But no transducers there.

2020-10-19T11:42:28.068100Z

have you found https://clojure.org/reference/reducers to be useful, @dominicm I've been underwhelmed

2020-10-19T11:47:25.069Z

I thought the most useful things were fjfork and fjjoin, and they're both declared as private https://github.com/clojure/clojure/blob/master/src/clj/clojure/core/reducers.clj#L32-L34

2020-10-19T11:48:30.070Z

er, so I mean I did not find clojure.core.reducers to add much value on top of the raw Java Fork/Join mechanism

2020-10-19T11:51:35.071600Z

@dominicm it still doesn't give me a way to share my intermediate results. I've had good fun using them with memory mapped files though

2020-10-19T11:52:41.072100Z

when you say 'share', how do you expect to retrieve the intermediate results?

2020-10-19T11:57:30.072900Z

I'm not entirely sure about the mechanism, but I'd like to avoid re-calculating everything each time and I'd like to avoid having everything in memory all at once

2020-10-19T11:57:59.073500Z

wouldn't you just have a big map that you assoc interesting things on to

2020-10-19T11:58:30.074Z

oh you'd write to a temporary file then

2020-10-19T11:58:47.074600Z

and assoc its filename into your big map

borkdude 2020-10-19T11:58:49.074700Z

> oh you'd write to a temporary file then I said, use database.

borkdude 2020-10-19T11:59:02.075Z

;P

borkdude 2020-10-19T12:08:02.075200Z

(https://twitter.com/borkdude/status/1073212076515049472)

2πŸ˜‚
dominicm 2020-10-19T12:57:09.075700Z

I've never used the fork join mechanism. It gives me a pretty easy api to just merge together some bits.

ordnungswidrig 2020-10-19T12:57:13.075900Z

Postgres’ indeed cool.

ordnungswidrig 2020-10-19T12:57:29.076300Z

Like β€œI can declare indexes on paths in json”-cool.

dominicm 2020-10-19T12:57:50.076900Z

There's that cool library from Riverford that does that for maps

dominicm 2020-10-19T12:58:22.077400Z

@otfrom maybe titanoboa?

2020-10-19T13:06:08.078600Z

hmm... my intermediate results don't have to go into a database tho. If I'm reading from a file a lot of times what I'll end up doing is reading line by line, then splitting the lines to multiple records using mapcat and turning things into maps with useful keys and tweaked types at that point

2020-10-19T13:06:41.079300Z

so it is really just a stream by that point and not something I want to write into a database as I'd just dump the intermediate result anyway

2020-10-19T13:07:15.079700Z

most of the work is pretty boring by the time it gets to something that fits into a database

borkdude 2020-10-19T13:07:31.079900Z

@otfrom maybe redis? ;) - neh, temp file sounds fine then

2020-10-19T13:10:53.080800Z

I don't want to write out to files either. The computation usually fits into memory if I'm reasonably thoughtful about it

2020-10-19T13:12:01.082300Z

it is just coming up today b/c I have 2 charts I need to make that do different filtering on the same scrubbed source data, which is reasonably big (500MB of csv) so takes about 1 minute to run. I'd just like it to stay around that number rather than creep up as I add new things that hang off the back of the reading and scrubbing

2020-10-19T13:12:27.082800Z

@dominicm I've been keeping an eye on titanoboa. That might be the way to go, but I've not looked into it enough

mpenet 2020-10-19T13:15:57.082900Z

sqlite has also decent json support if you want something more lightweight

dominicm 2020-10-19T13:21:18.084100Z

Me neither. I'd love to hear how you get on, if it isn't/is a fit, why that is, etc. I'd love to know the fit.

2020-10-19T13:26:28.084600Z

@dominicm ah, looking at this I know why I'm not going to use it: https://github.com/mikub/titanoboa it is waaaay more than I need

2020-10-19T13:27:24.084900Z

yeah, nah

2020-10-19T13:58:24.086200Z

@otfrom; so you return a map that contains a bunch of atom values as your code processes stuff, it swap! s intermediate values into that map. isnt that the kind of thing that you are talking yourself into?

2020-10-19T13:59:51.087Z

no, b/c most of what I want to do is before I'd turn anything into a map

2020-10-19T14:00:27.087600Z

hmm... I think an eduction with a redcible that handles the lines/records from the file sounds like the way to go atm. I get the code composition if not the result composition

2020-10-19T14:01:15.088300Z

and then if I want to re-wire it later into core.async then I can as it doesn't sound like there is an alternative for this bit of it

2020-10-19T14:01:35.088800Z

once it is a map then going in and out of a database sounds reasonable

2020-10-19T14:02:06.089500Z

I suppose my issue is that I don't want to keep a database around as it is all batch, so I'd be creating a whole bunch of things just to throw them away when I produce my results

2020-10-19T14:02:17.089800Z

(I go months between getting new bits of input)

2020-10-19T14:03:00.090200Z

You're worried about the expense of a database that you only use rarely?

borkdude 2020-10-19T14:03:33.090800Z

@otfrom use redis + memory limit, if you're going to throw away the results afterwards?

2020-10-19T14:03:37.091Z

from a dependency and remembering how to do it pov

borkdude 2020-10-19T14:03:52.091200Z

if the results are bigger than memory

2020-10-19T14:05:07.092600Z

the final results for the things I do are usually quite small. I'm almost always rolling up a bunch of simulations and pushing them through a t-digest to create histograms (iqr, medians, that kind of thing). The problem is to get a new histogram for a different cut you have to push all the data through the t-digest engine again

2020-10-19T14:05:27.093100Z

so I end up calculating lots of different counts per simulation and then pushing them through

2020-10-19T14:06:12.093700Z

so, results tiny, input large-ish, but not large enough to need something like spark/titanoboa/other

2020-10-19T14:07:11.094700Z

I'm wondering if my route is eventually going to be http://tech.ml.dataset and lots of things on top of that as it seems to have lots of ways of doing fast memory mapped access

2020-10-19T14:07:13.094900Z

idk

dominicm 2020-10-19T14:56:32.095600Z

I wonder if onyx in single node mode would be good for this

dominicm 2020-10-19T14:57:27.096200Z

I bet there's a java-y thing that's not too bad either.

borkdude 2020-10-19T14:58:34.096400Z

kafka :P

dominicm 2020-10-19T14:59:02.096700Z

No, not kafka :)

dominicm 2020-10-19T14:59:12.097100Z

Kafka streams-alike though

dominicm 2020-10-19T15:00:04.098Z

But for a single instance, and presumably with some kind of web interface or something for intermediates.

mpenet 2020-10-19T15:08:46.099200Z

ChronicleQueue is a decent embedded solution for queue persistence with kafaka'ish semantics

borkdude 2020-10-19T15:09:55.099700Z

@otfrom https://github.com/mpenet/tape

2020-10-19T15:29:01.101Z

@mpenet and @borkdude ChronicleQueue and tape are on my "will use at some point" list

dominicm 2020-10-19T16:26:24.101800Z

I suppose they're more single threaded though, kafka streams is anyway iirc. But I'm no expert there.

2020-10-19T23:23:42.102400Z

I seem to be going doing the core.async route

1