clojure-europe

For people in Europe... or elsewhere... UGT https://indieweb.org/Universal_Greeting_Time
dominicm 2020-11-13T04:23:37.374500Z

Morning

2020-11-13T06:39:42.374700Z

Morning

dharrigan 2020-11-13T06:40:20.374900Z

Good Morning!

slipset 2020-11-13T06:48:29.375300Z

God morgen!

jasonbell 2020-11-13T08:13:56.375500Z

Morning

thomas 2020-11-13T08:16:59.375800Z

morning

2020-11-13T08:29:13.376Z

Morning

plexus 2020-11-13T08:38:08.376500Z

morning!

plexus 2020-11-13T08:40:27.376700Z

Friday, yo!

2020-11-13T08:59:55.377600Z

re: https://clojurians.slack.com/archives/CBJ5CGE0G/p1605207260374000 I had a play and you are of course correct

2020-11-13T09:00:16.378100Z

for the stuff I'm doing passing around eductions works as they all end up in transduce or into in the end

2020-11-13T09:00:51.378600Z

tho having them as a sequence would be handy as they just fit into memory

2020-11-13T09:08:15.379300Z

I can use it in run! tho, which is handy. I wonder about adding an ISeq interface to the things in reducibles

2020-11-13T10:15:57.380500Z

I once had a fun time discovering this exact problem in the code from a highly-paid consultant which left me a little sensitive to it

2020-11-13T10:18:51.380900Z

That is a fair enough reason to be touchy about it

borkdude 2020-11-13T10:27:51.381100Z

moaning!

2020-11-13T10:32:05.381500Z

just thinking about the trade offs between eduction/sequence/into

2020-11-13T10:32:46.382400Z

eduction is going to recalculate things each time you run through it, so it is cheap in memory, but expensive in CPU

2020-11-13T10:33:46.383700Z

sequence realises things one at a time like eduction, but keeps the results in memory, so if you pass it around to other things they get to use the cached values. It will only realise as much of the underlying thing as you ask for tho, so if you don't need all the data then it won't get it all

2020-11-13T10:34:05.384100Z

AFAIU it also doesn't do chunks the way seq does

2020-11-13T10:35:06.385100Z

into will greedily realise everything at the beginning, so if you are always going to want all of it then it is a good replacement for sequence

2020-11-13T10:35:11.385300Z

are those good rules of thumb?

borkdude 2020-11-13T10:40:58.385700Z

sounds good to me

2020-11-13T11:06:35.387100Z

@ben.hammond so calling (sequence (take 10) (seq eduction-thingie)) works. I'm trying to figure out the downside (other than seq realising things in 32 element chunks I think)

2020-11-13T11:07:37.387500Z

I would question what the eduction is actually buying you

2020-11-13T11:08:29.388500Z

atm, eduction is wrapping up some IO on a csv

2020-11-13T11:08:46.389200Z

(sequence (comp (take 10) xfrom-previously-hidden-inside-eduction) coll) may work just as well

2020-11-13T11:09:21.389600Z

the InputStream is pointing at something largish

2020-11-13T11:10:50.390400Z

and it will be massively larger if the reducible transit on top of Fressian works and is performant enough

2020-11-13T11:12:20.391Z

so I want a sequence that manages the file handle using reducible (I think)

2020-11-13T11:13:16.392200Z

so those two things are fundamentally in tension because you dont know when a sequences resources may be disposed of

2020-11-13T11:13:18.392400Z

as eduction returns an iterateable thing that can be handed to seq I thought this might be my escape hatch

2020-11-13T11:13:54.392900Z

yeah, I agree that they are in tension

2020-11-13T11:14:02.393200Z

reducibles know exactly when they are no longer required

2020-11-13T11:14:07.393500Z

sequences do not

2020-11-13T11:14:21.393900Z

it might just be a dumb idea and the reason to stick to eduction and reducible is to close things down ASAP

2020-11-13T11:14:35.394200Z

(and I'm happy for that to be the answer)

2020-11-13T11:15:59.395Z

so you could end up contorting yourself into things like

(defn lazywalk-reducible
  "walks the reducible in chunks of size n,
  returns an iterable that permits access"
  [n reducible]
  (reify java.lang.Iterable
    (iterator [_]
      (let [bq (java.util.concurrent.ArrayBlockingQueue. n)
            finished? (volatile! false)
            traverser (future (reduce (fn [_ v] (.put bq v)) nil reducible)
                              (vreset! finished? true))]
        (reify java.util.Iterator
          (hasNext [_] (or (false? @finished?) (false? (.isEmpty bq))))
          (next [_] (.take bq)))))))

2020-11-13T11:17:34.396500Z

I appreciate that code can be made more concise... finished? is superfluous but if you want to end up with a sequence, and you only have a reducible to plug into it I dont see how you can avoid this

2020-11-13T11:20:35.397200Z

and it has the downside that if you dont walk the entire Iterator, then it leaks resource

2020-11-13T11:22:42.397700Z

I suppose the eduction makes it pretty obvious that the thing on disk is fundamentally mutable too

2020-11-13T11:23:12.398Z

er, does it? I don't follow

2020-11-13T11:24:33.398500Z

every time you query the csv file you get a different sequence of lines?

2020-11-13T11:25:39.399300Z

not usually in practice, but fundamentally. Another process could be writing things into that file (which would probably mess things up, but is what the OS allows)

2020-11-13T11:27:56.401200Z

Ive always ended up with a single

(transduce
  do-my-inputdata-transformations-on-lines-of-csv
  write-my-outputs-insomeway-that-can-handle-it
  initialise-my-outputs-somehow
  dredge-my-enormous-csvfile-for-its-current-status)
kind of thing when I'v had to do this kind of processing

2020-11-13T11:29:58.402200Z

that's what my stuff looks like, but I've got lots of different cuts I need to do-my-inputdata-transformations-on-lines-of-csv and write-my-outputs-insomeway-that-can-handle-it

2020-11-13T11:30:19.402500Z

up to a few hundred atm of the same files

2020-11-13T11:30:41.402800Z

which is why I keep coming back to core.async to do it

2020-11-13T11:31:14.403700Z

is the file always available in its entirety?

2020-11-13T11:31:15.403800Z

but being able to reuse reducing step functions and transducer pipelines for one off things using eduction is handy while developing

2020-11-13T11:31:26.404200Z

do you have to wait for it to arrive in dribs and drabs?

2020-11-13T11:31:34.404400Z

yeah, it is always available. It is all pretty batchy

2020-11-13T11:32:01.405300Z

and it is usually files rather than file

2020-11-13T11:32:11.405700Z

at least 2, often 30ish

2020-11-13T11:32:38.406400Z

so if you compose the xform functions doesn't that give you the same thing as the eduction? just not tied to a specific input

2020-11-13T11:32:57.406800Z

(which is probably a good thing)

2020-11-13T11:33:08.407100Z

it does, I just usually need to compose some on top of others

2020-11-13T11:33:35.408100Z

you can compose infinitely deeply

2020-11-13T11:33:40.408400Z

so I'll do basic "read it and clean it" and then I'll have others that add particular derived fields or do some filtering or reduce things and then spit out a large reduced thing

2020-11-13T11:33:43.408700Z

composure all the way down

2020-11-13T11:33:47.408900Z

yeah

2020-11-13T11:34:12.409800Z

but being able to cache the intermediate results w/o having to go back to the original files usually gives me a good speed boost

2020-11-13T11:34:33.410400Z

if I have enough RAM I can do that with into

2020-11-13T11:34:35.410500Z

but an eduction does not do caching?

2020-11-13T11:34:53.411100Z

no, eduction doesn't which is why I was looking at putting it into a sequence

2020-11-13T11:35:07.411700Z

but I should probably just do (into [] ...) in that instance

2020-11-13T11:35:07.411800Z

so you are back to writing intermediate results into postgres? 8P

2020-11-13T11:36:03.412700Z

I've avoided that so far. Creating postgres tables on the fly and introducing a big external dependency for something batch based like this feels like a big pain.

2020-11-13T11:37:08.413300Z

are these intermediate results infinitely long?

2020-11-13T11:37:11.413500Z

often the intermediate results are bigger than is easily handled in RAM

2020-11-13T11:37:13.413700Z

are they manageabfe

2020-11-13T11:37:19.413800Z

not infinite

2020-11-13T11:37:32.414300Z

but just big enough that I worry about -Xmx and the OOM Killer

2020-11-13T11:37:46.414800Z

ah right so you want to have alot of simultaneous calculations so that you can leverage

2020-11-13T11:37:48.415Z

(at least if I'm doing it on my laptop)

2020-11-13T11:38:08.415600Z

and then a moving window of intermediate calcs

2020-11-13T11:38:12.415800Z

yeah, I've got cores sitting idle (which is why I'd like to use core.async)

2020-11-13T11:38:13.415900Z

hence the core.async

2020-11-13T11:38:31.416300Z

and the size of the data pushes at the edges of 8/12/16GB

2020-11-13T11:38:51.416800Z

on a 16GB machine a -Xmx of 12GB is about the most I'm happy with to consistently avoid the OOM Killer

2020-11-13T11:39:03.417100Z

and then I need to shut down all browsers/slack/etc

2020-11-13T11:39:26.417500Z

that is doing it single threaded and in memory

2020-11-13T11:39:46.417900Z

it got better when I moved from ->> to using transducers

2020-11-13T11:40:01.418400Z

could be a job for https://clojure.github.io/clojure/clojure.core-api.html#clojure.core/agent ?,

2020-11-13T11:40:08.418800Z

lots of speed up and fewer OutOfMemoryErrors

2020-11-13T11:40:08.418900Z

theres a thing I'v never said before...

2020-11-13T11:40:34.419300Z

each agent running a transducer to process its own bit

2020-11-13T11:40:56.419800Z

and punting its results out to other agents in the reducing function

2020-11-13T11:40:58.420Z

maybe?

2020-11-13T11:41:12.420300Z

agent and send-off?

2020-11-13T11:41:23.420700Z

kinda thing

2020-11-13T11:41:23.420800Z

that's not a terrible idea

😁 1
2020-11-13T11:41:28.421Z

I'll think about that

2020-11-13T11:43:38.421600Z

feels a bit like I'm reimplementing a small buggy subset of core.async 😉

2020-11-13T11:44:07.422Z

what you gain on the swings...

2020-11-13T11:50:47.422900Z

at least w/core.async other people are looking at the guts of the pipeline too and I can always ask Alex for design advice 😄

2020-11-13T11:51:32.423300Z

core.async channel + mult = the shareable eduction ?

2020-11-13T11:52:03.423600Z

async/reduce is just transduce

2020-11-13T11:52:20.423900Z

just have to set up the mechanism before pushing the batch data through

2020-11-13T11:52:39.424300Z

and having easy parallelism in pipeline-blocking is nice

2020-11-13T13:12:25.424700Z

interesting that there is another data engineering/data science DAG out there: https://clojurians.slack.com/archives/C0BQDEJ8M/p1605250177093100

2020-11-13T13:13:01.425600Z

feels like everyone is doing one. I quite like being able to do it in just a library and have things like transducers and reducing step functions work in lots of different ways

dominicm 2020-11-13T13:17:12.426200Z

I wish transducers had figured out shared state

2020-11-13T13:17:40.426400Z

in what way?

dominicm 2020-11-13T13:20:52.427200Z

Well, pipeline can't use any stateful transducers, like distinct.

2020-11-13T14:36:47.427400Z

got it

2020-11-13T14:37:23.427800Z

I'm a bit in two minds about stateful transducers.

2020-11-13T14:38:10.428900Z

They are very useful, but it feels a bit like systems that abstract out network calls. They are very easy, but they are hiding a lot of things underneath that you might want control over or view of or will fail in ways you wouldn't expect

2020-11-13T14:39:00.429700Z

I've had some problems with things like x/by-key in transducers as there is a bug in core.async(?) that means the completing arity of the reducing bit is getting called multiple times

2020-11-13T14:39:45.430500Z

and obviously it changes how I need to reason about things moving through a core.async system as some channels will be storing up big memory stores of data rather than passing it downstream

dominicm 2020-11-13T14:41:22.432Z

It might just be as simple as there's no clear visual indicator when you're dealing with a stateful rather than stateless transducer, and limited guidance on when to use what

2020-11-13T14:42:29.432300Z

distinct! ?

2020-11-13T14:43:03.432700Z

it is definitely becoming more embodied knowledge and lore