Morning
Morning
Good Morning!
God morgen!
Morning
morning
Morning
morning!
Friday, yo!
re: https://clojurians.slack.com/archives/CBJ5CGE0G/p1605207260374000 I had a play and you are of course correct
for the stuff I'm doing passing around eductions works as they all end up in transduce or into in the end
tho having them as a sequence would be handy as they just fit into memory
I can use it in run!
tho, which is handy. I wonder about adding an ISeq interface to the things in reducibles
I once had a fun time discovering this exact problem in the code from a highly-paid consultant which left me a little sensitive to it
That is a fair enough reason to be touchy about it
moaning!
just thinking about the trade offs between eduction/sequence/into
eduction is going to recalculate things each time you run through it, so it is cheap in memory, but expensive in CPU
sequence realises things one at a time like eduction, but keeps the results in memory, so if you pass it around to other things they get to use the cached values. It will only realise as much of the underlying thing as you ask for tho, so if you don't need all the data then it won't get it all
AFAIU it also doesn't do chunks the way seq does
into will greedily realise everything at the beginning, so if you are always going to want all of it then it is a good replacement for sequence
are those good rules of thumb?
sounds good to me
@ben.hammond so calling
(sequence (take 10) (seq eduction-thingie))
works. I'm trying to figure out the downside (other than seq realising things in 32 element chunks I think)
I would question what the eduction
is actually buying you
atm, eduction is wrapping up some IO on a csv
(sequence (comp (take 10) xfrom-previously-hidden-inside-eduction) coll)
may work just as well
the InputStream is pointing at something largish
and it will be massively larger if the reducible transit on top of Fressian works and is performant enough
so I want a sequence that manages the file handle using reducible (I think)
so those two things are fundamentally in tension because you dont know when a sequences resources may be disposed of
as eduction returns an iterateable thing that can be handed to seq I thought this might be my escape hatch
yeah, I agree that they are in tension
reducibles know exactly when they are no longer required
sequences do not
it might just be a dumb idea and the reason to stick to eduction and reducible is to close things down ASAP
(and I'm happy for that to be the answer)
so you could end up contorting yourself into things like
(defn lazywalk-reducible
"walks the reducible in chunks of size n,
returns an iterable that permits access"
[n reducible]
(reify java.lang.Iterable
(iterator [_]
(let [bq (java.util.concurrent.ArrayBlockingQueue. n)
finished? (volatile! false)
traverser (future (reduce (fn [_ v] (.put bq v)) nil reducible)
(vreset! finished? true))]
(reify java.util.Iterator
(hasNext [_] (or (false? @finished?) (false? (.isEmpty bq))))
(next [_] (.take bq)))))))
I appreciate that code can be made more concise... finished?
is superfluous
but if you want to end up with a sequence, and you only have a reducible to plug into it
I dont see how you can avoid this
and it has the downside that if you dont walk the entire Iterator, then it leaks resource
I suppose the eduction makes it pretty obvious that the thing on disk is fundamentally mutable too
er, does it? I don't follow
every time you query the csv file you get a different sequence of lines?
not usually in practice, but fundamentally. Another process could be writing things into that file (which would probably mess things up, but is what the OS allows)
Ive always ended up with a single
(transduce
do-my-inputdata-transformations-on-lines-of-csv
write-my-outputs-insomeway-that-can-handle-it
initialise-my-outputs-somehow
dredge-my-enormous-csvfile-for-its-current-status)
kind of thing
when I'v had to do this kind of processingthat's what my stuff looks like, but I've got lots of different cuts I need to do-my-inputdata-transformations-on-lines-of-csv
and write-my-outputs-insomeway-that-can-handle-it
up to a few hundred atm of the same files
which is why I keep coming back to core.async to do it
is the file always available in its entirety?
but being able to reuse reducing step functions and transducer pipelines for one off things using eduction is handy while developing
do you have to wait for it to arrive in dribs and drabs?
yeah, it is always available. It is all pretty batchy
and it is usually files rather than file
at least 2, often 30ish
so if you compose the xform functions doesn't that give you the same thing as the eduction? just not tied to a specific input
(which is probably a good thing)
it does, I just usually need to compose some on top of others
you can compose infinitely deeply
so I'll do basic "read it and clean it" and then I'll have others that add particular derived fields or do some filtering or reduce things and then spit out a large reduced thing
composure all the way down
yeah
but being able to cache the intermediate results w/o having to go back to the original files usually gives me a good speed boost
if I have enough RAM I can do that with into
but an eduction does not do caching?
no, eduction doesn't which is why I was looking at putting it into a sequence
but I should probably just do (into [] ...)
in that instance
so you are back to writing intermediate results into postgres? 8P
I've avoided that so far. Creating postgres tables on the fly and introducing a big external dependency for something batch based like this feels like a big pain.
are these intermediate results infinitely long?
often the intermediate results are bigger than is easily handled in RAM
are they manageabfe
not infinite
but just big enough that I worry about -Xmx and the OOM Killer
ah right so you want to have alot of simultaneous calculations so that you can leverage
(at least if I'm doing it on my laptop)
and then a moving window of intermediate calcs
yeah, I've got cores sitting idle (which is why I'd like to use core.async)
hence the core.async
and the size of the data pushes at the edges of 8/12/16GB
on a 16GB machine a -Xmx of 12GB is about the most I'm happy with to consistently avoid the OOM Killer
and then I need to shut down all browsers/slack/etc
that is doing it single threaded and in memory
it got better when I moved from ->>
to using transducers
could be a job for https://clojure.github.io/clojure/clojure.core-api.html#clojure.core/agent ?,
lots of speed up and fewer OutOfMemoryErrors
theres a thing I'v never said before...
each agent running a transducer to process its own bit
and punting its results out to other agents in the reducing function
maybe?
agent and send-off?
kinda thing
that's not a terrible idea
I'll think about that
feels a bit like I'm reimplementing a small buggy subset of core.async 😉
what you gain on the swings...
at least w/core.async other people are looking at the guts of the pipeline too and I can always ask Alex for design advice 😄
core.async channel + mult = the shareable eduction ?
async/reduce is just transduce
just have to set up the mechanism before pushing the batch data through
and having easy parallelism in pipeline-blocking is nice
interesting that there is another data engineering/data science DAG out there: https://clojurians.slack.com/archives/C0BQDEJ8M/p1605250177093100
feels like everyone is doing one. I quite like being able to do it in just a library and have things like transducers and reducing step functions work in lots of different ways
I wish transducers had figured out shared state
in what way?
Well, pipeline can't use any stateful transducers, like distinct.
got it
I'm a bit in two minds about stateful transducers.
They are very useful, but it feels a bit like systems that abstract out network calls. They are very easy, but they are hiding a lot of things underneath that you might want control over or view of or will fail in ways you wouldn't expect
I've had some problems with things like x/by-key in transducers as there is a bug in core.async(?) that means the completing arity of the reducing bit is getting called multiple times
and obviously it changes how I need to reason about things moving through a core.async system as some channels will be storing up big memory stores of data rather than passing it downstream
It might just be as simple as there's no clear visual indicator when you're dealing with a stateful rather than stateless transducer, and limited guidance on when to use what
distinct!
?
it is definitely becoming more embodied knowledge and lore