Is there any possibility to use protocol in macro during macroexpansion time?
No implementation of method: :-patterns of protocol: #'ribelo.munich/IMulti
found for class: clojure.lang.Symbol
Sure. Just remember when you have clojure forms versus runtime data. Emit code that does the right thing at runtime. You most likely are just dealing with sequences of symbols which is what you are seeing in your error message
Up to a minute and half? On what hardware?
@jakub.stastny.pt_serv 90 seconds to start up a Clojure app from source is nothing... for a large app... because all the .clj
files have to be compiled into memory.
When that overhead is removed -- by AOT'ing code going into the uberjar -- then the services restart in seconds, not minutes.
I have a (very large) number of gzipped files that contain edn per lines, so I made these two functions to handle process them:
(defn read-gzipped
[fname]
(with-open [in (java.util.zip.GZIPInputStream.
(io/input-stream fname))]
(slurp in)))
(defn read-edn-per-line
[in f]
(->> in
str/split-lines
(map (comp f read-string))))
I would expect these functions to parallelize well and to be able to do pmap
(or upmap
when using claypoole) over the list of files, however there is no difference in time whether I use pmap
or map
, not sure why, am I missing something?@archibald.pontier_clo could it be that your producer is slower than your reader? i.e. doing the gunzip and slurp is slower than read-string?
I am not sure, but how would that stop calling read-gzipped
+ read-edn-per-line
from having the same speed in a map
and in a pmap
? Because even if read-gzipped
is slow, I should be able to read multiple files at once no?
@archibald.pontier_clo Perhaps I misunderstood where your pmap
was. I thought it was in place of the map
at the end of read-edn-per-line
. Not that you were pmap
'ing your list of files. The other option is that it's so fast the overhead of the parallelism makes it the same speed. pmap does have some footguns due to laziness, and I'm not sure if those might apply here, it depends on how you consume the sequence afterwards.
@archibald.pontier_clo Are you sure you are measuring the complete result? pmap
is semi-lazy so unless you are forcing the whole result you may not be getting accurate times?
@seancorfield I do a pmap
followed by a mapcat
and doall
(last call), that should be fine no?
@dominicm reading one file takes 20s+ so I don't think the overhead plays a role there
@archibald.pontier_clo to confirm, your full code is (pmap #(read-edn-per-line (read-gzipped %) %) ["file-1" "file-2"])
?
(->> selected-days
(pmap
(fn [f]
(-> (str f "/" ticker ".txt.gz")
utils/read-gzipped
;;(utils/read-edn-per-line parse-line)
)))
(mapcat identity)
doall)
I removed read-edn-per-line
for the moment
How many selected-days are we talking here?
10 for the moment (I'm testing on a small subset of files for the moment)
Slurp+read-string is generally horrendous, use read
@hiredman I think it's newline-separated files, so it would be a map over .readLine
(which isn't there on InputStreams).
Yes, one edn per line
Read will handle that fine
You're right, just needs repeated calls to read.
That is more or less what a clojure source file jsis
My bad 🙂
Pmap entangles a lot of things so it is tricky to understand. Pmap limits its parallelism to the number of cores the java runtime reports
I need to convert the gzip stream to a stream that read understands though
Yes, java.io.PushbackReader
You may need to wrap in a reader first via http://clojure.java.io/reader
That's where the core limiter is, I knew there must be one around there somewhere. pmap is an interesting beast 😛
There's also a lot of environment involved here: If you've only got a couple of cores, (I only have 4 for example) then you're not going to get loads of parallelism here. Although I am surprised you're seeing absolutely no speedup. I'd expect it to be less than 200s.
class java.io.BufferedReader cannot be cast to class
java.io.PushbackReader (java.io.BufferedReader and
java.io.PushbackReader are in module java.base of loader
'bootstrap')
I found a previous slack thread on that topic, doesn't seem straightforward but I'll figure it out
some, but nothing
https://clojurians-log.clojureverse.org/clojure/2018-04-16 here
@archibald.pontier_clo For comparison, how long does this take? (time (doall (pmap #(Thread/sleep (+ 5000 %)) (range 20))))
That should give some idea of parallelism available to you.
5s
If you use an ExecutorService and an ExecutorCompletionService instead of pmap, you have a lot more visibility and control.
Isn't this what is under the hood in the claypoole library?
Again, pmap is tricky, I forget if the specialized range type implements chunking, but chunking does weird things to pmaps attempts to limit parallelism
Ok thank you I'll look into it
That's true again. But at least indicates there're enough cores around to be making use of this parallelism.
Your process may just be io bound, such that your io requests are queuing sequential somewhere else (os kernel, disk driver, etc), such that any parallelism in dispatching the requests doesn't result in faster processing
I was trying to figure out which profiler or debugging tool would give insight into this, and I wasn't sure.
that might be it
Out of spite I ran that code on my prod server (significantly more powerful / better ssd than my laptop), and there pmap
gives a very nice speed boost
And thanks for the help! :hugging_face:
Is lein still the best build system?
@honza "best" is subjective, but it's the most complete solution. there is also now deps.edn
which is more "decomplected": it does less and you can build tooling around this (which people have done and more to come)
Lein has a lot of power (from its existing ecosystem to its unquoting and middleware systems) but deps.edn has shown the right way for a number of things (single JVM per task, first-class git dependencies, composable aliases) All those are technically possible in Lein but not the default... in an ideal world Lein would pick up some insights or even implementational details from deps.edn In practice it would be quite a lot of work, like so many things in OSS