Does anyone print logging data in edn directly? The goal would be to copy paste directly to the repl. Edit: of course only in Dev mode 😃
Yes I was thinking about that but it would require more work
Pedestal at one time logged in edn to some degree
https://gist.github.com/hiredman/64bc7ee3e89dbdb3bb2d92c6bddf1ff6 is a little library for using java util logging to log in edn
this looks super cool. do you have any examples of usage?
https://gist.github.com/hiredman/3443693c5994a8b0bb0a41f068107abd
awesome, thank you!
i almost linked to that. i use it constantly now
People get excited about macros writing macros, but what about non-macros writing macros
> non-macros writing macros Oh, I am one! :)
What impresses me the most are those non-macros that write macros that write macros.
You mean functions that emit code as a string / .clj file? Legit.
I didn't mean for this to be enigmatic, if you look back in main chat there is a gist I posted of some code, and it generates macros by doseq'ing over a list, interning some functions, then call the setMacro method on the var
Thank you! that's what this code base is actually using, will check the link
A bit magical indeed 😃
I am unsure that's possible.
Is there some kind of pfilter
around? Like pmap
. with its nice interface similarity with map
. The lack of pfilter
in the core library makes me think I might not be reasoning correctly about the problem… (Which is to filter a sequence of integers as fast as I possibly can. 😄 )
If I use criterium/quick-bench
instead, the transduce and reducers wins are a bit more appearant:
filter
Evaluation count : 30 in 6 samples of 5 calls.
Execution time mean : 23,654346 ms
Execution time std-deviation : 301,776435 µs
Execution time lower quantile : 23,352585 ms ( 2,5%)
Execution time upper quantile : 24,050820 ms (97,5%)
Overhead used : 14,507923 ns
transduce
Evaluation count : 36 in 6 samples of 6 calls.
Execution time mean : 20,129352 ms
Execution time std-deviation : 595,084459 µs
Execution time lower quantile : 19,646010 ms ( 2,5%)
Execution time upper quantile : 21,079716 ms (97,5%)
Overhead used : 14,507923 ns
core.reducers/filter
Evaluation count : 36 in 6 samples of 6 calls.
Execution time mean : 17,903643 ms
Execution time std-deviation : 186,971423 µs
Execution time lower quantile : 17,675913 ms ( 2,5%)
Execution time upper quantile : 18,138291 ms (97,5%)
Overhead used : 14,507923 ns
core.async/pipeline
Evaluation count : 22230 in 6 samples of 3705 calls.
Execution time mean : 27,089463 µs
Execution time std-deviation : 136,898899 ns
Execution time lower quantile : 26,919838 µs ( 2,5%)
Execution time upper quantile : 27,291471 µs (97,5%)
Overhead used : 14,507923 ns
(For some reason, it fails to measure the pipeline code. It doesn’t in my real code.)Interestingly (to me, at least 😄 ) for
performs on par with transduce
with this task:
(println "for")
(quick-bench #_time
(count
(for [i every-other
:when (aget ba i)]
i)))
for
Evaluation count : 36 in 6 samples of 6 calls.
Execution time mean : 19,456574 ms
Execution time std-deviation : 100,364503 µs
Execution time lower quantile : 19,370228 ms ( 2,5%)
Execution time upper quantile : 19,618312 ms (97,5%)
Overhead used : 14,507923 ns
Out of curiosity - what about a plain loop
?
@pez I see… Ok, I think the biggest gains from pipeline are to be had when the pipeline transducer is CPU intensive (think like parsing HTML into data, file compression, etc); here you have a pretty straightforward xf (filter #(aget ba %)) Also, I think 1,000,000 samples is not that much really, so (pipeline …) would be suffering from all the channel, etc overhead of passing the data around;
Also, a sidenote, (time …) is almost never a good benchmark strategy (but quick-bench is); I’ve seen cases where a simple (time …) benchmark would be “slow” but quick-bench would actually show a huge improvement since the JVM does its JIT magic and code really speeds up after a few iterations in some cases;
I think that’s a good idea @p-himik (loop []…)
That’s probably the fastest thing you can get in terms of raw single thread perf… pretty much Java speed;
BOOM
(println "loop")
(quick-bench #_time
(count
(loop [res []
i 1]
(if (<= i n)
(recur (if (aget ba i)
(conj res i)
res)
(+ i 2))
res))))
loop
Evaluation count : 84 in 6 samples of 14 calls.
Execution time mean : 7,518441 ms
filter
Evaluation count : 30 in 6 samples of 5 calls.
Execution time mean : 23,020098 ms
transduce
Evaluation count : 36 in 6 samples of 6 calls.
Execution time mean : 19,090405 ms
core.reducers/filter
Evaluation count : 42 in 6 samples of 7 calls.
Execution time mean : 16,328693 ms
for
Evaluation count : 36 in 6 samples of 6 calls.
Execution time mean : 19,678977 ms
Yup, loop is the king 🙂
If you really care about perf. I highly recommend YourKit
I bet it will help you gain 50% in no time
I’ve used it, it’s like magic; the gains will come from a place you least expect… some reflection call that’s using 50% of your CPU time
@pez Now try making res
a transient. :)
transient, huh? Doin’ it!
Try also unchecked-math 🙂
In my previous adventure with single-threaded high perf, I ended up writing a Java class. :D All my data consisted of integers and Clojure doesn't really like them.
Also, http://clojure-goes-fast.com (various ideas how to go fast)
I’ll be trying YourKit too. Though only out of curiosity really. I don’t have performance tasks often. This is a little toy challenge I have, mainly to learn more about Clojure. I profile it with tufte
right now, which is pretty nice.
Seems like I should be able to parallelize the loop, no?
Absolutely, your problem is a textbook map(filter)/reduce problem.
transient shaves some more of the time, as hinted at 😃
loop
Execution time mean : 7,704050 ms
loop-transient
Execution time mean : 5,017702 ms
filter
Execution time mean : 24,047486 ms
transduce
Execution time mean : 19,687393 ms
core.reducers/filter
Execution time mean : 17,303117 ms
for
Execution time mean : 21,142251 ms
Unchecked math doesn’t seem to make much of a difference for the particular problem.
I think that's because there's only a single math operation there, and its arguments' types are well known by the compiler. If you really want to pursue it further, I would try to get the bytecode for that code and see if there's something fishy going on. I've had some success with https://github.com/gtrak/no.disassemble/ and https://github.com/clojure-goes-fast/clj-java-decompiler before.
Unchecked doesn’t attract me so much. I would rather to figure out how to parallellize it. I can’t immediately see how:
(quick-bench #_time
(count
(loop [res []
i 1]
(if (<= i n)
(recur (if (aget ba i)
(conj res i)
res)
(+ i 2))
res))))
- Split ba
into N chunks
- For each chunk, run a thread that creates its own res
- Combine the resulting collection of res
vectors in a single vector, preserving the order
Just out of interest - why (+ i 2)
? Does ba
store something unrelated at even indices?
Yes, I am only interested in the odd indices. ba
contains the results of an eratosthenes sieve, where I have skipped sieving even numbers, b/c we all know there’s only one even prime number. 😃
I was hoping there was some reducer or something that would do all those steps for me.
Oh, is that code just to find prime numbers up to n
?
If so, then even constructing the sieve could be made parallel. And I'm 95% certain there's already a Java library that does it. :)
Haha, I’m in this to learn about Clojure. 😃
That code is only to pick out the prime numbers I have found up to n
.
Here’s the full thing, using loop and transient:
(defn pez-ba-loop-transient-sieve [^long n]
(let [primes (boolean-array (inc n) true)
sqrt-n (int (Math/ceil (Math/sqrt n)))]
(if (< n 2)
'()
(loop [p 3]
(if (< sqrt-n p)
(loop [res (transient [])
i 3]
(if (<= i n)
(recur (if (aget primes i)
(conj! res i)
res)
(+ i 2))
(concat [2] (persistent! res))))
(do
(when (aget primes p)
(loop [i (* p p)]
(when (<= i n)
(aset primes i false)
(recur (+ i p p)))))
(recur (+ p 2))))))))
I haven’t ventured into how to speed up the sieving (beyond the obvious optimizations) b/c most of the time has been spent in picking out the indicies from the sieve.
I’m trying to figure out how to parallelize the work with converting my byte-array to indexes. Parallelizing with filter was so easy that I get surprised by how much things grow when I try to do it with the loop
. I have this so far.
(comment
(let [n 1000000
ba (boolean-array n)
prob 0.15
sample-size (long (* n prob))]
(doseq [i (take sample-size
(random-sample prob (range n)))]
(aset ba i true))
(let [^ExecutorService
service (Executors/newFixedThreadPool 6)
^Callable
mk-collector (fn [^long start ^long end]
(fn []
(loop [res (transient [])
i start]
(if (<= i end)
(recur (if (aget ba i)
(conj! res i)
res)
(+ i 2))
(persistent! res)))))
num-slices 10
slice-size (/ n num-slices)]
(doseq [[start end] (partition
2
(interleave
(range 1 (inc n) slice-size)
(range slice-size (inc n) slice-size)))
:let [f (.submit service (mk-collector start end))]]
@f))))
There are two unsolved things here:
1. My future f
contains nil
even though I know that the collector I create with mk-collector
produces the collection I want.
2. I don’t know how to combine my slices in the order I start the threads.
And also. This is slower than my single thread solution. Not by very much, but anyway. Am I even on the right track?Things that I notice immediately:
- ^Callable
there marks mk-collector
and not the result of calling (mk-collector ...)
. And that, I think, it useless because the compiler already knows that mk-collector
is callable. If you want to say "mk-collector returns a callable", you have to tag its arguments list.
- Don't deref within doseq
- this way, you start a thread and immediately wait for its completion, then start the second one, and so on. Instead, create a vector of futures and only then deref all of them in order. And that will be the exact order in which you have created them. In fact, you can deref them in order in reduce
- it even optimizes it further, albeit not substantially since all threads have roughly the same amount of work in your case.
- You start 6 threads but create 10 slices - why? Choose the number of threads you want to have and create the same amount of slices, one per thread.
- That (partition ...)
form makes my head spin. I have a strong feeling that whatever it does could be rewritten in a much simpler way in the overall context. I might be wrong though.
Thanks. About the partitition… I’m sure you are right. I had it hard coded at first and then just translated the way I hard coded it. 😃
Threads vs slices. I tried using 10 for both, but it didn’t make a difference. I have six cores on my machine so went for that, but my partition blows up with 6 slices. Haha.
> partition blows up What exactly does that mean? It was working just fine with 1 huge slice after all.
It makes a difference in the overall code - you won't need any explicit executor, you would be able to just use future
.
And it also should make some performance difference as well. It might not be noticeable in this context, but in general it should exist.
Blows up means that if my start, end
, indices get out of whack and I get index out of bounds errors. I didn’t want to focus on this before I have got the basic infrastructure right.
Ah, it just means that your partition
incantation is incorrect. :) It has nothing to do with threads.
Yeah, nothing to do with threads. I just didn’t succeed with this naive partition to create 6 slices for my 6 threads. But it won’t matter if I don’t need the executor service, anyway.
You should end up with something like this:
(let [n-partitions 6
;; Notice how it says `mapv` and not `map` - this is important.
;; You want to be eager to start all the futures right away.
futures (mapv (fn [partition]
(future
%magic%))
(range n-partitions))]
(into []
(mapcat deref)
futures))
Yes. now it runs about 2X faster than the non-future version. And, even produces the right result.
(let [mk-collector (fn [^long start ^long end]
(fn []
(loop [res (transient [])
i start]
(if (<= i end)
(recur (if (aget ba i)
(conj! res i)
res)
(+ i 2))
(persistent! res)))))
num-slices 10
slice-size (/ n num-slices)
slices (partition
2
(interleave
(range 1 (inc n) slice-size)
(range slice-size (inc n) slice-size)))
futures (mapv (fn [[start end]]
(future
((mk-collector start end))))
slices)]
(into []
(mapcat deref)
futures))
Great! Although I would personally inline mk-collector
. Using ((
is a hint to that.
And you can do all the partition
work inside future
, thus making it parallel as well.
The partition work takes zero time though?
Interesting that you suggest inlining mk-collector. I thought the same, but it then started to take 3X more time…
Depends on its inputs. But moving it inside futures will make the code much simpler.
With the above I have Execution time mean : 3,113634 ms
Inlining:
(let [num-slices 10
slice-size (/ n num-slices)
slices (partition
2
(interleave
(range 1 (inc n) slice-size)
(range slice-size (inc n) slice-size)))
futures (mapv (fn [[start end]]
(future
(loop [res (transient [])
i start]
(if (<= i end)
(recur (if (aget ba i)
(conj! res i)
res)
(+ i 2))
(persistent! res)))))
slices)]
(into []
(mapcat deref)
futures))))
Execution time mean : 10,080578 ms
How about this? I haven't tested it, might not even work:
(let [num-slices 10
slice-size (int (/ n num-slices))
offset 2
_ (assert (zero? (mod slice-size offset))
"Dealing with slices that have fractional chunks would be too complicated.")
futures (mapv (fn [slice-idx]
(future
(let [start (* slice-idx slice-size)
end (if (= slice-idx (dec num-slices))
n
(+ start slice-size))]
(loop [res (transient [])
i start]
(if (< i end)
(recur (cond-> res
(aget ba i) (conj! i))
(+ i offset))
(persistent! res))))))
(range num-slices))]
(into []
(mapcat deref)
futures))
The difference in your code is not only inlining but also the lack of type hints. Try adding ^long
wherever necessary.
To help further analyze such issues, always do this:
(set! *unchecked-math* :warn-on-boxed)
(set! *warn-on-reflection* true) ;; Doubt it will be useful here, but it's useful in general.
I simplified the ranges similar to what you suggest here. Did try throwing in type hints, but didn’t seem to bite. Will look closer at where you suggest they should be…
Unfortunately it doesn’t gain me the slightest in with my prime number sieve. 😃 But this was very, very good for me to investigate and get to know a bit about, so I am good and happy. Many thanks for the guidance!
Sure thing. I'm actually quite curious for why extracting that fn makes the code faster.
Oh, it didn't in the end. Setting those warning levels helped me find where I lost the time. Using quot instead of / fixed it.
I might add that the filter predicate is fast, afaik. So this note on pmap
seems to tell me I should be looking for other ways to speed the process up:
> Only useful for computationally intensive functions where the time of f dominates the coordination overhead.
Might this be of use? https://github.com/reborg/parallel
Although it doesn't have a pfilter
, it works with transducers so you can supply a filter
<https://github.com/reborg/parallel#pfold-pxrf-and-pfolder>
If your predicate is fast, why do you need pmap at all?
because the collection is huge?
Thanks @dharrigan! I’ll have a look!
in this case you might be better off with reducers perhaps
Yes, the collection can potentially be huge, and then I want it to go much quicker than it does today.
So, I filter 500K in 20ms and imagine that if all 6 cores of my machine took a slice each it would be done in less than 4ms. 😃
@pez I don't think pmap will buy you anything here. Take a look at clojure.core.reducers
reducers will slice the collection in multiple parts and then do the work on each slice in separate threads and then concat the result
this is not how pmap works
I will. Interestingly @dharrigan linked to pfold form that parallell lib. 😃
I tend to want pfilter from time to time, but always procastinate implementing one (that also suits my sensibilities)
my usual workaround is to run the predicate through pmap and then use a vanilla filter identity
as the next step (which won't be parallel, but can be assumed to be fast since identity
is a simple pred)
Also interesting that in that beginner’s guide to Clojure I am writing, yesterday I wrote “I won’t be going into reducers here”. 😃
That approach only helps if the predicate itself is slow
Have you considered the core.async pipeline utils?
My predicate is an index lookup in a boolean array.
ah whoops, didn't read I might add that the filter predicate is fast
They are quite powerful and nice to use in my experience https://clojuredocs.org/clojure.core.async/pipeline
If you do a lot of number crunching, perhaps using Neanderthal would be worth it. Map/reduce tutorial section: https://neanderthal.uncomplicate.org/articles/tutorial_native.html#fast-mapping-and-reducing
Not considered core.async, @raspasov, I started to think about the option to parallellize this some minutes before I asked the question and hadn’t found pfilter
in the core library.
I’ll have a look at that. Even if the number crunching is done for the particular task. It takes .3 ms and then filtering out the results take 20ms. Very frustrating!
@pez pipeline would allow you to write a transducer like (filter my-fn) and then just give it “n” (defonce p1 (pipeline 21 to-ch (filter my-fn) from-ch))
....`does the database dance`...
🙂 From neanderthal 🙂
Do note that Neanderthal is IIRC hundreds of MBs because it requires BLAS and/or MKL.
Then simply start put! -ing elements onto to-ch
Sounds nice!
(pipeline, not hundreds o MBs 😄 )
… and receive the filtered result onto ‘from-ch’
actually…. reverse
start with ‘from-ch’
receive in ‘to-ch’
Hm, if these results are coming from a database, you might be able to do this work inside the database instead (dharrigan's database word triggered that thought)
Hopefully that was clear 🙂
clojure docs has some nice examples of pipeline
Wondering if using transducers and no parallelization would result in a noticeable speedup (at the very least it tends to be more memory-efficient)
@vemv It depends on what you’re doing… it can be significant but rarely an order of magnitude improvement (just switching from collections to transducers without something like pipeline)
(pipeline …) really shines if you have a big server with many real cores and a bunch of tasks that you need to get done in parallel and they require minimal coordination (for example, web scraping)
I’ve launched a server on AWS with 32+ cores and used pipeline… it’s pretty neat
The easiest one to test was transduce
. It gained me 10%. Next thing to try is reducers, I think. But later, my lunch break is over. 😃
so I have this chain of async events where I need to wait on a status conditions for each step in order to proceed to the next. struggling a bit with how to structure this code. currently looks like this:
looking for input on how to handle this sort of thing. gets pretty ugly when we're talking about a chain of 8-10 steps.
This is called callback hell. You might be able to structure this better using core async
or some monadic library like promesa maybe (never tried it)
The way you write your code right now it seems you're not doing it async btw, it seems like a series of sync operations
Recently someone showed me how he used https://github.com/adambard/failjure to solve this kind of problem
yeah it is. each wait-for hides a loop polling some HTTP API for a specific status
You might also be able to use an async http lib like httpkit or a java 11 based one
the whole thing is essentially one big sync operation in that each step absolutely needs to wait for the next before proceeding
but consisting of async HTTP calls underneath
then handle the error/success in the http callback
you either wait, or you async, there's no waiting + async
unless you are waiting for a promise that gets delivered by an async request for example
hm. the whole point of this was to wait different amounts of time for each step before issuing a timeout to the client, but iirc you can perhaps do something like
the http client (clj-http) returns futures if you tell it to (which I hadn't)
it might be better to set the timeout on the request though, if possible
nah the request doesn't time out. you get a response with current status from the API. so this won't really work either
so it's essentially a daisy chain of polling loops
I feel like I’m having a core.async day 🙂 (already talked about pipeline elsewhere) @restenb https://clojuredocs.org/clojure.core.async/pipeline-async might be helpful for your case
@raspasov i'll take a look, thanks
lately I got massive hangups when retrieving libs from central, using leiningen 2.9.5 - is that a lein problem or maven?
always hangs on different libs and restarting deps a few times eventually succeeds
The gain from reducers
is a tad better, but still nothing major. I don’t quite understand why. Next experiement will be pipeline
but I expect it to not help too much either, because I suspect I have not analyzed the problem correctly.
i'm surprised i've never made this silent but terrible error before [{:keys [a :as thing]}]
. thing
here is not the whole object being destructured. my eyes glanced right over it for a while
syntactically, that's all legal :)
but :as
in a :keys
list seems like something you could lint
calling all @borkdude s
@dpsutton @alexmiller clj-kondo will already kind of make you notice by saying that :as
is an unused binding
yeah. i guess i missed it in the font-locking for :as
as a keyword
Speaking about keywords, I'd like some input on this proposal for an :invalid-ident
linter which will warn about things like :1.10.2
:
https://github.com/clj-kondo/clj-kondo/issues/1179
The reason :1.10.2
is problematic is that when you take the name
and convert it to a symbol and try to read that as EDN, it will fail for example. We ran into this issue when outputting keywords in the analysis.
With pipeline
things go about 200 times slower :thinking_face:
I think pipeline
might not be suited for parallellizing things that go fast.