Given this:
user=> (time (dotimes [_ 10000000] (assoc {} :k1 :v1 :k2 :v2 :k3 :v3)))
"Elapsed time: 2046.959477 msecs"
nil
user=> (time (dotimes [_ 10000000] (-> {} (assoc :k1 :v1) (assoc :k2 :v2) (assoc :k3 :v3))))
"Elapsed time: 666.028164 msecs"
nil
would it make sense to just add more arities to assoc
in core to speed up things a little when adding more than 1 kv-pair?E.g. this one seems to help:
(def
^{:arglists '([map key val] [map key val & kvs])
:doc "assoc[iate]. When applied to a map, returns a new map of the
same (hashed/sorted) type, that contains the mapping of key(s) to
val(s). When applied to a vector, returns a new vector that
contains val at index. Note - index must be <= (count vector)."
:added "1.0"
:static true}
assoc2
(fn ^:static assoc2
([map key val] (clojure.lang.RT/assoc map key val))
([map k1 v1 k2 v2]
(-> map (assoc k1 v1) (assoc k2 v2)))
([map k1 v1 k2 v2 k3 v3]
(-> map (assoc k1 v1) (assoc k2 v2) (assoc k3 v3)))
([map k1 v1 k2 v2 k3 v3 & kvs]
(let [ret (assoc map k1 v1)]
(if kvs
(if (next kvs)
(recur ret k2 v2 k3 v3 (first kvs) (second kvs) (nnext kvs))
(throw (IllegalArgumentException.
"assoc expects even number of arguments after map/vector, found odd number")))
(-> ret (assoc k2 v2 k3 v3)))))))
There exists a ticket for this somewhere
same get be done with dissoc
You can also use the inliner to opportunistically unroll get-in
when the vector is written literally at the call site, which is very often the case
the speedups for get-in and for select-keys are very significant https://github.com/bsless/clj-fast/blob/master/doc/results.md#get-in https://github.com/bsless/clj-fast/blob/master/doc/results.md#select-keys
@ben.sless I think a lot of these things could be linting rules in clj-kondo as well
Optional of course
The problem here is that optimizing is usually only relevant in certain contexts and not globally
for the linter I mean
Yep. Also, I can already guess what Rich will say when he looks at such a patch: it increases the amount of generated bytecode, this will mess with JITing, hard pass.
I can see some linting rules around this being useful when applied namespace-locally
and this is easily analyzed statically:
(get-in foo [:foo :bar])
;; => Replace get-in by -> for better performance
or something like that(clj-kondo already has the ability to turn on rules namespace-locally)
Consider here that's also an issue with emitted bytecode size. (:foo m)
generates more code than (get m :foo)
, and after the JIT really kicks in the difference is quite small
Also, funny business:
(def m {:a 1})
(with-out-str (time (dotimes [_ 100000000] (get m :a))))
;; => "\"Elapsed time: 1032.999081 msecs\"\n"
(with-out-str (time (dotimes [_ 100000000] (:a m))))
;; => "\"Elapsed time: 1793.317199 msecs\"\n"
would it be good to say that like with assoc, it's more optimal to replace get-in
with nested get
when the length of the path is statically known?
absolutely
but again, it may have unexpected effects on performance, so I'd want to hear some of the core devs' opinion on this
user=> (time (dotimes [i 10000000] (get-in a [:a :b])))
"Elapsed time: 384.070726 msecs"
nil
user=> (time (dotimes [i 10000000] (get (get a :a) :b)))
"Elapsed time: 153.975087 msecs"
nil
I think my last comment there is still valid - this needs assessment of real use in the wild. Most people don't call assoc with multiple kvs.
the problem with linter advice like "Replace get-in by -> for better performance" is that this advice may be true for some clojure/java versions but gets cargo culted even when it stops being true.
and usually it doesn't matter anyways
so people focus on the wrong things
@alexmiller I can do some analysis. Only in clj-kondo I find 41 cases:
$ grasp . "(g/seq 'assoc any? any? any? (s/+ any?))" | grep file | wc -l
41
Results from our 200k LOC codebase:
Total: 15776
1: ******************************************************************************* (12330, 79%)
2: **************** (2425, 16%)
3: ***** (642, 5%)
4: ** (223, 2%)
5: * (114, 1%)
6: * (22, 1%)
9: * (6, 1%)
7: * (3, 1%)
11: * (3, 1%)
10: * (3, 1%)
12: * (2, 1%)
20: * (1, 1%)
14: * (1, 1%)
8: * (1, 1%)
Fixed the percentages now:
$ ~/Dropbox/temp/assoc_pairs.clj
Total: 1002
1: ********************************************************************************* (814, 81,24%)
2: ************ (120, 11,98%)
3: ***** (55, 5,49%)
5: * (6, 0,60%)
4: * (3, 0,30%)
9: * (2, 0,20%)
6: * (2, 0,20%)
vs 85 with only one k/v pair
right, I expect it falls off quickly after htat
I can make a frequency table
I'm interested what that looks like in clojure code in general
it's quite common, I have seen that often in code reviews.
I can run the analysis on "clojure code in general" if you will supply me with a body of code
and the counter-intuitive nit: "you can reduce nesting by doing a single assoc"
:)
I can't share production code but it's ubiquitous in our code base as well
seriously though, I use https://grep.app all the time to look at these kinds of things on github
http://grep.app is nice but it won't give you any statistical numbers, just an intuition while scrolling through examples
I don't disagree with that :)
that's sufficient
it's not very scientific though
I usually use it to confirm the intuitions I already have
but sometimes I'm surprised :)
Ran the grasp borkdude provided on all the work repos I have cloned on this machine
37 2 2 1 4 5 91 1 4 1 7 9 1 196 42 13 1 21 29 4 6 43 4 37 43 6 4 4 5
about a third of them had 0
But if you do (assoc x :foo :bar :baz :qix :lol :whatever)
, why don’t you just (merge x {…})
If you thought the restFn arity of assoc was slow, wait until you see merge =\
(everything in relation to assoc of one kv pair)
i would be surprised if someone used merge
when semantically they just needed assoc
and would probably point it out for a change.
True, but it happens more often than you'd think
I even wrote some code that unrolls a static map merge to a series of assoc-s
@ben.sless what do those numbers you posted signify?
for dir in work-projects/* ; do ./grasp $dir "(g/seq 'assoc any? any? any? (s/+ any?))" | grep file | wc -l | grep -v 0 ; done
This is the number of times usage of the rest arity of assoc was found in projects
For frequency analysis, run this: https://gist.github.com/borkdude/e6f0b12f9352f3375e5f3277d2aba6c9 It will analyze the current directory for assoc usage. For clj-kondo I get:
([1 99] [2 20] [3 8] [4 3] [10 1])
So yes, lots of single k/v pairs, about a 1/5th of that amount for 2 k/v pairs, etc.I believe there is a ticket for speeding up merge as well.
So 99 single vs 27 multi
I could argue that if your associng a bunch of things, then maybe you’re missing a semantic thingy in your program.
i don't follow. assoc'ing a few things seems like its using one of the basic building blocks of clojure
the question is how much unrolling is worth doing, also needs perf data per case (or maybe that's in the ticket already, don't have it open). would be great to review the patches too, I haven't looked at those in a long while. prob worth it to unroll to 3, maybe 4.
I can report I tested unrolled versions of assoc for up to 4 keys for different map sizes and saw speedups for all of them
Sorry, I was thinking higher up in the stack, like in the business domain. If you’re assoc’ing a bunch of things on to another thing, then maybe those other things together are a thing in your domain.
on my "big" work project (also includes cljs sources): ([1 377] [2 46] [3 12] [4 4] [6 4] [7 2] [5 2])
so unrolling to 4 would cover 98% usage
that seems likely
would be good to include these examples in a comment
I'll post a link to the script and the stats from my own project(s)
also, vote here: https://ask.clojure.org/index.php/3330/unroll-assoc-and-assoc-for-small-numbers-of-arguments
Ran @borkdude's script on our codebase.
([1 1213] [2 186] [3 41] [4 7] [5 4] [6 2] [7 1] [8 1])
80% of 1 kv pair again, 10% of 2 and the rest is the long tail basically
I get roughly the same numbers so far
Happy to run this against our codebase. How do I install grasp
on macOS? I looked in the repo and didn't see a binary to download?
@seancorfield Just run this on the JVM: https://gist.github.com/borkdude/e6f0b12f9352f3375e5f3277d2aba6c9
invoke as bash script
Yup, just spotted that and ran it. Thanks!
and from ours:
([1 1708] [2 287] [3 67] [4 25] [5 5] [7 3] [8 2] [13 1] [9 1])
29?!
haha
This is probably where a transient may start to pay off?
This is us:
([1 596] [2 69] [3 28] [4 9] [5 3] [6 1] [10 1] [11 1])
I really can’t believe we have an assoc
with 11 things in it.
@slipset You can adapt the script to find out which form that was. The location info (file, line, column) is on the form
I am somewhat surprised some of those are so large
can grasp work on a single file at a time? could we throw all clj files in m2 through it?
@dpsutton you can throw any path at it and it will "grasp" .clj/.cljc/.cljs files, .jar files and directories recursively. It will be slow as hell though for an entire m2 dir. I'm thinking of some heuristics to speed this up here: https://github.com/borkdude/grasp/issues/13
it also accepts a classpath (as produced by the clojure CLI for example)
I ran it on a subset of some nubank code and looked at the longest cases, most were in tests which makes sense
Mine in 100kLOC codebase
([1 923] [2 123] [4 38] [5 35] [3 34] [6 32] [7 25] [8 6] [11 2] [12 1] [14 1])
These 6+ cases I'm pretty sure that are my routes (code that will run only once, at startup time), so I don't think that this kind of change worthI kicked it off a couple of hours ago on my entire "workspace" folder which has both our work codebase (that I reported on earlier) and every single OSS project I work on. It's still running 🙂
I think I want to include a couple of hooks so you can say: if this file doesn't contain "assoc" at all, just skip it. And, if this form doesn't contain "assoc" at all, just skip it as well. I'm trying this now locally on my .m2 folder...
the resulting forms from grasp are returned as a lazy-seq so you can already build some rolling average reporting on top of it
There may also be a bug in grasp currently, I'm looking into it
Why it's not progressing after a certain amount of time
I think I found something using visualVM: there might be a bug in the parser 🤦
OK, I'll kill my process (that was still running 3 1/2 hours later).
Yeah, that's probably best. I had a bug in the parser before, but that was already fixed. It seems clojure was using and older version of the parsing lib although the deps.edn of the gitlib was already updated. It seemed to work again with -Sforce but now running into something else. I'll report back when I can "grasp" my entire m2.
(found it, it's a bug with parsing .cljc from data.xml - node.cljc to be specific)
I'll also post these results to the jira.
Here's the results for all the non-contrib source code I have on my machine:
([1 32258] [2 4848] [3 1286] [4 535] [5 354] [6 118] [7 100] [9 32] [11 22] [12 14] [18 10] [8 8] [29 6] [16 6] [19 2] [10 2])
and here's the results for Clojure and all the Contrib libraries together:
([1 2786] [2 423] [3 99] [4 10] [5 6] [6 4] [7 1])
(I'm a bit surprised I have so much source code checked out locally to be honest)
(in case you wanted to know what the bug was: the parser didn't expect there could be whitespace between a reader conditional splice and the next expression: #?@\n(:clj ..)
)
I just grep'd all the code I have here and there are two occurrences in that one file and none in any of the other source code I have checked out.
ok, well, glad it's fixed now.
Maybe I should grab a clojars dump and hook up the analysis I've been threatening for years.