clojure-dev

Issues: https://clojure.atlassian.net/browse/CLJ | Guide: https://insideclojure.org/2015/05/01/contributing-clojure/
borkdude 2020-12-21T13:21:55.393400Z

Given this:

user=> (time (dotimes [_ 10000000] (assoc {} :k1 :v1 :k2 :v2 :k3 :v3)))
"Elapsed time: 2046.959477 msecs"
nil
user=> (time (dotimes [_ 10000000] (-> {} (assoc :k1 :v1) (assoc :k2 :v2) (assoc :k3 :v3))))
"Elapsed time: 666.028164 msecs"
nil
would it make sense to just add more arities to assoc in core to speed up things a little when adding more than 1 kv-pair?

borkdude 2020-12-21T13:37:40.394100Z

E.g. this one seems to help:

(def
  ^{:arglists '([map key val] [map key val & kvs])
    :doc "assoc[iate]. When applied to a map, returns a new map of the
    same (hashed/sorted) type, that contains the mapping of key(s) to
    val(s). When applied to a vector, returns a new vector that
    contains val at index. Note - index must be <= (count vector)."
    :added "1.0"
    :static true}
  assoc2
  (fn ^:static assoc2
    ([map key val] (clojure.lang.RT/assoc map key val))
    ([map k1 v1 k2 v2]
     (-> map (assoc k1 v1) (assoc k2 v2)))
    ([map k1 v1 k2 v2 k3 v3]
     (-> map (assoc k1 v1) (assoc k2 v2) (assoc k3 v3)))
    ([map k1 v1 k2 v2 k3 v3 & kvs]
     (let [ret (assoc map k1 v1)]
       (if kvs
         (if (next kvs)
           (recur ret k2 v2 k3 v3 (first kvs) (second kvs) (nnext kvs))
           (throw (IllegalArgumentException.
                   "assoc expects even number of arguments after map/vector, found odd number")))
         (-> ret (assoc k2 v2 k3 v3)))))))

ghadi 2020-12-21T13:40:35.395300Z

There exists a ticket for this somewhere

Ben Sless 2020-12-21T13:40:47.395500Z

same get be done with dissoc

Ben Sless 2020-12-21T13:42:06.395700Z

You can also use the inliner to opportunistically unroll get-in when the vector is written literally at the call site, which is very often the case

Ben Sless 2020-12-21T13:44:34.396Z

the speedups for get-in and for select-keys are very significant https://github.com/bsless/clj-fast/blob/master/doc/results.md#get-in https://github.com/bsless/clj-fast/blob/master/doc/results.md#select-keys

borkdude 2020-12-21T13:46:58.396400Z

@ben.sless I think a lot of these things could be linting rules in clj-kondo as well

borkdude 2020-12-21T13:47:01.396600Z

Optional of course

borkdude 2020-12-21T13:47:30.396800Z

The problem here is that optimizing is usually only relevant in certain contexts and not globally

borkdude 2020-12-21T13:48:31.397Z

for the linter I mean

Ben Sless 2020-12-21T13:48:39.397200Z

Yep. Also, I can already guess what Rich will say when he looks at such a patch: it increases the amount of generated bytecode, this will mess with JITing, hard pass.

borkdude 2020-12-21T13:56:21.397400Z

I can see some linting rules around this being useful when applied namespace-locally

borkdude 2020-12-21T13:57:14.397600Z

and this is easily analyzed statically:

(get-in foo [:foo :bar])
;; => Replace get-in by -> for better performance or something like that

borkdude 2020-12-21T13:58:22.397800Z

(clj-kondo already has the ability to turn on rules namespace-locally)

Ben Sless 2020-12-21T14:03:47.398Z

Consider here that's also an issue with emitted bytecode size. (:foo m) generates more code than (get m :foo), and after the JIT really kicks in the difference is quite small

Ben Sless 2020-12-21T14:06:42.398200Z

Also, funny business:

(def m {:a 1})
(with-out-str (time (dotimes [_ 100000000] (get m :a))))
;; => "\"Elapsed time: 1032.999081 msecs\"\n"
(with-out-str (time (dotimes [_ 100000000] (:a m))))
;; => "\"Elapsed time: 1793.317199 msecs\"\n"

borkdude 2020-12-21T14:07:24.398400Z

would it be good to say that like with assoc, it's more optimal to replace get-in with nested get when the length of the path is statically known?

Ben Sless 2020-12-21T14:07:59.398700Z

absolutely

Ben Sless 2020-12-21T14:08:41.398900Z

but again, it may have unexpected effects on performance, so I'd want to hear some of the core devs' opinion on this

borkdude 2020-12-21T14:08:53.399100Z

user=> (time (dotimes [i 10000000] (get-in a [:a :b])))
"Elapsed time: 384.070726 msecs"
nil
user=> (time (dotimes [i 10000000] (get (get a :a) :b)))
"Elapsed time: 153.975087 msecs"
nil

alexmiller 2020-12-21T14:50:01.399400Z

https://clojure.atlassian.net/browse/CLJ-1656

alexmiller 2020-12-21T14:50:49.400Z

I think my last comment there is still valid - this needs assessment of real use in the wild. Most people don't call assoc with multiple kvs.

alexmiller 2020-12-21T14:57:17.400100Z

the problem with linter advice like "Replace get-in by -> for better performance" is that this advice may be true for some clojure/java versions but gets cargo culted even when it stops being true.

alexmiller 2020-12-21T14:58:00.400400Z

and usually it doesn't matter anyways

alexmiller 2020-12-21T14:58:11.400600Z

so people focus on the wrong things

borkdude 2020-12-21T15:16:17.401300Z

@alexmiller I can do some analysis. Only in clj-kondo I find 41 cases:

$ grasp . "(g/seq 'assoc any? any? any? (s/+ any?))" | grep file | wc -l
      41

2020-12-22T13:58:06.442200Z

Results from our 200k LOC codebase:

Total: 15776
 1: ******************************************************************************* (12330, 79%)
 2: **************** (2425, 16%)
 3: ***** (642, 5%)
 4: ** (223, 2%)
 5: * (114, 1%)
 6: * (22, 1%)
 9: * (6, 1%)
 7: * (3, 1%)
11: * (3, 1%)
10: * (3, 1%)
12: * (2, 1%)
20: * (1, 1%)
14: * (1, 1%)
 8: * (1, 1%)

borkdude 2020-12-22T14:07:53.442400Z

Fixed the percentages now:

$ ~/Dropbox/temp/assoc_pairs.clj
Total: 1002
 1: ********************************************************************************* (814, 81,24%)
 2: ************ (120, 11,98%)
 3: ***** (55, 5,49%)
 5: * (6, 0,60%)
 4: * (3, 0,30%)
 9: * (2, 0,20%)
 6: * (2, 0,20%)

borkdude 2020-12-21T15:16:56.401600Z

vs 85 with only one k/v pair

alexmiller 2020-12-21T15:17:43.401900Z

right, I expect it falls off quickly after htat

borkdude 2020-12-21T15:18:01.402100Z

I can make a frequency table

alexmiller 2020-12-21T15:18:37.402900Z

I'm interested what that looks like in clojure code in general

mpenet 2020-12-21T15:18:47.403200Z

it's quite common, I have seen that often in code reviews.

borkdude 2020-12-21T15:19:07.403900Z

I can run the analysis on "clojure code in general" if you will supply me with a body of code

mpenet 2020-12-21T15:19:13.404200Z

and the counter-intuitive nit: "you can reduce nesting by doing a single assoc"

alexmiller 2020-12-21T15:19:16.404300Z

https://github.com

alexmiller 2020-12-21T15:19:29.404500Z

:)

Ben Sless 2020-12-21T15:20:18.405600Z

I can't share production code but it's ubiquitous in our code base as well

alexmiller 2020-12-21T15:20:26.405800Z

seriously though, I use https://grep.app all the time to look at these kinds of things on github

borkdude 2020-12-21T15:21:02.406500Z

http://grep.app is nice but it won't give you any statistical numbers, just an intuition while scrolling through examples

borkdude 2020-12-21T15:21:49.406600Z

I don't disagree with that :)

alexmiller 2020-12-21T15:22:36.407600Z

that's sufficient

borkdude 2020-12-21T15:23:11.407900Z

it's not very scientific though

alexmiller 2020-12-21T15:24:56.408600Z

I usually use it to confirm the intuitions I already have

alexmiller 2020-12-21T15:25:04.408800Z

but sometimes I'm surprised :)

Ben Sless 2020-12-21T15:30:35.410800Z

Ran the grasp borkdude provided on all the work repos I have cloned on this machine

37 2 2 1 4 5 91 1 4 1 7 9 1 196 42 13 1 21 29 4 6 43 4 37 43 6 4 4 5

Ben Sless 2020-12-21T15:32:06.411300Z

about a third of them had 0

slipset 2020-12-21T15:34:23.412200Z

But if you do (assoc x :foo :bar :baz :qix :lol :whatever), why don’t you just (merge x {…})

Ben Sless 2020-12-21T15:37:19.412600Z

If you thought the restFn arity of assoc was slow, wait until you see merge =\

Ben Sless 2020-12-21T15:37:32.412900Z

(everything in relation to assoc of one kv pair)

dpsutton 2020-12-21T15:38:23.413700Z

i would be surprised if someone used merge when semantically they just needed assoc and would probably point it out for a change.

Ben Sless 2020-12-21T15:38:53.413800Z

True, but it happens more often than you'd think

Ben Sless 2020-12-21T15:39:18.414Z

I even wrote some code that unrolls a static map merge to a series of assoc-s

ghadi 2020-12-21T15:39:56.414600Z

@ben.sless what do those numbers you posted signify?

Ben Sless 2020-12-21T15:41:04.415200Z

for dir in work-projects/* ; do ./grasp $dir "(g/seq 'assoc any? any? any? (s/+ any?))" | grep file | wc -l | grep -v 0 ; done

Ben Sless 2020-12-21T15:42:11.416200Z

This is the number of times usage of the rest arity of assoc was found in projects

borkdude 2020-12-21T15:43:54.417600Z

For frequency analysis, run this: https://gist.github.com/borkdude/e6f0b12f9352f3375e5f3277d2aba6c9 It will analyze the current directory for assoc usage. For clj-kondo I get:

([1 99] [2 20] [3 8] [4 3] [10 1])
So yes, lots of single k/v pairs, about a 1/5th of that amount for 2 k/v pairs, etc.

slipset 2020-12-21T15:44:00.417900Z

I believe there is a ticket for speeding up merge as well.

borkdude 2020-12-21T15:44:54.418400Z

So 99 single vs 27 multi

slipset 2020-12-21T15:45:52.419600Z

I could argue that if your associng a bunch of things, then maybe you’re missing a semantic thingy in your program.

dpsutton 2020-12-21T15:50:36.420900Z

i don't follow. assoc'ing a few things seems like its using one of the basic building blocks of clojure

alexmiller 2020-12-21T15:52:20.422Z

the question is how much unrolling is worth doing, also needs perf data per case (or maybe that's in the ticket already, don't have it open). would be great to review the patches too, I haven't looked at those in a long while. prob worth it to unroll to 3, maybe 4.

Ben Sless 2020-12-21T15:55:32.422600Z

I can report I tested unrolled versions of assoc for up to 4 keys for different map sizes and saw speedups for all of them

slipset 2020-12-21T15:55:51.422800Z

Sorry, I was thinking higher up in the stack, like in the business domain. If you’re assoc’ing a bunch of things on to another thing, then maybe those other things together are a thing in your domain.

borkdude 2020-12-21T15:56:07.423100Z

on my "big" work project (also includes cljs sources): ([1 377] [2 46] [3 12] [4 4] [6 4] [7 2] [5 2])

alexmiller 2020-12-21T16:07:48.423400Z

so unrolling to 4 would cover 98% usage

borkdude 2020-12-21T16:09:22.423600Z

that seems likely

alexmiller 2020-12-21T16:13:41.423900Z

would be good to include these examples in a comment

borkdude 2020-12-21T16:13:57.424200Z

I'll post a link to the script and the stats from my own project(s)

kenny 2020-12-21T16:23:04.425400Z

Ran @borkdude's script on our codebase.

([1 1213] [2 186] [3 41] [4 7] [5 4] [6 2] [7 1] [8 1])

borkdude 2020-12-21T16:24:18.425900Z

80% of 1 kv pair again, 10% of 2 and the rest is the long tail basically

borkdude 2020-12-21T16:24:39.426200Z

I get roughly the same numbers so far

seancorfield 2020-12-21T17:51:29.426500Z

Happy to run this against our codebase. How do I install grasp on macOS? I looked in the repo and didn't see a binary to download?

borkdude 2020-12-21T17:53:52.426700Z

@seancorfield Just run this on the JVM: https://gist.github.com/borkdude/e6f0b12f9352f3375e5f3277d2aba6c9

borkdude 2020-12-21T17:53:56.426900Z

invoke as bash script

seancorfield 2020-12-21T17:54:56.427100Z

Yup, just spotted that and ran it. Thanks!

mkvlr 2020-12-21T18:24:42.427700Z

and from ours:

([1 1708] [2 287] [3 67] [4 25] [5 5] [7 3] [8 2] [13 1] [9 1])

borkdude 2020-12-21T19:02:09.427900Z

alexmiller 2020-12-21T19:15:17.428Z

29?!

borkdude 2020-12-21T19:17:37.428200Z

haha

borkdude 2020-12-21T19:18:36.428400Z

This is probably where a transient may start to pay off?

slipset 2020-12-21T19:21:42.428800Z

This is us:

([1 596] [2 69] [3 28] [4 9] [5 3] [6 1] [10 1] [11 1])

slipset 2020-12-21T19:26:06.429300Z

I really can’t believe we have an assoc with 11 things in it.

borkdude 2020-12-21T19:27:33.430Z

@slipset You can adapt the script to find out which form that was. The location info (file, line, column) is on the form

alexmiller 2020-12-21T19:37:31.430800Z

I am somewhat surprised some of those are so large

dpsutton 2020-12-21T19:39:17.431400Z

can grasp work on a single file at a time? could we throw all clj files in m2 through it?

borkdude 2020-12-21T19:40:27.432400Z

@dpsutton you can throw any path at it and it will "grasp" .clj/.cljc/.cljs files, .jar files and directories recursively. It will be slow as hell though for an entire m2 dir. I'm thinking of some heuristics to speed this up here: https://github.com/borkdude/grasp/issues/13

borkdude 2020-12-21T19:42:19.433300Z

it also accepts a classpath (as produced by the clojure CLI for example)

alexmiller 2020-12-21T20:03:28.433900Z

I ran it on a subset of some nubank code and looked at the longest cases, most were in tests which makes sense

souenzzo 2020-12-21T20:07:52.434100Z

Mine in 100kLOC codebase

([1 923] [2 123] [4 38] [5 35] [3 34] [6 32] [7 25] [8 6] [11 2] [12 1] [14 1])
These 6+ cases I'm pretty sure that are my routes (code that will run only once, at startup time), so I don't think that this kind of change worth

seancorfield 2020-12-21T20:32:37.434300Z

I kicked it off a couple of hours ago on my entire "workspace" folder which has both our work codebase (that I reported on earlier) and every single OSS project I work on. It's still running 🙂

borkdude 2020-12-21T20:34:36.434500Z

I think I want to include a couple of hooks so you can say: if this file doesn't contain "assoc" at all, just skip it. And, if this form doesn't contain "assoc" at all, just skip it as well. I'm trying this now locally on my .m2 folder...

borkdude 2020-12-21T20:37:23.434700Z

the resulting forms from grasp are returned as a lazy-seq so you can already build some rolling average reporting on top of it

borkdude 2020-12-21T21:08:44.435Z

There may also be a bug in grasp currently, I'm looking into it

borkdude 2020-12-21T21:08:55.435200Z

Why it's not progressing after a certain amount of time

borkdude 2020-12-21T21:13:43.435400Z

I think I found something using visualVM: there might be a bug in the parser 🤦

seancorfield 2020-12-21T21:34:42.435600Z

OK, I'll kill my process (that was still running 3 1/2 hours later).

borkdude 2020-12-21T21:45:08.435800Z

Yeah, that's probably best. I had a bug in the parser before, but that was already fixed. It seems clojure was using and older version of the parsing lib although the deps.edn of the gitlib was already updated. It seemed to work again with -Sforce but now running into something else. I'll report back when I can "grasp" my entire m2.

1
borkdude 2020-12-21T22:00:20.436300Z

(found it, it's a bug with parsing .cljc from data.xml - node.cljc to be specific)

borkdude 2020-12-21T22:35:22.437Z

I'll also post these results to the jira.

seancorfield 2020-12-21T22:39:29.437900Z

Here's the results for all the non-contrib source code I have on my machine:

([1 32258] [2 4848] [3 1286] [4 535] [5 354] [6 118] [7 100] [9 32] [11 22] [12 14] [18 10] [8 8] [29 6] [16 6] [19 2] [10 2])
and here's the results for Clojure and all the Contrib libraries together:
([1 2786] [2 423] [3 99] [4 10] [5 6] [6 4] [7 1])

seancorfield 2020-12-21T22:40:56.438600Z

(I'm a bit surprised I have so much source code checked out locally to be honest)

borkdude 2020-12-21T22:43:05.438700Z

(in case you wanted to know what the bug was: the parser didn't expect there could be whitespace between a reader conditional splice and the next expression: #?@\n(:clj ..))

seancorfield 2020-12-21T22:55:18.439600Z

I just grep'd all the code I have here and there are two occurrences in that one file and none in any of the other source code I have checked out.

borkdude 2020-12-21T22:57:13.439800Z

ok, well, glad it's fixed now.

dominicm 2020-12-21T23:07:14.440Z

Maybe I should grab a clojars dump and hook up the analysis I've been threatening for years.