What's the reason Keyword doesn't implement IObj and can't carry metadata?
at the very least, you can’t intern them directly
keywords are interned
they're cached and reused so metadata in one place would affect other uses
When saying they are interned, or they are cached, is that the same thing as saying: "we want equality to be fast, determined by whether references/pointers to them are equal", or is there any more nuance to it than that?
fast equality (identity) checks and memory impact
it also means they won't ever be garbage collected, so creating a gazillion different keywords in a long running process isn't a good idea maybe?
that's not the typical clojure program of course
The clj implementation uses weakrefs to intern, which can be GCed
ah, very good
but the map holding them I guess could still get very big
under gc pressure, they are gc'ed
just as a btw, cljs will “intern” (make a fixed map of) keywords discovered statically during compilation, but keywords created dynamically are not because JS has no facility to intern them without leaking. So cljs doesn’t guarantee equal keywords are identical.
that's an interesting detail, thank you
it has an extra predicate keyword-identical?
to get some of the performance back
ah that's where that comes from. yeah. in .cljc I usually write (defn kw-identical? [k v] #?(:clj (identical? k v) :cljs (keyword-identical? k v)))
I guess one could make the Symbol -> Reference to Keyword map smaller under GC pressure, too, but that would require some kind of sweep over that map to remove entries from it, and some kind of trigger to call that sweep?
hmm, so the corresponding symbols still won't get GC-ed... what's the point then of these weak refs?
Hmmm, perhaps Clojure already does make that map smaller, e.g. in its method clearCache ...
aha!
with the trigger to call it for the Symbol->Reference to Keyword map being an intern call on a keyword that finds a null reference
so for each new (non-identical) keyword that trigger is called?
The trigger appears to be calling intern on a Symbol that already has an entry in the map, but its weak reference has been made null, which is evidence that an earlier GC run freed the Keyword object. If you never do an intern call on such a Symbol, there will be no call to clearCache that I can see: https://github.com/clojure/clojure/blob/master/src/jvm/clojure/lang/Keyword.java#L34-L37
right, that makes a lot of sense
One could imagine trying to call clearCache in other situations, but it appears to be a computation time vs. space tradeoff, e.g. having a time-based periodic trigger to call ClearCache would be a waste of CPU time in most situations.
I suppose the "perfect" trigger would be some thread whose only job was to be an an infinite loop doing a blocking remove() call on the reference queue of GC'ed WeakReferences, and call clearCache every time that remove() returned something. Then you'd need synchronization on that map.
errr, or maybe the ConcurrentHashMap is safe for that use already
doesn’t this actually only clearcache when there’s a keyword miss?
That is what it appears to me, yes.
hmm
well, nm, it runs intern again after removing the entry, and that ends up clearing out the table and reference queue
What line(s) of code are you referring to when you say "it runs intern again after removing the entry"?
if(existingk != null)
return existingk;
//entry died in the interim, do over
table.remove(sym, existingRef);
return intern(sym);
existingk=null means a cached item was GCed (the table has an empty weakref in it), so it removes the entry and runs intern again; this time table.get(sym) will be null, so it runs clearCache
Suppose some Keywords were GC'ed, and every call you made to intern after that were for Symbol's that had Keyword's that were never GC'ed. It appears to me that clearCache would never be called. Do you see a way that clearCache could be called in that scenario?
no, but that would also not be a case with memory pressure
It could be. When I said "some Keywords were GC'ed", it could be 99% of a billion of them that were GC'ed.
eventually the GC will collect a keyword that was only not in use for a brief time and gets used again; at that point it will be detected
if the GC never bothers to do that, then it has spare memory
the worst case I can think of would be 1) create a large number of unique keywords 2) then stop and never use a new keyword again
The scenario I am describing is one where the keywords that are GC'ed, are never used again by the application. I know there are applications that will not behave that way, but that is the scenario I was trying to describe above.
correct, but if there’s memory pressure from those empty entries, eventually it seems that some keyword somewhere that the application still uses will eventually be GCed, if the application is still using keywords at all that don’t have permanent lifetimes
Ah, I see what you mean. clearCache is called either if you re-intern an old GC'ed Keyword, OR if you intern a brand new keyword. The only situation where you avoid it indefinitely is by sticking with the Keywords you have now, forever.
not only sticking with them, but keeping them “alive”
right
so, it’s possible but seems very unlikely
doesn’t mean you won’t have a bad time when clearCache gets called though
It is a kind of stop-the-world GC on that map, yes.
or at least stop-the-thread-calling-intern, not the world
I think the lesson here is don’t dynamically create large numbers of unique short-lived keywords
So don’t create keywords on user input (form, json, edn, xml… parsing).
edn billion laughs attack
It would be fun to get some statistics on the amount of interned keywords in various clojure apps
I have done that in the past
I also have a benchmark (based on some real world cases) that is importing json with a lot of unique keywords to push this case
the last set of changes done were specifically to make that case better (used to be kind of slow to hit the gc conditions)
this is the (in)famous aphyr ticket (https://clojure.atlassian.net/browse/CLJ-1439)