dunno, but this may be useful https://cloud.google.com/functions/docs/concepts/java-deploy#deploy_from_a_jar
Observability for Clojure applications, state of the art: What is it?
https://github.com/metrics-clojure/metrics-clojure for the Dropwizard library
#observability
is there a similar facade for micrometer?
I love Honeycomb for tracing
Micrometer is ok for metrics
OpenTelemetry metrics are too new, but the tracing side is ok
Dropwizard/CodaHale are showing age
honeycomb is quite nice, but it can get expensive at scale. But it might not matter in that case.
JMX ?
micrometer is a good solution, it's quite easy to work with via interop
if you want to go the extra mile you can also use something like opentelemetry
Hey, What is the correct way to add a timeout to channels? I've tried to add a naive implementation using an async/timeout but got huge spikes with memory increase Then found out that I'm not the only one to struggle this (see here https://clojure.atlassian.net/browse/ASYNC-225)
can you post your naive implementation?
(defmacro <?t
"Read a value from a channel (using <!) and if the value is Throwable, throw it."
([channel]
`(<?t ~channel default-take-timeout))
([channel timeout-msecs]
`(let [in-chan# ~channel
timeout-chan# (async/timeout ~timeout-msecs)
[v# c#] (async/alts! [in-chan# timeout-chan#])]
(if (= c# in-chan#)
(<?* v#)
(throw (ex-info timeout-msg {:cause timeout-msg
:timeout ~timeout-msecs}))))))
There is a #core-async channel, but timeout channels and alt is the way to do it
async-225 is fairly trivial to work around (async/go (async/<! (async/timeout whatever)))
is it tho?
does timeout chan sharing only happen in a single go
block?
the underlying issue is that prolifically using timeouts increase CPU overhead
and you cannot close timeouts because they are shared
you are misunderstanding
the channel returned from (async/go (async/<! (async/timeout whatever)))
is a fresh channel that while close whenever the timeout does
I doubt prolific timeout usage is the root cause of your increased cpu usage
or I should say, I doubt it is the sharing of timeouts that is the root cause
my guess is it is the fundamental global nature of timeouts keeping objects alive, filling your heap, causing the gc run all the time spiking your cpu usage
Yeah, I’ve not looked into it, but the ticket says that timeout checks caused CPU usage.
(I hope that’s not the case, but that’s what it says.)
the thing about timeout channels is there is a global singleton scheduler that is doing the timing out, so it keeps alive a reference to each timeout channel until it closes
and the nature of go blocks is the code is transformed into callbacks are registered on channels
so if you wait on a timeout you attach a closure to the timeout that closes over all the data your code references
and that callback may still be live (reachable) because of the timeout even when you stop waiting on the timeout, because of how channels do bookkeeping
so my guess is the issue both you, and the guy in the ticket are having, has nothing to do with the shared nature of timeouts, and everything with the gc implications
at work are have custom sort of weak reference channel implementation we use to help combat this kind of thing
and in fact towards the end of the ticket they even talk about it alleviating gc pressure
yeah, theres a seeming "inversion", where the channel becomes the gc root holding onto all the objects in the code waiting on the channel. it is clearly counterintuitive, and I could see how that could easily lead to misunderstanding the source of a problem like this
So the sharing comes into play if you try to manually clean up for yourself.
It’s not clear to me what the actual underlying problem is. As you say, it could be GC. However, if you bork the timeout check, you can introduce redundant checks in the DelayQueue.
(I know because I’ve done it.)
They say they confirmed that timeout checks were a problem. I have no idea if that’s true.
it is absolutely the gc, because the cpu usage from increased gc is what caused them to start closing shared timeout channels in the first place, which introduced further breakage
they probably assumed closing the chan would stop the checking?
but why did they looking for a problem to solve in the first place?
because they had metrics somewhere that showed increased cpu usage
so they started doing things like closing channels, and looking at the delayqueue, etc
> Using timeout channels (with long timeouts of 2 minutes) without closing them solved our core problem (too many TCP sockets in CLOSE_WAIT state) > but immediately caused the application to use approximately 10%+ CPU and way more memory than before.
which is in line with gc probs
> After analyzing the problem, we concluded that leaving the timeout channels as is resulted in them continuously executing the code that checks if their time had passed, even if the “safe take” go block had returned
which is a dubious claim, but not impossible
yeah, the whole root cause analysis in that ticket doesn't go deep enough and is bound up way to much in what they already decided is a solution based on how they go there
Agreed. I would be interested to know how they arrived at that claim.
the issue is alts attach a closure to all the channels in the alts, and don't get rid of the reference once one of the alternatives is committed to
yeah, agreed
I just wrote a little patch that I believe fixes that and left it as a comment on the ticket, but I'll need to remember how the contribution process works to attach a real patch and then lean on alex for a few months to a year to get the patch in to a new core.async release
woah:
(mm/measure (map inc (take 100 (range)))) => 232B
(mm/measure (walk/postwalk identity (map inc (take 100 (range))))) => "8.6 KiB"
This was an interesting discovery. Not sure what to make of it. mm is clj-memory-meter.map returns a lazy seq, and range has a specifically optimized lazy seq, walking a lazy-seq realizes it
Oh. The map hasn't been realized at that point. I see.
Duh, needs more doall
.
Feels less mind blowing now. Lazy sequences always ruin profiling :)
Hadn’t heard of clj-mm, this is cool
It's been helping me out with a production issue :). We handle analytical-style data and clojure.walk is somehow "leaking" memory. I thought I had a repro, but evidently not.
Okay, this is actually it:
(let [x [[]]]
(prn (mm/measure x)) => "304 B"
(prn (mm/measure (walk/postwalk identity x)))) => "488 B"
It has to be a nested vector, [1]
won't do it, nor will [()]
. It's something to do with into
, but I haven't figured out what.(let [x [[:a]]]
(prn (mm/measure x))
(prn (mm/measure (into [] x))))
Also has the memory leak.into does transient->persistent, might affect what's held in fields
I added a patch that I've used collection-check and a few small manual tests to test, to the CLJ-2594 ticket, plus a link to a small git repo I made with code for drawing pictures of various persistent and transient vectors that help to see the behavior before and after.
@andy.fingerhut thank you for digging into this.
No worries. It was something that I may have even noticed before and wondered about, in my digging into the implementation of persistent and transient vectors a while ago, so right up there in my wheel house
So I have a mixed project consisting of Java/Clojure code. It works wonderfully on Linux, but on Windows I run into trouble with the Java bit.
(defproject guidescan-web "2.0"
:description "Version 2.0 of the Guidescan website."
:url "<http://guidescan.com/>"
:author "Henri Schmidt"
:dependencies [[org.clojure/clojure "1.10.1"]
[org.clojure/data.csv "1.0.0"]
[com.github.samtools/htsjdk "2.23.0"]
[com.stuartsierra/component "1.0.0"]
[http-kit "2.4.0"]
[compojure "1.6.2"]
[ring/ring-defaults "0.3.2"]
[cheshire "5.10.0"]
[failjure "2.0.0"]
[selmer "1.12.28"]
[com.taoensso/timbre "4.10.0"]
[org.clojure/tools.cli "1.0.194"]
[seancorfield/next.jdbc "1.1.610"]
[honeysql "1.0.444"]
[com.h2database/h2 "1.4.197"]
[org.postgresql/postgresql "42.2.5"]
[com.mchange/c3p0 "0.9.5.5"]
[ring "1.8.1"]]
:profiles {:uberjar {:aot :all}
:dev {:resource-paths ["test/resources"]
:source-paths ["src" "dev" "test"]
:java-source-paths ["src/java"]
:javac-options ["-target" "1.8" "-source" "1.8"]
:repl-options {:init-ns reload}
:dependencies [[org.clojure/tools.namespace "1.0.0"]]}
:prod {:main guidescan-web.core
:jar-name "guidescan.jar-THIN"
:uberjar-name "guidescan.jar"
:java-source-paths ["src/java"]
:javac-options ["-target" "1.8" "-source" "1.8"]}
;;;; Scripts for various maintenance tasks
:create-gene-db {:source-paths ^:replace ["scripts"]
:uberjar-name "create-gene-db.jar"
:jar-name "create-gene-db.jar-THIN"
:main create-gene-db}
:add-organism {:source-paths ^:replace ["scripts"]
:uberjar-name "add-organism.jar"
:jar-name "add-organism.jar-THIN"
:main add-organism}})
Anyone have any ideas?
you haven't said what the trouble is
Sorry, let me attach that.
Caused by: java.lang.ClassNotFoundException: guidescan.algo.RabinKarp
at java.base/java.net.URLClassLoader.findClass(URLClassLoader.java:471)
at clojure.lang.DynamicClassLoader.findClass(DynamicClassLoader.java:69)
at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:588)
at clojure.lang.DynamicClassLoader.loadClass(DynamicClassLoader.java:77)
at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:521)
at java.base/java.lang.Class.forName0(Native Method)
at java.base/java.lang.Class.forName(Class.java:398)
at clojure.lang.RT.classForName(RT.java:2211)
at clojure.lang.RT.classForNameNonLoading(RT.java:2224)
at guidescan_web.genomics.resolver$eval5366$loading__6721__auto____5367.invoke(resolver.clj:1)
at guidescan_web.genomics.resolver$eval5366.invokeStatic(resolver.clj:1)
at guidescan_web.genomics.resolver$eval5366.invoke(resolver.clj:1)
at clojure.lang.Compiler.eval(Compiler.java:7177)
at clojure.lang.Compiler.eval(Compiler.java:7166)
at clojure.lang.Compiler.load(Compiler.java:7636)
... 146 more
Suppressed exit
Essentially, I get a stacktrace telling me that it can't find my Java code.
But I have explicitly specified it and everything is working on linux.
are there any classfiles from your java sources?
No. They do not appear to be getting compiled.
so on windows it doesn't seem to be honoring your java source paths. if you delete those class files on the linux machine does it still work?
@dpsutton let me check, but I imagine it would simply recompile them.
@dpsutton the issue I think is that it's not honoring my java source paths.
so it failed to compile them on linux? if so we've eliminated the os from the problem (or at least the problem as far as you've gotten so far)
But I make sure to run lein with-profile dev repl
explicitly.
No it did not fail to compile them on linux.
Okay so I think I got it working... Still it was rather strange behaviour.
See: Having one source root contain another (e.g. src and src/java) can cause obscure problems.
from https://github.com/technomancy/leiningen/blob/stable/doc/MIXED_PROJECTS.md#polyglot-clojure-java-projects-with-leiningen
That's exactly the problem. A new root is being established, which is only for that vector. So it's wasting loads of memory (we have ~500,000 vectors in this data structure). Obviously we shouldn't be postwalking it, but now I'm determined to dig into why there's a leak.
(persistent! (transient [[]]))
is bigger than [[]]
by 184 Bytes.
@alexmiller does this warrant a jira?
https://clojure.atlassian.net/browse/CLJ-2541 maybe there's already one, but it was concerned with CPU and not Mem.
why is it bigger?
tracking owner thread?
Because it clones the empty root array, whereas before it pointed at the common one in clojure.lang.PersistentVector/EMPTY.
So, essentially /EMPTY is being cloned creating arbitrary overhead.
https://github.com/clojure/clojure/blob/f47c139e20970ee0852166f48ee2a4626632b86e/src/jvm/clojure/lang/PersistentVector.java#L534 calls https://github.com/clojure/clojure/blob/f47c139e20970ee0852166f48ee2a4626632b86e/src/jvm/clojure/lang/PersistentVector.java#L557 which clones the old root.
so this is specific only to the case of an empty vector?
seems like a non-empty vector wouldn't have the benefit of reusing the special EMPTY vector - do you see that in comparing (persistent! (transient [[1]]))
and [[1]]
?
user=> (mm/measure (persistent! (transient [[1]])))
"520 B"
user=> (mm/measure [[1]])
"336 B"
Yeah. It's more general than empty vectors. We picked up on this because we have keys like [:a [:b :c]]
in maps.I guess I don't understand your explanation above then
Maybe I'm not understanding well. For [1]
there's a root set of EMPTY. I assume vectors are red/black tries or something?
pretty sure the cloning is important to avoid having the new persistent thing refer into the old transient thing
Yeah, I think so.
not red/black, this is the same HAMT thing used in maps
My understanding of that part is vague hand wavey stuff :)
EMPTY was wrong. It's actually EMPTY_NODE.
So, [[]]
has a root
set to clojure.lang.PersistentVector/EMPTY_NODE. But the one that's been transient'd has a "brand new root".
FYI, if you ever want to dig into the details of these calculations, with pictures of the JVM objects involved and pointers between them, there is this tool: https://github.com/jafingerhut/cljol
@dominicm ^^^
If I do (def a [[]])
and (def b [[]])
both of those vectors will share a root, EMPTY_NODE. Once it goes through a transient/persistent cycle, there's a new root.
Which has been cloned.
Oh, that would have been super useful.
Although, this is a 350MB data structure when it hasn't been exploded by postwalk, so it may still have needed some help :p
You can also give it a sequence of "root objects", and it will visually show any same-object-in-memory sub-structures shared between them as the same node in the drawing, versus different objects in memory are always different nodes in the drawing.
Seems like there's overlap with visualvm, but REPL friendly?
I've been wanting a way to "send" something to visualvm while working on this.
I've resorted to creating a custom type to hold what I'm trying to measure.
Yeah, cljol is best for small structures that you want to see all of. It is possible to tweak the graph it constructs in memory to prune it to depth 2, depth 3, etc., or whatever kinds of pruning you want, but you typically don't ever want to draw a graph with hundreds of JVM objects.
yeah, I get it. it's cloned at that point so it can be potentially edited in place. the constraints around time to do the p->t or t->p changes are important too. seems like you could potentially detect this case on t->p and use EMPTY_NODE
but it's a judgement call whether the perf of that op or the loss of the memory optimization are more important
There might be overlap with visualvm -- cljol was something I put some hobby time into hacking on, for fun, without attempting to compare it to other tools.
would be ok by me to file a ticket. I guess for it to rise up the awareness list I'd want to know how it's actually affecting you. presumably you have these in volume for it to be mattering
Yeah. I get that. I can raise a jira. We can fix this short term by migrating our keys to lists rather than vectors, and probably drop postwalk long-term.
This causes our data structure to grow from 373.1MiB to 842.3MiB. It's probably got around 640,000 vectors in it.
We've been having OOMs in prod and tracked it back to this.
well that's the kind of info that will catch Rich's eye, so definitely include that
Please send me a link to the JIRA with any steps you have to reproduce. I have looked at Clojure vector and transient implementations quite a bit, and may have seen this before, or at least be in a position to quickly analyze it.
I guess I'm still a little puzzled as to what's different in the non-empty case
you don't need to figure that out to log the jira, but include that info above on [[1]]
I'm puzzled about that too. It's a bit beyond me :p
https://clojure.atlassian.net/browse/CLJ-2594 @andy.fingerhut :)
The repro is very small :)
there are only diffs up to [[ .. 32 elem ]]
>32 is same
Probably changes the root at that point.
I'm still surprised nobody started a community fork of core.async (or a whole other alternative to it). Considering I feel everyone always complain about its slow pace of change and lack of higher level error handling and all.
on the other hand maybe going slow is good, I am still convinced there is a leak there in alts!, but I've been trying to show a difference in memory usage before and after a patch, and I am having trouble getting anything repeatable. which could be any number of things 1. no leak 2. poor data collection 3. poor test case 4. the jvm is optimization the differences away somehow 5. the overhead of the timeout structures is dwarfing the leak of data via alts