clojure

New to Clojure? Try the #beginners channel. Official docs: https://clojure.org/ Searchable message archives: https://clojurians-log.clojureverse.org/
paulocuneo 2020-12-03T01:13:34.489600Z

dunno, but this may be useful https://cloud.google.com/functions/docs/concepts/java-deploy#deploy_from_a_jar

devn 2020-12-03T07:18:40.496500Z

Observability for Clojure applications, state of the art: What is it?

Ben Sless 2020-12-03T09:02:18.498300Z

https://github.com/metrics-clojure/metrics-clojure for the Dropwizard library

simongray 2020-12-03T09:12:15.498600Z

#observability

2020-12-03T10:58:09.498800Z

is there a similar facade for micrometer?

ghadi 2020-12-03T14:46:46Z

I love Honeycomb for tracing

ghadi 2020-12-03T15:13:52.000900Z

Micrometer is ok for metrics

ghadi 2020-12-03T15:14:00.001400Z

OpenTelemetry metrics are too new, but the tracing side is ok

ghadi 2020-12-03T15:15:43.002Z

Dropwizard/CodaHale are showing age

mpenet 2020-12-03T16:18:04.002200Z

honeycomb is quite nice, but it can get expensive at scale. But it might not matter in that case.

2020-12-03T23:24:31.040900Z

JMX ?

mpenet 2020-12-03T08:33:36.497Z

micrometer is a good solution, it's quite easy to work with via interop

mpenet 2020-12-03T08:38:01.497800Z

if you want to go the extra mile you can also use something like opentelemetry

Elad Gur 2020-12-03T16:24:54.005300Z

Hey, What is the correct way to add a timeout to channels? I've tried to add a naive implementation using an async/timeout but got huge spikes with memory increase Then found out that I'm not the only one to struggle this (see here https://clojure.atlassian.net/browse/ASYNC-225)

ghadi 2020-12-03T16:42:40.005600Z

can you post your naive implementation?

Elad Gur 2020-12-04T17:40:03.051500Z

(defmacro <?t
  "Read a value from a channel (using <!) and if the value is Throwable, throw it."
  ([channel]
   `(<?t ~channel default-take-timeout))
  ([channel timeout-msecs]
   `(let [in-chan# ~channel
          timeout-chan# (async/timeout ~timeout-msecs)
          [v# c#] (async/alts! [in-chan# timeout-chan#])]
      (if (= c# in-chan#)
        (<?* v#)
        (throw (ex-info timeout-msg {:cause timeout-msg
                                     :timeout ~timeout-msecs}))))))

2020-12-03T16:58:09.007400Z

There is a #core-async channel, but timeout channels and alt is the way to do it

2020-12-03T17:01:00.008500Z

async-225 is fairly trivial to work around (async/go (async/<! (async/timeout whatever)))

2020-12-03T17:05:59.008900Z

is it tho?

2020-12-03T17:06:21.009100Z

does timeout chan sharing only happen in a single go block?

2020-12-03T17:07:20.009300Z

the underlying issue is that prolifically using timeouts increase CPU overhead

2020-12-03T17:07:32.009500Z

and you cannot close timeouts because they are shared

2020-12-03T17:07:54.009700Z

you are misunderstanding

2020-12-03T17:08:21.009900Z

the channel returned from (async/go (async/<! (async/timeout whatever))) is a fresh channel that while close whenever the timeout does

2020-12-03T17:09:41.010100Z

I doubt prolific timeout usage is the root cause of your increased cpu usage

2020-12-03T17:10:31.010300Z

or I should say, I doubt it is the sharing of timeouts that is the root cause

2020-12-03T17:11:22.010500Z

my guess is it is the fundamental global nature of timeouts keeping objects alive, filling your heap, causing the gc run all the time spiking your cpu usage

2020-12-03T17:11:57.010700Z

Yeah, I’ve not looked into it, but the ticket says that timeout checks caused CPU usage.

2020-12-03T17:12:07.010900Z

(I hope that’s not the case, but that’s what it says.)

2020-12-03T17:14:12.011100Z

the thing about timeout channels is there is a global singleton scheduler that is doing the timing out, so it keeps alive a reference to each timeout channel until it closes

2020-12-03T17:14:47.011300Z

and the nature of go blocks is the code is transformed into callbacks are registered on channels

2020-12-03T17:15:40.011500Z

so if you wait on a timeout you attach a closure to the timeout that closes over all the data your code references

2020-12-03T17:17:44.011700Z

and that callback may still be live (reachable) because of the timeout even when you stop waiting on the timeout, because of how channels do bookkeeping

2020-12-03T17:18:32.012Z

so my guess is the issue both you, and the guy in the ticket are having, has nothing to do with the shared nature of timeouts, and everything with the gc implications

2020-12-03T17:20:12.012300Z

at work are have custom sort of weak reference channel implementation we use to help combat this kind of thing

2020-12-03T17:21:30.012500Z

and in fact towards the end of the ticket they even talk about it alleviating gc pressure

2020-12-03T17:21:59.012700Z

yeah, theres a seeming "inversion", where the channel becomes the gc root holding onto all the objects in the code waiting on the channel. it is clearly counterintuitive, and I could see how that could easily lead to misunderstanding the source of a problem like this

2020-12-03T17:45:16.012900Z

So the sharing comes into play if you try to manually clean up for yourself.

2020-12-03T17:46:37.013100Z

It’s not clear to me what the actual underlying problem is. As you say, it could be GC. However, if you bork the timeout check, you can introduce redundant checks in the DelayQueue.

2020-12-03T17:46:46.013300Z

(I know because I’ve done it.)

2020-12-03T17:47:14.013500Z

They say they confirmed that timeout checks were a problem. I have no idea if that’s true.

2020-12-03T17:49:11.013700Z

it is absolutely the gc, because the cpu usage from increased gc is what caused them to start closing shared timeout channels in the first place, which introduced further breakage

2020-12-03T17:49:52.013900Z

they probably assumed closing the chan would stop the checking?

2020-12-03T17:52:18.014100Z

but why did they looking for a problem to solve in the first place?

2020-12-03T17:52:40.014300Z

because they had metrics somewhere that showed increased cpu usage

2020-12-03T17:53:02.014600Z

so they started doing things like closing channels, and looking at the delayqueue, etc

2020-12-03T17:53:04.014800Z

> Using timeout channels (with long timeouts of 2 minutes) without closing them solved our core problem (too many TCP sockets in CLOSE_WAIT state) > but immediately caused the application to use approximately 10%+ CPU and way more memory than before.

2020-12-03T17:53:16.015Z

which is in line with gc probs

2020-12-03T17:53:24.015200Z

> After analyzing the problem, we concluded that leaving the timeout channels as is resulted in them continuously executing the code that checks if their time had passed, even if the “safe take” go block had returned

2020-12-03T17:53:36.015400Z

which is a dubious claim, but not impossible

2020-12-03T17:55:10.015700Z

yeah, the whole root cause analysis in that ticket doesn't go deep enough and is bound up way to much in what they already decided is a solution based on how they go there

2020-12-03T17:55:30.015900Z

Agreed. I would be interested to know how they arrived at that claim.

2020-12-03T17:56:03.016100Z

the issue is alts attach a closure to all the channels in the alts, and don't get rid of the reference once one of the alternatives is committed to

2020-12-03T17:56:53.016300Z

yeah, agreed

2020-12-03T17:58:43.016500Z

I just wrote a little patch that I believe fixes that and left it as a comment on the ticket, but I'll need to remember how the contribution process works to attach a real patch and then lean on alex for a few months to a year to get the patch in to a new core.async release

2
1
😆 1
dominicm 2020-12-03T20:11:14.017400Z

woah:

(mm/measure (map inc (take 100 (range)))) => 232B
(mm/measure (walk/postwalk identity (map inc (take 100 (range))))) => "8.6 KiB"
This was an interesting discovery. Not sure what to make of it. mm is clj-memory-meter.

2020-12-03T20:13:22.018400Z

map returns a lazy seq, and range has a specifically optimized lazy seq, walking a lazy-seq realizes it

dominicm 2020-12-03T20:15:07.018900Z

Oh. The map hasn't been realized at that point. I see.

dominicm 2020-12-03T20:15:14.019100Z

Duh, needs more doall.

dominicm 2020-12-03T20:15:59.019600Z

Feels less mind blowing now. Lazy sequences always ruin profiling :)

2020-12-03T20:18:33.020500Z

Hadn’t heard of clj-mm, this is cool

dominicm 2020-12-03T20:21:42.020700Z

It's been helping me out with a production issue :). We handle analytical-style data and clojure.walk is somehow "leaking" memory. I thought I had a repro, but evidently not.

dominicm 2020-12-03T20:42:48.021200Z

Okay, this is actually it:

(let [x [[]]]
    (prn (mm/measure x)) => "304 B"
    (prn (mm/measure (walk/postwalk identity x)))) => "488 B"
It has to be a nested vector, [1] won't do it, nor will [()]. It's something to do with into, but I haven't figured out what.

dominicm 2020-12-03T20:45:33.021300Z

(let [x [[:a]]]
    (prn (mm/measure x))
    (prn (mm/measure (into [] x))))
Also has the memory leak.

alexmiller 2020-12-03T20:56:06.023300Z

into does transient->persistent, might affect what's held in fields

2020-12-04T15:24:30.049700Z

I added a patch that I've used collection-check and a few small manual tests to test, to the CLJ-2594 ticket, plus a link to a small git repo I made with code for drawing pictures of various persistent and transient vectors that help to see the behavior before and after.

dominicm 2020-12-04T21:22:47.055100Z

@andy.fingerhut thank you for digging into this.

2020-12-04T21:23:43.055300Z

No worries. It was something that I may have even noticed before and wondered about, in my digging into the implementation of persistent and transient vectors a while ago, so right up there in my wheel house

schmidt73 2020-12-03T20:56:44.023700Z

So I have a mixed project consisting of Java/Clojure code. It works wonderfully on Linux, but on Windows I run into trouble with the Java bit.

schmidt73 2020-12-03T20:57:15.024100Z

(defproject guidescan-web "2.0"
  :description "Version 2.0 of the Guidescan website."
  :url "<http://guidescan.com/>"
  :author "Henri Schmidt"
  :dependencies [[org.clojure/clojure "1.10.1"]
                 [org.clojure/data.csv "1.0.0"]
                 [com.github.samtools/htsjdk "2.23.0"]
                 [com.stuartsierra/component "1.0.0"]
                 [http-kit "2.4.0"]
                 [compojure "1.6.2"]
                 [ring/ring-defaults "0.3.2"]
                 [cheshire "5.10.0"]
                 [failjure "2.0.0"]
                 [selmer "1.12.28"]
                 [com.taoensso/timbre "4.10.0"]
                 [org.clojure/tools.cli "1.0.194"]
                 [seancorfield/next.jdbc "1.1.610"]
                 [honeysql "1.0.444"]
                 [com.h2database/h2 "1.4.197"]
                 [org.postgresql/postgresql "42.2.5"]
                 [com.mchange/c3p0 "0.9.5.5"]
                 [ring "1.8.1"]]
  :profiles {:uberjar {:aot :all}
             :dev {:resource-paths ["test/resources"]
                   :source-paths ["src" "dev" "test"]
                   :java-source-paths ["src/java"]
                   :javac-options ["-target" "1.8" "-source" "1.8"]
                   :repl-options {:init-ns reload}
                   :dependencies [[org.clojure/tools.namespace "1.0.0"]]}
             :prod {:main guidescan-web.core
                    :jar-name "guidescan.jar-THIN"
                    :uberjar-name "guidescan.jar"
                    :java-source-paths ["src/java"]
                    :javac-options ["-target" "1.8" "-source" "1.8"]}
             ;;;; Scripts for various maintenance tasks
             :create-gene-db {:source-paths ^:replace ["scripts"]
                              :uberjar-name "create-gene-db.jar"
                              :jar-name "create-gene-db.jar-THIN"
                              :main create-gene-db}
             :add-organism   {:source-paths ^:replace ["scripts"]
                              :uberjar-name "add-organism.jar"
                              :jar-name "add-organism.jar-THIN"
                              :main add-organism}})

schmidt73 2020-12-03T20:58:03.024300Z

Anyone have any ideas?

alexmiller 2020-12-03T20:59:26.024500Z

you haven't said what the trouble is

schmidt73 2020-12-03T20:59:39.024700Z

Sorry, let me attach that.

schmidt73 2020-12-03T21:00:04.025200Z

Caused by: java.lang.ClassNotFoundException: guidescan.algo.RabinKarp
        at java.base/java.net.URLClassLoader.findClass(URLClassLoader.java:471)
        at clojure.lang.DynamicClassLoader.findClass(DynamicClassLoader.java:69)
        at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:588)
        at clojure.lang.DynamicClassLoader.loadClass(DynamicClassLoader.java:77)
        at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:521)
        at java.base/java.lang.Class.forName0(Native Method)
        at java.base/java.lang.Class.forName(Class.java:398)
        at clojure.lang.RT.classForName(RT.java:2211)
        at clojure.lang.RT.classForNameNonLoading(RT.java:2224)
        at guidescan_web.genomics.resolver$eval5366$loading__6721__auto____5367.invoke(resolver.clj:1)
        at guidescan_web.genomics.resolver$eval5366.invokeStatic(resolver.clj:1)
        at guidescan_web.genomics.resolver$eval5366.invoke(resolver.clj:1)
        at clojure.lang.Compiler.eval(Compiler.java:7177)
        at clojure.lang.Compiler.eval(Compiler.java:7166)
        at clojure.lang.Compiler.load(Compiler.java:7636)
        ... 146 more
Suppressed exit

schmidt73 2020-12-03T21:00:20.025600Z

Essentially, I get a stacktrace telling me that it can't find my Java code.

schmidt73 2020-12-03T21:00:44.026Z

But I have explicitly specified it and everything is working on linux.

dpsutton 2020-12-03T21:03:02.026300Z

are there any classfiles from your java sources?

schmidt73 2020-12-03T21:03:27.026600Z

No. They do not appear to be getting compiled.

dpsutton 2020-12-03T21:04:15.027200Z

so on windows it doesn't seem to be honoring your java source paths. if you delete those class files on the linux machine does it still work?

schmidt73 2020-12-03T21:04:58.027500Z

@dpsutton let me check, but I imagine it would simply recompile them.

schmidt73 2020-12-03T21:05:22.028300Z

@dpsutton the issue I think is that it's not honoring my java source paths.

dpsutton 2020-12-03T21:05:51.029300Z

so it failed to compile them on linux? if so we've eliminated the os from the problem (or at least the problem as far as you've gotten so far)

schmidt73 2020-12-03T21:05:59.029400Z

But I make sure to run lein with-profile dev repl explicitly.

schmidt73 2020-12-03T21:06:04.029600Z

No it did not fail to compile them on linux.

schmidt73 2020-12-03T21:14:26.030100Z

Okay so I think I got it working... Still it was rather strange behaviour.

schmidt73 2020-12-03T21:14:38.030400Z

See: Having one source root contain another (e.g. src and src/java) can cause obscure problems. from https://github.com/technomancy/leiningen/blob/stable/doc/MIXED_PROJECTS.md#polyglot-clojure-java-projects-with-leiningen

dominicm 2020-12-03T21:15:53.030600Z

That's exactly the problem. A new root is being established, which is only for that vector. So it's wasting loads of memory (we have ~500,000 vectors in this data structure). Obviously we shouldn't be postwalking it, but now I'm determined to dig into why there's a leak.

dominicm 2020-12-03T21:21:11.030800Z

(persistent! (transient [[]])) is bigger than [[]] by 184 Bytes.

dominicm 2020-12-03T21:21:35.031Z

@alexmiller does this warrant a jira?

dominicm 2020-12-03T21:24:06.031200Z

https://clojure.atlassian.net/browse/CLJ-2541 maybe there's already one, but it was concerned with CPU and not Mem.

alexmiller 2020-12-03T21:25:32.031400Z

why is it bigger?

alexmiller 2020-12-03T21:25:55.031600Z

tracking owner thread?

dominicm 2020-12-03T21:27:07.031800Z

Because it clones the empty root array, whereas before it pointed at the common one in clojure.lang.PersistentVector/EMPTY.

dominicm 2020-12-03T21:27:29.032Z

So, essentially /EMPTY is being cloned creating arbitrary overhead.

alexmiller 2020-12-03T21:39:04.032600Z

so this is specific only to the case of an empty vector?

alexmiller 2020-12-03T21:41:23.032800Z

seems like a non-empty vector wouldn't have the benefit of reusing the special EMPTY vector - do you see that in comparing (persistent! (transient [[1]])) and [[1]] ?

dominicm 2020-12-03T21:45:11.033Z

user=&gt; (mm/measure (persistent! (transient [[1]])))
"520 B"
user=&gt; (mm/measure [[1]])
"336 B"
Yeah. It's more general than empty vectors. We picked up on this because we have keys like [:a [:b :c]] in maps.

alexmiller 2020-12-03T21:46:05.033200Z

I guess I don't understand your explanation above then

dominicm 2020-12-03T21:47:00.033400Z

Maybe I'm not understanding well. For [1] there's a root set of EMPTY. I assume vectors are red/black tries or something?

alexmiller 2020-12-03T21:47:09.033600Z

pretty sure the cloning is important to avoid having the new persistent thing refer into the old transient thing

dominicm 2020-12-03T21:47:36.033900Z

Yeah, I think so.

alexmiller 2020-12-03T21:48:01.034100Z

not red/black, this is the same HAMT thing used in maps

dominicm 2020-12-03T21:48:29.034300Z

My understanding of that part is vague hand wavey stuff :)

dominicm 2020-12-03T21:50:49.034500Z

EMPTY was wrong. It's actually EMPTY_NODE.

dominicm 2020-12-03T21:51:59.034700Z

So, [[]] has a root set to clojure.lang.PersistentVector/EMPTY_NODE. But the one that's been transient'd has a "brand new root".

2020-12-03T21:52:44.034900Z

FYI, if you ever want to dig into the details of these calculations, with pictures of the JVM objects involved and pointers between them, there is this tool: https://github.com/jafingerhut/cljol

2020-12-03T21:53:00.035300Z

@dominicm ^^^

dominicm 2020-12-03T21:53:33.035500Z

If I do (def a [[]]) and (def b [[]]) both of those vectors will share a root, EMPTY_NODE. Once it goes through a transient/persistent cycle, there's a new root.

dominicm 2020-12-03T21:53:41.035700Z

Which has been cloned.

dominicm 2020-12-03T21:54:35.035900Z

Oh, that would have been super useful.

dominicm 2020-12-03T21:55:14.036100Z

Although, this is a 350MB data structure when it hasn't been exploded by postwalk, so it may still have needed some help :p

2020-12-03T21:55:27.036300Z

You can also give it a sequence of "root objects", and it will visually show any same-object-in-memory sub-structures shared between them as the same node in the drawing, versus different objects in memory are always different nodes in the drawing.

dominicm 2020-12-03T21:55:47.036500Z

Seems like there's overlap with visualvm, but REPL friendly?

dominicm 2020-12-03T21:55:56.036700Z

I've been wanting a way to "send" something to visualvm while working on this.

dominicm 2020-12-03T21:56:09.036900Z

I've resorted to creating a custom type to hold what I'm trying to measure.

2020-12-03T21:56:18.037100Z

Yeah, cljol is best for small structures that you want to see all of. It is possible to tweak the graph it constructs in memory to prune it to depth 2, depth 3, etc., or whatever kinds of pruning you want, but you typically don't ever want to draw a graph with hundreds of JVM objects.

alexmiller 2020-12-03T21:56:24.037300Z

yeah, I get it. it's cloned at that point so it can be potentially edited in place. the constraints around time to do the p->t or t->p changes are important too. seems like you could potentially detect this case on t->p and use EMPTY_NODE

alexmiller 2020-12-03T21:57:03.037500Z

but it's a judgement call whether the perf of that op or the loss of the memory optimization are more important

2020-12-03T21:57:07.037700Z

There might be overlap with visualvm -- cljol was something I put some hobby time into hacking on, for fun, without attempting to compare it to other tools.

alexmiller 2020-12-03T21:58:19.037900Z

would be ok by me to file a ticket. I guess for it to rise up the awareness list I'd want to know how it's actually affecting you. presumably you have these in volume for it to be mattering

dominicm 2020-12-03T21:58:27.038100Z

Yeah. I get that. I can raise a jira. We can fix this short term by migrating our keys to lists rather than vectors, and probably drop postwalk long-term.

dominicm 2020-12-03T21:59:33.038300Z

This causes our data structure to grow from 373.1MiB to 842.3MiB. It's probably got around 640,000 vectors in it.

dominicm 2020-12-03T21:59:59.038500Z

We've been having OOMs in prod and tracked it back to this.

alexmiller 2020-12-03T22:00:25.038700Z

well that's the kind of info that will catch Rich's eye, so definitely include that

2020-12-03T22:00:47.038900Z

Please send me a link to the JIRA with any steps you have to reproduce. I have looked at Clojure vector and transient implementations quite a bit, and may have seen this before, or at least be in a position to quickly analyze it.

alexmiller 2020-12-03T22:01:23.039100Z

I guess I'm still a little puzzled as to what's different in the non-empty case

alexmiller 2020-12-03T22:02:58.039300Z

you don't need to figure that out to log the jira, but include that info above on [[1]]

dominicm 2020-12-03T22:03:25.039500Z

I'm puzzled about that too. It's a bit beyond me :p

dominicm 2020-12-03T22:12:19.039800Z

https://clojure.atlassian.net/browse/CLJ-2594 @andy.fingerhut :)

dominicm 2020-12-03T22:12:24.040Z

The repro is very small :)

alexmiller 2020-12-03T22:18:22.040200Z

there are only diffs up to [[ .. 32 elem ]]

alexmiller 2020-12-03T22:18:29.040400Z

>32 is same

dominicm 2020-12-03T22:18:48.040600Z

Probably changes the root at that point.

2020-12-03T23:34:04.041200Z

I'm still surprised nobody started a community fork of core.async (or a whole other alternative to it). Considering I feel everyone always complain about its slow pace of change and lack of higher level error handling and all.

2020-12-03T23:43:38.041500Z

on the other hand maybe going slow is good, I am still convinced there is a leak there in alts!, but I've been trying to show a difference in memory usage before and after a patch, and I am having trouble getting anything repeatable. which could be any number of things 1. no leak 2. poor data collection 3. poor test case 4. the jvm is optimization the differences away somehow 5. the overhead of the timeout structures is dwarfing the leak of data via alts