core-async

jeaye 2020-09-27T18:48:09.004300Z

Are there any core.async-oriented Clojure profiling tools for ongoing profiling?

jeaye 2020-09-27T18:48:33.004900Z

Seems like tufte may be the best I've found, but it doesn't have first class support for core.async.

Ben Sless 2020-09-27T19:05:53.005700Z

What capabilities are you missing? I'm usually satisfied with clj-async-profiler and visualVM @jeaye

jeaye 2020-09-27T19:18:48.006400Z

@ben.sless I haven't used the former, but its documentation seems to be focused on benchmarking, not profiling.

jeaye 2020-09-27T19:19:01.006700Z

i.e. more criterium than tufte.

Ben Sless 2020-09-27T19:20:36.008800Z

The profiler's output is a flame graph of %CPU time on X axis and stack on Y axis. I find it very useful, usually

jeaye 2020-09-27T19:20:36.008900Z

For VisualVM, I think it could work, but things are made more complicated by me wanting to profile a staging environment within AWS.

Ben Sless 2020-09-27T19:21:19.009200Z

Ah, been there πŸ™‚

Ben Sless 2020-09-27T19:21:33.009700Z

you need to inject a bunch of flags to your run command

jeaye 2020-09-27T19:21:51.010300Z

Actually, might be the same issue with clj-async-profile. For tufte, everything was done in-process and then can be exposed through an HTTP endpoint.

jeaye 2020-09-27T19:22:11.010900Z

That was some of the appeal, I think.

Ben Sless 2020-09-27T19:22:13.011100Z

It's a bother, but I have done both. Async profiler can also be exposed as an endpoint

Ben Sless 2020-09-27T19:23:42.011900Z

If you have a flag or environment variable which lets you know you're in a staging environment,

Ben Sless 2020-09-27T19:24:11.012500Z

I recently wrapped one service with a run script:

Ben Sless 2020-09-27T19:24:24.013Z

#!/usr/bin/env bash

opts="-server -Xmx5g -Xms5g"

if [ "ENVTYPE" = "prod" ]; then
    echo "running in prod"
else
    opts="$opts -XX:+PreserveFramePointer -XX:+UnlockDiagnosticVMOptions -XX:+DebugNonSafepoints"
fi

jar=path/to/your.jar
cmd="java $opts -jar $jar $*"
echo running "$cmd"
exec $cmd

Ben Sless 2020-09-27T19:24:38.013400Z

You can inject whatever arguments you'd like, that way

Ben Sless 2020-09-27T19:24:57.013800Z

we're using it to run async profiler, we expose it as an endpoint

jeaye 2020-09-27T19:25:47.014500Z

Thank you. πŸ™‚ We have a whole init setup with docker/ECR, so I can plug stuff into there.

jeaye 2020-09-27T19:26:15.015200Z

Out of curiosity and thoroughness, you've used VisualVM and clj-async-profiler. Have you used tufte?

Ben Sless 2020-09-27T19:26:44.015700Z

Not yet, I have found them adequate for all my use cases until now

jeaye 2020-09-27T19:28:26.016700Z

Seems like ECS doesn't set perf_event_paranoid , so that may rule async-profiler out.

Ben Sless 2020-09-27T19:28:27.016800Z

From what I see tufte is used to profile a specific piece of code, not the entire JVM process?

jeaye 2020-09-27T19:29:05.017Z

Right.

Ben Sless 2020-09-27T19:29:43.017300Z

For that you can consider using jmh

Ben Sless 2020-09-27T19:30:22.017700Z

https://github.com/jgpc42/jmh-clojure

Ben Sless 2020-09-27T19:30:59.018300Z

It's more oriented towards benchmarking, but the results are more detailed, like tufte's

jeaye 2020-09-27T19:31:14.018800Z

Yeah, benchmarking isn't what I'm after.

jeaye 2020-09-27T19:31:26.019300Z

I did see jmh-clojure in my research.

Ben Sless 2020-09-27T19:31:37.019800Z

Trying to understand which features exactly you're looking for, what you want to check, etc.

jeaye 2020-09-27T19:32:34.021Z

I think either whole-JVM process for exporatory flame graphs or tufte's approach, where I add profiling to specific parts of the code to see how long things take. The former would be better for finding surprises; the latter would be better for timing things I already know I want timed.

jeaye 2020-09-27T19:33:03.021400Z

I wouldn't be lying if I said both sounded good. πŸ™‚

Ben Sless 2020-09-27T19:33:05.021500Z

Ah!, why not just use metrics?

jeaye 2020-09-27T19:33:22.022Z

I am using prometheus + grafana for metrics.

Ben Sless 2020-09-27T19:33:46.022800Z

So, wrapping the piece of code under test in a timer wouldn't give you the same amount of information as tufte?

jeaye 2020-09-27T19:35:35.023800Z

For tufte, they'd be quite similar, but rendered differently and tufte would be more convenient, rather than timing myself and pushing it into prometheus.

jeaye 2020-09-27T19:36:04.024700Z

I suppose the biggest gain, now that we talk it all out, would be from whole-JVM process flame graph exploration.

Ben Sless 2020-09-27T19:36:25.025100Z

I see. It's a bit of a hack but you don't have to push it to prometheus

Ben Sless 2020-09-27T19:36:45.025600Z

You can get all the data from the timer object as an endpoint πŸ™ƒ

jeaye 2020-09-27T19:37:26.026300Z

Ah. We already have prometheus and grafana up, so any time series data might as well get rendered nicely through grafana.

jeaye 2020-09-27T19:38:04.027100Z

Ok, refocusing. If I want a whole-JVM process flame graph for exploring the unknown, what's recommended?

Ben Sless 2020-09-27T19:38:19.027300Z

clj-async-profiler

Ben Sless 2020-09-27T19:38:38.027600Z

use the (prof/profile-for duration options) function

Ben Sless 2020-09-27T19:39:24.028Z

trigger it with a http request sending it the options.

Ben Sless 2020-09-27T19:40:23.028800Z

Pretty simple solution. If your flame graph comes out weird, means you need to add JVM flags. If you get an exception, you need perf_event_paranoid

Ben Sless 2020-09-27T19:42:57.030600Z

You can then use it to find the function call you're interested in, zoom in on it, and analyze the CPU usage. If you're using go blocks pervasively, the graph will be slightly harder to analyze, because everything will happen on the same thread pool

jeaye 2020-09-27T19:43:10.030700Z

I'll look into it. I'm currently concerned about perf_event_paranoid within ECS, but we'll see where I can get.

jeaye 2020-09-27T19:43:27.031400Z

Yeah, everything we do is within go blocks.

Ben Sless 2020-09-27T19:43:33.031500Z

I can tell you from experience I had no issues with it on AWS

jeaye 2020-09-27T19:43:44.031700Z

ECS as well?

jeaye 2020-09-27T19:43:50.032Z

i.e. docker

Ben Sless 2020-09-27T19:43:53.032300Z

yes

Ben Sless 2020-09-27T19:43:59.032500Z

Docker on EC2

jeaye 2020-09-27T19:44:12.032800Z

Cool. That's good to know. Thanks a lot, Ben.

Ben Sless 2020-09-27T19:44:25.033Z

πŸ™‚

Ben Sless 2020-09-27T19:44:48.033500Z

Good luck, let me know if you found something interesting

Ben Sless 2020-09-27T19:46:00.034700Z

To satisfy my curiosity, what are you concerned about which led you to need to profile? Think you might be spending too much time with locks or something else?

jeaye 2020-09-27T20:16:41.037200Z

@ben.sless We just moved our whole back-end to core.async to try to resolve some scalability issues. We're now fully async, with no blocking anywhere, but our scalability is 100x lower than where we need it and where we need it is still a very reasonable number. So I need to figure out what's taking so much time; we may be fully async, but we have the latency of molasses with even 1k clients.

Ben Sless 2020-09-28T07:19:46.048300Z

And you suspect you might be blocking on your middleware?

jeaye 2020-09-28T20:12:04.048900Z

@ben.sless Nope, I don't think there's any blocking going on. We checked pretty thoroughly.

jeaye 2020-09-28T20:12:30.049100Z

I do think we're being quite slow with some things, though; just trying to figure out what those things are.

Ben Sless 2020-09-28T21:04:42.049300Z

Then async profiler will serve you well. Swapping from compojure to reitit+muuntaja will speed you up. ring-json uses slurp and regular expressions, performance killer. I PRed it a few days ago to remove slurp but still, switch to #reitit if you want speed. (also jsonista > cheshire)

jeaye 2020-09-28T21:39:06.049500Z

@ben.sless We have reitit on the list to bring in! Based on some testing we did a few months ago, it gave us a 15% speed up in running all of our tests, which is a surprising amount of time we we're spending doing routing.

jeaye 2020-09-28T21:39:26.049700Z

We're using jsonista already.

Ben Sless 2020-09-28T21:40:10.049900Z

yeah, and when the server is under stress it compounds exponentially

jeaye 2020-09-28T21:40:29.050100Z

Thanks for all the tips. πŸ™‚

jeaye 2020-09-28T21:40:30.050300Z

I have clj-async-profiler working from an HTTP endpoint now. Just digging into what's going on.

Ben Sless 2020-09-28T21:40:59.050500Z

Best of luck! hope you find some performance lying on the floor for quick wins

Ben Sless 2020-09-28T21:41:11.050700Z

Then write a blog post πŸ™ƒ

jeaye 2020-09-28T21:45:25.050900Z

hehe, maybe. It's been years since my last blog post. This type of stuff is typically quite fun to write about, though.

jeaye 2020-09-28T21:46:49.051100Z

What I find very weird is that a quarter samples seem to include the JVM function compiler. This isn't anything I've seen in other JVM flamegraph articles, aside from one, which verified it's the JIT.

jeaye 2020-09-28T21:47:05.051300Z

Do you typically see that?

Ben Sless 2020-09-29T04:37:26.051500Z

did you run with these flags? -XX:+PreserveFramePointer -XX:+UnlockDiagnosticVMOptions -XX:+DebugNonSafepoints

jeaye 2020-09-29T04:38:19.051700Z

Not the first one, but the other two.

jeaye 2020-09-29T04:38:33.051900Z

I'm using -Djdk.attach.allowAttachSelf -XX:+UnlockDiagnosticVMOptions -XX:+DebugNonSafepoints

Ben Sless 2020-09-29T04:39:41.052100Z

I think PreserveFramePointer is required for analyzing JITed code. you should also wait a few minutes before starting the profiler while the JIT does its job

Ben Sless 2020-09-29T04:39:57.052300Z

(a few ~3 should be enough)

jeaye 2020-09-29T04:41:49.052500Z

Yep, I'll add that flag.

alexmiller 2020-09-27T20:43:49.038700Z

It’s probably worth running a load test and then just taking periodic thread dumps. Generally if you take 10 thread dumps and 8 show the same thing, that’s the bottleneck

alexmiller 2020-09-27T20:45:06.039500Z

This is dumb, but shockingly effective

πŸ’― 1
jeaye 2020-09-27T20:49:06.039700Z

heh

vlaaad 2020-09-27T21:14:09.040300Z

are there plans to add some sort of metrics/profiling help to core-async?

alexmiller 2020-09-27T21:54:37.041200Z

No plans, not sure what would be genetically useful

2020-09-27T22:23:08.044Z

Something we watch in production is the latency of the go block thread pool, we periodically throw a task on there that just reports the elapsed time between when it was queued and when it was executed.

2020-09-27T22:24:17.045500Z

It would be interesting to get the queue size out, I forget if I tired to get that

2020-09-27T22:25:32.047400Z

In theory the queue should pretty much always be empty, because tasks run fast, run a little logic, then wait on a channel

2020-09-27T22:25:44.047800Z

So a constant cycle through the queue