I'm doing some stress testing of clara and i'm seeing some weird behaviour
the background is that I want to build a sort of access control engine
there are 3 fact types: a consent, a processor and a processingallowed fact
there's just one rule which inserts processingallowed if there is sufficient amount of concent for a given processor
when I insert 1 million random consents, firing the rules takes just a few seconds
if I insert 1 million more into the previous session, firing the rules takes over an hour (I never waiting long enough for it to complete)
if I insert 2 million random consents in an empty session, firing the rules still only takes a few seconds
while this isn't a truly realistic scenario (updates to consent would arrive in batches of a few 100 once initialised), I'm still kind of surprised by this behaviour
@wdullaer that is interesting. it would seem that the insert 1 mil then 1 mil later is triggering some heavy “retraction” work
However, with that sort of time being taken up, it’d be good to have some profiling
there is no rule which retracts facts
Unless you can reproduce it in a small case, which may not be easy. I wonder if you could do some CPU sampling snapshot while it is running excessively via a tool like “visual vm” or similar
I haven't found a good way to do detailed profiling / tracing yet, so all suggestions are welcome 🙂
Well, retracts can happen due to logical inserts (normal insert!
in Clara). Also, accumulators may be updated and have to retract previous results. There are quite a few reasons to have “implicit” retracts happening in working memory
make sense
I am using an all
accumulator in the rule
VisualVM is free. I’ve used it mostly just since it was good enough for seeing many things like this.
but the funny thing is that it works just fine when doing 2 million in bulk on an empty session
I'll give that a spin
That isn’t entirely surprising to me - if you’ve found a bad retraction sort of scenario
In bulk, Clara does a lot of work in batches
Which has one benefit of avoiding “rework” - aka retractions/insertions to update “incorrect” earlier working memory states
I'm comparing 2 million in 1 go, vs 2 batches of 1 million
(I was expecting memory usage to be a bottleneck, but it's doing pretty fine on that front)
anyway, visualvm it is 🙂
So running with the VisualVM idea. 1) Download it, 2) open it, 3) find your JVM process (it has a list on in the UI) running the bad performance case with
(4) click “Sampler” tab (5) click “CPU” (6) start running whatever is taking a long time
(7) let it run for a bit, perhaps like a minute or so, then click “snapshot”
👍
Once you have a snapshot, you can dig around the callstack to see where the time is. If you want help with that, you can perhaps explain what is taking such a long time in the Clara functions of the callstack and perhaps I (or someone) will know enough by that. Or you can you can “export to” via an icon in the UI and it’ll give you an nps file to share. If you have proprietary stuff (I don’t think much proprietary info shows up in these though, but still.) you don’t want to show, you could try to just find good screenshots that cut the callstack down to just the Clara functions parts.
I have to run now, but I'll do the test later today and post back with the results
thanks a lot already!
It’s probably not a bad idea to get some familiarity with digging through some cpu sampling snapshot results though. Can definitely be useful for finding bottlenecks quickly
No problem
@wdullaer are you inserting the facts like this:
(-> session (insert first-fact) .... (insert millionth-fact))
or like this: (insert session [first-fact .. millionth-fact]
It seems odd to me that simply breaking a single batch into two would have such a huge performance impact - OTOH smaller batching could do it. Arguably Clara should try to batch the first case together anyway, but it doesn’t. See how facts are added into the “pending-operations” on insertion at https://github.com/cerner/clara-rules/blob/0.17.0/src/main/clojure/clara/rules/engine.cljc#L1818 and removed from the queue in fire-rules at https://github.com/cerner/clara-rules/blob/0.17.0/src/main/clojure/clara/rules/engine.cljc#L1870If a simple division into two batches really does increase runtime like that I’d be curious to see a reproducing case and if there is some performance optimization we could do - at first glance it doesn’t seem like it should happen, but as always it is hard to say without a concrete case. On profiling, seconding mikerod’s suggestion of VisualVM. Another possible route would be to look at the trace of the rule execution for a smaller number of facts - manual examination wouldn’t be practical for millions of facts, but a batch of 2 facts vs 2 batches of 1 fact each might show insights. I’d probably start with VisualVM though especially since traces generally require some understanding of Clara’s internals to yield much insight. https://github.com/cerner/clara-rules/blob/0.17.0/src/main/clojure/clara/tools/tracing.cljc
@wparker sounded like he had batches of 1 million. so not really “small” hah
however, I was thinking that the scenario was “insert 1 million, then fire”, later on, “insert 1 million, then fire”
Trying to reuse and old session state basically
It certainly sounds like the batching behavior was critical to the performance of the scenario they were having.
It seems that there is potentially found some bad performing case in terms of updating working memory state that this uncovered. Also, I think I assumed this was on the JVM. Not sure if that was actually stated before. I’m guessing so with the millions of facts and the discussion around using visualvm
My point was that I think it would be possible to think the batch was a million, but actually end up being one at a time when rules fire, so just wanted to make sure that it was batched
yeah, true
Perhaps Clara should probably take (-> session (insert A) (insert B) fire-rules) and turn it into a single batch - but I think it is 2 at present
Yeah
If they were inserting a million things though, they probably did it in a single insert
otherwise they’d have to do something like (reduce insert session million-things)
I don’t think they’d write out a million (insert thing1)
by hand 😛
anyway a VisualVM snapshot would be informative, spending time in memory updates would be a possible culprit
yeah, I thought it would be good to se
hopefully some can be obtained
I agree in general though that Clara should just batch the work from multiple insert calls
Should batch as much as possible
Would insert-all be preferable here?
If you have a big collection of things to insert, yes insert-all
or you call insert
via varargs if you have compile-time known ones
You could also do apply insert
, but no reason too since insert-all
exists
Batched propagation through the rete network can be a big performance boost
So it’s best to be as batch’ful as you can be
A while ago I think I noticed retracts were slow. Not the result of insert logical though
to clarify, external retract
or RHS retract!
?
For no good reason (that I’m aware of) there isn’t a retract-all
for external retraction right now.
So you should use apply retract
if you have a batch there.
that's exactly it
here's roughly what's happening:
(->session
(insert-all (generate-consents 1000000 test-purposes test-attributes))
(fire-rules)
(insert-all (generate-consents 1000000 test-purposes test-attributes))
(fire-rules))
with some log statements to see where I am
this is on the jvm
I still have to take the snapshot
maybe I'll try to reduce the code to the size of a gist so I can share it
> maybe I’ll try to reduce the code to the size of a gist so I can share it Yeah, that is best-case scenario
I'll report back tomorrow
External
So yep, apply retract
(we should get a retract-all
added for convenience and symmetry with insert
Faster than doseq?
I don’t know how you would doseq
external retract
- returns a new version of the session, not mutation
If you have a collection of facts to externally retract
Instead of
(reduce retract session facts)
do
(apply retract session facts)