Any tips on analyzing where I'm spending time in rules? (for example, which rules).
@dave.tenny should be possible to use Clara tracing to take a look at something like that. Although it’s exploratory to find what you want given the data structures it returns. You could get something like count of times things were evaluated that way.
It’d be nice if Clara had more supporting functions for this purpose though.
I'm testing with 100 users, 15 job types, 100k job requests, and 45 workers. That means there are 1500 ActiveUserJobCount facts, 45 WorkerResource facts, and the rest is probably obvious. The first (fire-rules) that digests all the initial fact entry takes 27 seconds, then my mileage varies wildly, from 57 seconds to 10 seconds when I generate 15 or more SelectedPair entries for dispatch. (Initially there are a LOT of selected pair entries until the workers are saturated). Anyway, just looking for stupid newbie tips. One thing I might consider is changing the entire ActiveUserJobCount check to predicates instead of facts, but I have no idea how that would affect performance.
There are some minimum number of queries in addition to the above rules.
Between each fire-rules
the dispatcher will select one job per job type (so 15 in this numbers case), dispatch it, update the ActiveUserJobCount, WorkerResource, facts, and retract the dispatched JobRequest facts.
So I'm averaging about 1-3 seconds per job dispatch, and whacking the hell out of my CPU. (Memory footprint is good however... surprise!)
I had originally hoped to do the fact maintenance for active user job counts (used for round robin eligibility consideration), and worker resource stats (to track remaining worker capacity), in the RHS of rules. But I gave that up early in the process because it was a side-effect oriented process with a bunch of [:not A] => A
scenarios, which might have been doable, but it was really the timing of the rule LHS evaluations that killed it, since LHS evaluations are perceptually "in parallel" and not sequential with respect to cause and effect for a given rule.
So I compute some eligible things, then do the dispatching and accounting between calls to fire-rules
, then do it all over again (always saving and continuing from the updated session).
> but it was really the timing of the rule LHS evaluations that killed it, since LHS evaluations are perceptually “in parallel” and not sequential with respect to cause and effect for a given rule.
Only the case for insert-unconditional!
(and possibly even a defect), but yeah. if you have to extract these facts after they are “done” it may make most sense to be an external thing anyways.
@dave.tenny I can look at your rules from above some, sounds too slow
Happy to share the whole module, nothing but some pretty rules and ugly code to do the bookkeeping and setup mock data and such
when something takes long in that range of seconds (like 10+) I tend to do a cpu sampling
Job dispatch and completion is all about accounting then trying again with updated rules.
With a profiler, I tend to use visualVM because it is good enough for this situation
sometimes it gives quicker leads to what is causing issues. Sometimes it is too opaque unless you know the rule engine internals, but not necessarily always
Yeah, I'm having problems with VisualVM on my Linux system becuase of some JNI/Jar problem I can't figure out, and HPROF sampling is usually useless
but also, doing some just blunt counting of times that certain conditions in rules and/or rule RHS were fired sometimes can give you the outliers as well
Hmm
Yeah, VisualVM with “cpu sampler” is what I have used. if things are whacky there, not sure hah
Yeah, I'm looking at instrumenting, is it possible to capture the time spent, via instrumentation, in each LHS condition?
well, time will obviously go to the instrumentation/tracing you do
but if you do something like just gather counts
that is not time-sensitive anyways
Re the rules, any obvious stupid hash-join failures or things that might be better done in predicates or with accumulators?
when I’ve tried “counting condition/rhs” firings before in bad performance cases, there are often outliers
Like most things evaluation perhaps no more than 1k times for example
but then you find something that evaluates like 1 mil times
typically its something like that
In my case I'm worried it's the tests in the conditions that are firing a lot, but I'll get more data.
So even your rule like worker-viable-jobs
may be pretty heavy
I’m just looking over what you had and your sort of fact counts
For example, if I'm doing N^2 firings on the worker-viable-jobs conditions for the 100k job requests and 45 worker-resource facts.
you have 100K job requests facts and 45 workers, and you also have a :not
in that rule
so you are something like 100K x 100K x 45 territory
Right, is there a way to achieve the semantics there without the combinatorics?
probably just doing millions of comparisons with clj =
and <
etc and it adds up
yes, I think there is a way out of it
This is where, if I could update the worker resources in the RHS and immediately prune the possibilities for the next firing of worker-viable-job it would have been a win, but that doesn't work because all the worker-viable-jobs are going to fire regardless of whether I update the worker resources because of that seemingly parallel LHS evaluation protocol.
Extract a rule, find the oldest jobs first, then only bring those into the join with WorkerResource
rule here
in general, you have a lot of JobRequest
facts to deal with. You want to avoid any rule that may do a join across that set of facts with itself
I’m not sure how memory
and thread
factor into an “oldest job” situation
so hard to give you an example
Except we'll still need to potentially consider the next oldest, and so on, until we find one that fits the avaialble worker resources, so does an extra rule really help?
[:not [JobRequest (< job-id ?job-id)
(= ?job-type job-type)
(= ?user-id user-id)
(<= threads ?threads)
(<= memory ?memory)]]
The memory and thread tests are to test only for jobs that can actually execute on the worker resources, i.e. for which enough resources exist.
That's the "viable" part of the semantics
the rule can help, just have to figure out what the correct rule is
thinking about it
Any advice for a beginner on how to effectively use the tracing API here? The one time I tried it there was too much data to process, and that was on the simplest most minimal amount of facts.
re: the "oldest job" stuff, I was wondering if accumulators would in any way help, I have no idea how they're implemented w.r..t incremental fact maintenance.
> Any advice for a beginner on how to effectively use the tracing API here? The one time I tried it there was too much data to process, and that was on the simplest most minimal amount of facts. I haven’t used it as much as I’d expect. I was used to rolling my own stuff prior to when tracing stuff was introduced. However, for counting, I believe you can do something like:
(let [traced (-> (clara.rules/mk-session <your rules>)
(clara.tools.tracing/with-tracing)
(insert <your facts>)
fire-rules
clara.tools.tracing/get-trace)]
(frequencies (map :node-id traced)))
or perhaps better sorted:
(let [tr (-> (clara.rules/mk-session [temp-rule])
(t/with-tracing)
(insert (->Temperature 10 "MCI")
(->Temperature 20 "MCI"))
(fire-rules)
(t/get-trace )
)]
(->> (map :node-id tr)
frequencies
(sort-by val)
reverse))
once you know the highest count :node-id
s you can look those up in the rulebase associated with the session
excellent, thanks
(let [session (-> (clara.rules/mk-session <your rules>)
(t/with-tracing)
(insert <your facts>)
(fire-rules))
trace (t/get-trace session)
node-id <whatever node id in question from `trace`>
{:keys [rulebase]} (clara.rules.engine/components session)
{:keys [id-to-node]} rulebase]
(get id-to-node node-id) )
This is how you could look up the node-id
to try to find what node in the engine it isa node will be a defrecord of stuff, not all that readable to you, but you should be able to recognize aspects of it and align it back to your rules typically
An example of finding the node-id
@dave.tenny here is an example of your worker-viable-jobs
rule refactored from earlier. I believe it has the same semantics. I also think it cuts down on the number of comparisons done.
Thanks Mike, I'll have a look
@mikerod as I am still new at accumulators, I'm trying to discern the purpose of the ?user-id binding on line 30 in the above snippet. Is it used so that the rule will fire once for each distinct user id? Does it work given that ?user-id is not otherwise bound?
@dave.tenny > Is it used so that the rule will fire once for each distinct user id? yes, in that’s its purpose in that example
which makes me realize I had a typo there, it should have been (= ?user-id (-> this :worker-viable-job :user-id))
since it was nested one level lower, will update. Sorry if that caused confusion.
Accumulators behavior with field-level bindings like that is explained more in http://www.clara-rules.org/docs/accumulators/
I'm getting null pointer exceptions on calls to <
that I think are the accumulator, but are maybe something else, unfortunately the stack trace doesn't not clue me in other than it's in (fire-rules)
, pretty much. However the acc/min
isn't documented to accept a :initial-value
argument, so it's probably something else. Just in case something obvious is missing above.
@dave.tenny uploaded a file: https://clojurians.slack.com/files/U7SGKB4LF/FB1C5P7A4/perhaps_more_obvious_to_you.clj
Ah wait, maybe it's because I removed an explicit binding of :user-id in the candidate rule.
@dave.tenny more typo’s because I forgot the thing was neste
(acc/min (comp :job-id :worker-viable-job) :returns-fact true)
, I updated the snippet above to reflect that
if :job-id
may be nil
though, would have to defend against it