clara

http://www.clara-rules.org/
jdt 2018-06-05T12:57:03.000462Z

Any tips on analyzing where I'm spending time in rules? (for example, which rules).

2018-06-05T13:10:08.000063Z

@dave.tenny should be possible to use Clara tracing to take a look at something like that. Although it’s exploratory to find what you want given the data structures it returns. You could get something like count of times things were evaluated that way.

2018-06-05T13:10:47.000618Z

It’d be nice if Clara had more supporting functions for this purpose though.

jdt 2018-06-05T13:13:30.000676Z

I'm testing with 100 users, 15 job types, 100k job requests, and 45 workers. That means there are 1500 ActiveUserJobCount facts, 45 WorkerResource facts, and the rest is probably obvious. The first (fire-rules) that digests all the initial fact entry takes 27 seconds, then my mileage varies wildly, from 57 seconds to 10 seconds when I generate 15 or more SelectedPair entries for dispatch. (Initially there are a LOT of selected pair entries until the workers are saturated). Anyway, just looking for stupid newbie tips. One thing I might consider is changing the entire ActiveUserJobCount check to predicates instead of facts, but I have no idea how that would affect performance.

jdt 2018-06-05T13:13:57.000417Z

There are some minimum number of queries in addition to the above rules.

jdt 2018-06-05T13:16:36.000344Z

Between each fire-rules the dispatcher will select one job per job type (so 15 in this numbers case), dispatch it, update the ActiveUserJobCount, WorkerResource, facts, and retract the dispatched JobRequest facts.

jdt 2018-06-05T13:17:10.000607Z

So I'm averaging about 1-3 seconds per job dispatch, and whacking the hell out of my CPU. (Memory footprint is good however... surprise!)

jdt 2018-06-05T13:23:35.000768Z

I had originally hoped to do the fact maintenance for active user job counts (used for round robin eligibility consideration), and worker resource stats (to track remaining worker capacity), in the RHS of rules. But I gave that up early in the process because it was a side-effect oriented process with a bunch of [:not A] => A scenarios, which might have been doable, but it was really the timing of the rule LHS evaluations that killed it, since LHS evaluations are perceptually "in parallel" and not sequential with respect to cause and effect for a given rule.

jdt 2018-06-05T13:24:40.000557Z

So I compute some eligible things, then do the dispatching and accounting between calls to fire-rules, then do it all over again (always saving and continuing from the updated session).

2018-06-05T13:44:01.000390Z

> but it was really the timing of the rule LHS evaluations that killed it, since LHS evaluations are perceptually “in parallel” and not sequential with respect to cause and effect for a given rule. Only the case for insert-unconditional! (and possibly even a defect), but yeah. if you have to extract these facts after they are “done” it may make most sense to be an external thing anyways.

2018-06-05T13:44:26.000802Z

@dave.tenny I can look at your rules from above some, sounds too slow

jdt 2018-06-05T13:45:15.000190Z

Happy to share the whole module, nothing but some pretty rules and ugly code to do the bookkeeping and setup mock data and such

2018-06-05T13:46:05.000802Z

when something takes long in that range of seconds (like 10+) I tend to do a cpu sampling

jdt 2018-06-05T13:46:09.000395Z

Job dispatch and completion is all about accounting then trying again with updated rules.

2018-06-05T13:46:20.000497Z

With a profiler, I tend to use visualVM because it is good enough for this situation

2018-06-05T13:46:47.000436Z

sometimes it gives quicker leads to what is causing issues. Sometimes it is too opaque unless you know the rule engine internals, but not necessarily always

jdt 2018-06-05T13:46:57.000131Z

Yeah, I'm having problems with VisualVM on my Linux system becuase of some JNI/Jar problem I can't figure out, and HPROF sampling is usually useless

2018-06-05T13:47:11.000071Z

but also, doing some just blunt counting of times that certain conditions in rules and/or rule RHS were fired sometimes can give you the outliers as well

2018-06-05T13:47:24.000872Z

Hmm

2018-06-05T13:47:38.000245Z

Yeah, VisualVM with “cpu sampler” is what I have used. if things are whacky there, not sure hah

jdt 2018-06-05T13:47:40.000066Z

Yeah, I'm looking at instrumenting, is it possible to capture the time spent, via instrumentation, in each LHS condition?

2018-06-05T13:47:58.000663Z

well, time will obviously go to the instrumentation/tracing you do

2018-06-05T13:48:05.000082Z

but if you do something like just gather counts

2018-06-05T13:48:11.000591Z

that is not time-sensitive anyways

jdt 2018-06-05T13:48:35.000387Z

Re the rules, any obvious stupid hash-join failures or things that might be better done in predicates or with accumulators?

2018-06-05T13:48:36.000248Z

when I’ve tried “counting condition/rhs” firings before in bad performance cases, there are often outliers

2018-06-05T13:48:48.000825Z

Like most things evaluation perhaps no more than 1k times for example

2018-06-05T13:48:56.000575Z

but then you find something that evaluates like 1 mil times

2018-06-05T13:49:06.000265Z

typically its something like that

jdt 2018-06-05T13:49:16.000898Z

In my case I'm worried it's the tests in the conditions that are firing a lot, but I'll get more data.

2018-06-05T13:49:55.000400Z

So even your rule like worker-viable-jobs may be pretty heavy

2018-06-05T13:50:06.000242Z

I’m just looking over what you had and your sort of fact counts

jdt 2018-06-05T13:50:17.000934Z

For example, if I'm doing N^2 firings on the worker-viable-jobs conditions for the 100k job requests and 45 worker-resource facts.

2018-06-05T13:50:26.000083Z

you have 100K job requests facts and 45 workers, and you also have a :not in that rule

2018-06-05T13:50:35.000202Z

so you are something like 100K x 100K x 45 territory

jdt 2018-06-05T13:51:24.000080Z

Right, is there a way to achieve the semantics there without the combinatorics?

2018-06-05T13:51:34.000842Z

probably just doing millions of comparisons with clj = and < etc and it adds up

2018-06-05T13:51:43.000324Z

yes, I think there is a way out of it

jdt 2018-06-05T13:52:18.000375Z

This is where, if I could update the worker resources in the RHS and immediately prune the possibilities for the next firing of worker-viable-job it would have been a win, but that doesn't work because all the worker-viable-jobs are going to fire regardless of whether I update the worker resources because of that seemingly parallel LHS evaluation protocol.

2018-06-05T13:52:45.000452Z

Extract a rule, find the oldest jobs first, then only bring those into the join with WorkerResource rule here

2018-06-05T13:53:21.000179Z

in general, you have a lot of JobRequest facts to deal with. You want to avoid any rule that may do a join across that set of facts with itself

2018-06-05T13:53:41.000423Z

I’m not sure how memory and thread factor into an “oldest job” situation

2018-06-05T13:53:47.000122Z

so hard to give you an example

jdt 2018-06-05T13:53:47.000920Z

Except we'll still need to potentially consider the next oldest, and so on, until we find one that fits the avaialble worker resources, so does an extra rule really help?

2018-06-05T13:54:03.000259Z

[:not [JobRequest (< job-id ?job-id) 
                    (= ?job-type job-type)
                    (= ?user-id user-id)
                    (<= threads ?threads)
                    (<= memory ?memory)]]

jdt 2018-06-05T13:54:21.000277Z

The memory and thread tests are to test only for jobs that can actually execute on the worker resources, i.e. for which enough resources exist.

jdt 2018-06-05T13:54:54.000286Z

That's the "viable" part of the semantics

2018-06-05T13:55:01.000341Z

the rule can help, just have to figure out what the correct rule is

2018-06-05T13:55:30.000246Z

thinking about it

jdt 2018-06-05T13:55:57.000461Z

Any advice for a beginner on how to effectively use the tracing API here? The one time I tried it there was too much data to process, and that was on the simplest most minimal amount of facts.

jdt 2018-06-05T13:57:13.000912Z

re: the "oldest job" stuff, I was wondering if accumulators would in any way help, I have no idea how they're implemented w.r..t incremental fact maintenance.

2018-06-05T14:11:28.000318Z

> Any advice for a beginner on how to effectively use the tracing API here? The one time I tried it there was too much data to process, and that was on the simplest most minimal amount of facts. I haven’t used it as much as I’d expect. I was used to rolling my own stuff prior to when tracing stuff was introduced. However, for counting, I believe you can do something like:

(let [traced (-> (clara.rules/mk-session <your rules>)
                 (clara.tools.tracing/with-tracing)
                 (insert <your facts>)
                 fire-rules
                 clara.tools.tracing/get-trace)]
  (frequencies (map :node-id traced)))

2018-06-05T14:12:50.000328Z

or perhaps better sorted:

(let [tr (-> (clara.rules/mk-session [temp-rule])
                                       (t/with-tracing)
                                       (insert (->Temperature 10 "MCI")
                                               (->Temperature 20 "MCI"))
                                       (fire-rules)
                                       (t/get-trace )
                                       )]
                            (->>  (map :node-id tr)
                                  frequencies
                                  (sort-by val)
                                  reverse))

2018-06-05T14:13:10.000169Z

once you know the highest count :node-ids you can look those up in the rulebase associated with the session

jdt 2018-06-05T14:13:35.000507Z

excellent, thanks

2018-06-05T14:16:32.000604Z

(let [session (-> (clara.rules/mk-session <your rules>)
                  (t/with-tracing)
                  (insert <your facts>)
                  (fire-rules))

      trace (t/get-trace session)
      node-id <whatever node id in question from `trace`>


      {:keys [rulebase]} (clara.rules.engine/components session)
      {:keys [id-to-node]} rulebase]
  (get id-to-node node-id) )
This is how you could look up the node-id to try to find what node in the engine it is

2018-06-05T14:16:52.000977Z

a node will be a defrecord of stuff, not all that readable to you, but you should be able to recognize aspects of it and align it back to your rules typically

2018-06-05T14:19:12.000436Z

An example of finding the node-id

2018-06-05T15:00:28.000509Z

@dave.tenny here is an example of your worker-viable-jobs rule refactored from earlier. I believe it has the same semantics. I also think it cuts down on the number of comparisons done.

jdt 2018-06-05T15:09:13.000724Z

Thanks Mike, I'll have a look

jdt 2018-06-05T16:38:05.000178Z

@mikerod as I am still new at accumulators, I'm trying to discern the purpose of the ?user-id binding on line 30 in the above snippet. Is it used so that the rule will fire once for each distinct user id? Does it work given that ?user-id is not otherwise bound?

2018-06-05T16:43:58.000679Z

@dave.tenny > Is it used so that the rule will fire once for each distinct user id? yes, in that’s its purpose in that example

2018-06-05T16:44:54.000120Z

which makes me realize I had a typo there, it should have been (= ?user-id (-> this :worker-viable-job :user-id)) since it was nested one level lower, will update. Sorry if that caused confusion.

2018-06-05T16:46:43.000084Z

Accumulators behavior with field-level bindings like that is explained more in http://www.clara-rules.org/docs/accumulators/

jdt 2018-06-05T16:53:05.000209Z

I'm getting null pointer exceptions on calls to < that I think are the accumulator, but are maybe something else, unfortunately the stack trace doesn't not clue me in other than it's in (fire-rules), pretty much. However the acc/min isn't documented to accept a :initial-value argument, so it's probably something else. Just in case something obvious is missing above.

jdt 2018-06-05T16:58:04.000116Z

Ah wait, maybe it's because I removed an explicit binding of :user-id in the candidate rule.

2018-06-05T16:59:44.000614Z

@dave.tenny more typo’s because I forgot the thing was neste

2018-06-05T17:00:00.000378Z

(acc/min (comp :job-id :worker-viable-job) :returns-fact true), I updated the snippet above to reflect that

2018-06-05T17:00:18.000060Z

if :job-id may be nil though, would have to defend against it