let's say i have one million entities in my database that represent a simple checkin log of the guests at my hotel (business is booming), and i want to find the latest checkin date of all guests. is there a more performant method than using the max
aggregate? in my test case it takes ~4 seconds.
(d/q '{:find [(max ?d)]
:in [$]
:where [[?g :guest/checkin ?d]]}
db)
"Elapsed time: 4411.448713 msecs"
i thought the indexes might have something to do with it. this seems like a better option
(first
(d/index-pull
(d/db sort-conn) {:index :avet
:reverse true
:selector [:guest/checkin]
:start [:guest/checkin]}))
"Elapsed time: 524.218481 msecs"
Are you timing this operation from your laptop through the client api + socks proxy to a cloud system?
yup, very crude measurement. though i thought max
runs on the query group, so i'm just gaining* time on the i/o of the query result, right?
500ms is at least an order of magnitude more time than you would experience from within an ion for your d/index-pull
operation.
ah yes, i suspect index-pull will be much faster when deployed
i was still a little surprised to see the max
version take so long though
Is this "latest checkin for each guest" or "most recent checkin of any guest"?
the most recent checkin of any guest
Try https://docs.datomic.com/cloud/query/raw-index-access.html#index-range if you only need the value.
TBH, in your measurements the network will dominate the time, assuming your data is already cached.
that is true, i'll run some more accurate tests in the vpc. still from my local repl i return "simple" queries in under 100ms so i don't think the network is blame for the four second return from the max aggregate. i understand that sticking with the index-pull/index-range is the best scenario anyhow, but i'm still curious why the max aggregate is so much slower than the index pull. do you know?
You're asking it to do significantly more work.
(dotimes [_ 10]
(time (d/q '{:find [(max ?d)]
:in [$]
:where [[_ :guest/checkin ?d]]}
db)))
Note the _
. What is the output of that?
Actually, I don't think it matters. The query has to consider all ?d
values and thus load them all from storage, put them in memory, etc.
Asking the index only considers the first (or last) element and it's done.
So, yeah, query is doing more work here.
that's what i thought. in one case i'm pull and sorting one million values, and in the other case i'm just stepping along the index and stopping at the first/highest* value in my case (when :reverse true anyway)
"Elapsed time: 3668.642766 msecs"
"Elapsed time: 4459.585873 msecs"
"Elapsed time: 4711.916313 msecs"
"Elapsed time: 4607.739472 msecs"
"Elapsed time: 5062.974172 msecs"
"Elapsed time: 4710.631811 msecs"
"Elapsed time: 4298.420077 msecs"
"Elapsed time: 4505.261057 msecs"
"Elapsed time: 5294.66838 msecs"
"Elapsed time: 4841.231095 msecs"
And now try the same dotimes
but with
(first
(d/index-pull
(d/db sort-conn) {:index :avet
:reverse true
:selector [:guest/checkin]
:start [:guest/checkin]}))
yup, of course it's a lot faster
"Elapsed time: 488.688867 msecs"
"Elapsed time: 383.445716 msecs"
"Elapsed time: 395.543983 msecs"
"Elapsed time: 407.429619 msecs"
"Elapsed time: 394.223538 msecs"
"Elapsed time: 425.687384 msecs"
"Elapsed time: 382.726732 msecs"
"Elapsed time: 397.136071 msecs"
"Elapsed time: 392.209467 msecs"
"Elapsed time: 414.867937 msecs"
An important thing to look at is the variance between min and max here vs when running inside the vpc vs inside an ion. It gets pretty fast.
(might not matter for you in this case though)
in my opinion it's an interesting bit of "learned knowledge" about datomic. 🙂 i've seen my fair share of queries written to find the latest this-or-that using min and max aggregates. maybe it's just muscle memory from the SQL days:
(time (j/query pg-db ["select MAX(checkin) from guests"]))
"Elapsed time: 187.651582 msecs"
dubious test because that postgres connection is local. anyway i'll follow your advice and try out min/max from inside the vpc and from an ion. you would expect the ion version to run fastest, right?Yes, but the point of the ion test is to discern "What am I really measuring?"
because the ion is running in the same memory space
Yes, but more importantly, You're NOT measuring network overhead or auth fns or lambda coldstart, etc You're focusing on JUST the time it takes to do the query/db operation, many times, after a warmup period of issuing the query a few times to warm all the caches (if the data fits in cache).
where can i find docs on the map query syntax ?
Thanks @lanejo01, those are some very useful details
On-Prem or Cloud?
On-prem
There is an example of passing a map to d/q
, but were you talking about the map form of the query itself?
I'm trying to figure out how to programatically generate a datomic query
Do you have a specific scenario in mind?
Imagine i had the schema like the above and I had optional fields :firstname and :lastname, and i would reduce over the inputs and programatically add
{:where [ ] }
clauses depending on whether the key exists or not
(cond-> '{:find [[?e ...]]
:in [$]
:where []}
lastName (update :in conj '?lname)
lastName (update :where conj '[?e :user/lastName ?lname])
firstName (update :in conj '?fname)
firstName (update :where conj '[?e :user/firstName ?fname]))
Just use clojure ¯\(ツ)/¯
Curious, why are you saving the data to both a rdbms and datomic?
I don't think he is, I think he was using that as an example of what he was used to doing in other systems to achieve the same thing.
awesome 🙂 thanks