datomic

Ask questions on the official Q&A site at https://ask.datomic.com!
joshkh 2021-05-05T16:17:48.459700Z

let's say i have one million entities in my database that represent a simple checkin log of the guests at my hotel (business is booming), and i want to find the latest checkin date of all guests. is there a more performant method than using the max aggregate? in my test case it takes ~4 seconds.

(d/q '{:find  [(max ?d)]
       :in    [$]
       :where [[?g :guest/checkin ?d]]}
     db)
"Elapsed time: 4411.448713 msecs"

joshkh 2021-05-05T16:27:32.459900Z

i thought the indexes might have something to do with it. this seems like a better option

(first
 (d/index-pull
  (d/db sort-conn) {:index    :avet
                    :reverse  true
                    :selector [:guest/checkin]
                    :start    [:guest/checkin]}))
"Elapsed time: 524.218481 msecs"

Joe Lane 2021-05-05T16:29:50.460100Z

Are you timing this operation from your laptop through the client api + socks proxy to a cloud system?

joshkh 2021-05-05T16:31:06.460300Z

yup, very crude measurement. though i thought max runs on the query group, so i'm just gaining* time on the i/o of the query result, right?

Joe Lane 2021-05-05T16:32:38.460600Z

500ms is at least an order of magnitude more time than you would experience from within an ion for your d/index-pull operation.

joshkh 2021-05-05T16:33:23.460800Z

ah yes, i suspect index-pull will be much faster when deployed

joshkh 2021-05-05T16:34:01.461Z

i was still a little surprised to see the max version take so long though

Joe Lane 2021-05-05T16:36:46.461200Z

Is this "latest checkin for each guest" or "most recent checkin of any guest"?

joshkh 2021-05-05T16:38:27.461400Z

the most recent checkin of any guest

Joe Lane 2021-05-05T16:41:26.461600Z

Try https://docs.datomic.com/cloud/query/raw-index-access.html#index-range if you only need the value.

Joe Lane 2021-05-05T16:44:41.461800Z

TBH, in your measurements the network will dominate the time, assuming your data is already cached.

joshkh 2021-05-05T16:50:13.462Z

that is true, i'll run some more accurate tests in the vpc. still from my local repl i return "simple" queries in under 100ms so i don't think the network is blame for the four second return from the max aggregate. i understand that sticking with the index-pull/index-range is the best scenario anyhow, but i'm still curious why the max aggregate is so much slower than the index pull. do you know?

Joe Lane 2021-05-05T16:51:10.462200Z

You're asking it to do significantly more work.

Joe Lane 2021-05-05T16:52:03.462400Z

(dotimes [_ 10] 
 (time (d/q '{:find  [(max ?d)]
              :in    [$]
              :where [[_ :guest/checkin ?d]]}
         db)))

Joe Lane 2021-05-05T16:52:31.462600Z

Note the _ . What is the output of that?

Joe Lane 2021-05-05T16:54:17.462900Z

Actually, I don't think it matters. The query has to consider all ?d values and thus load them all from storage, put them in memory, etc. Asking the index only considers the first (or last) element and it's done.

Joe Lane 2021-05-05T16:54:29.463100Z

So, yeah, query is doing more work here.

joshkh 2021-05-05T16:54:48.463300Z

that's what i thought. in one case i'm pull and sorting one million values, and in the other case i'm just stepping along the index and stopping at the first/highest* value in my case (when :reverse true anyway)

joshkh 2021-05-05T16:54:54.463500Z

"Elapsed time: 3668.642766 msecs"
"Elapsed time: 4459.585873 msecs"
"Elapsed time: 4711.916313 msecs"
"Elapsed time: 4607.739472 msecs"
"Elapsed time: 5062.974172 msecs"
"Elapsed time: 4710.631811 msecs"
"Elapsed time: 4298.420077 msecs"
"Elapsed time: 4505.261057 msecs"
"Elapsed time: 5294.66838 msecs"
"Elapsed time: 4841.231095 msecs"

Joe Lane 2021-05-05T16:56:43.463800Z

And now try the same dotimes but with

(first
 (d/index-pull
  (d/db sort-conn) {:index    :avet
                    :reverse  true
                    :selector [:guest/checkin]
                    :start    [:guest/checkin]}))

joshkh 2021-05-05T16:57:37.464Z

yup, of course it's a lot faster

"Elapsed time: 488.688867 msecs"
"Elapsed time: 383.445716 msecs"
"Elapsed time: 395.543983 msecs"
"Elapsed time: 407.429619 msecs"
"Elapsed time: 394.223538 msecs"
"Elapsed time: 425.687384 msecs"
"Elapsed time: 382.726732 msecs"
"Elapsed time: 397.136071 msecs"
"Elapsed time: 392.209467 msecs"
"Elapsed time: 414.867937 msecs"

Joe Lane 2021-05-05T16:59:18.464200Z

An important thing to look at is the variance between min and max here vs when running inside the vpc vs inside an ion. It gets pretty fast.

Joe Lane 2021-05-05T17:00:12.464400Z

(might not matter for you in this case though)

joshkh 2021-05-05T17:01:57.464600Z

in my opinion it's an interesting bit of "learned knowledge" about datomic. 🙂 i've seen my fair share of queries written to find the latest this-or-that using min and max aggregates. maybe it's just muscle memory from the SQL days:

(time (j/query pg-db ["select MAX(checkin) from guests"]))
"Elapsed time: 187.651582 msecs"
dubious test because that postgres connection is local. anyway i'll follow your advice and try out min/max from inside the vpc and from an ion. you would expect the ion version to run fastest, right?

Joe Lane 2021-05-05T17:02:50.464800Z

Yes, but the point of the ion test is to discern "What am I really measuring?"

joshkh 2021-05-05T17:03:26.465Z

because the ion is running in the same memory space

Joe Lane 2021-05-05T17:05:33.465200Z

Yes, but more importantly, You're NOT measuring network overhead or auth fns or lambda coldstart, etc You're focusing on JUST the time it takes to do the query/db operation, many times, after a warmup period of issuing the query a few times to warm all the caches (if the data fits in cache).

jdkealy 2021-05-05T17:59:44.465800Z

where can i find docs on the map query syntax ?

joshkh 2021-05-05T18:01:40.465900Z

Thanks @lanejo01, those are some very useful details

Joe Lane 2021-05-05T18:04:58.466100Z

On-Prem or Cloud?

jdkealy 2021-05-05T18:06:03.466300Z

On-prem

Joe Lane 2021-05-05T18:09:55.466700Z

There is an example of passing a map to d/q, but were you talking about the map form of the query itself?

jdkealy 2021-05-05T18:11:30.466900Z

I'm trying to figure out how to programatically generate a datomic query

Joe Lane 2021-05-05T18:14:02.467400Z

Do you have a specific scenario in mind?

jdkealy 2021-05-05T18:35:23.467600Z

Imagine i had the schema like the above and I had optional fields :firstname and :lastname, and i would reduce over the inputs and programatically add {:where [ ] } clauses depending on whether the key exists or not

Joe Lane 2021-05-05T18:40:39.467800Z

(cond-> '{:find [[?e ...]]
          :in [$]
          :where []}
  lastName (update :in conj '?lname)
  lastName (update :where conj '[?e :user/lastName ?lname])
  firstName (update :in conj '?fname)
  firstName (update :where conj '[?e :user/firstName ?fname]))

Joe Lane 2021-05-05T18:41:11.468100Z

Just use clojure ¯\(ツ)

JohnJ 2021-05-05T18:43:11.468300Z

Curious, why are you saving the data to both a rdbms and datomic?

Joe Lane 2021-05-05T18:44:13.468500Z

I don't think he is, I think he was using that as an example of what he was used to doing in other systems to achieve the same thing.

👍 1
jdkealy 2021-05-05T18:44:59.468700Z

awesome 🙂 thanks