datalog

plexus 2020-08-03T04:40:18.010300Z

Asami is also still work in progress, Paula is still planning to add a storage layer for instance.

plexus 2020-08-03T08:10:36.011200Z

Am I doing something wrong here or are Asami queries really 100x faster than with Datomic/DataScript?

plexus 2020-08-03T08:24:38.012900Z

I know one has to be careful with microbenchmarks, I'd absolutely love for people to challenge these results. PRs to that repo are also welcome, would be nice if it could grow into a comprehensive comparison of the different implementations.

jumar 2020-08-03T08:42:40.014600Z

I don't know much about either of these implementations but you could check where the time is spent e.g. via clj-async-profiler

quoll 2020-08-03T11:51:51.027Z

It would be unfair to compare Asami to Datomic, because that system is designed for its durable provisioning. There’s a lot going on. Datascript is more comparable. Asami indexing is simpler, and it doesn’t really keep things like deletion datoms. That may help Asami to be faster here? Also, a micro benchmark will be hitting the cache on Asami’s query planner. I don’t believe that the others have a query optimizer, while Asami does. There is a small cost to using it, but if you keep running the same query without updating the database then it will just return the same query plan without getting re-executed.

plexus 2020-08-03T11:55:16.027800Z

that explains a lot, because this benchmark does a warmup first

Huahai 2020-08-04T20:41:33.049900Z

what’s the index of asami looks like?

Huahai 2020-08-04T20:42:19.050100Z

because I think the performance difference mostly comes from the the differences in index data structures

quoll 2020-08-04T20:42:19.050300Z

3 maps

Huahai 2020-08-04T20:43:15.050500Z

of?

quoll 2020-08-04T20:44:55.050700Z

The indexes are: {entity {attribute #{value}}} {attribute {value #{entity}}} {value {entity #{attribute}}}

Huahai 2020-08-04T20:45:11.051Z

so it’s all nested maps?

quoll 2020-08-04T20:45:18.051200Z

yes

Huahai 2020-08-04T20:45:39.051400Z

where’s datascript has flat maps

Huahai 2020-08-04T20:46:06.051600Z

so it has to join much more data

quoll 2020-08-04T20:46:51.051800Z

The indexes are pluggable. I’m building a new one at the moment, which is quite different (very similar to the Mulgara indices)

Huahai 2020-08-04T20:47:15.052Z

what’s the new one look like?

quoll 2020-08-04T20:48:28.052200Z

It’s a tree backed by a buffer. In ClojureScript I want that buffer to be stored in the browser’s persistent storage. In Clojure it’s backed by a MappedByteBuffer on a file

quoll 2020-08-04T20:48:54.052400Z

same indexing as above though

Huahai 2020-08-04T20:49:19.052600Z

i see

quoll 2020-08-04T20:51:03.052800Z

As I said, it’s based on Mulgara. I have not maintained it for some time, but the last I heard it was still being used commercially: http://mulgara.org/

quoll 2020-08-04T20:51:35.053Z

I stopped working on Mulgara because I wanted to redo a lot of it in Clojure (it’s a Java project)

Huahai 2020-08-04T20:51:46.053200Z

cool

Huahai 2020-08-04T20:52:53.053400Z

wouldn’t it be easier to just use it as backend rather than re-implement in clojure?

Huahai 2020-08-04T20:53:37.053600Z

performance wise, it would be better too, guess you want it to be usable in bowser?

quoll 2020-08-04T20:54:06.053800Z

I want to revisit the architecture. We made decisions about it in 2001 that involved tradeoffs that we don’t need to make anymore.

quoll 2020-08-04T20:54:22.054Z

And yes… I want it to run on both platforms

quoll 2020-08-04T20:54:50.054200Z

The Clojure code is dramatically simpler

quoll 2020-08-04T20:55:43.054400Z

and the main overhead for the trees is I/O, so the speed improvement gained from Java isn’t a big deal

Huahai 2020-08-04T20:56:37.054600Z

right

Huahai 2020-08-04T20:59:24.054800Z

very cool, looking forward to it

quoll 2020-08-04T20:59:39.055Z

Still working on it 🙂

quoll 2020-08-04T20:59:59.055200Z

I have the memory mapping working (it’s tricky to manage files over 2GB)

quoll 2020-08-04T21:00:10.055400Z

Still building the trees

Huahai 2020-08-04T21:04:58.055600Z

what’s your plan for asami? compete with datomic?

quoll 2020-08-04T21:05:26.055800Z

No

quoll 2020-08-04T21:07:09.056Z

It’s a hybrid plan… • We use it at Cisco. I’ve been expanding on its capabilities as needed. • It has supplanted Mulgara as my personal project. I’m trying to get back to what Mulgara can do

quoll 2020-08-04T21:08:29.056200Z

But, to a certain extend, the fact that Datomic was not open source has been frustrating. So I thought it would be nice to have an option. It will fill a different niche though, since it isn’t peer-based, and it’s based on an OWA like RDF/SPARQL

quoll 2020-08-04T21:10:05.056500Z

There are new architectures that I want to try with triple indices for different use-cases. Mulgara’s complexity has been too hard to make this change, and I’m hoping that Asami gives me the flexibility to do what I’ve wanted to there

quoll 2020-08-04T21:15:07.056700Z

It’s been slow going, because until yesterday no one else had ever submitted a PR 🙂

Huahai 2020-08-04T21:32:20.056900Z

I think people are just now noticing asami

Huahai 2020-08-04T21:39:42.057100Z

it’s good to have more options, so far i counted 7 options that is doing datomic-like queries in the clojure world

Huahai 2020-08-04T21:41:01.057300Z

Asami the fastest in term of query speed, so people are starting to notice

Huahai 2020-08-04T21:44:18.057500Z

Datomic, Datascript, Asami, Crux, Datahike, Eva and mine Datalevin

Huahai 2020-08-04T21:44:24.057700Z

7 in total

quoll 2020-08-04T21:44:57.057900Z

Well, I’m aiming for Datomic-like, rather than exactly copying Datomic 🙂

Huahai 2020-08-04T21:45:14.058100Z

i think these 7 are all different

Huahai 2020-08-04T21:46:01.058300Z

Asami is the most different, actually, for it’s not a

Huahai 2020-08-04T21:46:38.058500Z

Datomic inspired, whereas all others are

quoll 2020-08-04T21:47:29.058700Z

Well, it was partly inspired by Datomic. But also RDF

Huahai 2020-08-04T21:49:17.058900Z

datomic is insperied by RDF too, but yous is earlier than that

Huahai 2020-08-04T21:50:33.059100Z

the biggest difference is the indexing structure, and I think you got it right

Huahai 2020-08-04T21:51:32.059300Z

you seems to have more experience in RDF like databases than Datomic people

Huahai 2020-08-04T21:56:42.059500Z

are you open to alternative drualable storage than what you are working on?

quoll 2020-08-04T21:57:12.059700Z

Well one thing I haven’t done is to keep the full history of assertions and retractions. There’s a lot of complexity if I want to fully index that

quoll 2020-08-04T21:57:18.059900Z

definitely

quoll 2020-08-04T21:57:22.060100Z

This is why it’s pluggable

quoll 2020-08-04T21:58:35.060300Z

It just needs to follow a simple protocol

quoll 2020-08-04T21:59:05.060800Z

And the graph-diff function in there is optional

Huahai 2020-08-04T21:59:48.061Z

graph here means what?

Huahai 2020-08-04T21:59:53.061200Z

a set of triples?

quoll 2020-08-04T21:59:59.061400Z

yes

quoll 2020-08-04T22:01:11.061800Z

That protocol may get tweaked (around transactions), but that’s basically the level of abstraction I need

Huahai 2020-08-04T22:04:13.062Z

what does resolve-triple do?

quoll 2020-08-04T22:06:49.062200Z

It takes a graph and 3 elements (that come from a constraint pattern). Each element can either be a scalar value (e.g. a keyword, string, long, etc), or else a symbol that starts with ? . These symbols are treated as variables. It then returns bindings for all of those variables, based on the triples that match the provided scalar values.

quoll 2020-08-04T22:08:22.062400Z

The simple index (there’s another index as well) does this using asami.index/get-from-index https://github.com/threatgrid/asami/blob/main/src/asami/index.cljc#L33

quoll 2020-08-04T22:09:18.062700Z

There’s a simplify function that converts values into :v and variables into ?. That’s used to determine the dispatch

quoll 2020-08-04T22:10:07.062900Z

You can see there are 8 variations, each with its own code to return the appropriate result

quoll 2020-08-04T22:10:45.063100Z

The lettering of SPO is based on RDF: Subject-Predicate-Object. This is equivalent to EAV

Huahai 2020-08-04T22:10:46.063300Z

so instead of returning all the matched full datoms like in datascript, this returns the bound values only?

quoll 2020-08-04T22:10:56.063500Z

yes

quoll 2020-08-04T22:11:21.063700Z

The other columns are fixed by the request, so they’re redundant

Huahai 2020-08-04T22:11:57.063900Z

right

Huahai 2020-08-04T22:12:17.064100Z

that’s the biggest differences in term of performance though

quoll 2020-08-04T22:13:32.064300Z

Well, I thought about returning all 3 columns, and at the time I thought that I would just need to project them away anyway, so why would I?

quoll 2020-08-04T22:13:40.064500Z

It would be creating work for myself

Huahai 2020-08-04T22:14:21.064700Z

so the indexing has to be more complex than a simple set of datoms, like in datomic or datascript

quoll 2020-08-04T22:15:18.064900Z

Well, have a look at that file

quoll 2020-08-04T22:15:23.065100Z

how complex does it look?

Huahai 2020-08-04T22:15:52.065300Z

100 lines :0

Huahai 2020-08-04T22:15:54.065500Z

:0

Huahai 2020-08-04T22:15:56.065700Z

🙂

Huahai 2020-08-04T22:16:08.065900Z

I am thinking of storage on disk

quoll 2020-08-04T22:16:21.066100Z

there’s an index-add function, an index-delete function and then just a bit of code in the graph to call them

quoll 2020-08-04T22:16:37.066300Z

storage on disk is trickier. I won’t be using maps

Huahai 2020-08-04T22:17:03.066500Z

so what kind of trees you are implementing?

quoll 2020-08-04T22:17:04.066700Z

instead, trees. They’ll be sorting triples based on the 3 orderings

quoll 2020-08-04T22:17:07.066900Z

AVL

quoll 2020-08-04T22:17:45.067100Z

each node will refer to a block of triples. From memory, 512 of them

quoll 2020-08-04T22:18:26.067300Z

B-trees are nice too, but I’ve had great success with AVL for both upload speed and read performance

quoll 2020-08-04T22:18:54.067500Z

Because the nodes are small, the whole top of the tree (down many levels) gets cached

quoll 2020-08-04T22:22:11.067700Z

So even though it’s a deeper tree, it’s not as slow to search as it seems at first glance. Also, each node handles a range, which means it’s not purely AVL

Huahai 2020-08-04T22:23:11.067900Z

right, but how do you maintain the nested nature of the maps in that tree?

quoll 2020-08-04T22:25:05.068100Z

by the sorting

Huahai 2020-08-04T22:26:20.068300Z

i see, so it sounds to me the storage could be the exactly the same as datascript, only need add a step to filter to return the bunded values only

Huahai 2020-08-04T22:27:25.068500Z

then your query engine can be bring to bear

quoll 2020-08-04T22:27:28.068700Z

consider these triples:

a m 1
a m 2
a n 1
a o 3
b m 2
b m 3
c m 2
This is equivalent to:
a m 1
    2
  n 1
  o 3
b m 2
    3
c m 2
The second is the nested-map view

Huahai 2020-08-04T22:27:39.068900Z

right

quoll 2020-08-04T22:29:02.069100Z

on disk you find the boundaries of what you’re looking for by tree searches O(log(n)). When you have maps, then it’s O(1), which is obviously nicer.

Huahai 2020-08-04T22:30:07.069300Z

so you are only storing [abc] [momm] …, not [[am1]…]?

Huahai 2020-08-04T22:31:09.069500Z

or you still storing the later, but use the bounds to avoid scan the whole range? instead, scan a small range, then move to the next?

quoll 2020-08-04T22:32:26.069700Z

I’m sorry. I didn’t quite follow that

Huahai 2020-08-04T22:33:23.069900Z

i am trying to figure what the on-disk storage looks like in your design

quoll 2020-08-04T22:34:05.070100Z

ah

quoll 2020-08-04T22:34:29.070300Z

definitely storing the full triples. So there is redundancy

Huahai 2020-08-04T22:34:38.070500Z

my question is do you store [a m 2] in your example?

quoll 2020-08-04T22:34:58.070700Z

There are ways to reduce the amount being stored, but given the tradeoffs, I’m going with the simplest approach first

Huahai 2020-08-04T22:35:02.070900Z

oh i see, so you still store the full triples, but also store the bounds

quoll 2020-08-04T22:35:10.071100Z

yes

quoll 2020-08-04T22:35:22.071300Z

No. Store just the triples

Huahai 2020-08-04T22:35:24.071500Z

that’s what’s missing from datascript, it doesn’t store bounds

quoll 2020-08-04T22:36:41.071700Z

If I’m looking for [a ?x ?y] then I need to search twice. I’ll find the start, and then find the point between [a o 3] and [b m 2]

Huahai 2020-08-04T22:37:30.071900Z

i see, bounds are not stored, but found by seeking

quoll 2020-08-04T22:38:03.072100Z

searching… to me, “seeking” implies a linear search forward

Huahai 2020-08-04T22:38:26.072300Z

sure, you have a binary tree

Huahai 2020-08-04T22:39:36.072500Z

so my understanding is the exact same storage that datascript uses can be used in your query engine

quoll 2020-08-04T22:39:51.072700Z

yes

quoll 2020-08-04T22:40:37.072900Z

You can even just store flat triples and do everything via filter. The performance would be terrible, but it would work fine

Huahai 2020-08-04T22:40:54.073100Z

right

Huahai 2020-08-04T22:41:39.073300Z

i think it is possible for these projects to be merged

Huahai 2020-08-04T22:42:53.073500Z

what I did in Datalevin is to port datascript to LMDB, which becomes faster than Datascript, even though datalevin is on disk and datascript is in memory

Huahai 2020-08-04T22:43:32.073700Z

the same thing can be done with Asami, the only thing missing is the filtering step we discussed above

Huahai 2020-08-04T22:46:01.073900Z

once resolve-triple is implemented as your api requires, your query engine can be used, i think this combination will yield the best possible implementation

quoll 2020-08-04T22:47:32.074100Z

It’s open source 🙂

quoll 2020-08-04T22:48:01.074300Z

sorry, I’ve been distracted. It seems that ClojureScript does not support find on transient maps

quoll 2020-08-04T22:48:04.074500Z

sigh

Huahai 2020-08-04T22:49:41.074700Z

i don’t care about clojurescript, i only care about clojure, so it’s good to have choices

Huahai 2020-08-04T22:50:15.074900Z

thank you for talking with me, I admire what you did

quoll 2020-08-04T22:53:26.075100Z

Thank you! 🙂

quoll 2020-08-04T22:54:12.075300Z

@plexus created the channel #asami so feel free to ask things in there

Huahai 2020-08-04T22:55:55.075500Z

Yes, I am in that channel 🙂

Huahai 2020-08-04T22:57:14.075700Z

Back to work. Talk to your later!

quoll 2020-08-03T11:59:27.028500Z

It’s also a very simple conjunction, so it will be naturally fast

quoll 2020-08-03T12:00:13.029Z

There’s not a lot for the optimizer to do 🙂

quoll 2020-08-03T12:25:36.029500Z

The first query is literally just a get-in operation followed by a projection. The second query does the planning (at a guess, it will likely pick the provided ordering of the query) and it will then turn into a set of get operations in a loop for however many people there are named “Ivan”. I would it expect it to be trivially fast as well

plexus 2020-08-03T14:31:48.032300Z

I'll try to find some more interesting queries :)

quoll 2020-08-03T16:15:56.034800Z

I just realized… the query planner doesn’t execute on the first query, because there’s nothing to plan. I’m honestly confused why Datascript would be so slow for that query

lilactown 2020-08-03T16:20:56.035200Z

asami looks really cool

❤️ 1
lilactown 2020-08-03T16:26:53.035900Z

how does transacting references to other entities work?

quoll 2020-08-03T16:27:50.037100Z

May be better to ask in #asami? But it also depends on what you’re referring to specifically. In general, it does a lookup to find the entity and get it’s internal node value

quoll 2020-08-03T22:18:19.039500Z

I just looked at the code that would be executed for the first query. It’s several if /`cond` statements that figure out the structure of the query, followed by a get-in from the index. Then comes the expensive bit… For every result, it does a reduce over the 1 column defined in the :find clause, doing a map get to find the offset of the bound variable ?e (which is 0) and doing: (assoc '[?e] 0 result) to create each result row. Literally, that’s it. I can’t see why it would take multiple milliseconds?