Asami is also still work in progress, Paula is still planning to add a storage layer for instance.
Am I doing something wrong here or are Asami queries really 100x faster than with Datomic/DataScript?
I know one has to be careful with microbenchmarks, I'd absolutely love for people to challenge these results. PRs to that repo are also welcome, would be nice if it could grow into a comprehensive comparison of the different implementations.
I don't know much about either of these implementations but you could check where the time is spent e.g. via clj-async-profiler
It would be unfair to compare Asami to Datomic, because that system is designed for its durable provisioning. There’s a lot going on. Datascript is more comparable. Asami indexing is simpler, and it doesn’t really keep things like deletion datoms. That may help Asami to be faster here? Also, a micro benchmark will be hitting the cache on Asami’s query planner. I don’t believe that the others have a query optimizer, while Asami does. There is a small cost to using it, but if you keep running the same query without updating the database then it will just return the same query plan without getting re-executed.
that explains a lot, because this benchmark does a warmup first
what’s the index of asami looks like?
because I think the performance difference mostly comes from the the differences in index data structures
3 maps
of?
The indexes are: {entity {attribute #{value}}} {attribute {value #{entity}}} {value {entity #{attribute}}}
so it’s all nested maps?
yes
where’s datascript has flat maps
so it has to join much more data
The indexes are pluggable. I’m building a new one at the moment, which is quite different (very similar to the Mulgara indices)
what’s the new one look like?
It’s a tree backed by a buffer. In ClojureScript I want that buffer to be stored in the browser’s persistent storage. In Clojure it’s backed by a MappedByteBuffer on a file
same indexing as above though
i see
As I said, it’s based on Mulgara. I have not maintained it for some time, but the last I heard it was still being used commercially: http://mulgara.org/
I stopped working on Mulgara because I wanted to redo a lot of it in Clojure (it’s a Java project)
cool
wouldn’t it be easier to just use it as backend rather than re-implement in clojure?
performance wise, it would be better too, guess you want it to be usable in bowser?
I want to revisit the architecture. We made decisions about it in 2001 that involved tradeoffs that we don’t need to make anymore.
And yes… I want it to run on both platforms
The Clojure code is dramatically simpler
and the main overhead for the trees is I/O, so the speed improvement gained from Java isn’t a big deal
right
very cool, looking forward to it
Still working on it 🙂
I have the memory mapping working (it’s tricky to manage files over 2GB)
Still building the trees
what’s your plan for asami? compete with datomic?
No
It’s a hybrid plan… • We use it at Cisco. I’ve been expanding on its capabilities as needed. • It has supplanted Mulgara as my personal project. I’m trying to get back to what Mulgara can do
But, to a certain extend, the fact that Datomic was not open source has been frustrating. So I thought it would be nice to have an option. It will fill a different niche though, since it isn’t peer-based, and it’s based on an OWA like RDF/SPARQL
There are new architectures that I want to try with triple indices for different use-cases. Mulgara’s complexity has been too hard to make this change, and I’m hoping that Asami gives me the flexibility to do what I’ve wanted to there
It’s been slow going, because until yesterday no one else had ever submitted a PR 🙂
I think people are just now noticing asami
it’s good to have more options, so far i counted 7 options that is doing datomic-like queries in the clojure world
Asami the fastest in term of query speed, so people are starting to notice
Datomic, Datascript, Asami, Crux, Datahike, Eva and mine Datalevin
7 in total
Well, I’m aiming for Datomic-like, rather than exactly copying Datomic 🙂
i think these 7 are all different
Asami is the most different, actually, for it’s not a
Datomic inspired, whereas all others are
Well, it was partly inspired by Datomic. But also RDF
datomic is insperied by RDF too, but yous is earlier than that
the biggest difference is the indexing structure, and I think you got it right
you seems to have more experience in RDF like databases than Datomic people
are you open to alternative drualable storage than what you are working on?
Well one thing I haven’t done is to keep the full history of assertions and retractions. There’s a lot of complexity if I want to fully index that
definitely
This is why it’s pluggable
It just needs to follow a simple protocol
https://github.com/threatgrid/asami/blob/main/src/asami/graph.cljc#L9
And the graph-diff
function in there is optional
graph here means what?
a set of triples?
yes
That protocol may get tweaked (around transactions), but that’s basically the level of abstraction I need
what does resolve-triple do?
It takes a graph and 3 elements (that come from a constraint pattern). Each element can either be a scalar value (e.g. a keyword, string, long, etc), or else a symbol that starts with ?
. These symbols are treated as variables.
It then returns bindings for all of those variables, based on the triples that match the provided scalar values.
The simple index (there’s another index as well) does this using asami.index/get-from-index
https://github.com/threatgrid/asami/blob/main/src/asami/index.cljc#L33
There’s a simplify
function that converts values into :v
and variables into ?
. That’s used to determine the dispatch
You can see there are 8 variations, each with its own code to return the appropriate result
The lettering of SPO is based on RDF: Subject-Predicate-Object. This is equivalent to EAV
so instead of returning all the matched full datoms like in datascript, this returns the bound values only?
yes
The other columns are fixed by the request, so they’re redundant
right
that’s the biggest differences in term of performance though
Well, I thought about returning all 3 columns, and at the time I thought that I would just need to project them away anyway, so why would I?
It would be creating work for myself
so the indexing has to be more complex than a simple set of datoms, like in datomic or datascript
Well, have a look at that file
how complex does it look?
100 lines :0
:0
🙂
I am thinking of storage on disk
there’s an index-add
function, an index-delete
function and then just a bit of code in the graph to call them
storage on disk is trickier. I won’t be using maps
so what kind of trees you are implementing?
instead, trees. They’ll be sorting triples based on the 3 orderings
AVL
each node will refer to a block of triples. From memory, 512 of them
B-trees are nice too, but I’ve had great success with AVL for both upload speed and read performance
Because the nodes are small, the whole top of the tree (down many levels) gets cached
So even though it’s a deeper tree, it’s not as slow to search as it seems at first glance. Also, each node handles a range, which means it’s not purely AVL
right, but how do you maintain the nested nature of the maps in that tree?
by the sorting
i see, so it sounds to me the storage could be the exactly the same as datascript, only need add a step to filter to return the bunded values only
then your query engine can be bring to bear
consider these triples:
a m 1
a m 2
a n 1
a o 3
b m 2
b m 3
c m 2
This is equivalent to:
a m 1
2
n 1
o 3
b m 2
3
c m 2
The second is the nested-map viewright
on disk you find the boundaries of what you’re looking for by tree searches O(log(n)). When you have maps, then it’s O(1), which is obviously nicer.
so you are only storing [abc] [momm] …, not [[am1]…]?
or you still storing the later, but use the bounds to avoid scan the whole range? instead, scan a small range, then move to the next?
I’m sorry. I didn’t quite follow that
i am trying to figure what the on-disk storage looks like in your design
ah
definitely storing the full triples. So there is redundancy
my question is do you store [a m 2] in your example?
There are ways to reduce the amount being stored, but given the tradeoffs, I’m going with the simplest approach first
oh i see, so you still store the full triples, but also store the bounds
yes
No. Store just the triples
that’s what’s missing from datascript, it doesn’t store bounds
If I’m looking for [a ?x ?y]
then I need to search twice. I’ll find the start, and then find the point between [a o 3] and [b m 2]
i see, bounds are not stored, but found by seeking
searching… to me, “seeking” implies a linear search forward
sure, you have a binary tree
so my understanding is the exact same storage that datascript uses can be used in your query engine
yes
You can even just store flat triples and do everything via filter
. The performance would be terrible, but it would work fine
right
i think it is possible for these projects to be merged
what I did in Datalevin is to port datascript to LMDB, which becomes faster than Datascript, even though datalevin is on disk and datascript is in memory
the same thing can be done with Asami, the only thing missing is the filtering step we discussed above
once resolve-triple is implemented as your api requires, your query engine can be used, i think this combination will yield the best possible implementation
It’s open source 🙂
sorry, I’ve been distracted. It seems that ClojureScript does not support find
on transient maps
sigh
i don’t care about clojurescript, i only care about clojure, so it’s good to have choices
thank you for talking with me, I admire what you did
Thank you! 🙂
@plexus created the channel #asami so feel free to ask things in there
Yes, I am in that channel 🙂
Back to work. Talk to your later!
It’s also a very simple conjunction, so it will be naturally fast
There’s not a lot for the optimizer to do 🙂
The first query is literally just a get-in
operation followed by a projection.
The second query does the planning (at a guess, it will likely pick the provided ordering of the query) and it will then turn into a set of get
operations in a loop for however many people there are named “Ivan”. I would it expect it to be trivially fast as well
I'll try to find some more interesting queries :)
I just realized… the query planner doesn’t execute on the first query, because there’s nothing to plan. I’m honestly confused why Datascript would be so slow for that query
asami looks really cool
how does transacting references to other entities work?
May be better to ask in #asami? But it also depends on what you’re referring to specifically. In general, it does a lookup to find the entity and get it’s internal node value
I just looked at the code that would be executed for the first query. It’s several if
/`cond` statements that figure out the structure of the query, followed by a get-in
from the index.
Then comes the expensive bit…
For every result, it does a reduce
over the 1 column defined in the :find
clause, doing a map get
to find the offset of the bound variable ?e
(which is 0) and doing: (assoc '[?e] 0 result)
to create each result row.
Literally, that’s it. I can’t see why it would take multiple milliseconds?