datalog

Huahai 2020-08-04T19:12:26.046500Z

i think asami implemented the proper query algorithm

Huahai 2020-08-04T19:14:08.048200Z

looks at the code, there’s not a lot of low level optimizations, but it is still much faster than datascript, meaning asami’s algorithms are correct, whereas datascript’s maybe too naive

Huahai 2020-08-04T19:16:15.049200Z

e,g, i don’t remember seeing projection operation in datascript

Huahai 2020-08-04T20:41:33.049900Z

what’s the index of asami looks like?

Huahai 2020-08-04T20:42:19.050100Z

because I think the performance difference mostly comes from the the differences in index data structures

quoll 2020-08-04T20:42:19.050300Z

3 maps

Huahai 2020-08-04T20:43:15.050500Z

of?

quoll 2020-08-04T20:44:55.050700Z

The indexes are: {entity {attribute #{value}}} {attribute {value #{entity}}} {value {entity #{attribute}}}

Huahai 2020-08-04T20:45:11.051Z

so it’s all nested maps?

quoll 2020-08-04T20:45:18.051200Z

yes

Huahai 2020-08-04T20:45:39.051400Z

where’s datascript has flat maps

Huahai 2020-08-04T20:46:06.051600Z

so it has to join much more data

quoll 2020-08-04T20:46:51.051800Z

The indexes are pluggable. I’m building a new one at the moment, which is quite different (very similar to the Mulgara indices)

Huahai 2020-08-04T20:47:15.052Z

what’s the new one look like?

quoll 2020-08-04T20:48:28.052200Z

It’s a tree backed by a buffer. In ClojureScript I want that buffer to be stored in the browser’s persistent storage. In Clojure it’s backed by a MappedByteBuffer on a file

quoll 2020-08-04T20:48:54.052400Z

same indexing as above though

Huahai 2020-08-04T20:49:19.052600Z

i see

quoll 2020-08-04T20:51:03.052800Z

As I said, it’s based on Mulgara. I have not maintained it for some time, but the last I heard it was still being used commercially: http://mulgara.org/

quoll 2020-08-04T20:51:35.053Z

I stopped working on Mulgara because I wanted to redo a lot of it in Clojure (it’s a Java project)

Huahai 2020-08-04T20:51:46.053200Z

cool

Huahai 2020-08-04T20:52:53.053400Z

wouldn’t it be easier to just use it as backend rather than re-implement in clojure?

Huahai 2020-08-04T20:53:37.053600Z

performance wise, it would be better too, guess you want it to be usable in bowser?

quoll 2020-08-04T20:54:06.053800Z

I want to revisit the architecture. We made decisions about it in 2001 that involved tradeoffs that we don’t need to make anymore.

quoll 2020-08-04T20:54:22.054Z

And yes… I want it to run on both platforms

quoll 2020-08-04T20:54:50.054200Z

The Clojure code is dramatically simpler

quoll 2020-08-04T20:55:43.054400Z

and the main overhead for the trees is I/O, so the speed improvement gained from Java isn’t a big deal

Huahai 2020-08-04T20:56:37.054600Z

right

Huahai 2020-08-04T20:59:24.054800Z

very cool, looking forward to it

quoll 2020-08-04T20:59:39.055Z

Still working on it 🙂

quoll 2020-08-04T20:59:59.055200Z

I have the memory mapping working (it’s tricky to manage files over 2GB)

quoll 2020-08-04T21:00:10.055400Z

Still building the trees

Huahai 2020-08-04T21:04:58.055600Z

what’s your plan for asami? compete with datomic?

quoll 2020-08-04T21:05:26.055800Z

No

quoll 2020-08-04T21:07:09.056Z

It’s a hybrid plan… • We use it at Cisco. I’ve been expanding on its capabilities as needed. • It has supplanted Mulgara as my personal project. I’m trying to get back to what Mulgara can do

quoll 2020-08-04T21:08:29.056200Z

But, to a certain extend, the fact that Datomic was not open source has been frustrating. So I thought it would be nice to have an option. It will fill a different niche though, since it isn’t peer-based, and it’s based on an OWA like RDF/SPARQL

quoll 2020-08-04T21:10:05.056500Z

There are new architectures that I want to try with triple indices for different use-cases. Mulgara’s complexity has been too hard to make this change, and I’m hoping that Asami gives me the flexibility to do what I’ve wanted to there

quoll 2020-08-04T21:15:07.056700Z

It’s been slow going, because until yesterday no one else had ever submitted a PR 🙂

Huahai 2020-08-04T21:32:20.056900Z

I think people are just now noticing asami

Huahai 2020-08-04T21:39:42.057100Z

it’s good to have more options, so far i counted 7 options that is doing datomic-like queries in the clojure world

Huahai 2020-08-04T21:41:01.057300Z

Asami the fastest in term of query speed, so people are starting to notice

Huahai 2020-08-04T21:44:18.057500Z

Datomic, Datascript, Asami, Crux, Datahike, Eva and mine Datalevin

Huahai 2020-08-04T21:44:24.057700Z

7 in total

quoll 2020-08-04T21:44:57.057900Z

Well, I’m aiming for Datomic-like, rather than exactly copying Datomic 🙂

Huahai 2020-08-04T21:45:14.058100Z

i think these 7 are all different

Huahai 2020-08-04T21:46:01.058300Z

Asami is the most different, actually, for it’s not a

Huahai 2020-08-04T21:46:38.058500Z

Datomic inspired, whereas all others are

quoll 2020-08-04T21:47:29.058700Z

Well, it was partly inspired by Datomic. But also RDF

Huahai 2020-08-04T21:49:17.058900Z

datomic is insperied by RDF too, but yous is earlier than that

Huahai 2020-08-04T21:50:33.059100Z

the biggest difference is the indexing structure, and I think you got it right

Huahai 2020-08-04T21:51:32.059300Z

you seems to have more experience in RDF like databases than Datomic people

Huahai 2020-08-04T21:56:42.059500Z

are you open to alternative drualable storage than what you are working on?

quoll 2020-08-04T21:57:12.059700Z

Well one thing I haven’t done is to keep the full history of assertions and retractions. There’s a lot of complexity if I want to fully index that

quoll 2020-08-04T21:57:18.059900Z

definitely

quoll 2020-08-04T21:57:22.060100Z

This is why it’s pluggable

quoll 2020-08-04T21:58:35.060300Z

It just needs to follow a simple protocol

quoll 2020-08-04T21:59:05.060800Z

And the graph-diff function in there is optional

Huahai 2020-08-04T21:59:48.061Z

graph here means what?

Huahai 2020-08-04T21:59:53.061200Z

a set of triples?

quoll 2020-08-04T21:59:59.061400Z

yes

quoll 2020-08-04T22:01:11.061800Z

That protocol may get tweaked (around transactions), but that’s basically the level of abstraction I need

Huahai 2020-08-04T22:04:13.062Z

what does resolve-triple do?

quoll 2020-08-04T22:06:49.062200Z

It takes a graph and 3 elements (that come from a constraint pattern). Each element can either be a scalar value (e.g. a keyword, string, long, etc), or else a symbol that starts with ? . These symbols are treated as variables. It then returns bindings for all of those variables, based on the triples that match the provided scalar values.

quoll 2020-08-04T22:08:22.062400Z

The simple index (there’s another index as well) does this using asami.index/get-from-index https://github.com/threatgrid/asami/blob/main/src/asami/index.cljc#L33

quoll 2020-08-04T22:09:18.062700Z

There’s a simplify function that converts values into :v and variables into ?. That’s used to determine the dispatch

quoll 2020-08-04T22:10:07.062900Z

You can see there are 8 variations, each with its own code to return the appropriate result

quoll 2020-08-04T22:10:45.063100Z

The lettering of SPO is based on RDF: Subject-Predicate-Object. This is equivalent to EAV

Huahai 2020-08-04T22:10:46.063300Z

so instead of returning all the matched full datoms like in datascript, this returns the bound values only?

quoll 2020-08-04T22:10:56.063500Z

yes

quoll 2020-08-04T22:11:21.063700Z

The other columns are fixed by the request, so they’re redundant

Huahai 2020-08-04T22:11:57.063900Z

right

Huahai 2020-08-04T22:12:17.064100Z

that’s the biggest differences in term of performance though

quoll 2020-08-04T22:13:32.064300Z

Well, I thought about returning all 3 columns, and at the time I thought that I would just need to project them away anyway, so why would I?

quoll 2020-08-04T22:13:40.064500Z

It would be creating work for myself

Huahai 2020-08-04T22:14:21.064700Z

so the indexing has to be more complex than a simple set of datoms, like in datomic or datascript

quoll 2020-08-04T22:15:18.064900Z

Well, have a look at that file

quoll 2020-08-04T22:15:23.065100Z

how complex does it look?

Huahai 2020-08-04T22:15:52.065300Z

100 lines :0

Huahai 2020-08-04T22:15:54.065500Z

:0

Huahai 2020-08-04T22:15:56.065700Z

🙂

Huahai 2020-08-04T22:16:08.065900Z

I am thinking of storage on disk

quoll 2020-08-04T22:16:21.066100Z

there’s an index-add function, an index-delete function and then just a bit of code in the graph to call them

quoll 2020-08-04T22:16:37.066300Z

storage on disk is trickier. I won’t be using maps

Huahai 2020-08-04T22:17:03.066500Z

so what kind of trees you are implementing?

quoll 2020-08-04T22:17:04.066700Z

instead, trees. They’ll be sorting triples based on the 3 orderings

quoll 2020-08-04T22:17:07.066900Z

AVL

quoll 2020-08-04T22:17:45.067100Z

each node will refer to a block of triples. From memory, 512 of them

quoll 2020-08-04T22:18:26.067300Z

B-trees are nice too, but I’ve had great success with AVL for both upload speed and read performance

quoll 2020-08-04T22:18:54.067500Z

Because the nodes are small, the whole top of the tree (down many levels) gets cached

quoll 2020-08-04T22:22:11.067700Z

So even though it’s a deeper tree, it’s not as slow to search as it seems at first glance. Also, each node handles a range, which means it’s not purely AVL

Huahai 2020-08-04T22:23:11.067900Z

right, but how do you maintain the nested nature of the maps in that tree?

quoll 2020-08-04T22:25:05.068100Z

by the sorting

Huahai 2020-08-04T22:26:20.068300Z

i see, so it sounds to me the storage could be the exactly the same as datascript, only need add a step to filter to return the bunded values only

Huahai 2020-08-04T22:27:25.068500Z

then your query engine can be bring to bear

quoll 2020-08-04T22:27:28.068700Z

consider these triples:

a m 1
a m 2
a n 1
a o 3
b m 2
b m 3
c m 2
This is equivalent to:
a m 1
    2
  n 1
  o 3
b m 2
    3
c m 2
The second is the nested-map view

Huahai 2020-08-04T22:27:39.068900Z

right

quoll 2020-08-04T22:29:02.069100Z

on disk you find the boundaries of what you’re looking for by tree searches O(log(n)). When you have maps, then it’s O(1), which is obviously nicer.

Huahai 2020-08-04T22:30:07.069300Z

so you are only storing [abc] [momm] …, not [[am1]…]?

Huahai 2020-08-04T22:31:09.069500Z

or you still storing the later, but use the bounds to avoid scan the whole range? instead, scan a small range, then move to the next?

quoll 2020-08-04T22:32:26.069700Z

I’m sorry. I didn’t quite follow that

Huahai 2020-08-04T22:33:23.069900Z

i am trying to figure what the on-disk storage looks like in your design

quoll 2020-08-04T22:34:05.070100Z

ah

quoll 2020-08-04T22:34:29.070300Z

definitely storing the full triples. So there is redundancy

Huahai 2020-08-04T22:34:38.070500Z

my question is do you store [a m 2] in your example?

quoll 2020-08-04T22:34:58.070700Z

There are ways to reduce the amount being stored, but given the tradeoffs, I’m going with the simplest approach first

Huahai 2020-08-04T22:35:02.070900Z

oh i see, so you still store the full triples, but also store the bounds

quoll 2020-08-04T22:35:10.071100Z

yes

quoll 2020-08-04T22:35:22.071300Z

No. Store just the triples

Huahai 2020-08-04T22:35:24.071500Z

that’s what’s missing from datascript, it doesn’t store bounds

quoll 2020-08-04T22:36:41.071700Z

If I’m looking for [a ?x ?y] then I need to search twice. I’ll find the start, and then find the point between [a o 3] and [b m 2]

Huahai 2020-08-04T22:37:30.071900Z

i see, bounds are not stored, but found by seeking

quoll 2020-08-04T22:38:03.072100Z

searching… to me, “seeking” implies a linear search forward

Huahai 2020-08-04T22:38:26.072300Z

sure, you have a binary tree

Huahai 2020-08-04T22:39:36.072500Z

so my understanding is the exact same storage that datascript uses can be used in your query engine

quoll 2020-08-04T22:39:51.072700Z

yes

quoll 2020-08-04T22:40:37.072900Z

You can even just store flat triples and do everything via filter. The performance would be terrible, but it would work fine

Huahai 2020-08-04T22:40:54.073100Z

right

Huahai 2020-08-04T22:41:39.073300Z

i think it is possible for these projects to be merged

Huahai 2020-08-04T22:42:53.073500Z

what I did in Datalevin is to port datascript to LMDB, which becomes faster than Datascript, even though datalevin is on disk and datascript is in memory

Huahai 2020-08-04T22:43:32.073700Z

the same thing can be done with Asami, the only thing missing is the filtering step we discussed above

Huahai 2020-08-04T22:46:01.073900Z

once resolve-triple is implemented as your api requires, your query engine can be used, i think this combination will yield the best possible implementation

quoll 2020-08-04T22:47:32.074100Z

It’s open source 🙂

quoll 2020-08-04T22:48:01.074300Z

sorry, I’ve been distracted. It seems that ClojureScript does not support find on transient maps

quoll 2020-08-04T22:48:04.074500Z

sigh

Huahai 2020-08-04T22:49:41.074700Z

i don’t care about clojurescript, i only care about clojure, so it’s good to have choices

Huahai 2020-08-04T22:50:15.074900Z

thank you for talking with me, I admire what you did

quoll 2020-08-04T22:53:26.075100Z

Thank you! 🙂

quoll 2020-08-04T22:54:12.075300Z

@plexus created the channel #asami so feel free to ask things in there

Huahai 2020-08-04T22:55:55.075500Z

Yes, I am in that channel 🙂

Huahai 2020-08-04T22:57:14.075700Z

Back to work. Talk to your later!