asami

Asami, the graph database https://github.com/threatgrid/asami
whilo 2020-12-03T08:19:34.199300Z

@quoll I have done some quick experiments, but I probably do something wrong. My results are here: https://github.com/whilo/asami/blob/benchmark-comparison-hitchhiker-tree/test/asami/durable/pool_test.cljc#L230

whilo 2020-12-03T08:23:13.200700Z

Maybe you can have a look before our meeting later. I realized that the book contains a lot of double words and decided to only insert a set of those, but performance seems similar otherwise as well.

quoll 2020-12-03T12:59:02.201Z

Even if performance seems similar, only inserting a set of the words is not going to be realistic usage. The purpose of this is to get a single, consistent ID value for every string that is passed in. Which means that it needs to search for them as well as store the ones it doesn’t find. Of course, caching will matter a lot for that, and 13k words will easily fit in memory, which will be why the performance looks similar at this size.

quoll 2020-12-03T13:04:37.201200Z

I haven’t looked at the loading yet, though reading is being done completely differently. In the HH case you’re iterating over the data, while for Asami you’re doing an individual lookup (i.e. search the tree) for each word. A more direct comparison there will be to use a tree iterator node-seq

quoll 2020-12-03T13:28:26.201400Z

That would be:

(time (doall (node-seq (:index bpool) (first-node bpool))))

quoll 2020-12-03T13:29:03.201600Z

However, that will only count the words that were stored. Short words aren’t stored!

quoll 2020-12-03T13:30:33.201800Z

Strings of 7 characters or fewer are turned into a number. It’s only strings of 8 characters or more that are saved in the tree

quoll 2020-12-03T13:31:44.202Z

(I say “characters”. It’s really “bytes”. But I use UTF-8, and that text is all in the ASCII range, so it’s the same)

quoll 2020-12-03T13:35:08.202200Z

One difference I see in loading is in what is being done for each insert. Without knowing the hh code, it appears that you’re loading everything in a batch, then flushing. I presume that the load operation is ordering the data, right?

quoll 2020-12-03T13:38:03.202400Z

The Asami code is doing a 2 stage operation for each string: • Does the string exist in the pool? If so, return its ID. • Insert the string, returning the new ID. The result is a seq of IDs that represent the original text. (Called coded in my example).

quoll 2020-12-03T13:38:55.202600Z

The second step of my test was to use the pool to map the sequence of numbers in coded back into their original strings, using the index.

quoll 2020-12-03T13:40:33.202800Z

That said, I fully expect the HH code to be faster at loading, though not by as much as your earlier test suggests

quoll 2020-12-03T14:24:10.203900Z

You had originally stated 11am last week. When I said I couldn’t do that time you suggested today and that you were going to ask around for who wants to join… but then nothing more was said!

quoll 2020-12-03T14:24:20.204200Z

I can manage 11am though. That’s 96 minutes from now

quoll 2020-12-03T14:34:59.204500Z

This machine (apparently fast than yours) encodes the strings into numbers in about 7 seconds (7436ms on my last run. I’m seeing faster and slower) However, that’s only the strings that are longer than 7 bytes, which 7051. You’ll likely be faster again when you restrict yourself to just those. Iterating over that takes 220ms

whilo 2020-12-03T15:37:52.205100Z

I am here, yes, sorry for not confirming.

whilo 2020-12-03T15:41:42.205200Z

The data is not presorted before it is inserted into the hitchhiker-tree. Every string is inserted in sequence and the sorting happens internally, partially lazily, because the hitchhiker-tree is fractal.

whilo 2020-12-03T15:49:12.205400Z

Ok, with (time (doall (node-seq (:index bpool) (first-node bpool)))) bulk reading takes half the time for me than before. I have not benchmarked random accesses yet. For this it might be sensible to run a more careful benchmark. The hitchhiker-tree has a fairly nice benchmarking suite with reporting to an Excel sheet, albeit only for random insertions and deletes so far.

whilo 2020-12-03T15:50:20.205600Z

I am not sure abou the pool internals with the string ids that you are mentioning. Are the ids important for the user of the pool or are they just internal?

whilo 2020-12-03T15:58:58.206Z

Let's meet here https://meet.jit.si/WeakParticipationsRotateSharply.

whilo 2020-12-03T15:59:17.206500Z

(if this works, otherwise we can also go somewhere else)

quoll 2020-12-03T16:00:32.207Z

Hi! I’m really sorry… I was out on an errand and couldn’t respond

quoll 2020-12-03T16:01:13.207700Z

I have to apologize. I was in a rush this morning, and didn’t think it through. I have a department meeting right now!

quoll 2020-12-03T16:01:18.207900Z

It’s every Thursday at this time

whilo 2020-12-03T16:01:48.208500Z

Ok, no problem.

quoll 2020-12-03T16:01:50.208700Z

Oh, wait.

quoll 2020-12-03T16:01:53.208900Z

We have a new boss

quoll 2020-12-03T16:02:01.209200Z

and over the break she changed the time

quoll 2020-12-03T16:02:06.209400Z

So I can meet now!

quoll 2020-12-03T16:03:36.209500Z

The IDs are the point of the pool

quoll 2020-12-03T16:03:56.209700Z

The purpose of the pool is to map data to IDs, and to map IDs back to data

whilo 2020-12-03T16:04:11.210100Z

Cool, let me know whether jitsi works for you.

quoll 2020-12-03T16:51:35.210900Z

As being discussed offline, this is the program that Naga was running:

sibling(fred, barney).
parent(fred, mary).
sibling(mary, george).
gender(george, male).

parent(B, C) :- sibling(A, B), parent(A, C).
brother(A, B) :- sibling(A, B), gender(B, male).
uncle(A, C) :- parent(A, B), brother(B, C).
sibling(A, B) :- parent(A, P), parent(B, P), A != B.
gender(F, male) :- father(A, F).
parent(A, F) :- father(A, F).