@quoll I have done some quick experiments, but I probably do something wrong. My results are here: https://github.com/whilo/asami/blob/benchmark-comparison-hitchhiker-tree/test/asami/durable/pool_test.cljc#L230
Maybe you can have a look before our meeting later. I realized that the book contains a lot of double words and decided to only insert a set of those, but performance seems similar otherwise as well.
Even if performance seems similar, only inserting a set of the words is not going to be realistic usage. The purpose of this is to get a single, consistent ID value for every string that is passed in. Which means that it needs to search for them as well as store the ones it doesn’t find. Of course, caching will matter a lot for that, and 13k words will easily fit in memory, which will be why the performance looks similar at this size.
I haven’t looked at the loading yet, though reading is being done completely differently. In the HH case you’re iterating over the data, while for Asami you’re doing an individual lookup (i.e. search the tree) for each word. A more direct comparison there will be to use a tree iterator node-seq
That would be:
(time (doall (node-seq (:index bpool) (first-node bpool))))
However, that will only count the words that were stored. Short words aren’t stored!
Strings of 7 characters or fewer are turned into a number. It’s only strings of 8 characters or more that are saved in the tree
(I say “characters”. It’s really “bytes”. But I use UTF-8, and that text is all in the ASCII range, so it’s the same)
One difference I see in loading is in what is being done for each insert. Without knowing the hh code, it appears that you’re loading everything in a batch, then flushing. I presume that the load operation is ordering the data, right?
The Asami code is doing a 2 stage operation for each string:
• Does the string exist in the pool? If so, return its ID.
• Insert the string, returning the new ID.
The result is a seq of IDs that represent the original text. (Called coded
in my example).
The second step of my test was to use the pool to map the sequence of numbers in coded
back into their original strings, using the index.
That said, I fully expect the HH code to be faster at loading, though not by as much as your earlier test suggests
You had originally stated 11am last week. When I said I couldn’t do that time you suggested today and that you were going to ask around for who wants to join… but then nothing more was said!
I can manage 11am though. That’s 96 minutes from now
This machine (apparently fast than yours) encodes the strings into numbers in about 7 seconds (7436ms on my last run. I’m seeing faster and slower) However, that’s only the strings that are longer than 7 bytes, which 7051. You’ll likely be faster again when you restrict yourself to just those. Iterating over that takes 220ms
I am here, yes, sorry for not confirming.
The data is not presorted before it is inserted into the hitchhiker-tree. Every string is inserted in sequence and the sorting happens internally, partially lazily, because the hitchhiker-tree is fractal.
Ok, with (time (doall (node-seq (:index bpool) (first-node bpool))))
bulk reading takes half the time for me than before. I have not benchmarked random accesses yet. For this it might be sensible to run a more careful benchmark. The hitchhiker-tree has a fairly nice benchmarking suite with reporting to an Excel sheet, albeit only for random insertions and deletes so far.
I am not sure abou the pool internals with the string ids that you are mentioning. Are the ids important for the user of the pool or are they just internal?
Let's meet here https://meet.jit.si/WeakParticipationsRotateSharply.
(if this works, otherwise we can also go somewhere else)
Hi! I’m really sorry… I was out on an errand and couldn’t respond
I have to apologize. I was in a rush this morning, and didn’t think it through. I have a department meeting right now!
It’s every Thursday at this time
Ok, no problem.
Oh, wait.
We have a new boss
and over the break she changed the time
So I can meet now!
The IDs are the point of the pool
The purpose of the pool is to map data to IDs, and to map IDs back to data
Cool, let me know whether jitsi works for you.
As being discussed offline, this is the program that Naga was running:
sibling(fred, barney).
parent(fred, mary).
sibling(mary, george).
gender(george, male).
parent(B, C) :- sibling(A, B), parent(A, C).
brother(A, B) :- sibling(A, B), gender(B, male).
uncle(A, C) :- parent(A, B), brother(B, C).
sibling(A, B) :- parent(A, P), parent(B, P), A != B.
gender(F, male) :- father(A, F).
parent(A, F) :- father(A, F).
Link to async-difficult code: https://github.com/threatgrid/naga/blob/d36cb2a44d38c27669a52b6b3f421d9db3bc6a96/src/naga/storage/memory/core.clj#L155