asami

Asami, the graph database https://github.com/threatgrid/asami
whilo 2020-11-24T05:02:05.183400Z

@quoll This is an interesting benchmark. Do you have a reproducible version somewhere? We can compare the numbers with the hitchhiker-tree then. Was this with or without persisting to a durable medium?

quoll 2020-11-24T05:03:46.184400Z

I’ve been messing with it since 🙂 And there’s a bug that I’m tracking down, though I don’t think it shows up in the version that led to that benchmark.

quoll 2020-11-24T05:04:28.185200Z

The string pool does not store short strings. Instead, it encodes them in a negative number. Longer strings are stored

quoll 2020-11-24T05:06:04.187Z

A lot of the document was short strings, so it’s hard to say what the ratio of encoding into numbers vs storing was. To figure that out, I told it to store everything. And it turns out that there’s a bug when the stored string is shorter than the size of the data in a tree node. I’m debugging this now. 🙂

quoll 2020-11-24T05:06:35.187600Z

So it was a bit premature to be talking about that above, but I was really excited that it worked!

quoll 2020-11-24T05:23:08.189700Z

It’s possible to reproduce what I was talking about, though it’s tricky… • checkout the “storage” branch” • checkout the working commit: 0b6031f • execute the code in the attached thread…

quoll 2020-11-24T05:23:24.189800Z

(require '[clojure.string :as s])
(require '[asami.durable.common :refer :all])
(require '[asami.durable.pool :refer [create-pool]])

;; read a book
(def book (slurp "resources/pride_and_prejudice.txt"))
(def words (s/split book #" "))

;; create a data pool
(def pool (create-pool "book"))
;; load words into the pool, accumulating the numbers the pool assigns to the words
(def coded-pool (time (reduce (fn [[ids p] w]
                          (let [[id p'] (write! p w)]
                            [(conj ids id) p']))
                        [[] pool]
                        words)))
(def coded (first coded-pool))
(def bpool (second coded-pool))

;; transactions are handled outside of the pool, so get transaction point for later
(def root (:root-id bpool))

;; coded now contains numbers for every word in the document
(count coded)
(take 10 coded)

;; ask the data pool for the data associated with each number
(def output-words (map #(find-object bpool %) coded))

(time (count output-words))
;; 3.1s 3.4s
(= words output-words)

;; truncate the files to just what is in use
(close bpool)


;; come back and reopen the file. Use the transaction point from above
(def pool2 (create-pool "book" 68))

;; ask the data pool again for the data associated with each number in coded
(def output-words2 (map #(find-object pool2 %) coded))

;; is it all still available?
(count output-words2)
(= words output-words2)

quoll 2020-11-24T05:25:25.190Z

The file pride_and_prejudice.txt is at: <https://www.gutenberg.org/files/1342/1342-0.txt>

quoll 2020-11-24T05:26:02.190700Z

Either that, or give me a day to 2 to get it back into shape.

quoll 2020-11-24T05:26:43.191500Z

The reason for this work is that the tree nodes are bigger than they seem to require. It’s not slowing anything down, but it’s using more space