asami

Asami, the graph database https://github.com/threatgrid/asami
quoll 2021-03-20T00:31:51.053600Z

Asami alpha 6 is out now. Changes are: • Inserting entities that have no :db/id will no longer report their own ID in the :tempids from the transaction • No longer dependent on Cheshire, which means no more dependency on Jackson XML • Fixed reflections in the durable code, with about a 30% speedup

Ⓜ️ 1
💪 4
❌ 1
quoll 2021-03-20T00:33:26.054300Z

I’m doing training next week, so there won’t be any Asami development until the week after

Craig Brozefsky 2021-03-20T13:27:35.056700Z

Just did the same load test with ALpha6 and can confirm that estimate on speedup

👍 1
quoll 2021-03-20T15:12:46.073500Z

I know I have work to do to improve load speed, but at the same time there’s a tradeoff between use cases: • Regular updated and modifications to do (current design) • Load once and analyze

quoll 2021-03-20T15:13:34.074200Z

The current design is specifically to allow regular updates without great expense, while also trying to keep querying fast

quoll 2021-03-20T15:17:19.077500Z

I have another design which is optimized for fast loading. But it can’t update anything in place. Instead, it would manage updates as another pair of graphs (additions and retractions), which get dynamically merged in with with the main graph during queries. Then in the background, those changes would be merged into a single index again. This makes updates possible, but it’s not going to do lots of modifications quickly.

quoll 2021-03-20T15:18:22.078500Z

The thing I just implemented is actually a hybrid between the old Mulgara and this write-once design. It mostly follows what Mulgara did, but with a few exceptions.

quoll 2021-03-20T15:19:52.079300Z

I knew that Clojure was going to bring some overhead, but I’m still a bit disappointed in the speed. Hoping I can improve it more

Craig Brozefsky 2021-03-20T20:21:37.080Z

Well, fromt what I know of the problem space Cisco is dealing with..

Craig Brozefsky 2021-03-20T20:21:53.080500Z

My hunch would be fast load, minimal modification relative to loaded data set...

Craig Brozefsky 2021-03-20T20:22:19.081400Z

aka, investigation/analysis vs progressive development of a stable and persistent knowledge set

quoll 2021-03-20T20:22:24.081600Z

It depends on who’s using it. Right now my team is modifying things a LOT

Craig Brozefsky 2021-03-20T20:23:02.082400Z

enrichment loads?

👍 1
quoll 2021-03-20T20:23:07.082600Z

But I agree. I think the real value of these systems is in providing a new view for analysis

Craig Brozefsky 2021-03-20T20:23:28.083Z

or the synthesis of higher order models like "target" etc...

quoll 2021-03-20T20:23:37.083200Z

both

Craig Brozefsky 2021-03-20T20:23:49.083600Z

Ok, cause to me, neither of those is modification

Craig Brozefsky 2021-03-20T20:23:52.083800Z

that's addition

Craig Brozefsky 2021-03-20T20:24:08.084600Z

but apparently those are modifications in terms of what you were talking about above

Craig Brozefsky 2021-03-20T20:24:26.085200Z

I think "update" when I here modification, not addiing some triples

quoll 2021-03-20T20:24:44.085600Z

They’ve recently started understanding what they get out of it, and they’re starting to use it for everything. Apparently it’s both easier for them, and the resulting code is faster than what they were doing before (which I’m surprised at, amused by, and grateful to learn)

Craig Brozefsky 2021-03-20T20:25:14.086500Z

well, at some point, munging and mapping/zipping around in javascript data objects is just tedious 8^)

quoll 2021-03-20T20:25:19.086700Z

They have lots of entities where they want to change state of values, not simply add new ones

Craig Brozefsky 2021-03-20T20:25:41.087Z

yah, so those "synthesized" entities, like Target

Craig Brozefsky 2021-03-20T20:25:54.087200Z

or node for the graph...

Craig Brozefsky 2021-03-20T20:26:57.087800Z

well, to put this in perspective. I wager my test was way more data than they are expecting to deal with

Craig Brozefsky 2021-03-20T20:27:12.088Z

and they are primarily using the in-mem store

Craig Brozefsky 2021-03-20T20:28:12.088700Z

My intuition is that it's really the update speed on the durable storage that is bottleneck

Craig Brozefsky 2021-03-20T20:28:22.088900Z

err, I mean, "adding triples"

Craig Brozefsky 2021-03-20T20:28:58.089400Z

but I wonder how much durable storage use cases and in-mem use cases overalp

Craig Brozefsky 2021-03-20T20:29:21.089700Z

I mean, I think Mario's work is doing snapshots...

Craig Brozefsky 2021-03-20T20:29:41.090200Z

durable storage is more about having a working set that is significantly larger than expected available RAM

Craig Brozefsky 2021-03-20T20:30:11.090800Z

it's not so much about persistence -- as the current snapshot dump/load mechanism for an in-mem db is sufficient for those use cases

Craig Brozefsky 2021-03-20T20:31:01.091500Z

if I can shrink my working set down to something that can fit in like 32g of ram...

Craig Brozefsky 2021-03-20T20:31:08.091900Z

current asami is gold

Craig Brozefsky 2021-03-20T20:32:00.093700Z

it's when I need a working data set where the indexes might fit in 32g of ram, but my actual data is MUCH larger... that durable stores win.. But maybe I'm not grokking the resource constraints of durable asami...

quoll 2021-03-20T20:33:22.095Z

Well, for now they’re only just starting to hit the durable store, so I think they’re OK. My next step is indexing entities, so they don’t get rebuilt when they’re retrieved. That’s a little tricky, because updates to triples can be updating entities, and I need to find those. But I think I can get most of the low-hanging fruit reasonably well. After that, I can start looking at the triples indexes again.

quoll 2021-03-20T20:33:47.095500Z

Fortunately, a lot of it can be built on the same infrastructure that the current indexes use

quoll 2021-03-20T20:34:02.095800Z

Also, I’d like to do another tree type, and not just AVL

quoll 2021-03-20T20:34:24.096300Z

AVL is perfect for the triples, but it’s not ideal for the data pool (i.e. strings, URIs, etc)

quoll 2021-03-20T20:35:57.097700Z

e.g. if you load the data twice, the second time is several times faster. That’s because the data pool already has all the strings and keywords, and they’re cached without having to traverse the tree

quoll 2021-03-20T20:36:46.098100Z

Thinking about Hitchhiker trees or skiplists here

quoll 2021-03-20T20:38:33.098800Z

Meanwhile, people would like pull syntax, and I need to figure out storage in IndexedDB… it feels like a lot for one person

quoll 2021-03-20T20:38:55.099100Z

Oh… and Naga needs some love too 🙂

Craig Brozefsky 2021-03-20T20:40:33.099700Z

Seems like pretty classic problem of having a single abstraction interface for vastly different resources

Craig Brozefsky 2021-03-20T20:41:10.100400Z

mem vs. write to file...

quoll 2021-03-20T20:41:25.100600Z

That’s true

Craig Brozefsky 2021-03-20T20:41:46.101200Z

interesting to see the file on disk growing iteravly within a single transaction boundary too

quoll 2021-03-20T20:42:19.102100Z

But Naga was always supposed to talk to external graph DBs. I never thought of it being in-memory until you asked me to make it

Craig Brozefsky 2021-03-20T20:42:22.102200Z

not what I would have intuited for a mem-mapped file block

Craig Brozefsky 2021-03-20T20:42:42.102600Z

it's a marvelous little problem domain

quoll 2021-03-20T20:44:59.105600Z

It depends on the index. If it’s the data pool (data.bin and idx.bin) then the data.bin file is just appended to with standard .write operations. When you read from it, it checks if the offset is within a currently mapped region of the file. If so, then you get it. If not, then it will extend the mapping, then you get it

Craig Brozefsky 2021-03-20T20:45:00.105700Z

I would guess that in-mem DB with fast snapshot dump/load is still what will support CTR/SX the best

Craig Brozefsky 2021-03-20T20:45:40.106400Z

outsider guess of course

quoll 2021-03-20T20:45:44.106500Z

I’m thinking so for now

Craig Brozefsky 2021-03-20T20:46:22.107300Z

already durable speed is sufficient for those data sets, and for even persistent the entirety of public CTIA I would guess

Craig Brozefsky 2021-03-20T20:46:35.108Z

so it's pretty solidly hitting all your paying use cases 8^)

Craig Brozefsky 2021-03-20T20:46:39.108300Z

in alpha6....

quoll 2021-03-20T20:46:41.108400Z

Right now, the only pain point is the entity rebuilding. So next on the list is indexing those directly

Craig Brozefsky 2021-03-20T20:47:53.110100Z

yah, well at some point, you have to recognize entity rebuilding is providing a entitment abstractin on top of a more efficient query engine -- aka, maybe users should be making queries they need directly, instead of snarfling whole entities

Craig Brozefsky 2021-03-20T20:48:03.110400Z

entity abstraction o top of...

quoll 2021-03-20T20:48:25.111Z

in memory is easy. The durable version needs me to serialize. I’ve ummed and ahh-ed over using transit, but because I already have most of the serialization I need, I’m going to extend my code a little more. The space overhead is similar, and mine is better in some cases

Craig Brozefsky 2021-03-20T20:48:30.111100Z

Maybe consider something like the ES approach

Craig Brozefsky 2021-03-20T20:48:48.111500Z

that entity cache... you already are considering it, nvm

quoll 2021-03-20T20:48:58.111700Z

heh. Yes. That

quoll 2021-03-20T20:49:39.112500Z

It was built inside the engine, as a layer over the DB. I’m talking about shifting it into the DB. It also makes other features possible then

Craig Brozefsky 2021-03-20T20:50:17.113Z

yah, it's where you pay the cost of triple store

Craig Brozefsky 2021-03-20T20:50:20.113200Z

vs ES document model

Craig Brozefsky 2021-03-20T20:50:33.113500Z

in that I can add/modify/delete from an entity

Craig Brozefsky 2021-03-20T20:50:38.113700Z

via manipulating triples

Craig Brozefsky 2021-03-20T20:50:59.114400Z

can't do that in ES -- only document level operations, so "entity level"

Craig Brozefsky 2021-03-20T20:51:10.114800Z

but gotta have that for naga...

quoll 2021-03-20T20:51:10.114900Z

yes. But if I can see the entity those triples are connected to, then I can update the entity as well

Craig Brozefsky 2021-03-20T20:51:14.115100Z

I think it's worth the cost

Craig Brozefsky 2021-03-20T20:51:20.115300Z

yah

Craig Brozefsky 2021-03-20T20:51:25.115500Z

or just invalidate...

Craig Brozefsky 2021-03-20T20:51:34.115800Z

don't pay the cost until read

Craig Brozefsky 2021-03-20T20:52:23.117Z

hmmm, subentity ids containing some pointer to the parent entity?

Craig Brozefsky 2021-03-20T20:52:36.117500Z

so :tg-12314-121 ...

quoll 2021-03-20T20:53:22.118Z

I’m better with the upfront cost of writing. That’s because writing always happens in another thread anyway, and it’s usually queries that we want to make fast

Craig Brozefsky 2021-03-20T20:53:32.118200Z

hehehe

quoll 2021-03-20T20:53:34.118300Z

No, but can

Craig Brozefsky 2021-03-20T20:53:50.118800Z

so yes, in your current use case, or any use case where your data set is... contrains to fit in mem

Craig Brozefsky 2021-03-20T20:54:34.119300Z

So, you mention merged graphs.. tell me more about multigraph

quoll 2021-03-20T20:54:35.119400Z

Also, it’s only in-memory graphs that use keywords. On-disk graphs use Internal Nodes (serialized as: #a/n "1234")

quoll 2021-03-20T20:55:18.120200Z

Multigraph isn’t merged. That’s where you have multiple edges between nodes. It gives you a weight

Craig Brozefsky 2021-03-20T20:55:29.120600Z

ok, I was thinking of merged DBs...

Craig Brozefsky 2021-03-20T20:55:38.121Z

give me two graphs DBs, make them behave as one...

quoll 2021-03-20T20:55:44.121200Z

OK… now THAT is coming soon

quoll 2021-03-20T20:55:56.121700Z

I need it so I can do as

Craig Brozefsky 2021-03-20T20:56:07.122200Z

that opens up ALOT of use cases to manage write constraints ...

Craig Brozefsky 2021-03-20T20:56:31.122900Z

especailly if it's make "n" graphs behave as one on query/update ...

quoll 2021-03-20T20:57:02.123600Z

Have an initial graph (either in memory or on disk), and then you do speculative transactions against it. This becomes a pair of graphs. The fixed one, and an in-memory one. Queries against this will be returning concatenations of the results of both

Craig Brozefsky 2021-03-20T20:57:14.123800Z

yah

Craig Brozefsky 2021-03-20T20:57:24.124300Z

a graphdb variant of WALs

quoll 2021-03-20T20:57:31.124800Z

Datomic has this, and people have asked for it.

Craig Brozefsky 2021-03-20T20:57:36.124900Z

which is where you end up going to get write speeds approaching RDBMSs

Craig Brozefsky 2021-03-20T20:58:05.125500Z

yah, and the power of having a single query doing joins across those graphs..

Craig Brozefsky 2021-03-20T20:58:47.126900Z

means easier to build inference/queries/logic/processes across a larger set of knowledge (global threat context vs. my org scope

quoll 2021-03-20T20:58:48.127Z

Yes, you can do everything in memory, and then when you’re done you send a replay off to the index. Queries get done against the pair, until the index has finished it’s transaction, and then you swap over to the index again.

quoll 2021-03-20T20:59:14.127300Z

This is actually how Naga operates against Datomic

quoll 2021-03-20T21:00:27.128700Z

Datomic’s only notion of “transaction” is grouping write operations. But Naga needs a transaction to be a writes and reads interspersed, with the reads returning data that includes the results of the writes.

Craig Brozefsky 2021-03-20T21:01:11.129800Z

ah

quoll 2021-03-20T21:01:34.130300Z

I managed it by using an as database for executing against, and accumulating write operations until the end of the rules. At that point, I replay all of the writes into the main DB.

Craig Brozefsky 2021-03-20T21:01:44.130500Z

as ?

quoll 2021-03-20T21:02:37.131Z

Sorry… that was a mistake. I meant with

quoll 2021-03-20T21:02:49.131300Z

as is the operation of going back in time

Craig Brozefsky 2021-03-20T21:02:50.131400Z

Stepping back for a moment

Craig Brozefsky 2021-03-20T21:03:34.131900Z

I fee like needs to scale and get "coverage" in data sets

Craig Brozefsky 2021-03-20T21:03:42.132100Z

has been over-prioritized

Craig Brozefsky 2021-03-20T21:04:05.132700Z

at the expense of making the tools that experts and analysts can use to make richer explorations and inferences about data

Craig Brozefsky 2021-03-20T21:04:37.133400Z

aka, security is too focused on completeness, breadth, or scaling (mongedb is cloud scale... splunk is cloud scale...)

Craig Brozefsky 2021-03-20T21:05:57.134500Z

I feel like there is a need for a way to get a subset of your data, a working set you build from querying all those sources (the mythology of your SEIM being that one source of data is a myth...)

Craig Brozefsky 2021-03-20T21:06:20.134900Z

and layer increasingly more sophisticated abstration on top of that

Craig Brozefsky 2021-03-20T21:07:20.135700Z

that you can't do with sql, or splunks query language, or ES, or these pipe based "event" query systems

quoll 2021-03-20T21:09:10.137300Z

One thing I’d like to do is provide some graph algorithms to all of this. That’s one reason I integrated with Loom. It’s also why I support transitive attributes, and subgraph identification. I feel like we can use some graph algorithms to get some data that we’re not looking for yet

Craig Brozefsky 2021-03-20T21:09:22.137500Z

YES

quoll 2021-03-20T21:10:01.138100Z

So… I’ve made a start on this. But you’ve already heard lots of other “priorities” that I’m working on at the same time :rolling_on_the_floor_laughing:

quoll 2021-03-20T21:10:29.138800Z

However, the other day, Jesse made the suggestion of looking for someone to help me. That would help a lot

Craig Brozefsky 2021-03-20T21:10:43.139100Z

kewl!

Craig Brozefsky 2021-03-20T21:11:07.139700Z

So that whole problem of breadth vs richness of data representation

Craig Brozefsky 2021-03-20T21:11:17.140200Z

is what I'm thinking about -- it's what drove the SX/CTR archiecture too

quoll 2021-03-20T21:11:39.140900Z

Also, I got my first significant external PR the other day. I hope I get to see more 🙂

Craig Brozefsky 2021-03-20T21:11:44.141100Z

we're never get all the data in one place, so let's make a protocol for asking all the different data sources to give us relevant subsets of their data to put into a richer representaion/tool

Craig Brozefsky 2021-03-20T21:11:56.141400Z

you should consider how to handle copyright assignment BTW

quoll 2021-03-20T21:12:05.141500Z

This requires that the project keep moving, meeting people’s needs, and appearing responsive

quoll 2021-03-20T21:12:52.142700Z

Oh yeah… that’s a very good point. We ran into that with Mulgara when the previous system was purchased

Craig Brozefsky 2021-03-20T21:15:50.143300Z

n-graph combination is gold

Craig Brozefsky 2021-03-20T21:16:36.143800Z

even if it's something that requires all mods to go to a single new graph...

Craig Brozefsky 2021-03-20T21:17:01.144100Z

hmmm

Craig Brozefsky 2021-03-20T21:17:35.144400Z

I imagine there must be some research on this already

quoll 2021-03-20T21:18:25.144700Z

I imagine you’re right 🙂

Craig Brozefsky 2021-03-20T21:19:53.145600Z

just thinking that my intuitions could certainly use some correction via a proper survey, vocabulary and theory of distributed graph stores and their interaction with rule engines and the assumptions naga makes around open world/negation/aggreation etc

Craig Brozefsky 2021-03-20T21:19:57.145800Z

8^)

Craig Brozefsky 2021-03-20T21:21:24.146200Z

not even distributed.. layered? composed?

Craig Brozefsky 2021-03-20T21:22:13.146900Z

scoping entity ids, namespacing relations, and handling negation/modification in appropriate store...

quoll 2021-03-20T21:36:46.148Z

Question for you… Do you see value in allowing entities with string keys? (i.e. like the ones you tried to load yesterday)

quoll 2021-03-20T21:40:00.151400Z

I’ve been looking at it, and I recall now. It’s because Chris had graphs where his displayable edges were strings. I allowed these in as attributes, but filtered them out of entities, since entities were supposed to be constructed with keywords. It let him keep his string edges, and gave us an easy way to tell the difference between the UI things that Chris was doing, and properties for the entities that he was connecting. … but … that’s not happening anymore (I believe). So I can let entities use non-keywords as keys now if I want. That would let your data from yesterday show up as entities

quoll 2021-03-20T21:40:25.151800Z

It already loads just fine. It’s just using the entity function to retrieve the nested object

Craig Brozefsky 2021-03-20T21:41:35.152Z

well, strings are better IMO

Craig Brozefsky 2021-03-20T21:41:45.152300Z

keywords means we are restricted to EDN really

Craig Brozefsky 2021-03-20T21:42:04.152800Z

strings means, it's more generic, makes no assumption about EDN/clojure etc...

Craig Brozefsky 2021-03-20T21:42:25.153200Z

so yah, I think it SHOULD support strings as keys

Craig Brozefsky 2021-03-20T21:44:28.153500Z

for entities..

quoll 2021-03-20T21:44:37.153800Z

OK. It does

Craig Brozefsky 2021-03-20T21:44:54.154200Z

in alpha6?

quoll 2021-03-20T21:45:04.154400Z

This is my repl right now:

user=> (d/entity d n)
{"layers" {"ip" {"ip.checksum.status" "2", "ip.dst_host" "152.195.33.40", "ip.host" "152.195.33.40", "ip.dsfield" "0x00000000", "ip.version" "4", "ip.len" "40", "ip.src" "192.168.1.122", "ip.addr" "152.195.33.40", "ip.frag_offset" "0", "ip.dsfield_tree" {"ip.dsfield.dscp" "0", "ip.dsfield.ecn" "0"}, "ip.ttl" "64", "ip.checksum" "0x0000bec2", "ip.id" "0x00000000", "ip.proto" "6", "ip.flags_tree" {"ip.flags.rb" "0", "ip.flags.df" "1", "ip.flags.mf" "0"}, "ip.hdr_len" "20", "ip.dst" "152.195.33.40", "ip.src_host" "192.168.1.122", "ip.flags" "0x00000040"}, "eth" {"eth.dst" "22:4e:7f:74:55:8d", "eth.src" "46:eb:d7:d5:2b:c8", "eth.dst_tree" {"eth.dst.oui" "2248319", "eth.addr" "22:4e:7f:74:55:8d", "eth.dst_resolved" "22:4e:7f:74:55:8d", "eth.dst.ig" "0", "eth.ig" "0", "eth.lg" "1", "eth.addr_resolved" "22:4e:7f:74:55:8d", "eth.dst.lg" "1", "eth.addr.oui" "2248319"}, "eth.src_tree" {"eth.addr" "46:eb:d7:d5:2b:c8", "eth.ig" "0", "eth.lg" "1", "eth.src.oui" "4647895", "eth.addr_resolved" "46:eb:d7:d5:2b:c8", "eth.src.lg" "1", "eth.src.ig" "0", "eth.addr.oui" "4647895", "eth.src_resolved" "46:eb:d7:d5:2b:c8"}, "eth.type" "0x00000800"}, "tcp" {"tcp.srcport" "57836", "tcp.seq" "314", "tcp.window_size" "65535", "tcp.dstport" "443", "tcp.urgent_pointer" "0", "tcp.nxtseq" "314", "tcp.ack_raw" "2807365467", "tcp.stream" "42", "tcp.hdr_len" "20", "tcp.seq_raw" "3486482781", "tcp.checksum" "0x00005841", "tcp.port" "443", "tcp.ack" "24941", "Timestamps" {"tcp.time_relative" "0.112280000", "tcp.time_delta" "0.000135000"}, "tcp.window_size_scalefactor" "-1", "tcp.checksum.status" "2", "tcp.flags" "0x00000010", "tcp.window_size_value" "65535", "tcp.len" "0", "tcp.flags_tree" {"tcp.flags.ecn" "0", "tcp.flags.res" "0", "tcp.flags.cwr" "0", "tcp.flags.syn" "0", "tcp.flags.urg" "0", "tcp.flags.fin" "0", "tcp.flags.push" "0", "tcp.flags.str" "·······A····", "tcp.flags.reset" "0", "tcp.flags.ns" "0", "tcp.flags.ack" "1"}, "tcp.analysis" {"tcp.analysis.acks_frame" "2181", "tcp.analysis.ack_rtt" "0.023370000"}}, "frame" {"frame.protocols" "eth:ethertype:ip:tcp", "frame.cap_len" "54", "frame.marked" "0", "frame.offset_shift" "0.000000000", "frame.time_delta_displayed" "0.000068000", "frame.time_relative" "9.977223000", "frame.time_delta" "0.000068000", "frame.time_epoch" "1612222675.179957000", "frame.time" "Feb  1, 2021 18:37:55.179957000 EST", "frame.encap_type" "1", "frame.len" "54", "frame.number" "2273", "frame.ignored" "0"}}}

quoll 2021-03-20T21:45:12.154700Z

no. On my localhost

quoll 2021-03-20T21:45:19.155200Z

I can make it alpha7 if you want 🙂

quoll 2021-03-20T21:45:37.155700Z

Check the thread above

Craig Brozefsky 2021-03-20T21:45:45.155900Z

I would say we should not assume keywords anywhere eh?

Craig Brozefsky 2021-03-20T21:45:51.156100Z

maybe I'm wrong

Craig Brozefsky 2021-03-20T21:45:59.156500Z

I need to think more 8^)

quoll 2021-03-20T21:46:19.156900Z

I’ve been removing that assumption in general. You’ll note that keywords are no longer required in the “entity” position of a triple

quoll 2021-03-20T21:46:41.157300Z

and strings were ALWAYS allowed as attributes

Craig Brozefsky 2021-03-20T21:46:43.157500Z

excellent

quoll 2021-03-20T21:46:49.157700Z

they were just filtered out of entities

Craig Brozefsky 2021-03-20T21:59:29.158Z

alpha7 it is!

quoll 2021-03-20T22:01:21.158200Z

Try it

quoll 2021-03-20T22:02:08.158500Z

It should work on an existing store, so just connect to it

quoll 2021-03-20T22:04:04.159100Z

I need to update my project files to do a full release for me. I’m doing a lot manually right now