Asami alpha 6 is out now.
Changes are:
• Inserting entities that have no :db/id
will no longer report their own ID in the :tempids
from the transaction
• No longer dependent on Cheshire, which means no more dependency on Jackson XML
• Fixed reflections in the durable code, with about a 30% speedup
I’m doing training next week, so there won’t be any Asami development until the week after
Just did the same load test with ALpha6 and can confirm that estimate on speedup
I know I have work to do to improve load speed, but at the same time there’s a tradeoff between use cases: • Regular updated and modifications to do (current design) • Load once and analyze
The current design is specifically to allow regular updates without great expense, while also trying to keep querying fast
I have another design which is optimized for fast loading. But it can’t update anything in place. Instead, it would manage updates as another pair of graphs (additions and retractions), which get dynamically merged in with with the main graph during queries. Then in the background, those changes would be merged into a single index again. This makes updates possible, but it’s not going to do lots of modifications quickly.
The thing I just implemented is actually a hybrid between the old Mulgara and this write-once design. It mostly follows what Mulgara did, but with a few exceptions.
I knew that Clojure was going to bring some overhead, but I’m still a bit disappointed in the speed. Hoping I can improve it more
Well, fromt what I know of the problem space Cisco is dealing with..
My hunch would be fast load, minimal modification relative to loaded data set...
aka, investigation/analysis vs progressive development of a stable and persistent knowledge set
It depends on who’s using it. Right now my team is modifying things a LOT
enrichment loads?
But I agree. I think the real value of these systems is in providing a new view for analysis
or the synthesis of higher order models like "target" etc...
both
Ok, cause to me, neither of those is modification
that's addition
but apparently those are modifications in terms of what you were talking about above
I think "update" when I here modification, not addiing some triples
They’ve recently started understanding what they get out of it, and they’re starting to use it for everything. Apparently it’s both easier for them, and the resulting code is faster than what they were doing before (which I’m surprised at, amused by, and grateful to learn)
well, at some point, munging and mapping/zipping around in javascript data objects is just tedious 8^)
They have lots of entities where they want to change state of values, not simply add new ones
yah, so those "synthesized" entities, like Target
or node for the graph...
well, to put this in perspective. I wager my test was way more data than they are expecting to deal with
and they are primarily using the in-mem store
My intuition is that it's really the update speed on the durable storage that is bottleneck
err, I mean, "adding triples"
but I wonder how much durable storage use cases and in-mem use cases overalp
I mean, I think Mario's work is doing snapshots...
durable storage is more about having a working set that is significantly larger than expected available RAM
it's not so much about persistence -- as the current snapshot dump/load mechanism for an in-mem db is sufficient for those use cases
if I can shrink my working set down to something that can fit in like 32g of ram...
current asami is gold
it's when I need a working data set where the indexes might fit in 32g of ram, but my actual data is MUCH larger... that durable stores win.. But maybe I'm not grokking the resource constraints of durable asami...
Well, for now they’re only just starting to hit the durable store, so I think they’re OK. My next step is indexing entities, so they don’t get rebuilt when they’re retrieved. That’s a little tricky, because updates to triples can be updating entities, and I need to find those. But I think I can get most of the low-hanging fruit reasonably well. After that, I can start looking at the triples indexes again.
Fortunately, a lot of it can be built on the same infrastructure that the current indexes use
Also, I’d like to do another tree type, and not just AVL
AVL is perfect for the triples, but it’s not ideal for the data pool (i.e. strings, URIs, etc)
e.g. if you load the data twice, the second time is several times faster. That’s because the data pool already has all the strings and keywords, and they’re cached without having to traverse the tree
Thinking about Hitchhiker trees or skiplists here
Meanwhile, people would like pull syntax, and I need to figure out storage in IndexedDB… it feels like a lot for one person
Oh… and Naga needs some love too 🙂
Seems like pretty classic problem of having a single abstraction interface for vastly different resources
mem vs. write to file...
That’s true
interesting to see the file on disk growing iteravly within a single transaction boundary too
But Naga was always supposed to talk to external graph DBs. I never thought of it being in-memory until you asked me to make it
not what I would have intuited for a mem-mapped file block
it's a marvelous little problem domain
It depends on the index. If it’s the data pool (data.bin and idx.bin) then the data.bin file is just appended to with standard .write operations. When you read from it, it checks if the offset is within a currently mapped region of the file. If so, then you get it. If not, then it will extend the mapping, then you get it
I would guess that in-mem DB with fast snapshot dump/load is still what will support CTR/SX the best
outsider guess of course
I’m thinking so for now
already durable speed is sufficient for those data sets, and for even persistent the entirety of public CTIA I would guess
so it's pretty solidly hitting all your paying use cases 8^)
in alpha6....
Right now, the only pain point is the entity rebuilding. So next on the list is indexing those directly
yah, well at some point, you have to recognize entity rebuilding is providing a entitment abstractin on top of a more efficient query engine -- aka, maybe users should be making queries they need directly, instead of snarfling whole entities
entity abstraction o top of...
in memory is easy. The durable version needs me to serialize. I’ve ummed and ahh-ed over using transit, but because I already have most of the serialization I need, I’m going to extend my code a little more. The space overhead is similar, and mine is better in some cases
Maybe consider something like the ES approach
that entity cache... you already are considering it, nvm
heh. Yes. That
It was built inside the engine, as a layer over the DB. I’m talking about shifting it into the DB. It also makes other features possible then
yah, it's where you pay the cost of triple store
vs ES document model
in that I can add/modify/delete from an entity
via manipulating triples
can't do that in ES -- only document level operations, so "entity level"
but gotta have that for naga...
yes. But if I can see the entity those triples are connected to, then I can update the entity as well
I think it's worth the cost
yah
or just invalidate...
don't pay the cost until read
hmmm, subentity ids containing some pointer to the parent entity?
so :tg-12314-121 ...
I’m better with the upfront cost of writing. That’s because writing always happens in another thread anyway, and it’s usually queries that we want to make fast
hehehe
No, but can
so yes, in your current use case, or any use case where your data set is... contrains to fit in mem
So, you mention merged graphs.. tell me more about multigraph
Also, it’s only in-memory graphs that use keywords. On-disk graphs use Internal Nodes (serialized as: #a/n "1234"
)
Multigraph isn’t merged. That’s where you have multiple edges between nodes. It gives you a weight
ok, I was thinking of merged DBs...
give me two graphs DBs, make them behave as one...
OK… now THAT is coming soon
I need it so I can do as
that opens up ALOT of use cases to manage write constraints ...
especailly if it's make "n" graphs behave as one on query/update ...
Have an initial graph (either in memory or on disk), and then you do speculative transactions against it. This becomes a pair of graphs. The fixed one, and an in-memory one. Queries against this will be returning concatenations of the results of both
yah
a graphdb variant of WALs
Datomic has this, and people have asked for it.
which is where you end up going to get write speeds approaching RDBMSs
yah, and the power of having a single query doing joins across those graphs..
means easier to build inference/queries/logic/processes across a larger set of knowledge (global threat context vs. my org scope
Yes, you can do everything in memory, and then when you’re done you send a replay off to the index. Queries get done against the pair, until the index has finished it’s transaction, and then you swap over to the index again.
This is actually how Naga operates against Datomic
Datomic’s only notion of “transaction” is grouping write operations. But Naga needs a transaction to be a writes and reads interspersed, with the reads returning data that includes the results of the writes.
ah
I managed it by using an as
database for executing against, and accumulating write operations until the end of the rules. At that point, I replay all of the writes into the main DB.
as ?
Sorry… that was a mistake. I meant with
as
is the operation of going back in time
Stepping back for a moment
https://docs.datomic.com/on-prem/clojure/index.html#datomic.api/with
I fee like needs to scale and get "coverage" in data sets
has been over-prioritized
at the expense of making the tools that experts and analysts can use to make richer explorations and inferences about data
aka, security is too focused on completeness, breadth, or scaling (mongedb is cloud scale... splunk is cloud scale...)
I feel like there is a need for a way to get a subset of your data, a working set you build from querying all those sources (the mythology of your SEIM being that one source of data is a myth...)
and layer increasingly more sophisticated abstration on top of that
that you can't do with sql, or splunks query language, or ES, or these pipe based "event" query systems
One thing I’d like to do is provide some graph algorithms to all of this. That’s one reason I integrated with Loom. It’s also why I support transitive attributes, and subgraph identification. I feel like we can use some graph algorithms to get some data that we’re not looking for yet
YES
So… I’ve made a start on this. But you’ve already heard lots of other “priorities” that I’m working on at the same time :rolling_on_the_floor_laughing:
However, the other day, Jesse made the suggestion of looking for someone to help me. That would help a lot
kewl!
So that whole problem of breadth vs richness of data representation
is what I'm thinking about -- it's what drove the SX/CTR archiecture too
Also, I got my first significant external PR the other day. I hope I get to see more 🙂
we're never get all the data in one place, so let's make a protocol for asking all the different data sources to give us relevant subsets of their data to put into a richer representaion/tool
you should consider how to handle copyright assignment BTW
This requires that the project keep moving, meeting people’s needs, and appearing responsive
Oh yeah… that’s a very good point. We ran into that with Mulgara when the previous system was purchased
n-graph combination is gold
even if it's something that requires all mods to go to a single new graph...
hmmm
I imagine there must be some research on this already
I imagine you’re right 🙂
just thinking that my intuitions could certainly use some correction via a proper survey, vocabulary and theory of distributed graph stores and their interaction with rule engines and the assumptions naga makes around open world/negation/aggreation etc
8^)
not even distributed.. layered? composed?
scoping entity ids, namespacing relations, and handling negation/modification in appropriate store...
Question for you… Do you see value in allowing entities with string keys? (i.e. like the ones you tried to load yesterday)
I’ve been looking at it, and I recall now. It’s because Chris had graphs where his displayable edges were strings. I allowed these in as attributes, but filtered them out of entities, since entities were supposed to be constructed with keywords. It let him keep his string edges, and gave us an easy way to tell the difference between the UI things that Chris was doing, and properties for the entities that he was connecting. … but … that’s not happening anymore (I believe). So I can let entities use non-keywords as keys now if I want. That would let your data from yesterday show up as entities
It already loads just fine. It’s just using the entity
function to retrieve the nested object
well, strings are better IMO
keywords means we are restricted to EDN really
strings means, it's more generic, makes no assumption about EDN/clojure etc...
so yah, I think it SHOULD support strings as keys
for entities..
OK. It does
in alpha6?
This is my repl right now:
user=> (d/entity d n)
{"layers" {"ip" {"ip.checksum.status" "2", "ip.dst_host" "152.195.33.40", "ip.host" "152.195.33.40", "ip.dsfield" "0x00000000", "ip.version" "4", "ip.len" "40", "ip.src" "192.168.1.122", "ip.addr" "152.195.33.40", "ip.frag_offset" "0", "ip.dsfield_tree" {"ip.dsfield.dscp" "0", "ip.dsfield.ecn" "0"}, "ip.ttl" "64", "ip.checksum" "0x0000bec2", "ip.id" "0x00000000", "ip.proto" "6", "ip.flags_tree" {"ip.flags.rb" "0", "ip.flags.df" "1", "ip.flags.mf" "0"}, "ip.hdr_len" "20", "ip.dst" "152.195.33.40", "ip.src_host" "192.168.1.122", "ip.flags" "0x00000040"}, "eth" {"eth.dst" "22:4e:7f:74:55:8d", "eth.src" "46:eb:d7:d5:2b:c8", "eth.dst_tree" {"eth.dst.oui" "2248319", "eth.addr" "22:4e:7f:74:55:8d", "eth.dst_resolved" "22:4e:7f:74:55:8d", "eth.dst.ig" "0", "eth.ig" "0", "eth.lg" "1", "eth.addr_resolved" "22:4e:7f:74:55:8d", "eth.dst.lg" "1", "eth.addr.oui" "2248319"}, "eth.src_tree" {"eth.addr" "46:eb:d7:d5:2b:c8", "eth.ig" "0", "eth.lg" "1", "eth.src.oui" "4647895", "eth.addr_resolved" "46:eb:d7:d5:2b:c8", "eth.src.lg" "1", "eth.src.ig" "0", "eth.addr.oui" "4647895", "eth.src_resolved" "46:eb:d7:d5:2b:c8"}, "eth.type" "0x00000800"}, "tcp" {"tcp.srcport" "57836", "tcp.seq" "314", "tcp.window_size" "65535", "tcp.dstport" "443", "tcp.urgent_pointer" "0", "tcp.nxtseq" "314", "tcp.ack_raw" "2807365467", "tcp.stream" "42", "tcp.hdr_len" "20", "tcp.seq_raw" "3486482781", "tcp.checksum" "0x00005841", "tcp.port" "443", "tcp.ack" "24941", "Timestamps" {"tcp.time_relative" "0.112280000", "tcp.time_delta" "0.000135000"}, "tcp.window_size_scalefactor" "-1", "tcp.checksum.status" "2", "tcp.flags" "0x00000010", "tcp.window_size_value" "65535", "tcp.len" "0", "tcp.flags_tree" {"tcp.flags.ecn" "0", "tcp.flags.res" "0", "tcp.flags.cwr" "0", "tcp.flags.syn" "0", "tcp.flags.urg" "0", "tcp.flags.fin" "0", "tcp.flags.push" "0", "tcp.flags.str" "·······A····", "tcp.flags.reset" "0", "tcp.flags.ns" "0", "tcp.flags.ack" "1"}, "tcp.analysis" {"tcp.analysis.acks_frame" "2181", "tcp.analysis.ack_rtt" "0.023370000"}}, "frame" {"frame.protocols" "eth:ethertype:ip:tcp", "frame.cap_len" "54", "frame.marked" "0", "frame.offset_shift" "0.000000000", "frame.time_delta_displayed" "0.000068000", "frame.time_relative" "9.977223000", "frame.time_delta" "0.000068000", "frame.time_epoch" "1612222675.179957000", "frame.time" "Feb 1, 2021 18:37:55.179957000 EST", "frame.encap_type" "1", "frame.len" "54", "frame.number" "2273", "frame.ignored" "0"}}}
no. On my localhost
I can make it alpha7 if you want 🙂
Check the thread above
I would say we should not assume keywords anywhere eh?
maybe I'm wrong
I need to think more 8^)
I’ve been removing that assumption in general. You’ll note that keywords are no longer required in the “entity” position of a triple
and strings were ALWAYS allowed as attributes
excellent
they were just filtered out of entities
alpha7 it is!
Try it
It should work on an existing store, so just connect to it
I need to update my project files to do a full release for me. I’m doing a lot manually right now