@joelkuiper: this is actually going to change - currently Grafter parses a lang string into a reified object - you can str it to get the string but if you want the tag out you have to (.getLanguage (->sesame-rdf-type my-lang-string-obj))
... This bit is a bit broken - and we've had a ticket to fix it for a while... it's a pretty simple fix though...
The plan is to implement a Literal
record type -- so basically a map like @jamesaoverton says -- but with some polymorphic benefits that ensure it can coerce to the sesame (and maybe oneday jena) types properly... it'll have the string itself, the URI type and if its a string have a language keyword set e.g. :en
/`:fr` (we use keywords for language tags already and it works well).
Right now you can build lang strings with the (s "hola" :es)
@joelkuiper: @jamesaoverton: just reading your discussion -- remember SPARQL 1.1 doesn't really have FULL support for quads... i.e. you can't CONSTRUCT
a quad... the pattern in the construct is the GRAPH... I personally think this is a real shame, as there are quad serialisation formats (e.g trig/trix etc...). This might be why you can't just get quads from a model
@joelkuiper: just looking at your type coercion code -- Grafter also has both a Triple record and a Quad record... And I consider this a mistake... one we're going to undo
I think you really only want (defrecord Quad [s p o g])
- then create a constructor function for triple which returns a Quad with a nil :g
. Otherwise you'll get into a load of bother where #Quad { :s 1 :p 1 :o 1 :g nil }
is not=
to #Triple {:s 1 :p 1 :o 1}
simply because they're different types
obviously its easy enough to resolve but its a small pain... - yes there are still problems with this model where RDF semantics don't map directly onto clojure value semantics... However we recently had a discussion on the sesame developers mailing list where we convinced the core committer to change sesame's policy to use value semantics when testing equality - rather than RDF style equality where a quad of :s1 :p1 :o1 :g1
.equals
a triple of :s1 :p1 :o1
. This should be coming in a future release... Not sure what Jena's policy is here
@jamesaoverton: early on in the first version of grafter - because I initially wanted a terse syntax for expressing triple patterns I also chose to represent URI's as strings - as URI's are the primary data type - in Grafter string literals have to built with the s
function. - Again this is something I'm going to change -- raw java strings should probably not automatically coerce into RDF - or if they do they should do so to RDF strings in the default language... any java URI type you might reasonably use should probably be made to work.
Yeah, well… It’s something I’ve thought a lot about, and in the end I really like working with plain, literal EDN data everywhere I can. For what I do, IRIs are opaque, and I don’t need to get their protocol or query params. So I end up using strings for IRI, keywords for CURIEs/QNames, and maps for Literals.
I work with EDN as long as possible, and only convert to other formats at the very end.
yes - records can add noise - but I think you can actually override print-method to print them shorter e.g. you might be able to do #URI "<http://foo.com>"
Then you need to provide a reader function to cast the string to that type. I’m glad EDN has typed literals, but I haven’t found that they’re worth the hassle.
yes I know - it definitely adds some friction
I think that Transit has a native URI type, which would be more convenient.
ooo interesting
what exactly do you use edn-ld for @jamesaoverton ?
The library itself is a recent refactoring of some patterns I’ve developed over the last three years. So I’ve only used that particular code in a few projects so far, but I’ve used its predecessors in a larger number of projects.
And although I’m allowed to share those other projects, I’ve never had the time to clean them up and put them on GitHub...
But this is an example of some of the stuff that I do: https://github.com/jamesaoverton/MRO
cool
The Clojure code takes a table from an SQL database that contains a very dense representation of MHC class restrictions, AKA some biology stuff.
The goal is to convert that table into an OWL ontology. The ontology has several branches, with specific relationships.
There’s an Excel spreadsheet that specifies templates for different branches at different levels.
Then I read the source table and the template table, and zip them together into a sequence of maps defining OWL classes.
Finally, I convert that EDN data into RDFXML file.
what makes it EDN, rather than just CLJ? :simple_smile:
There are really two parts. The first is ripping the source table into a number of branch-specific tables. Then I use a Java tool I wrote ROBOT to convert those tables to OWL.
It’s not the best example, but it’s on GitHub.
To answer your question: I’m pretty convinced by this "Data are better than Functions, are better than Macros” thing that Clojure people talk about.
The MRO project doesn’t use the EDN-LD library because it’s for OWL and not just RDF. I haven’t figured out a general way to describe OWL in EDN, but I’ve been talking to Phil Lord about it.
yeah Phil and I have spoken in the past too
what in the MRO example is data?
thats not functions/macros/ just general clojure
The source table from SQL, and the Excel spreadsheet under src/mro
. Those are converted to all the branch-specific CSV files at the top level.
sorry I was meaning where is the data - in the EDN-LD Data > Functions > Macros, sense - presumably by that you meant that EDN-LD represents transformations by clojure data? Not symbols/functions/macros
The previous version of the MRO code had a separate function for each level of each branch.
EDN-LD is mostly just conventions for representing RDF in EDN, and then some functions for working with those representations.
and now you have a map - essentially in place of a cond
?
In the MRO example, there’s a sequence of maps representing templates, and a sequence of maps from the source table (SQL). Then the smarts are in the apply-template
function, that applies each template to each row of the source table.
So there’s a smaller number of higher-level functions, in the end, and I find it easier to reason about.
for what its worth - your MRO code seems broadly similar to grafter pipelines... In that you have a sequence of rows which you effectively process in row form... and then templatize. Is that fair?
oh sorry you just said that
I agree with that.
Grafter basically works the same
In the MRO case, the Clojure code is table-to-table, then ROBOT (my Java tool) is used for the table-to-OWL part.
At the end of the day, pretty much all the code I write is a pipeline. :^)
cool
same for a lot of the stuff we do
Some day I’ll publish a cleaner example :^)
that and tools around them
lol - same
You made a good point about Quad equality above. I’ll think more about that.
It was good talking, but I’ve got to go now.
Later!
cool
night
so as far as I’m aware off there’s no real way to use SPARQL 1.1 to get Quads, but there might be in the future, so I’ll just leave it nil I guess.
As far as type/data coercion … well I don’t really want to invent another class/type model for RDF. So I’ve chosen to represent results/triples as simple maps and records with strings for URI’s and json-ld-isch maps as best I can for the rest. If that’s not your cup of tea you can always just use the Jena objects 😉 and forget about the lazy-seq stuff 😛
If commonsRDF solves this problem I might consider implementing that, but for now it’s just too much of a mess to match the RDF semantics to Clojure, and the simplest thing I could think of was {:type “typeURI” :lang “@lang” :value “Jena coerced POJO”}
or a string for uri, I may consider wrapping that in a java.net.URI. though, bit unsure still
@joelkuiper: I'd be tempted to go with a record for Quads and Literals... it makes writing and extending coercions easier (admitedly you can use a multimethod for this too -- but you'll probably just end up dispatching on type anyway (and you can always use a multimethod on a record too if you want)... Also multimethod dispatch is quite a bit slower than record dispatch... and you'll probably end up dispatching on millions of quads
when users come to process results
well, Triples 😛 since there’s no real way of getting Quads 😉
So a Literal of [type, value, lang] -> [String, Object, Keyword] or something?
type => String, value => String, lang => Keyword
why value as a string?
there might not be a way to query for a Quad -- but I think on the processing side it makes sense to have a quad -- because you can set it to non-nil yourself and serialise nquads etc easier
Jena has excellent support for making sense of a lot of the XSD types into java objects
ahh ok sorry - by Object you mean Integer/Float/Double/Date etc...
that’s a fair point
then yes I agree
yep :simple_smile:
definitely coerce the types out where you can
but where you can't you'll need to fall back to string
right, that’s what I do now
thats what we're doing with grafter
yeah I saw that :simple_smile:
did you read the stuff I wrote here last about Triple/Quad equality etc?
yup, interesting stuff; I’ll probably change it to Quad for those reasons. Makes sense
Its definitely a trade off -- but I think its the better one
could also just use a map I guess
yes but it'll have the same issues -- i.e. (= {:s :s1 :p :p1 ::o o1 :g nil} {:s :s1 :p :p1 ::o o1}) => false
yeah, that’s true. it’s a silly problem 😛
its not a big deal - its just annoying -- and can cause hard to find bugs
it’s one of those things that would be easy enough to solve with a custom Equals method though
yes but I think its more pragmatic to retain value semantics
even in java
I’ve gone back and forth on that topic in Java projects; either can create hard to find bugs, especially if done inconsistently across developers 😛
yeah it definitely depends on what you're doing
but I think programming with values is generally better
@joelkuiper: any reason to use "@en"
strings rather than :en
keywords for language tags - (I know obviously that SPARQL and various serialisations represent them that way...
keywords share memory when you have lots of them
no strong opinion, it’s closer to JSON-LD
which is nice
switched it to keywords 😉, probably the last I’ll work on it for the week at least!
cool
https://github.com/joelkuiper/yesparql/tree/feature/type-conversion#a-note-on-lazyness :simple_smile:
I want to think on it some more, but I agree that we should have:
(not= {:s :s1 :p :p1 :o o1 :g nil} {:s :s1 :p :p1 :o o1})
rather than a custom =
function, I’d like to see another function that explicitly calls out that it’s handling some kind of equivalence instead
such as: (equiv {:s :s1 :p :p1 :o o1 :g nil} {:s :s1 :p :p1 :o o1})
I personally think its better to have one type - even if it has a nil field a lot of the time instead of two - for essentially the same thing
I think its a good idea to have a custom equivalence function that implements RDF semantics
so the not=
's case won't arise in normal usage
on the second point, yes. Clojure needs to have = semantics that are separate to what is needed for RDF
for instance, I want to be able to say things like: (matches {:s s1 :p p1 :o o1} {:s s1 :p p1 :o o1 :g g1})
because the triple in the first arg does match the triple-in-a-graph found in the second arg
quol - I think the best thing is to have a Quad record -- with a triple constructor - that essentially returns you a nil
in the :g
so (matches (triple :s1 :p1 :o1) (quad :s1 :p1 :o1 :g1) => true
it’ll depend on usage. I’ve never needed quads, except when storing multiple graphs in a single file
I’m a “triples” person myself :simple_smile:
we use both 50/50 - one representation simplifies things for everyone... if you don't care about the nil :g
- you don't need to...
when I say “storing”, I also mean “loading”, since you get quads back when you read, and they need to go to various graphs
the Quad
record will seamlessly coerce into a sesame/jena triple/quad resepectively
yes -- we use quads a lot -- because most of our work is writing pipelines that generate RDF... and we usually want to derive the graph from the data we're loading in
and you’re working with multiple graphs at once?
yes
the fact you can't in other tools is one reason we created grafter
we have tens of thousands of graphs
ah. You’re one of those :simple_smile:
we manage lots of data for many customers
so a lot of the time its out of our hands
graphs are also very useful for managing data
most RDF stores are optimized around triples, and then group statements into graphs. Those that treat graphs as an equal part of the quad take a small performance hit, and it often seems unjustified given that SPARQL treats graphs so differently
yes, I completely agree that graphs are great that way
@quoll: having used fuseki, sesame, stardog, bigdata and graphdb/owlim I can say that statements not true in my experience
on many stores you have to use graphs to get acceptable performance
I may not have been clear in what I was trying to say
I agree thats its unfortunate SPARQL only half implements graphs though
when RDF stores are storing data on disk, many of them will use a scheme that is based around subject/predicate/object. Graphs then get implemented as a separate structure (e.g. separate index files, or an index that refers to statements as a group, but not allowing arbitrary selection of subject/predicate/object/graph as single step index lookups).
Some stores do allow arbitrary lookup for quads
but then SPARQL hamstrings it
I mean, you can still work with it, but SPARQL presumes that you’ll be selecting only a couple of graphs, and working with triples from them. The syntax gets messier if you treat graphs as just another element of the quad
ironically, the stores that index symmetrically on the quad can handle the operations just fine. It’s SPARQL syntax that gets in the way
but because of this bias, many stores don’t index symmetrically around the quad
that’s usually OK, because many applications don’t ask for lots of graphs like that
but some do…. hence my statement that you’re “one of those” :simple_smile:
@quoll: yes you're right -- sorry was missunderstanding what you were saying... Yes that's definitely true... Graph performance can be spotty on some stores... I know - because we have some automatically generated queries which have well over 1000 graph clauses
but we actually sell a linked data management platform -- so its unavoidable -- we frequently push the limits and assumptions of every triple store
I can’t recall now which stores index symmetrically around quads. I know ours does, but it’s in dire need of some love, and doesn’t even handle SPARQL 1.1 (i.e. indexing is great, but query/update functionality is not)
I think that the default indexing in Jena is symmetric
I should ask Mike about Stardog though
I’ve never contributed to the internals of Stardog (for obvious reasons). And the Clojure adapter was just a client
I'm guessing stardog does
I thought it did
I can ask… hang on
what store do you work on?
Mulgara
or rather… I did
I’ve been busy 😕
ahh yes I've been to this site before! :simple_smile:
Well… busy life, plus the fact that I’d been on it for over a decade. I’ve been trying new things lately
ahh you're the guy that implemented an RDF store on Datomic... I had that same thought the moment Rich released it... How did it go?
it’s been good, though I put it aside for other stuff. I’m trying to pick it back up again actually
Datomic is implemented in a very similar way to Mulgara’s indexes (persistent trees), so it seemed natural to me
OK, Al doesn’t know. He said I should ask Mike directly :simple_smile:
Mike is fun to talk to about this stuff, but I only have him on email, not IM :simple_smile:
Yes Mike and I have exchanged emails...they have a gitter channel now
what datomic schema does kiara use?
does it implement a schema for triples/literals - or does it somehow use vocabularies for a datomic schema?
literals are done in 2 ways
if they’re simple text or using one of a few xsd datatypes then they’re stored as native values (strings, longs, doubles, floats, dates, URIs)
anything else, and they become a structure with properties for value (a string) and datatype (a URI, since there aren’t any IRIs in xsd datatypes)
RDF properties get scanned for the values that they refer to, and the most general type required is found
this is because if you have a property of my:value and it refers to a xsd:long, then it’s a very rare schema that requires that property to also refer to a string, or something else
yes I'd say thats a fair assumption
but if that DOES happen, then the type for the property in the Datomic schema is set to refer to a structure, and that structure then refers to the final value, using different property names for each type
that’s a corner case, but it makes querying more complex 😕
no shit :simple_smile:
😄
I think I need to change how subjects work though
whats the performance on datomic like?
is there any hope of it being competitive?
for now, if they’re IRIs then I convert to QNames (ruthlessly, if necessary) :simple_smile: then convert the QNames to keywords and use those as the entity IDs. This works, but it uses RAM.
I have not pushed it to big datasets yet
Most of the big sets are in RDF/XML (which I despise), and I really want to avoid Jena (I love those guys, but Jena is bloated), so I’ve started on an RDF/XML parser in Clojure
I have a decent Turtle parser though, and that seems OK
cool
but I haven’t loaded anything really big through hit
does it work with large files?
that’s another thing. Datomic recommends that you don’t try to do really big loads. They recommend chunking it up. That’s easy in Turtle, but not so much with RDF/XML
Jena do a good job - if you want a standards compliant, free store... but yeah the codebase is a mess... Sesame's code is so much better to work with
besides that, I hate the idea of multiple transaction points at arbitrary locations in a load. But it’s pragmatic, so I guess I need to
yes, I’ve contributed to Jena
@quoll: yeah chunking sucks
Mulgara is actually faster if you don't
annoyingly people would chunk their data, and then get annoyed at Mulgara for performing badly
but every chunk becomes a new transaction, which means that it requires a new root to the persistent tree
if you load 1M triples, then you just have a simple tree
so I'm guessing you need to reindex if that happens
if you load 100K triples 10 times, then you end up with most of the nodes in the first tree being duplicated while inserting the second 100K, and so on for each chunk
actually, Mulgara does not do background indexing (which is something I started work on, but never finished)
so when it’s finished loading, it’s fully available
but that makes loading slower
Stardog, for instance, loads immediately into a linear file, and then moves those triples (or quads) into the indexes in the background. Querying looks in both the indexes (fast) and the linear file (slow).
So loads are lightning fast, but querying sucks for a while
the longer you wait, the faster the querying gets
anyway, Mulgara isn’t as complex, but it does not need reindexing
Just got a response on twitter: yes, Stardog is symmetrically indexed (I thought it was)
thats interesting
hello.
Welcome @ricroberts
Hey :simple_smile:
Hi
Hey! Thought you’d might also be interested in this channel, we’ve also been discussing some YeSPARQL related things :simple_smile:
Definitely! Thanks for inviting me.