rdf

simongray 2020-10-12T08:43:07.010100Z

Can someone with more knowledge of RDF explain to me the rationale of langStrings like “word”@en? They seem like the completely wrong abstraction. Strings of letters are not encoded in a language, but rather it’s the other way around: languages use strings of letters to represent words and these string literals can of course be claimed by multiple languages and their interpretation may be different, but they are still the same string of letters. With the way langStrings are implemented in SPARQL queries (an enforced filter), basic string queries for words in multiple languages either suffer a combinatorial explosion of language-encoded strings or an unintentionally smaller result set (if not using an exhaustive set of languages). It seems to me like languages are better implemented simply as an entity that can be linked to any number of strings or simply just a separate property. This can then be used to selectively filter in different languages. I cannot see any benefit to hardcoding languages into string literals. I wonder why the current implementation is even a thing.

2020-10-12T09:20:35.021100Z

@simongray I think it’s partly pragmatic, but it runs deeper than lang strings. For better or worse no RDF primitives can be used in ?s position. Whatever design was taken in this area would make similar trade offs elsewhere, so the implemented solution is I think a fair compromise. Philosophically in RDF you have a kind of platonic concept which is assumed to exist outside of human languages, and any human language can in principle know it; but it’s the same universal concept regardless of human language. Obviously there are competing schools in philosophy of language that say meaning is relative to other terms, and consequently relative to the language they’re expressed in (e.g. the Sapir–Whorf hypothesis), but that’s not how things are modelled in RDF. In RDF labels etc are just annotation properties; i.e. they don’t carry any formal semantics… they’re just an aid to humans, the real meaning is intended to be in the identifiers and their relationships. Regarding SPARQL’s deficiencies (real or perceived), SPARQL was developed independently of RDF. RDF came first. Of course you could choose to implement language specific predicates en:name, fr:nom if you had a good reason to — but it’s not how modelling in conventionally done.

simongray 2020-10-12T09:24:25.023700Z

Well, looking past the fact that basic multi-language string queries cannot be guaranteed to be correct when using unknown datasets since the full range of possible language-encodings must be specified in the query, It just seems to defy logic to me as well. Are names language-encoded? Do I have 6000+ different names then to account for every possible official language?

simongray 2020-10-12T09:26:18.025500Z

It’s also seems completely idiosyncratic to RDF. What other languages have implemented languages as equality-distorting aspects of strings literals?

2020-10-12T09:26:44.025800Z

Not sure how to interpret this: > Are names language-encoded?

simongray 2020-10-12T09:27:35.027500Z

If I am to query for my own name in an RSF resource how should I refer to it? “Simon”@en, “Simon”@da, and 6000 other entries?

simongray 2020-10-12T09:28:06.028200Z

My name has nothing to do with my mother tongue

2020-10-12T09:28:55.029Z

In that case just use "simon" … an xsd:string language is optional.

simongray 2020-10-12T09:30:35.030500Z

That has not been my experience.

samir 2020-10-12T09:31:56.032700Z

@simongray It is useful to see the lang string as belonging to the presentation layer. Whenever you deal with data that will be queried, with most SPARQL implementations, basic strings are the main choice. (opiniated)

simongray 2020-10-12T09:33:12.034300Z

If I search for a basic string, won’t this simple not not include any language-encoded strings?

simongray 2020-10-12T09:34:09.035600Z

so in order to actually return a full result set I will have to construct a query for the string in every possible human language in addition to the basic string

simongray 2020-10-12T09:35:22.036900Z

I ran into this issue already using Apache Jena (aristotle)

2020-10-12T09:36:14.038Z

String searching isn’t really what RDF is optimised for. I’d say the use case for lang strings is mainly to provide labels for display. And yes this can be awkward. multi-lingual stuff is always awkward in any system. I’m not defending RDF here btw, I’m just explaining how to think of it.

simongray 2020-10-12T09:37:25.040800Z

When the dream is interconnectivity of resources, the fact that strings are represented in this way with no reliable enforcement mechanism seems to completely destroy any hope of integrating multilingual datasets

samir 2020-10-12T09:37:40.041500Z

This is what I meant with presentation layer. You fetch the lang strings with the proper language before display of the entity in some viewer. Or some of them if you have precedence rules between languages

☝️ 1
2020-10-12T09:38:17.042400Z

Typically most real world systems won’t work in 6000+ languages. You’ll probably have a just a handful…. So filtering to the users locale is tractable. Normally I just slightly overselect, e.g. :s rdfs:label ?label and then pick the most appropriate based on a precedence etc.

2020-10-12T09:38:31.042700Z

Jinx!

simongray 2020-10-12T09:40:30.046Z

Ok, so is it the case that different RDF implementations treat strings with no language-encoding in different ways? Cause when I have tried searching a for basic string it will not return any results unless I specify the language (the enforced filter). This was in Apache Jena

samir 2020-10-12T09:41:07.047200Z

The main point of RDF is actually to take resource identity very seriously. The labels are seen as helper data. I agree that this is suboptimal. On the other hand, systems integrating over multiple language can analyse all labels to infer the likelihood that two entity are identical and then add an appropriate statement articulating this fact

👍 1
simongray 2020-10-12T09:41:56.047800Z

But langStrings are not just used for labels

simongray 2020-10-12T09:42:18.048600Z

The object of a triplet can either be a resource or a literal, right?

samir 2020-10-12T09:43:21.050200Z

That is right, I used the term labels to simplify the discussion

👍 1
2020-10-12T09:43:45.050700Z

I’m pretty sure most systems treat (no)-language encoding the same. :foo rdfs:label "foo" and :foo rdfs:label "foo"@en are different triples. I’m pretty sure this is part of the standard. Obviously you can choose to handle this stuff at an ETL layer if you need too… e.g. by normalising labels into xsd strings or whatever to suit your application.

samir 2020-10-12T09:46:01.054Z

Actually text searching is not really part of SPARQL, often you have a parallel text indexing service (and the clauses for text search can be integrated in the SPARQL request). From the point of view of RDF “foo” and “foo”@en are just different litterals.

2020-10-12T09:46:41.054900Z

yeah, though some triple stores have non standard extensions to handle it better

👍 1
simongray 2020-10-12T09:46:46.055100Z

ok… thanks for responding both of you. I still think this aspect of RDF is completely idiosyncratic and to me simply introduces complexity that will need to be handled elsewhere.

simongray 2020-10-12T09:48:08.056500Z

I can’t imagine dealing with a programming langauge where every string is potentially language-encoded… shudder

2020-10-12T09:49:54.058400Z

You’re right it is idiosyncratic, and it does pass the complexity buck, and I have experienced this frustration myself. So I’m not disagreeing with you. Though I think any solution that wasn’t tailor made for your application would be idiosyncratic here too. You really just need to learn to work with it rather than against it. If you do that your life will be easier.

simongray 2020-10-12T09:54:00.061800Z

Yeah. I guess I am somewhat getting around it by representing the data as a labeled property graph instead, both in Neo4j and using ubergraph in Clojure. I just can’t believe that this passed through multiple committee (re)designs and the - to me - obvious and much more flexible way to represent languages already used in HTML/XML was not simply reused here.

simongray 2020-10-12T09:55:06.062200Z

In general, RDF seems a bit over-engineered

2020-10-12T10:00:56.062400Z

how do xml/html solve this? It sounds to me like your issue is more that you’re doing an exact string match, rather using FTS.

simongray 2020-10-12T10:01:39.062600Z

lang=“en” and xml:lang=“en”

simongray 2020-10-12T10:01:46.062800Z

what’s FTS?

2020-10-12T10:01:54.063Z

full text search

simongray 2020-10-12T10:03:23.063200Z

well, I am a SPARQL n00b. Is full-text search built-in and if that is the case, how is it accessed?

2020-10-12T10:04:01.063400Z

It’s not part of the standard, but lots of backends support it

simongray 2020-10-12T10:06:36.063800Z

I see. Thanks for pointing that out. I tried getting around the language-encodings by using regex but that was just unbearably slow. I guess FTS is close to the performance of matching string directly?

2020-10-12T10:10:54.064200Z

Yes it should be. It’ll use lucene indexing etc underneath, and you can probably even tweak the indexing there too should you need to. Enabling FTS will usually make indexing slower of course.

simongray 2020-10-12T10:12:44.066Z

Right. Thank you very much for educating me. I just noticed that I had already starred one of your libraries on github as part of my initial research of dealing with RDF in Clojure. 🙂

👍 1
2020-10-12T10:17:22.069600Z

I can see why you might think that, but I think RDF itself is actually very well engineered. RDF is actually pretty minimal. The complication comes from the fact that there are lots of interwoven standards; so the ecosystem is complicated; and so are some of the other standards, e.g. OWL. You should only use what you really need though.

Steven Deobald 2020-10-12T11:33:40.078500Z

Out of curiosity, what are the domains/projects you folks are working on with RDF and Clojure? I'm currently working on an implementation of a digital library for http://pariyatti.org, which requires quite a bit of relationship management between entities: ancient Pali literature with many variations and translations over the past ~2000 years, authors, topics, etc. I started with Neo4j but I'm currently spiking a move to Crux, for a variety of reasons. Because librarians at http://pariyatti.org will forever consist of volunteers with limited time, I've leaned away from semantic web tech in favour of writing something (potentially?) simpler by hand... but if the project ever begins to concern itself with the contents of the documents within the library, it might be foolish to continue avoiding things like RDF. My go-to example at the current granularity is Ledi Sayadaw, a monk who authored a long list of books in contemporary Pali about ~100 years ago. He's now a topic for other, modern literature in other languages. Those sorts of relationships would be a nightmare in Postgres but they've been manageable in Neo4j and Crux so far. "Contents" might be something as fine-grained as the knowledge that kukkara in Pali means dog in English (and a dozen other translations)... obviously I have no intention of encoding that knowledge at that granularity in a database layer I've hand-rolled. 😉 Have other folks in here walked a similar road?

2020-10-13T07:57:10.082800Z

RDF4j has several triple store backends just like Jena. In particular a native store (which is persisted to disk) and a memory store, plus a few more… It also comes with a workbench (database server) that you can run, like Jena (Jena’s is called Fuseki). RDF4j has a much cleaner API in my mind, but Jena has more features in some areas. In particular WRT inferencing. (Disclosure I’m actually supposed to be a core contributor to RDF4j; but it’s only because I submitted a bunch of extensive bug reports a few years back; with a few small patches.) I actually use Jena in a few places too.

1
simongray 2020-10-12T12:10:24.078600Z

I think having to figure everything out yourself is both freeing and requires more extensive research. Sounds like you don't need to integrate with any other sources or distribute your data? In that case, I don't think RDF is a requirement.

simongray 2020-10-12T12:16:47.078800Z

I've inherited the official Danish wordnet which was created as part of a big research project more than a decade ago. The primary data lives in a SQL db and only exists as RDF in a limited exported version using the original draft version of RDF/XML. I need to support linking with the Princeton WordNet while supporting a bunch of future functionality, so my mission has been normalising the usage of RDF and graphs for data modelling, including at the db level.

2020-10-12T13:22:31.079600Z

publishing government data (mostly statistical data)

2020-10-12T13:33:17.079700Z

Actually the cultural/arts/museum space has historically been a large adopter of rdf and linked data. Lots of big museums, art collections and libraries etc use RDF for their metadata catalogs. There is definitely a tonne of vocabularies and work using RDF in this space… in particular probably: https://iiif.io/ which is adopted by dozens of national musuems/galleries etc worldwide. but also cidoc: http://www.cidoc-crm.org/ and probably a bunch more. Not sure what the latest stuff is, but I could probably find out. SKOS was designed for representing thesaurus etc. https://www.w3.org/2004/02/skos/ @steven427 I’d say there’s a strong argument to use RDF here, given it’s wide adoption. Also there’s a good chance RDF will be around long after trendier stuff like crux.

Steven Deobald 2020-10-12T14:04:53.080Z

@rickmoynihan Interesting! If you had a line on more recent developments, I'd be very curious to know what they are. My entire career was spent in finance / e-commerce type things so I'm really a fish out of water in what seems to be an almost entirely government / academic dominated space.

Steven Deobald 2020-10-12T14:07:15.080200Z

> Also there’s a good chance RDF will be around long after trendier stuff like crux. @rickmoynihan I suppose I hadn't considered these two things at odds with each other. Is there a particular backing store(s) people tend to rely on in the world of RDF?

Steven Deobald 2020-10-12T14:10:38.080400Z

@simongray You're right, for the foreseeable future this system won't integrate with any other or require any sort of data distribution. Pariyatti will be internally curated and won't resemble anything like Wikimedia's work. That said, it's a fine line between a curated library and a system for researching ancient linguistics. The latter no doubt has a lot to learn from the work already done on the semantic web, whether the system is open or not.

Steven Deobald 2020-10-12T14:12:12.081400Z

Very cool. My partner has been working with http://CivicDataLab.in for the past half year or so, in a similar space. I'm not sure they've ever even contemplated RDF for their statistical data, though.

Steven Deobald 2020-10-12T14:18:23.081500Z

@rickmoynihan Do you know of any specific organizations or projects using cidoc? I'm surprised it didn't come up when I was researching off-the-shelf tools.

2020-10-12T15:25:07.081700Z

http://www.cidoc-crm.org/stackeholders http://www.cidoc-crm.org/sig-members-list I guess the above lists would be a good place look Also the Smithsonian… https://americanart.si.edu/about/lod There’s been lots of other linked data projects in this area; but I can’t recall many off the top of my head… I can ask some colleagues.

2020-10-12T15:28:24.082Z

> Is there a particular backing store(s) people tend to rely on in the world of RDF? There are many… probably half a dozen serious commercial options, plus the two big opensource ones Jena and RDF4j; and then maybe twenty or more opensource ones targeting various niches or in various stages of development.