In my previous excursion into the intersection of knowledge graphs and how they fit into the wild world of artificial intelligence, I brought up a point that I think needs to be explored in more depth. Not all knowledge graphs are SPARQL-based, but the language is very useful for a certain class of knowledge graphs, specifically RDF graphs, that have become the default for managing inferential logic.
This article is not an introduction to SPARQL, but rather looks at how information is stored within the graph in such a way that SPARQL can be used to retrieve information without necessarily needing much in the way of prior knowledge about the ontologies.
SPARQL is a powerful language because it is a language for the manipulation of sets of assertions called triples (or, more properly, tuples), with sets of such tuples being then known as graphs. This approach lets you identify relevant subgraphs within larger graphs, and from them you can retrieve deeply related information. So long as you remember that nearly all operations that are done with SPARQL are set operations (in essence, manipulating Venn Diagrams) you can do some very powerful things with the language.
Lately, knowledge graphs have been appearing as good candidates for Retrieval Augmentation with LLMs, and while there are knowledge graphs that aren’t RDF/SPARQL based, there are enough that are that understanding the benefits of working with SPARQL is definitely worthwhile in the community language model space. In particular, it is the ability to create very generalized query patterns that are nonetheless quite useful in SPARQL, especially when combined with a bit of creative Javascript.
The Label Conundrum
RDF revolves around the use of URIs. However, it’s usually the case that users who query a knowledge graph should not need to know what those URIs are. I’d even go so far as to argue that the need to know a URL for a given resource has been a major roadblock in the adoption of knowledge graphs.
For instance, suppose that you are looking for information about Wednesday from the knowledge graph. In this case, this could be one of at least three different concepts:
The Netflix show Wednesday which focuses on the Charles Addams character as played by Jena Ortega,
The character of that name as created by Charles Addams
The day of the week.
There may be others, but these three are sufficient to illustrate the process. The examples covered here are defined for the Apache Jena-Fuseki environment, although most triple stores contain some variation of them.
The RDF Schema standard first defined a “standard” property to identifying labels - rdfs:label
predicate was introduced way back in 2004. For many older ontologies, support for rdfs:label, and even where it is not the star of the show for labels everywhere, rdfs:label is often inherited by other title subproperties.
The problem with this is that not all labels are equal. There are labels for display in interfaces, labels intended to identify concepts internally to the knowledge graph, labels for identifying secondary synonyms, labels specific to acronyms, and, of course, labels in different languages. SKOS, the Simple Knowledge Ontology System, introduced a few more predicates, including skos:prefLabel
and skos:altLabel
, which defines primary and secondary labels.
SKOS labels, however, are a bit problematic in that they are structured somewhat differently than rdfs:label
. This means that many implementations either ignore these structural differences or choose SKOS without bringing over anything else from the ontology.
Dublin Core, which emerged for Library work, incorporates the dcmi:title
predicate for resource labels, but as the intent of Dublin Core is the description of books and other intellectual property, such a title breaks down for other entities such as people.
More recently, much of the role of Dublin Core was subsumed by the schema.org super-scheme, which introduced both the schema:name
and schema:altName
properties. The use of these label identifiers is not as pervasive as the older rdfs:label
, but it should be taken into account as schema.org-based ontologies become more common.
What this all means in practice is that you cannot simply create a SPARQL query of the form
select ?label where {
?s rdfs:label ?label
}
and expect to actually match anything relevant if you do not know the ontology being used. Instead, you have to get a bit creative. One common approach is to make use of RDF lists:
PREFIX list: <http://jena.apache.org/ARQ/list#>
select ?label where {
(rdfs:label skos:prefLabel dcmi:title schema:name) list:member ?labelP.
?s ?labelP ?label.
}
In this case, rdf:member is an example of a magical property, in that it takes an rdf:list and returns a sequence of items, one after another. If one of the predicates matches a triple, ?label will be bound and can be passed to the select list. Otherwise ?label remains unbound and nothing gets passed back. An alternative approach would be to make all of the other predicates sub properties of rdfs:label in the RDF, but this places more of the onus of ensuring fidelity to the model on the data providers. The list approach has the advantage that you can add additional properties into the SPARQL without a significant penalty to performance and thus need to only worry about matching once.
Retrieving Resources by Label
Once the range of potential predicates is defined, getting the resource via a label is still somewhat problematic. For starters, the strings the labels hold may be partially or wholly upper-case, have unexpected accents (such as umlauts), or even have punctuation or unexpected spacing. Moreover, you may be dealing with plurals or word stems (bear vs. bears vs. bearing).
The Sparql compare() and regex() commands would seem to be one approach that you could take to ameliorate at least part of these problems. Still, typically these commands (used in filters) are very slow in comparison to triple matching.
PREFIX list: <http://jena.apache.org/ARQ/list#>
# Workable but VERY slow
select ?s where {
(rdfs:label skos:prefLabel dcmi:title schema:name) list:member ?labelP.
?s ?labelP ?label.
filter(fn:contains(?label,?matchStr))
}
With Jena (and many other triple stores), the solution is to take advantage of indexing, especially through the use of Lucerne search or Elastic search. In this case, variations of words and phrases are indexed to base terms so that even if the match string isn’t an exact match for a defined label in the knowledge graph, the search will return the matchStr.
The text:query
predicate is another magic property, and is used to query against the currently defined Lucene index.
PREFIX list: <http://jena.apache.org/ARQ/list#>
# If index exists, this will be much faster and more flexible
select ?s where {
(rdfs:label skos:prefLabel dcmi:title schema:name) list:member ?labelP.
?s text:query (?labelP ?matchStr).
}
In this case, ?matchStr is passed externally. In the magic property text:query
, the variable ?s is populated (bound) only if the match string has a corresponding base string associated with ?s. For example, “Wednesday (Television Show)” and “Wednesday (Character)” are valid responses in the index for the match string “Wednesday” but “Thursday” would not be.
Note that once you have the pointer to the resource in question, you can use the same technique for getting the name of that node if you don’t know the predicate. In order to reduce both code duplication and potential sources for errors, it makes sense to bind the list to a variable (here ?labelPredicates):
PREFIX list: <http://jena.apache.org/ARQ/list#>
# ?s is passed as an external variable.
select ?label where {
bind ((rdfs:label skos:prefLabel dcmi:title schema:name) as ?labelPredicates)
?labelPredicates list:member ?labelP.
?s ?labelP ?label.
}
In all of these cases, the goal is to minimize or eliminate the dependency on a single ontology.
Retrieving Resources by Class
Retrieving items by class name is trickier, though the technique follows the same principle. The difficulty is that classes usually are not named or labeled in RDF ontologies.
This is one of those cases where it makes sense to use SHACL, even if you are not using SHACL to validate. The sh:name
property can be used with either a shape node or a property node. Typically, the semantics around sh:name
is somewhat ambiguous, but it usually does have an association with a label.
This is a typical SHACL definition as rendered in RDF Turtle:
@PREFIX sh: <http://www.w3.org/ns/shacl#>.
@PREFIX individual: <http://www.example.org/ns/individual#>.
individual: a sh:NodeShape,owl:Class;
sh:name "Individual";
sh:property individual:name;
# more property definitions ...
.
individual:name a sh:PropertyShape;
# definition for individual "name".
What’s worth noting here is the relationship between the node shape, a class declaration (shapes aren’t always classes, but classes are usually shapes) and the existence of the sh:name
predicate.
PREFIX list: <http://jena.apache.org/ARQ/list#>
PREFIX sh: <http://www.w3.org/ns/shacl#>
# ?classNameCandidate is passed as an external variable.
select ?instance ?label where {
bind ((rdfs:label skos:prefLabel dcmi:title schema:name sh:name) as ?labelPredicates)
?labelPredicates list:member ?labelP.
?class a sh:NodeShape.
?class text:query (?labelP ?classNameCandidate).
?labelPredicates list:member ?labelP.
?instance ?labelP ?label.
} order by ?label
Here, I’ve modified ?labelPredicates
to include the sh:name
predicate. The script generates a list of instances and their corresponding labels. Thus, if I’m looking for people and the class name is individual, I would have added an index entry that identifies “people” with “individual”. (I’m deliberately handwaving how one goes about doing this, but more information can be found at https://jena.apache.org/documentation/query/text-query.html).
Note that a similar approach can be taken with regards to getting a list of available classes. This might, in fact be part of a RAG discovery process to reduce the overall complexity of building RAGs. Again, this assumes shacl with the sh:name property on the SHACL statement:
PREFIX list: <http://jena.apache.org/ARQ/list#>
PREFIX sh: <http://www.w3.org/ns/shacl#>
# ?classNameCandidate is passed as an external variable.
select ?class ?label where {
bind ((rdfs:label skos:prefLabel dcmi:title schema:name) as ?labelPredicates)
?labelPredicates list:member ?labelP.
?class a sh:NodeShape.
?class text:query (?labelP ?classNameCandidate).
?labelPredicates list:member ?labelP.
?class ?labelP ?label.
} order by ?label
Standardizing on Services
RAGs are evolving at an extraordinary pace. The idea of extending out from LLMs only started surfacing a couple of months ago, and there is not a lot of consistency yet from the machine-learning side. In contrast, knowledge graphs are quite stable (and arguably even quiescent).
One potential approach for working with RAGs is to set up a server (such as an Express server in Nodejs) that can be used to wrap the various generalized queries. Keep in mind that the response from a typical LLM is a JSON structure of the form:
{
question:"Who are the people who work at Big Co?",
answer: "The list includes Jane Doe, Samuel Jackson, Sherlock Holmes, Tony Stark, Jean Gray and others."
source: "https://myKnowledgeGraph.com/path/to/endpoint/service"
}
where source contains the endpoint for retrieving the content.
While we’re not quite there yet, this use of context-free SPARQL lends itself well to a service model. For instance, the following can be used to retrieve all of the relevant classes in a given knowledge graph:
source: "https://myKnowledgeGraph.com/llm/getClassesByName"
In this case, what would be passed would be a tokenized prompt string that can use contextual clues to determine that classes by name (or partial name) was what was requested as a JSON document. This will be covered in more detail in a subsequent post.
The following illustrates how RAGs are set up. It’s worth noting that the LLM Processor makes reference to a Lang-Chain to retrieve initial content before passing this information to finalizer. A RAG is simply another form of Lang-Chain.
The processing is done not in the LLM but rather in the LLM Processor on the client (typically the browser). This is a key point - the information that comes back to the user client represents the LLM content, but it is, in fact, quite possible to make use of this architecture without ever actually touching the LLM.
Summary
This should make sense, and it points to a somewhat different architecture, where the client as LLM agnostic - the same query can retrieve content from multiple LLMs, Knowledge Graphs or data sources with the appropriate plugin.
The specific mechanism for how Lang Chains (and Plugins) work is out of the scope of this article, though I will be addressing it in a subsequent post. However, hopefully this revelation can serve to get your wheels spinning about how to use Lang-Chains as integration tools.
In the next post, I intend to explore a more generalized framework, using Shapes, to make other aspects of SPARQL context neutral, and by extension, make it play well with AI tools.
Kurt Cagle is the Editor of The Ontologist. He lives in Bellevue, Washington.