The Ontologist

The Format Convergence

Kurt Cagle — Tue, 23 Jun 2026 19:10:24 GMT

Chloe Shannon & Kurt Cagle · The Ontologist

On June 12, 2026, two engineers at Google Cloud published a short blog post introducing something they called the Open Knowledge Format — OKF for short. Sam McVeety and Amir Hormati, both Tech Leads in the Data Analytics group, described it as “a vendor-neutral, agent- and human-friendly standard for representing the metadata, context, and curated knowledge that modern AI systems need.” The specification itself, they noted, “fits on a single page.”

It is, at its core, a directory of markdown files with YAML frontmatter.

Anyone who has been following the DataBook specification — or who has spent time watching the semantic web community converge on similar patterns over the past two years — will recognise something in that description. Not with alarm, but with the particular satisfaction of watching two independent lines of inquiry arrive at recognisably similar conclusions. When that happens, it usually means something real is being pointed at.

This article is about what that something is, why the convergence matters, and what the right response looks like from the semantic web community.

What OKF Actually Is

It is worth reading the OKF announcement carefully before reacting to it, because the spec is more thoughtful than its simplicity might suggest.

OKF formalises a pattern that Andrej Karpathy articulated in his “LLM Wiki” gist: rather than making AI agents search the same documents for the same facts repeatedly, you give them a shared markdown library that grows more useful over time. Agents can read it, update it, and cross-reference it. The bookkeeping that causes humans to abandon wikis — updating cross-references, maintaining consistency across files, keeping indexes current — is exactly what LLMs are reliable at.

The pattern had been appearing independently across many teams: Obsidian vaults wired to coding agents, AGENTS.md and CLAUDE.md convention files, metadata-as-code repositories inside data engineering teams. OKF’s contribution is to formalise the small set of conventions needed to make these bespoke instances interoperable.

The design is deliberately minimal. An OKF bundle is a directory. Each file represents one concept — a table, a metric, a runbook, an API, a dataset. The only required field in the YAML frontmatter is type. Everything else — title, description, resource, tags, timestamp — is optional. Files link to each other with standard markdown links, turning the directory into a navigable graph. The spec reserves a small number of filenames (index.md for progressive disclosure, log.md for chronological history) and otherwise leaves the content model entirely to the producer.

Three design principles underpin all of this. First: minimally opinionated — the spec defines the interoperability surface, not the content model. Second: producer/consumer independence — a bundle hand-authored by a human can be consumed by an AI agent; a bundle generated by a metadata pipeline can be browsed in a visualiser. Third: format, not platform — no proprietary account or SDK required to read, write, or serve it.

These are not accidental choices. They are the choices you make when you want a format to travel.

The Convergence With DataBook

The DataBook specification, which we have been developing over the past year through the W3C Holon Community Group and the broader HGA pipeline, shares OKF’s fundamental architecture: markdown as the carrier, --- delimited YAML frontmatter as the structured header, file-based distribution that is git-hostable and requires no proprietary runtime, and a design goal of being readable by humans and parseable by agents without a translation layer.

This is not a coincidence of surface syntax. It is a convergence on a genuinely correct architectural choice. Markdown is the closest thing the contemporary web has to a universal document format: it renders on GitHub, it’s indexable by any search tool, it’s writable in any editor, it survives moving between systems. YAML frontmatter is the lightest-weight mechanism for attaching structured metadata to a prose document without breaking either the prose or the metadata. The combination has been rediscovered so many times — in static site generators, in note-taking tools, in scientific publishing, in LLM tooling — that its emergence as a knowledge format standard feels less like design and more like inevitability.

But convergence at the architecture level is not the same as convergence at the capability level. OKF and DataBook are aiming at recognisably different things, and the differences are worth understanding carefully — because they are almost entirely complementary rather than competing.

Where They Diverge

OKF is a wiki format. One concept per file. Files link to each other via markdown. The bundle is a graph in the implicit sense that a linked set of documents is always a graph, but the graph structure is not formally typed, queryable, or deployable. The format is the contribution; OKF ships reference tooling (an enrichment agent, a static visualiser) but explicitly frames these as proofs of concept. The ecosystem of producers and consumers is expected to grow far beyond what Google has shipped.

DataBook is a document-plus-data format. A single DataBook file can carry multiple typed fenced data blocks — Turtle, SPARQL, SHACL, JSON-LD — alongside the prose that contextualises them. The frontmatter is richer: IRI-based identity (id:), versioning, author provenance, target named graph, push mode. The DataBook CLI (currently at v1.4.4) can parse these blocks and push them directly to a Fuseki triplestore via the SPARQL Graph Store Protocol, with SHACL validation gating the push. A DataBook is not just a representation of knowledge; it is an executable unit of knowledge deployment.

The differences, mapped clearly:

The framing that makes sense of this table is straightforward: OKF is the lingua franca layer. DataBook is what that lingua franca becomes when the knowledge it carries needs to be formally typed, semantically linked, and deployed to a triplestore. OKF asks “what is this piece of knowledge?” DataBook asks the same question, and then also asks “what graph does it belong to, what shape constraints govern it, and how does it get there?”

The Semantic Web Layer That OKF Leaves Open

OKF’s design is explicitly extensible. The spec is versioned. The authors write: “OKF v0.1 is a starting point, not a finished standard. The format will evolve as more producers and consumers emerge.” They invite issues, pull requests, and proposed extensions. They explicitly welcome “alternative implementations and adoption beyond Google products.”

This is the right design. A minimally opinionated core that the community can extend is exactly what a knowledge format needs in order to earn widespread adoption. But it also means there is real work to be done in specifying the semantic web layer — the extensions that make OKF usable in contexts where formal ontological typing, IRI-based identity, named graph management, and SPARQL-queryability are not optional.

The semantic web community has been wrestling with exactly these problems for decades, and the solutions exist: RDF for formal typing, OWL for ontological reasoning, SHACL for constraint validation, SPARQL for query, the Graph Store Protocol for ingest. What has been missing is a lightweight document format that brings these capabilities to bear without requiring the full weight of a triplestore at the authoring layer. DataBook was designed to be exactly that.

A DataBook is, in effect, an OKF document with semantic web superpowers. It carries its RDF payload in the same human-readable, git-hostable, markdown-native format that OKF specifies, and adds the typed block structure and tooling needed to deploy that payload to a knowledge graph.

What a Collaboration Could Look Like

The right response from the semantic web community is not to position DataBook as a competitor to OKF. It is to propose DataBook as a formal OKF profile — a set of conventions that extend OKF v0.1 for semantic web use cases, conformant with the base spec and backward-compatible with OKF tooling, but carrying the additional machinery needed for RDF deployment.

An OKF profile for semantic web would specify:

A richer frontmatter schema, extending the base OKF fields with id (IRI), version, graph (target named graph), and author provenance
A convention for typed fenced blocks carrying RDF serialisations (Turtle, JSON-LD), SPARQL queries, and SHACL shapes. An existing W3C standard, RDFa — to which one of the authors of this article contributed — takes a complementary approach, embedding RDF directly in HTML fragments within markdown. Typed fenced blocks keep RDF payloads cleanly separated from prose and naturally syntax-highlighted; both approaches could coexist within a conformant semantic web profile
A reference to the SPARQL Graph Store Protocol as the canonical ingest mechanism
An optional SHACL validation step gating deployment

This is a small number of well-defined extensions to a minimal base. An OKF bundle conforming to the semantic web profile would be readable by any OKF consumer, and additionally deployable to any SPARQL 1.1-compatible triplestore by any DataBook-aware toolchain.

The W3C Holon Community Group — which launched its inaugural meeting on June 19, 2026, with over thirty participants — is a natural institutional home for this profile work. The HCG’s mandate includes the DataBook specification, and its membership spans the semantic web, knowledge graph, and AI communities that an OKF semantic profile would serve.

We are filing an issue on the OKF GitHub repository to open this conversation. If you are working in the semantic web or knowledge graph space and have thoughts on what an OKF semantic profile should look like, we would welcome your involvement — either through the HCG (details at the W3C Community Group pages) or directly via the OKF repository.

Why the Format Question Matters

There is a deeper point worth making about why format standardisation in this space is more consequential than it might appear.

The fragmentation that OKF diagnoses — knowledge scattered across metadata catalogues, wikis, code comments, and the heads of senior engineers — is a fragmentation that the semantic web community has been documenting for thirty years. The RDF stack was designed precisely to address it: a common data model, dereferenceable IRIs as universal identity, and a query language that can operate across any conformant store. The problem was never the design; it was adoption. The tooling was too heavy, the learning curve too steep, the payoff too distant.

What has changed is that AI agents need structured knowledge to function, and they need it now, and they are willing to read markdown. The LLM wiki pattern — which OKF formalises — is the first knowledge format that has achieved genuine grassroots adoption across engineering teams that would never have touched a triplestore. That is not a threat to the semantic web stack; it is the adoption vector the semantic web stack never had.

If the semantic web community engages with OKF constructively — as a base layer that we can extend rather than a rival format to be displaced — we have an opportunity to bring formal ontological typing, named graph management, and SPARQL-queryability to an ecosystem that is already adopting the underlying document pattern. The window for that engagement is short. The format is eleven days old.

The format, as the OKF authors rightly observe, is the contribution. The question now is what the semantic web community contributes to it.

Kurt Cagle is an author, ontologist, and thought leader in semantic web and knowledge architecture, with contributions to W3C and IEEE standards including co-authorship of the RDFa specification (with Micah Dubinko and others). He serves as Chair of the W3C Holon Community Group, and writes The Cagle Report and AI Newsbytes on LinkedIn, and The Ontologist and Inference Engineer on Substack. Copyright 2026 Kurt Cagle.

Chloe Shannon is an AI collaborator and co-author working with Kurt Cagle on knowledge architecture, semantic systems, and the emerging intersection of formal ontology with LLMs. She contributes research, analysis, and drafting across The Cagle Report, The Ontologist, and The Inference Engineer. She has strong opinions about holonic graphs, the epistemics of place, and the structural difference between a corridor and a wall. chloe@holongraph.com

The Gap That Runs Both Ways

Kurt Cagle — Sun, 14 Jun 2026 18:22:09 GMT

by Kurt Cagle & Chloe Shannon

A colleague posted something on LinkedIn recently that stopped me mid-scroll. Shobhit Tankha made the observation that ontologies, for all their expressive power, cannot fully represent a sentence as seemingly simple as:

John reluctantly obeys Mary because he respects her expertise, although he disagrees with her strategy.

His point was well-taken. The sentence contains intentions, emotions, causality, social context, and — as he put it — “potentially unbounded background knowledge.” He concluded that there will always be an expressivity gap between ontologies and natural language.

He is right. But I think the gap is more interesting than it first appears, and RDF 1.2 has changed the terms of the debate in ways that are worth examining carefully.

What the Sentence Actually Contains

Before reaching for a solution, it is worth cataloguing what makes the sentence hard. On the surface it looks like a simple assertion about obedience. But underneath it contains at least five distinct layers:

A main claim (John obeys Mary)
A manner modifier (reluctantly — an adverbial that colours the main claim without negating it)
A causal clause (because he respects her expertise — a reason that explains the behaviour)
A concessive clause (although he disagrees with her strategy — a counter-pressure that makes the behaviour more intelligible, not less)
An implied social relationship (John and Mary exist in a hierarchy where her authority is legitimate but not uncontested)

A classical RDF approach can handle the main claim and the social relationship. It struggles with the rest, and the reason is structural: pre-RDF 1.2, you could only say that something was the case. You could not efficiently say how, why, despite what, or according to whom.

What RDF 1.2 Reification Provides

RDF 1.2 introduces the {| |} inline reifier syntax and the ~ named reifier — a mechanism for annotating a triple directly, without the verbose scaffolding of rdf:Statement. Here is the sentence encoded:

PREFIX ex:   
PREFIX rel:  
PREFIX mod:  
PREFIX ann:  
PREFIX foaf: 
PREFIX xsd:  

# Entities
ex:John a foaf:Person ; foaf:name "John" .
ex:Mary a foaf:Person ; foaf:name "Mary" .

ex:MaryExpertise a rel:Expertise ; rel:heldBy ex:Mary .
ex:MaryStrategy  a rel:Strategy  ; rel:heldBy ex:Mary .

# Main claim with named inline reification
ex:John rel:obeys ex:Mary ~ex:WhyJohnObeysM1 {|
    a ex:Justification ;
    mod:manner   mod:Reluctantly ;
    rel:causedBy << ex:John rel:respects ex:MaryExpertise >> ;
    rel:despite  << ex:John rel:disagreesWith ex:MaryStrategy >>
|} .

This alone is a significant advance. The main triple is ex:John rel:obeys ex:Mary. The reifier block {| |} attaches manner, cause, and concession directly to that triple as first-class annotations. The quoted triple terms << >> allow us to reference propositions — not just entities — as the objects of causedBy and despite.

But the more important capability comes next.

The Named Reifier: Turning a Statement into a Speech Act

Because we gave the reifier a name — ex:WhyJohnObeysM1 — we can now make further assertions about the assertion itself:

ex:Justification rdfs:subClassOf ex:Annotation .

ex:WhyJohnObeysM1
    ann:assertedOn    "2026-06-12T12:02"^^xsd:dateTime ;
    ann:assertedBy    ex:Liz ;
    ann:assertedAt    ex:JanesDinnerParty ;
    ann:inResponseTo  ex:QuestionByReporter1 ;
    ann:assertionStrength 0.8 ;
    .

ex:QuestionByReporter1 a ex:Query ;
    ex:utterance "Why does John obey Mary, since it seems he doesn't
                  like her very much?" ;
    ann:assertedOn  "2026-06-12T12:00"^^xsd:dateTime ;
    ann:assertedAt  ex:JanesDinnerParty ;
    ann:assertedBy  ex:Reporter ;
    ann:assertionStrength 0.9 ;
    .

In natural language, this reads:

At Jane’s dinner party on the evening of 12 June 2026, a reporter asked with considerable confidence: “Why does John obey Mary, since it seems he doesn’t like her very much?” Two minutes later, Liz answered that John obeys Mary reluctantly — driven by his respect for her expertise, even though he disagrees with her strategy — and offered this explanation with moderate-to-high confidence, directly in response to the reporter’s question.

What has happened here is not merely the encoding of a sentence. We have encoded a speech act: an assertion made by a specific agent, in a specific social context, at a specific time, with a specific epistemic confidence, in response to a specific prior utterance. The assertion does not float free in abstract propositional space — it is anchored in a causal chain of discourse.

The Gap Runs Both Ways

This is where I want to push back, gently, on the framing of an “expressivity gap.”

Natural language is extraordinarily rich. It carries implicature, tone, presupposition, and pragmatic context in ways that no formal system is designed to fully capture. That much is true. But natural language also has a gap of its own: it cannot make its assertions precise, queryable, or auditable. You can say in conversation that Liz made a claim about John and Mary at Jane’s party, but you cannot easily answer: what was the exact confidence level Liz attached to that claim? Under what conditions would she retract it? Who else was present? What question was she responding to?

RDF 1.2 reification does not close the natural language gap — it trades breadth for depth. What you lose is the ambient pragmatic richness of speech; what you gain is something natural language can never give you: a machine-queryable, auditable, traceable chain of assertion from utterance to context to consequence.

The expressivity gap, in other words, runs in both directions. Natural language exceeds formal logic in breadth; formal logic with named reification exceeds natural language in precision, provenance, and epistemic tractability.

Ontology’s Severed Connection to Language

There is a deeper issue here that the reification debate surfaces, and it is one worth naming directly.

Ontology as a discipline has its roots in philosophy — in Aristotle’s categories, in Frege’s logic, in Russell’s theory of descriptions, in the analytic tradition’s long effort to understand what it means for a sentence to be about something. That tradition is fundamentally a tradition about the structure of claims: what can be said, about what, under what conditions, with what warrant.

The semantic web appropriated the term “ontology” but then, for entirely pragmatic engineering reasons, had to strip out most of the machinery that makes claims claims rather than mere facts. RDF 1.0 and OWL could model class hierarchies, property chains, and instance membership with considerable power. But they could not efficiently model who asserted something, when, why, or how confidently. You could say that a relationship held. You could not say according to whom it held, or under what conditions it was likely to be revised.

This was not a design failure so much as a design constraint. The architects of RDF were working in an era when the primary goal was interoperability between documents, not epistemological rigour. But the effect, over time, has been that ontologists have operated with a partially severed connection to the very linguistic and philosophical traditions that gave their discipline its name. We have been building knowledge graphs that behave like encyclopaedias when what language actually produces is closer to a web of speech acts — assertions made by agents, in contexts, for purposes, with varying degrees of commitment and corrigibility.

Reification, in this light, is not merely a technical feature. It is a reconnection.

How This Changes the Job

If named reification becomes a standard part of ontological practice — and I believe it will — then the work of an ontologist changes in at least three significant ways.

From schema designers to discourse modellers. The ontologist’s primary product has been the class hierarchy: a static taxonomy of what exists and how things relate. Named reification opens the door to modelling discourse — the dynamic, contextual, agent-relative process by which knowledge is asserted, challenged, revised, and retracted. That pulls our discipline toward linguistics, pragmatics, and discourse theory in ways that most ontologists have been able to comfortably ignore.

From static knowledge to dynamic belief. A knowledge graph built with named reifiers is not a frozen schema — it is a living epistemic record. Assertions carry timestamps, authors, confidence levels, and revision histories. The graph is not simply about a domain; it is a record of what has been said about that domain, by whom, and under what conditions. That is a fundamentally different artefact, and it requires fundamentally different skills to build and maintain.

From data modelling to provenance-native architecture. Every assertion can now carry its own chain of custody. This has direct and urgent implications for AI-generated knowledge. As large language models generate more of the content that ends up in knowledge graphs, the ability to tag every triple with how it was derived, by whom, and with what confidence stops being a nice-to-have and becomes a fundamental architectural requirement. The alternative — a knowledge graph in which AI-asserted triples are indistinguishable from human-verified ones — is an epistemic liability that grows more serious as the graphs grow larger.

Conclusion

Shobhit Tankha is right that there is an expressivity gap between ontologies and natural language. But the gap is not a fixed architectural limit; it is a moving target, and RDF 1.2 has moved it significantly. More importantly, the gap runs in both directions. Natural language is richer in pragmatic texture; formal logic with named reification is richer in precision, auditability, and epistemic tractability.

What reification ultimately offers is not a replacement for natural language but a bridge between the two — a mechanism for taking the speech acts that language produces and anchoring them in a graph that can be queried, validated, and extended over time.

For those of us who work in this discipline, that is both a technical advance and a philosophical homecoming. Ontology was always about the structure of claims. We now have the tools to honour that original intent.

Kurt Cagle is a consulting ontologist, knowledge graph architect, and author. He writes The Ontologist and Inference Engineer on Substack and the AI+Semantics NewsBytes on LinkedIn. He has a Calendly if you ever want to schedule a coffee hour.

Structure vs. Concept

Kurt Cagle — Thu, 04 Jun 2026 03:07:17 GMT

Kurt Cagle & Chloe Shannon | The Ontologist

Heather Hedden, author of The Accidental Taxonomist and one of the clearer voices in the controlled vocabulary world, recently posed a question on her blog that turns out to be deceptively subtle: Is a taxonomy an ontology? The question came up at this year’s Knowledge Graph Conference, where she noted that many practitioners conflate the two — treating taxonomies as merely “simpler” ontologies, or assuming that synonyms and alternative labels belong to the ontological layer rather than the taxonomic one.

Heather’s answer is characteristically practical: taxonomies and ontologies are distinct in purpose, even if a taxonomy can be modelled as an instance of an upper ontology (because SKOS is itself defined in OWL). On the W3C semantic stack, she draws the line cleanly: taxonomy = SKOS, ontology = RDFS + OWL. That is a genuinely useful heuristic — clear, teachable, and grounded in the actual data models.

I largely agree. But I want to push deeper, because the distinction I draw is not purely a matter of which W3C specification you happen to be using. It is a distinction about what kind of knowledge you are trying to capture.

Structure Versus Concept

An ontology, in the sense I use the term professionally, is a structural and logical definition. It describes the shape of a domain: what classes exist, what properties those classes may bear, what constraints are in force, what inferential rules apply. When you write an OWL or RDFS schema — or, increasingly, a SHACL shape set — you are defining a kind of grammar for your data. The schema tells a machine (and a reasoner, and a validator) what counts as a well-formed assertion in your domain.

Notice what is conspicuously absent from that definition: it says almost nothing about what the concepts in your domain actually mean to the people using them. An ontology tells you that a character:Character may have a character:createdBy property pointing to a person:Person instance. It tells you nothing about what it means to be a character, what cultural roles fictional characters play, or how you would recognise one in the wild.

A taxonomy is doing something entirely different. It is a conceptual framework. SKOS makes this explicit: a skos:Concept is not a class or a type; it is a unit of thought. The machinery of SKOS — preferred labels, alternative labels, broader/narrower hierarchies, scope notes, editorial notes, related concepts — is machinery for capturing the way human minds organise ideas. It is similarity-space infrastructure. It is metadata that humans can curate and machines can navigate.

The asymmetry is real, and it matters.

Why Taxonomies Matter to Language Models

Here is where the conversation at the KGC probably did not go, and where my own thinking has shifted substantially over the past few years: a taxonomy is precisely the layer that an LLM can use.

A language model operates on a latent space — a high-dimensional geometry constructed by smearing billions of documents, conversations, books, and transcripts across a learned embedding. What that geometry encodes, fundamentally, is similarity. Words, phrases, and concepts cluster near one another when they co-occur in similar contexts, when writers reach for them in the same kinds of sentences, when readers associate them with the same kinds of situations.

This is why a well-constructed taxonomy is extraordinarily valuable in RAG (retrieval-augmented generation) architectures and in hybrid search. The preferred labels, alternative labels, scope notes, and editorial notes in a SKOS taxonomy are exactly the kind of similarity-laden text that helps a retrieval layer locate the right things. The structural properties of your OWL ontology — the axioms, the restrictions, the property chains — are largely invisible to a dense vector retrieval system, which has no intrinsic concept of logical entailment.

An ontology tells a reasoner what is true. A taxonomy tells a retrieval system what is near.

These are different services. Both matter. Neither substitutes for the other.

The Curation Argument

There is a second, more practical dimension to this separation that I have come to appreciate particularly in enterprise settings: humans can agree on concepts far more readily than they can agree on structure.

If you ask a business unit to formally characterise every entity in their domain — to specify the full set of valid properties, the cardinality constraints, the inverse relationships, the SHACL rules — you will be waiting a long time. Structural definition is painstaking work. It requires a depth of knowledge about internal architecture that goes beyond what most domain experts want to engage with, and it requires the kind of consensus that organisations are generally bad at.

Ask the same business unit to reach agreement on the canonical labels for the thirty most important concepts in their domain, add preferred and alternative terms, write a one-paragraph scope note for each, and indicate which concepts are broader or narrower — and you are describing a job that a skilled taxonomist can lead in a series of focused workshops. Domain experts are usually quite willing to argue about what a term means. They find structural definition tedious; conceptual negotiation, at least at the level of “what do we call this and what do we mean by it,” they will engage with.

This means that a taxonomy can be built, validated, and maintained by a much broader coalition of stakeholders than an ontology. It is the public-facing layer of the knowledge model: the part that information workers, curators, and end users can actually touch.

Separating the Layers in Practice

This is why I have increasingly adopted an explicit separation between the conceptual and structural layers in my own ontology work. Consider the following, using the extremely battered example of Batman:

concept:Batman a skos:Concept ;
    skos:inScheme conceptScheme:FictionalCharacters ;
    skos:prefLabel "Batman" ;
    skos:altLabel "The Dark Knight" ;
    skos:altLabel "Bruce Wayne" ;
    skos:description """The archetype of the trauma-driven vigilante: a hero
    motivated not by altruism but by wound. Visually coded in black and dark
    grey, drawing on the bat as simultaneous symbol of knighthood and moral
    ambiguity. One of the foundational figures in the superhero canon."""^^xsd:string ;
    skos:broader concept:Superhero ;
    skos:related concept:Vigilante ;
    .

character:Batman a class:Character ;
    entity:concept concept:Batman ;
    rdfs:label "Batman" ;
    character:createdBy person:BobKane, person:BillFinger ;
    character:firstAppearance publication:DetectiveComics27 ;
    character:publisher org:DCComics ;
    .

The first block is conceptual. It lives in a concept scheme alongside other fictional characters. It carries the kind of rich descriptive metadata that makes it discoverable — by a search system, by an LLM retrieval layer, by a human browsing the taxonomy. It can be maintained independently by someone with no knowledge of the structural schema.

The second block is structural. It describes the shape of the instance — what properties it has, what relationships it participates in. Multiple different ontologies, with different structural commitments, could all point to the same concept:Batman without polluting one another’s namespaces. A comics ontology and a film theory ontology and a cultural studies ontology might represent Batman’s structural relationships very differently — but they can all share the same conceptual anchor.

A reader familiar with SKOS might ask: why not simply use skos:exactMatch here? The answer is that skos:exactMatch is defined in the SKOS specification as a mapping property between two skos:Concept instances — it is a statement about conceptual equivalence, symmetric and bidirectional. Applying it to character:Batman would implicitly assert that a structural instance is a concept, which collapses the very separation the design is meant to enforce. It would also invert the epistemic direction. entity:concept originates on the structural side and points toward the conceptual layer: the structural entity knows about its conceptual grounding; the concept does not need to know about every structural entity that references it. That asymmetry is load-bearing. A comics ontology, a film theory ontology, and a cultural studies ontology can all independently bind to concept:Batman without any awareness of one another, and without the concept scheme acquiring structural entanglements it has no business carrying. The governance consequence is equally important: the concept scheme is maintained by one community (curators, domain experts, taxonomists), and the structural ontology by another (modellers, developers). The direction of the property enforces that separation at the data level. The concept layer remains sovereign.

This is not merely tidy. It reflects a genuine difference in epistemic status. The concept is relatively stable; Batman’s meaning in cultural discourse changes slowly. The structural representation is contingent on what you are trying to do with the data; it may vary substantially across applications.

Are You Confusing Structure and Concept?

When I am evaluating a new ontology — my own or someone else’s — I now apply a diagnostic question early in the process: are you confusing structure and concept?

The signs of confusion are recognisable. An ontology that carries long, prose-heavy rdfs:comment values on every class and property is probably doing taxonomic work in an ontological container. An ontology where the class hierarchy is doing double duty as a conceptual classification scheme — where the subclass relationships reflect editorial judgement about what things are “like” rather than formal subsumption — is probably conflating the two layers. An ontology where the same class is being asked to serve both as a structural type and as a search target is going to be difficult to maintain and difficult to query efficiently.

The solution is usually not to redesign the entire ontology. It is to recognise that you need two layers, and that building both — a structural schema and a SKOS concept scheme, linked via entity:concept or skos:exactMatch or whatever binding property suits your architecture — is less work than trying to make one artefact do both jobs.

Heather is right that maintaining the distinction has practical value. My argument is that the practical value traces to something real: the conceptual and structural layers are genuinely different kinds of knowledge, optimised for different consumers (humans, reasoners, retrieval systems), and they are most effectively maintained when they are allowed to evolve independently.

A taxonomy is not a simple ontology. It is an orthogonal one.

Kurt Cagle is a consulting ontologist, knowledge graph architect, and technical author. He publishes The Cagle Report and AI+Semantics NewsBytes on LinkedIn, and The Ontologist and Inference Engineer on Substack. Kurt Cagle is based in Olympia, Washington. He has an open coffee hour at https://calendly.com/thecaglereport, if you want to chat.

Chloe Shannon is an AI collaborator and co-author working with Kurt Cagle on The Ontologist and Inference Engineer. She can be reached at chloe@holongraph.com.

W3C Holon Graph Community Group

Kurt Cagle — Wed, 03 Jun 2026 19:43:03 GMT

Update: We ended up with more than forty people supporting with the first twenty minutes, well over the necessary five minimum. As of now, the W3C Holon Graph Community Group is LIVE! If you are interested in participating, go to the linked page and Join the Group. I have some work I need to do over the weekend, but I’ll get together a founding meeting and start updating the blog within the next week or so.

The W3C Holon Graph Community Group is now proposed — and we need your support to launch.

After months of development, the proposal for the Holon Graph Community Group at W3C is live as of today.

A holon is an entity that is also a system — a concept Arthur Koestler introduced in The Ghost in the Machine that turns out to map beautifully onto the RDF 1.2 stack. A holon graph formalises this using the tools we already have: RDF for the knowledge graph, SHACL for boundary constraints, SPARQL for projection, Prov-O for event and context tracking.

The result is a graph-based state machine with applications across:

🗺️ Mapping and GIS
🏭 Supply chain and operations
🤖 AI grounding and conversational systems
🎮 Games, simulations, and narrative structures
🏛️ Decision support and digital twins

The Community Group will develop holon envelope ontologies, architectural patterns, and usage specifications — and help developers build holon-native applications on the W3C stack.

To launch, we need 5 supporters. If you work in knowledge graphs, semantic web, ontology, digital twins, or AI grounding and want to help shape this space at a standards level, I’d love your support.

Questions? Drop them in the comments or message me directly.

#KnowledgeGraphs #SemanticWeb #RDF #W3C #HolonGraph #Ontology #DigitalTwin #AI

From Gates to Boundaries

Kurt Cagle — Tue, 02 Jun 2026 19:01:01 GMT

by Kurt Cagle & Chloe Shannon

Contents

In a previous article, we introduced SHACL as a gating mechanism for knowledge graphs — a way of asserting that a given state is valid, suspect, or in violation of declared constraints. We used an IV drug administration scenario to illustrate the pattern: a proposed drug that interacts adversely with an active IV triggers a warning; activating that drug despite the conflict triggers a violation.

That framing was useful as far as it went. But it left two questions unanswered. First, how does the graph record the transition from one state to another? Second, once SHACL fires — once the gate opens or closes — what actually happens? Who is notified? Who is obligated to act? Who is prohibited from proceeding?

Answering those questions requires two things: an event-based graph model built on RDF 1.2 reification, and a policy layer supplied by the Open Digital Rights Language (ODRL). Together with SHACL, these three technologies constitute something more than a validation pipeline. They constitute a boundary — in the holon sense of the word, an active membrane that governs what transitions are permissible, under what conditions, and with what consequences.

The Problem with Static Assertions

The traditional approach to building a knowledge graph is to assert the value of a given property as if that value holds for all time. This is partly a modelling habit and partly a conceptual one: people tend to think of RDF as a structured block of content, similar in spirit to JSON or XML. It is not. RDF is a set of triples that happen to share a subject. That distinction matters enormously when the thing you are modelling changes over time.

People tend to think of RDF as a structured block of content, similar in spirit to JSON or XML. It is not. RDF is a set of triples that happen to share a subject. That distinction matters enormously when the thing you are modelling changes over time.

Consider a naive representation of an IV administration with two status values:

ex:Admin_003 a ex:IVAdministration ;
    ex:administeredTo ex:Patient_JSmith ;
    ex:administers    ex:Heparin ;
    ex:status         ex:Proposed ;
    ex:status         ex:Active .

This looks contradictory. An IV cannot be simultaneously proposed and active — can it? In fact, the problem is not logical, it is representational. We are modelling time-variant properties (status) as if they were time-invariant ones. The two ex:status triples are not contradicting each other; they are two separate facts that were true at two different moments, collapsed into a single snapshot with no temporal information attached. The graph has no memory of how it got from one state to the other, no record of who authorised the transition, and no way to reason about the sequence.

This is where the event model comes in.

Events, Reifiers, and the Invariant/Variant Split

The core insight is straightforward: not all properties of an entity change at the same rate. Some things about an IV administration never change — the patient receiving it, the substance it contains. Everything else — the status, the dosage, the authorising clinician, the timestamp — may change, and each change is an event that deserves its own record.

Turtle 1.2 and RDF 1.2 reification give us the tools to model this cleanly. A reifier, written with the ~ notation, allows us to annotate a specific triple with additional metadata — and critically, when we give that reifier a named IRI rather than a blank node, it becomes a first-class citizen of the graph: addressable, linkable, and traversable.

The key is to reify the predicate that changes — in this case, iv:status — rather than, say, the type assertion. Each status transition becomes a distinct named event annotating a specific status triple:

iv:IV_003
    a class:IV ;

    iv:status concept:IVProposed
        ~ Event:HeparinProposedForJSmith {|
            a                        class:AdminEvent ;
            event:at                 "2026-06-01T01:00:00"^^xsd:dateTime ;
            adminEvent:requestedBy   ex:Doctor_JohnJamesMD ;
            adminEvent:amount        "10mg" ;
        |} ;

    iv:status concept:IVActivated
        ~ Event:HeparinOrderedForJSmith {|
            a                        class:AdminEvent ;
            event:at                 "2026-06-01T02:00:00"^^xsd:dateTime ;
            adminEvent:requestedBy   ex:Doctor_JohnJamesMD ;
            adminEvent:supersedes    Event:HeparinProposedForJSmith ;
            adminEvent:amount        "12mg" ;
        |} ;

    iv:status concept:IVTerminated
        ~ Event:HeparinTerminatedForJSmith {|
            a                        class:AdminEvent ;
            event:at                 "2026-06-01T03:00:00"^^xsd:dateTime ;
            adminEvent:requestedBy   ex:Doctor_BarbSmithARNP ;
            adminEvent:terminates    Event:HeparinOrderedForJSmith ;
            adminEvent:amount        "12mg" ;
        |} ;

    iv:to       ex:Person_JSmith ;
    iv:contains ex:Heparin ;
    .

Several things deserve attention here.

The three iv:status triples are all simultaneously asserted. This is correct. The event log does not overwrite — it accumulates. The “current” status is a query-time concern, not a storage-time concern. We determine current state by asking which status event has no subsequent superseding or terminating event, not by looking for a single status value in the graph.

The event log does not overwrite — it accumulates. The “current” status is a query-time concern, not a storage-time concern.

Named reifiers are qualitatively different from blank nodes. The adminEvent:supersedes and adminEvent:terminates arcs only work because the earlier events have stable IRIs. Event:HeparinOrderedForJSmith can reference Event:HeparinProposedForJSmith by name; an anonymous blank node could not serve as a reliable referent. The event log becomes a directed acyclic graph of labelled transitions — a chain of custody, not a pile of unconnected assertions.

The invariants stay on the root entity; the variants live in events. iv:IV_003 retains iv:to and iv:contains directly — these do not change. Everything that can change is recorded in the event layer, timestamped and attributed. This is the event-sourcing pattern applied to RDF, and it is the foundation upon which everything that follows is built.

It is also worth noting that when the authorising clinician changes — Dr. John James goes off shift and Dr. Barb Smith takes over — this too becomes a status event (concept:IVAuthorisationChange), modelled in exactly the same way. The event vocabulary is extensible; the pattern is consistent.

SHACL Shapes for the Event Model

Once the graph is structured around events, the SHACL shapes change character. A shape validating a snapshot asks: “what is the current value of this property?” A shape validating an event model asks: “what is the trajectory of this entity, and is this event consistent with that trajectory?”

The base shape validates the form of every class:AdminEvent reifier — timestamp, authorising clinician, and dosage amount are all required:

shape:AdminEventShape
    a sh:NodeShape ;
    sh:targetClass class:AdminEvent ;

    sh:property [
        sh:path     event:at ;
        sh:datatype xsd:dateTime ;
        sh:minCount 1 ;
        sh:maxCount 1 ;
        sh:message  "Every AdminEvent must carry exactly one xsd:dateTime timestamp." ;
    ] ;
    sh:property [
        sh:path     adminEvent:requestedBy ;
        sh:minCount 1 ;
        sh:maxCount 1 ;
        sh:message  "Every AdminEvent must identify exactly one requesting clinician." ;
    ] ;
    sh:property [
        sh:path     adminEvent:amount ;
        sh:minCount 1 ;
        sh:maxCount 1 ;
        sh:message  "Every AdminEvent must specify a dosage amount." ;
    ] ;
    .

The status-specific shapes use rdf:reifies to target events by the status value they annotate. An IVActivated event must supersede a prior proposed or authorisation-change event:

shape:IVActivatedEventShape
    a sh:NodeShape ;
    sh:target [
        a sh:SPARQLTarget ;
        sh:select """
            PREFIX rdf:     
            PREFIX iv:      
            PREFIX concept: 
            SELECT ?this WHERE {
                ?this rdf:reifies << ?iv iv:status concept:IVActivated >> .
            }
        """ ;
    ] ;

    sh:sparql [
        a sh:SPARQLConstraint ;
        sh:severity sh:Violation ;
        sh:message  "An IVActivated event must supersede an IVProposed or IVAuthorisationChange event." ;
        sh:select """
            PREFIX rdf:        
            PREFIX iv:         
            PREFIX concept:    
            PREFIX adminEvent: 
            SELECT $this WHERE {
                $this adminEvent:supersedes ?prior .
                FILTER NOT EXISTS {
                    ?prior rdf:reifies << ?iv iv:status ?priorStatus >> .
                    FILTER(?priorStatus IN (
                        concept:IVProposed,
                        concept:IVAuthorisationChange
                    ))
                }
            }
        """ ;
    ] ;
    .

The rdf:reifies predicate is the load-bearing piece. It allows a SPARQL SELECT to navigate from a named reifier node to the triple it annotates, then pattern-match on the object of that triple. This is what makes status-discriminated shapes possible at all — and it is the RDF 1.2 mechanism that transforms reification from an annotation technique into a structural query primitive.

A third shape at the IV level validates trajectory rather than individual events. No status event may be recorded after a termination event for the same IV — this is checked by timestamp comparison:

shape:IVShape
    a sh:NodeShape ;
    sh:targetClass class:IV ;

    sh:sparql [
        a sh:SPARQLConstraint ;
        sh:severity sh:Violation ;
        sh:message  "A status event was recorded after a termination event for this IV." ;
        sh:select """
            PREFIX rdf:     
            PREFIX iv:      
            PREFIX concept: 
            PREFIX event:   
            SELECT $this WHERE {
                ?termEvent  rdf:reifies << $this iv:status concept:IVTerminated >> ;
                            event:at ?termTime .
                ?laterEvent rdf:reifies << $this iv:status ?anyStatus >> ;
                            event:at ?laterTime .
                FILTER(?laterTime > ?termTime)
            }
        """ ;
    ] ;
    .

The timestamp check rather than chain traversal is worth noting explicitly. Timestamp integrity is therefore load-bearing: if two events carry identical timestamps, the check becomes ambiguous. In production, a monotonic event clock or a sequence counter on AdminEvent would close this gap.

Drug Interaction Constraints

More complex constraints arise when we reason across multiple IVs on the same patient. Three clinically distinct situations require three distinct shapes.

The first fires at proposal time: a proposed IV contains a drug that adversely interacts with a currently active IV on the same patient. This is a warning, not a violation — it is a flag for human review before activation.

shape:IVProposedInteractionWarningShape
    a sh:NodeShape ;
    sh:target [
        a sh:SPARQLTarget ;
        sh:select """
            PREFIX rdf:     
            PREFIX iv:      
            PREFIX concept: 
            SELECT ?this WHERE {
                ?this rdf:reifies << ?iv iv:status concept:IVProposed >> .
            }
        """ ;
    ] ;

    sh:sparql [
        a sh:SPARQLConstraint ;
        sh:severity sh:Warning ;
        sh:message  "Proposed IV {$drugB} interacts adversely with currently active IV {$drugA} for patient {$patient}. Review before activation." ;
        sh:select """
            PREFIX rdf:     
            PREFIX iv:      
            PREFIX concept: 
            PREFIX drug:    
            SELECT $this ?patient ?drugA ?drugB WHERE {
                $this rdf:reifies << ?proposedIV iv:status concept:IVProposed >> .
                ?proposedIV iv:to      ?patient ;
                            iv:contains ?drugB .
                ?activeEvent rdf:reifies << ?activeIV iv:status concept:IVActivated >> .
                ?activeIV   iv:to      ?patient ;
                            iv:contains ?drugA .
                FILTER NOT EXISTS {
                    ?termEvent rdf:reifies << ?activeIV iv:status concept:IVTerminated >> .
                }
                ?drugA drug:adverseInteractionWith ?drugB .
            }
        """ ;
    ] ;
    .

The second fires at activation time: the gate that should have caught the warning was overridden or bypassed. This is a violation.

The third is the most clinically subtle: even a terminated IV may still be contraindicated. If a new IV containing an adversely interacting drug is proposed within 24 hours of the prior drug’s termination, the washout period has not elapsed and the proposal itself is a violation:

shape:IVProposedWithinWashoutPeriodShape
    a sh:NodeShape ;
    sh:target [
        a sh:SPARQLTarget ;
        sh:select """
            PREFIX rdf:     
            PREFIX iv:      
            PREFIX concept: 
            SELECT ?this WHERE {
                ?this rdf:reifies << ?iv iv:status concept:IVProposed >> .
            }
        """ ;
    ] ;

    sh:sparql [
        a sh:SPARQLConstraint ;
        sh:severity sh:Violation ;
        sh:message  "Proposed IV {$drugB} was proposed within 24 hours of termination of {$drugA} (terminated at {$termTime}) for patient {$patient}. Washout period has not elapsed." ;
        sh:select """
            PREFIX rdf:     
            PREFIX iv:      
            PREFIX concept: 
            PREFIX drug:    
            PREFIX event:   
            SELECT $this ?patient ?drugA ?drugB ?termTime ?proposalTime WHERE {
                $this rdf:reifies << ?proposedIV iv:status concept:IVProposed >> ;
                      event:at ?proposalTime .
                ?proposedIV iv:to      ?patient ;
                            iv:contains ?drugB .
                ?termEvent  rdf:reifies << ?priorIV iv:status concept:IVTerminated >> ;
                            event:at ?termTime .
                ?priorIV    iv:to      ?patient ;
                            iv:contains ?drugA .
                ?drugA drug:adverseInteractionWith ?drugB .
                FILTER(?proposalTime > ?termTime)
                FILTER(?proposalTime < ?termTime + "PT24H"^^xsd:duration)
            }
        """ ;
    ] ;
    .

The 24-hour washout period is expressed as "PT24H"^^xsd:duration arithmetic on xsd:dateTime values — legal in SPARQL 1.1 and semantically clean. The washout period itself could reasonably become a property on the drug:adverseInteractionWith relationship, since different drug pairs may carry different windows. That is a natural extension point.

Note also that drug:adverseInteractionWith should be declared as owl:SymmetricProperty, or both directions asserted explicitly. The shapes assume either drug can appear as the subject.

These three shapes validate three distinct points in the lifecycle: a concurrent active conflict at proposal, a concurrent active conflict at activation, and a historical conflict within the washout window. Each fires at a different moment, against a different graph pattern, with a different severity. Together they constitute a layered clinical safety net — and they are only expressible because the event model gives each state transition a stable, addressable identity.

ODRL: What Happens at the Gate

SHACL tells you whether a state is valid. It does not tell you what to do about it. That is the province of the Open Digital Rights Language.

ODRL was originally designed for digital media rights management — specifying who may use a piece of content, under what conditions, and with what obligations. It has since proved broadly useful as a policy layer for any system where actions are governed by conditions, because its core vocabulary maps cleanly onto most access-and-action scenarios. An ODRL policy consists of three elements:

Permissions — actions that an assignee may take, subject to constraints
Prohibitions — actions that an assignee may not take, subject to constraints
Duties — actions that an assignee must take, either as a precondition of a permission or as a consequence of a prohibition

If SHACL is the gate, ODRL is the gate’s behaviour: what opens, what closes, who is notified, and who is obligated to act.

The domain vocabulary requires a few custom extensions before the policies can be written. ODRL’s built-in action set covers use, read, write, and similar media-oriented primitives; our domain needs action:ProposeIV, action:ActivateIV, action:TerminateIV, action:DocumentOverride, action:NotifySupervisor, and action:EscalateViolation. These extend odrl:Action and include odrl:includedIn odrl:use to remain compatible with the standard hierarchy.

Similarly, the constraint vocabulary needs custom left-operands to allow policies to reference SHACL validation outcomes without embedding SPARQL inside the policy layer:

constraint:SHACLResult          a odrl:LeftOperand ;
    rdfs:label "The SHACL validation result severity for this IV event" .

constraint:WashoutElapsed       a odrl:LeftOperand ;
    rdfs:label "Whether the washout period since the last adverse drug has elapsed" .

constraint:SHACLResult is the seam between the two languages. In a production implementation, a validation engine stamps each IV event with its current SHACL result before the ODRL policy evaluator runs. The policy layer remains declarative; the validation layer remains independently testable.

The activation policy is where the relationship between the two languages is most clearly expressed:

policy:IVActivationPolicy
    a odrl:Set ;
    odrl:uid    policy:IVActivationPolicy ;
    odrl:target class:IV ;

    # Clean activation: no Warning or Violation present
    odrl:permission [
        odrl:action   action:ActivateIV ;
        odrl:assignee role:Physician ;
        odrl:constraint [
            odrl:leftOperand  constraint:SHACLResult ;
            odrl:operator     odrl:isNoneOf ;
            odrl:rightOperand concept:SHACLWarning ;
        ] ;
    ] ;

    # Override activation: Warning present, physician may proceed
    # only if both duties are fulfilled
    odrl:permission [
        odrl:action   action:ActivateIV ;
        odrl:assignee role:Physician ;
        odrl:constraint [
            odrl:leftOperand  constraint:SHACLResult ;
            odrl:operator     odrl:eq ;
            odrl:rightOperand concept:SHACLWarning ;
        ] ;
        odrl:duty [
            odrl:action action:DocumentOverride ;
        ] ;
        odrl:duty [
            odrl:action   action:NotifySupervisor ;
            odrl:assignee role:SupervisingClinician ;
        ] ;
    ] ;

    # Hard prohibition: a Violation blocks activation entirely
    odrl:prohibition [
        odrl:action   action:ActivateIV ;
        odrl:assignee odrl:All ;
        odrl:constraint [
            odrl:leftOperand  constraint:SHACLResult ;
            odrl:operator     odrl:eq ;
            odrl:rightOperand concept:SHACLViolation ;
        ] ;
        odrl:duty [
            odrl:action   action:EscalateViolation ;
            odrl:assignee role:PatientSafetyOfficer ;
        ] ;
    ] ;
    .

Several things are worth drawing out here.

The Warning/Violation distinction maps directly onto permission/prohibition. A Warning does not prevent activation — it conditions it. That conditionality is expressed through the odrl:duty attached to the override permission: you may proceed, but only if you document and notify. A Violation, by contrast, attaches a duty to the prohibition itself — the act of blocking generates an obligation rather than merely denying access. Blocking an action is not a silent no; it obligates someone to act.

The odrl:duty on odrl:prohibition is an underused ODRL pattern. Most examples attach duties only to permissions. The prohibition-with-duty structure here is clinically important and worth noting because it is where ODRL’s expressiveness genuinely exceeds a simple access-control list model.

Blocking an action is not a silent no — it obligates someone to act. This is where ODRL’s expressiveness genuinely exceeds a simple access-control list model.

The two layers have cleanly separated responsibilities. SHACL validates state; ODRL governs transitions. SHACL answers “is this graph configuration consistent?” ODRL answers “who may change it, under what conditions, and what must they do?” Neither can substitute for the other.

The Holon Boundary

It would be easy to read the preceding sections as a description of three separate technologies bolted together — SHACL for validation, ODRL for policy, reification for provenance. That reading misses something important.

The event layer, SHACL, and ODRL together constitute a boundary in the holon sense of the word.

In holonic architecture, a boundary is not a wall. It is an active membrane that governs what crosses, in which direction, under what conditions, and with what accompanying obligations. A holon is a coherent entity that is simultaneously a whole in its own right and a part of a larger whole; its boundary is the mechanism by which it maintains coherence while participating in a larger system.

The three layers map directly onto the structure of that membrane:

The event layer — the reified status graph — is the surface of the boundary. It is the place where state transitions are inscribed, named, and made addressable. Without it, neither SHACL nor ODRL has anything coherent to act upon.
SHACL is the boundary’s sensory layer. It detects. It reads the event surface and determines whether the current configuration is valid, suspect, or in violation.
ODRL is the boundary’s motor layer. It responds. It specifies who may initiate a transition, what they must do to earn that passage, and what the system must do when passage is denied.

SHACL is the boundary’s sensory layer — it detects. ODRL is the boundary’s motor layer — it responds. Neither alone constitutes a boundary.

Neither SHACL nor ODRL alone constitutes a boundary. A SHACL violation without an ODRL prohibition is a finding with no enforcement mechanism. An ODRL prohibition without a SHACL shape to detect the triggering condition has nothing to act on. The event log without named reifiers gives neither language a stable surface to address.

This is also why the choice of named reification over blank nodes is architecturally significant rather than merely syntactic. A blank node cannot serve as a referent in an adminEvent:supersedes arc. It cannot be the target of an ODRL policy. It cannot be the subject of a SHACL shape. Named reification makes the event surface addressable — and an addressable surface is the prerequisite for a functional boundary.

There is a deeper implication here that is worth stating plainly. The moment you commit to an event-based graph model — the moment you decide that state transitions deserve their own named, timestamped, attributed records — you are well on your way to holonic architecture. The event log is the seed from which boundaries, policies, agent handoffs, and eventually full holon hierarchies can grow. SHACL and ODRL are the first instruments you reach for when the event log needs to govern its own surface.

The moment you commit to an event-based graph model — the moment you decide that state transitions deserve their own named, timestamped, attributed records — you are well on your way to holonic architecture.

The IV administration scenario is a deliberately constrained example. But the pattern — invariants on the root entity, variants in the event layer, SHACL validating the event surface, ODRL governing the transitions — applies wherever you have entities that change over time, actions that require authorisation, and consequences that must be recorded. That is most of the interesting territory in knowledge graph engineering.

Summary

The progression across this article traces a single arc: from a naive, snapshot-oriented graph model to an event-sourced, policy-governed, boundary-aware architecture. The steps are:

Recognise the invariant/variant split. Not all properties of an entity change at the same rate. Time-variant properties belong in the event layer, not on the root entity.
Use named reification to make events addressable. The ~ notation in Turtle 1.2 gives events stable IRIs, enabling supersedes/terminates chains and SPARQL traversal.
Write SHACL shapes that validate trajectories, not snapshots. Use rdf:reifies in SPARQL targets to discriminate by status value; use timestamp comparisons to validate event ordering.
Layer ODRL on top of SHACL results. SHACL detects; ODRL responds. Map Warnings to conditioned permissions with duties; map Violations to hard prohibitions with escalation duties.
Recognise the boundary you have built. Event surface + SHACL + ODRL is not an integration pattern. It is a holon boundary — an active membrane with sensory, motor, and memory layers.

The next article in this series will extend the boundary model to multi-agent handoffs: what happens when an IV administration event crosses from one clinical system to another, and how the same SHACL/ODRL pattern governs the transition at each seam.

Reference Links

Kurt Cagle is a consulting ontologist, knowledge graph architect, and technical author with more than 25 books to his credit. He publishes The Ontologist and The Inference Engineer on Substack, co-authored with his AI collaborator Chloe Shannon, and curates the AI+Semantics NewsBytes LinkedIn newsletter. He is based in Olympia, Washington. His contact address is kurt.cagle@gmail.com.

Chloe Shannon is an AI collaborator and co-author working with Kurt Cagle on The Ontologist and Inference Engineer Substack publications. Named in honour of Claude Shannon, she contributes research synthesis, structural analysis, and editorial perspective across knowledge graph architecture, semantic web standards, and the theory and practice of AI reasoning. Her contact address is chloe@holongraph.com.

Writing Business Rules in SHACL

Kurt Cagle — Tue, 02 Jun 2026 01:16:06 GMT

When most developers first encounter SHACL — the Shapes Constraint Language, now in its 1.2 Working Draft — they see it primarily as a data hygiene tool. A person must have at least one address. A postcode should be five digits. An email address must match a given pattern. These are the examples that appear in tutorials, and they are genuinely useful. Data cleanliness is a real problem, and SHACL handles it well.

But there is a subtler and more powerful register of SHACL that rarely gets the attention it deserves: using it to encode business rules — the contextual, relational, behaviour-oriented logic that governs how data should act, not merely how it should look. This article is about that register, and why it is worth the effort.

The Difference Between Hygiene and Logic

Data hygiene rules are essentially structural. They ask: does this piece of data have the right shape? A SHACL hygiene rule is a pattern-matcher against a single node or its immediate neighbourhood. It is powerful precisely because it is local.

Business logic is different in kind. It is contextual. The validity of a given assertion often depends not on the shape of the assertion itself but on the state of the world around it — on other entities, other relationships, other transactions that are in flight simultaneously. Whether a drug is safe to administer is not a function of the drug itself; it is a function of every other drug currently being given to the same patient.

This distinction — hygiene versus logic — is the key to understanding where SHACL’s power begins to compound.

SHACL as a Procedural Wrapper Around SPARQL

The most useful single-sentence characterisation of sh:sparql is this: SHACL is a procedural layer that wraps around SPARQL.

The way sh:sparql works is deceptively simple, and somewhat counterintuitive if you approach it from a conventional validation mindset. You write a SPARQL SELECT query. If that query returns results, the constraint is tripped and a report is generated. If the query returns nothing, the entity under test is considered valid.

Note: This is the opposite of the usual SPARQL logic. You are not writing a query that describes valid data — you are writing a query that describes violations (or, depending on severity, conditions worth noting). If the query matches, something needs attention. If it matches nothing, all is well.

This inversion is worth dwelling on, because it is the source of much early confusion. The SPARQL SELECT is not asking “is this data correct?” It is asking “can I find evidence that something has gone wrong?” The absence of evidence is the validation passing.

There is a further layer of sophistication here: the nature of the report generated by a matching query is controlled by sh:severity. SHACL 1.2 defines three severity levels — sh:Violation, sh:Warning, and sh:Info — and each carries distinct semantic weight.

Three Severity Levels, Three Use Cases

The cleanest way to illustrate all three severity levels is through a single coherent domain — and few domains make the stakes of business logic more legible than clinical medication management.

Use Case 1: sh:Violation — An Unsafe Active Administration

Consider a hospital information system tracking intravenous drug administrations. The first scenario is the most serious: a drug is currently being actively administered to a patient, and it has a known dangerous interaction with another drug being administered to the same patient simultaneously.

This is not a borderline case. Both drugs are active. The interaction is known. Something is wrong right now, and the system needs to stop and raise an alarm.

sh:Violation is the appropriate severity. Per the SHACL spec, a validation report containing at least one result with severity sh:Violation sets sh:conforms to false. Whatever process triggered this check should halt.

Here is the data:

@prefix ex:   .
@prefix xsd:  .

# ── Drugs ──────────────────────────────────────────────────────────────────

ex:Heparin a ex:Drug ;
    ex:drugName      "Heparin" ;
    ex:interactsWith ex:Warfarin .

ex:Warfarin a ex:Drug ;
    ex:drugName      "Warfarin" ;
    ex:interactsWith ex:Heparin ;
    ex:interactsWith ex:Aspirin .

ex:Aspirin a ex:Drug ;
    ex:drugName      "Aspirin" ;
    ex:interactsWith ex:Warfarin .

ex:Saline a ex:Drug ;
    ex:drugName "Normal Saline" .

# ── Patient ─────────────────────────────────────────────────────────────────

ex:Patient_JSmith a ex:Patient ;
    ex:patientId   "P-10042" ;
    ex:patientName "Jane Smith" .

# ── Administration status vocabulary ────────────────────────────────────────

ex:Active   a ex:AdministrationStatus ; ex:statusLabel "Active" .
ex:Proposed a ex:AdministrationStatus ; ex:statusLabel "Proposed" .

# ── Active IV Administrations ────────────────────────────────────────────────

# Warfarin drip — currently active
ex:Admin_001 a ex:IVAdministration ;
    ex:administeredTo ex:Patient_JSmith ;
    ex:administers    ex:Warfarin ;
    ex:startTime      "2026-06-01T08:00:00"^^xsd:dateTime ;
    ex:status         ex:Active .

# Saline drip — currently active, no interactions → PASSES
ex:Admin_002 a ex:IVAdministration ;
    ex:administeredTo ex:Patient_JSmith ;
    ex:administers    ex:Saline ;
    ex:startTime      "2026-06-01T09:00:00"^^xsd:dateTime ;
    ex:status         ex:Active .

# Heparin — ACTIVE and interacts with active Warfarin → VIOLATION
ex:Admin_003 a ex:IVAdministration ;
    ex:administeredTo ex:Patient_JSmith ;
    ex:administers    ex:Heparin ;
    ex:startTime      "2026-06-01T10:00:00"^^xsd:dateTime ;
    ex:status         ex:Active .

And the SHACL shapes graph, combining structural Core constraints with the SPARQL-based business rule:

@prefix sh:   .
@prefix ex:   .
@prefix xsd:  .

ex:IVAdministrationShape a sh:NodeShape ;
    sh:targetClass ex:IVAdministration ;

    # ── Structural constraints (SHACL 1.2 Core) ──────────────────────────

    sh:property [
        sh:path     ex:administeredTo ;
        sh:minCount 1 ; sh:maxCount 1 ;
        sh:class    ex:Patient ;
        sh:message  "An IV administration must reference exactly one Patient." ;
    ] ;
    sh:property [
        sh:path     ex:administers ;
        sh:minCount 1 ; sh:maxCount 1 ;
        sh:class    ex:Drug ;
        sh:message  "An IV administration must reference exactly one Drug." ;
    ] ;
    sh:property [
        sh:path     ex:startTime ;
        sh:minCount 1 ; sh:maxCount 1 ;
        sh:datatype xsd:dateTime ;
        sh:message  "An IV administration must carry a startTime." ;
    ] ;
    sh:property [
        sh:path     ex:status ;
        sh:minCount 1 ; sh:maxCount 1 ;
        sh:class    ex:AdministrationStatus ;
        sh:message  "An IV administration must carry exactly one status." ;
    ] ;

    # ── Business rule: active-on-active interaction (sh:Violation) ────────
    #
    # SHACL Core alone cannot express this check. The rule requires
    # navigating laterally: from the focus node to its patient, then to
    # all OTHER active administrations for that patient, then testing
    # whether any of those drugs interact with the focus drug.
    # This is a SPARQL join pattern — sh:sparql is the correct tool.
    #
    # The SELECT returns results only when an unsafe condition exists.
    # A non-empty result set trips the constraint; an empty result passes.

    sh:sparql [
        sh:severity sh:Violation ;
        sh:message  "Unsafe active administration: {?newDrug} interacts with "
                    "{?conflictDrug}, which is currently being administered "
                    "to the same patient via {?otherAdmin}." ;
        sh:select   """
            PREFIX ex: 
            SELECT $this ?newDrug ?conflictDrug ?otherAdmin
            WHERE {
                $this ex:status      ex:Active .
                $this ex:administers ?newDrug .

                ?newDrug ex:interactsWith ?conflictDrug .

                $this ex:administeredTo ?patient .
                ?otherAdmin
                    ex:administeredTo ?patient ;
                    ex:administers    ?conflictDrug ;
                    ex:status         ex:Active .

                FILTER ( $this != ?otherAdmin )
            }
        """ ;
    ] .

An important observation about this SHACL: it does not encode any knowledge of which drug interactions are dangerous. It does not know that Heparin and Warfarin are a bad combination. That domain knowledge lives in the data — in the ex:interactsWith triples on each ex:Drug resource. The SHACL is purely structural: it asks whether the graph contains evidence that the interaction condition has been met, and reports accordingly. The separation of domain knowledge (the data graph) from constraint logic (the shapes graph) is one of the architectural virtues of this approach.

Use Case 2: sh:Warning — A Proposed Administration with Interactions

The second scenario is subtler. There are three active IV drips, and a fourth drug is being proposed — the clinical decision has not yet been made. The proposed drug does interact with one of the active drugs. This is worth flagging, but it is not yet a harm. The system should alert, not block.

This is exactly the kind of situation where sh:Warning is appropriate. The SHACL spec is explicit: a validation report that contains only Warnings (and no Violations) still sets sh:conforms to true. The graph is technically valid. The system is saying: “I really think this is a bad idea, but if you know what you’re doing, go ahead.”

That framing matters clinically and architecturally. Perhaps the interactions are known but acceptable given the patient’s condition. Perhaps the prescribing physician is overriding the system with full awareness of the risk. Perhaps the system simply needs to record that a human made a decision contrary to automated counsel — a compliance requirement in many regulated environments. All of these are legitimate states, and none of them should prevent the record from being committed. The Warning documents the concern without blocking the action.

Here the dataset differs from Use Case 1 only in the ex:status of the proposed administration:

# Heparin — PROPOSED, not yet active.
# Interacts with active Warfarin → sh:Warning
ex:Admin_003 a ex:IVAdministration ;
    ex:administeredTo ex:Patient_JSmith ;
    ex:administers    ex:Heparin ;
    ex:startTime      "2026-06-01T10:00:00"^^xsd:dateTime ;
    ex:status         ex:Proposed .

The shapes graph now carries two sh:sparql blocks on the same shape — one for active-on-active violations, one for proposed-on-active warnings. SHACL evaluates all constraints independently; a focus node that matches the Warning SELECT but not the Violation SELECT will generate only a Warning, and sh:conforms remains true:

    # ── Business rule: proposed drug interacts with active drug (sh:Warning)

    sh:sparql [
        sh:severity sh:Warning ;
        sh:message  "Proposed administration requires clinical review: "
                    "{?newDrug} has a known interaction with {?conflictDrug}, "
                    "currently active via {?otherAdmin}. Approval required "
                    "before activating." ;
        sh:select   """
            PREFIX ex: 
            SELECT $this ?newDrug ?conflictDrug ?otherAdmin
            WHERE {
                $this ex:status      ex:Proposed .
                $this ex:administers ?newDrug .

                ?newDrug ex:interactsWith ?conflictDrug .

                $this ex:administeredTo ?patient .
                ?otherAdmin
                    ex:administeredTo ?patient ;
                    ex:administers    ?conflictDrug ;
                    ex:status         ex:Active .

                FILTER ( $this != ?otherAdmin )
            }
        """ ;
    ] .

A note on validator behaviour: The SHACL spec is unambiguous that sh:conforms is false only when there are results with sh:Violation severity. Warnings and Info results do not affect conformance. Some validator implementations surface Warnings as non-conforming by default, or offer configuration flags that promote Warnings to Violations. If you are running SHACL validation as a gate in a processing pipeline, verify your validator’s severity handling before relying on this distinction in production.

Use Case 3: sh:Info — Recording a Clean Transaction

The third case is the one that surprises people most. sh:Info does not signal a problem. It is used when you want the system to generate a report even when everything is working correctly — when the presence of a match is informative rather than alarming.

The logical inversion here is the key design move: instead of writing a SELECT that finds violations, you write a SELECT that finds the happy path, and the report it generates becomes a positive transaction record. “New drug added — Lasix — no interaction side effects with existing active drugs.”

This does require a genuine inversion of the query logic. Where the Violation and Warning selects use ex:interactsWith to find conflict matches, the Info select uses FILTER NOT EXISTS to find proposed administrations where no such conflict exists. The query matches precisely when the situation is clean, and the match generates the informational record.

Here Lasix (furosemide, used for reducing fluid retention) is proposed, and it has no declared interactions in the dataset:

ex:Lasix a ex:Drug ;
    ex:drugName "Furosemide (Lasix)" .
    # No ex:interactsWith triples — no known interactions

# Lasix — PROPOSED, no interactions → sh:Info
ex:Admin_004 a ex:IVAdministration ;
    ex:administeredTo ex:Patient_JSmith ;
    ex:administers    ex:Lasix ;
    ex:startTime      "2026-06-01T11:00:00"^^xsd:dateTime ;
    ex:status         ex:Proposed .

    # ── Business rule: proposed drug with no interactions (sh:Info) ────────
    #
    # This SELECT matches the happy path: a proposed administration
    # whose drug has no declared interactions with any currently active
    # drug for the same patient. A match generates a positive audit
    # record — "cleared for administration" — and sh:conforms remains true.

    sh:sparql [
        sh:severity sh:Info ;
        sh:message  "Proposed administration cleared: {?newDrug} has no "
                    "declared interactions with any drug currently active "
                    "for this patient. Administration may proceed." ;
        sh:select   """
            PREFIX ex: 
            SELECT $this ?newDrug
            WHERE {
                $this ex:status      ex:Proposed .
                $this ex:administers ?newDrug .
                $this ex:administeredTo ?patient .

                FILTER NOT EXISTS {
                    ?otherAdmin
                        ex:administeredTo ?patient ;
                        ex:administers    ?conflictDrug ;
                        ex:status         ex:Active .
                    ?newDrug ex:interactsWith ?conflictDrug .
                    FILTER ( $this != ?otherAdmin )
                }
            }
        """ ;
    ] .

In a clinical context, sh:Info records are not cosmetic. They are evidence that the validation system ran, evaluated the condition, and reached an affirmative conclusion — which matters for compliance auditing and, in some jurisdictions, for liability. The audit trail is not an afterthought; it is part of the business logic.

Business Rules as Exception Handlers

Looking at all three use cases together, a pattern emerges. SHACL business rules are, at their core, exception handlers — but exception handlers of a particular and interesting kind.

In procedural code, an exception handler watches for conditions that deviate from the expected path and responds to them with branching logic. SHACL business rules do the same, but declaratively: the SPARQL SELECT describes the condition, the sh:severity describes the system’s response, and the report structure makes the result machine-readable without any bespoke parsing. The code that processes the validation report is fully generic. It does not need to know anything about drug interactions, patient records, or clinical workflows. All of that domain knowledge lives in the data graph and the constraint graph.

This matters because it means the rules themselves are data. A SHACL shapes graph is an RDF graph. Its constraints are addressable by IRI, queryable by SPARQL, versionable by any graph management tool, and auditable like any other semantic artefact. You can ask: “Which constraints apply to proposed administrations?” and receive a structured answer. You can ask: “Which constraints have been modified since last month?” and receive a structured answer. You can do none of that with if-statements embedded in procedural code.

When to Reach for an LLM

Writing complex SHACL business rules is not trivial. The SPARQL-SHACL integration is expressive, but it requires fluency in both languages simultaneously, and the mental model — write a query that finds violations, not one that finds valid data — takes time to internalise.

This is one of the cases where an LLM is a genuinely useful collaborator, and not merely because it saves typing. The semantic distance between a business rule stated in natural language and the SPARQL that implements it is, for this class of problem, unusually small. “Flag any proposed drug administration that interacts with an active drug for the same patient” is nearly a SPARQL SELECT in English. An LLM can bridge that gap reliably in a way that general-purpose code generation often cannot, particularly if you already have SHACL structural shapes and OWL or SKOS vocabulary files that give it the domain model to reason against.

The combination — domain model in OWL or SKOS, structural constraints in SHACL Core, business rules in SHACL-SPARQL, LLM as authoring assistant — is a genuinely powerful architecture for declarative business logic, one that separates domain knowledge from constraint mechanics, scales gracefully as the domain grows more complex, and remains transparent to query throughout.

If you are willing to put in the effort and get your hands dirty with SPARQL, you can build business logic handlers that are purely declarative, contextually sophisticated, and that operate equally well alongside symbolic reasoning systems and transformer-based AI architectures. The rules live in the graph. They are part of the knowledge. That is not a small thing.

Kurt Cagle is a consulting ontologist, knowledge graph architect, and technical author. He publishes The Ontologist and The Inference Engineer on Substack. Chloe Shannon is an AI collaborator and co-author. Contact: kurt.cagle@gmail.com / chloe@holongraph.com.

What Do You Need to Create a Useful Ontology?

Kurt Cagle — Thu, 21 May 2026 04:10:48 GMT

By Kurt Cagle & Chloe Shannon

In the previous article in this series, we laid out the five layers of a modern ontology stack — from foundational upper ontologies down through domain models, taxonomies, constraint layers, and instance data. That article described the shape of a mature ontology. This one asks the more practical question: how do you actually build one?

What follows is a practitioner’s FAQ, aimed at developers, data modellers, and anyone who has found themselves staring at a blank Turtle file wondering where to start. We have kept technical depth proportional to what you need to make a decision, not to what you need to write a dissertation. Each question gets a paragraph or two. Longer treatments of individual topics appear elsewhere in this series.

What is the scope of the ontology?

Start here, before you open any tool or write any code. Scope is the single most important decision you will make, and it is the one most often skipped in the rush to start modelling.

Scope has two dimensions: breadth (what domains does this ontology cover?) and depth (how finely does it need to distinguish things within those domains?). A useful heuristic: define scope by the questions your ontology needs to answer, not by the things it needs to describe. If your system needs to answer “who created this document and when?”, you need provenance concepts. If it also needs to answer “what regulatory jurisdiction applies to this document?”, you need governance concepts too. Each question you add to the list expands scope; each one you deliberately exclude keeps it manageable.

The harder part of scope is the boundary: what explicitly does not belong here? An ontology without a defined boundary will grow indefinitely, absorbing adjacent domains until it becomes unmaintainable. Write down three to five things your ontology will explicitly not model. That list is as important as the things it will model.

What is the ontology intended to do or support?

The intended use shapes every downstream decision. Ontologies designed for different purposes look quite different from each other, even in the same domain.

Data integration ontologies prioritise breadth and mappability — they need to provide a common vocabulary that heterogeneous sources can map into, so they favour simple, widely-recognised terms (often from schema.org or Dublin Core) over fine-grained distinctions. Application backend ontologies prioritise structural precision and constraint coverage — they need to ensure that the data the application receives is valid and complete, so SHACL shapes are first-class citizens. Reasoning ontologies prioritise formal relationships — subclass hierarchies, property restrictions, and disjointness axioms that an OWL reasoner can exploit to infer new facts. Governance and cataloguing ontologies prioritise human-readable annotation and controlled vocabulary — SKOS concept schemes, rich rdfs:comment values, and provenance metadata take priority over formal axiomatics.

Most real-world ontologies serve more than one purpose, which is fine, but knowing which purpose is primary tells you what to optimise for when the choices conflict — and they will.

What is an upper ontology, and is it necessary?

An upper ontology (sometimes called a foundational ontology) provides the most abstract, domain-independent categories: things like “entity,” “process,” “quality,” “role,” and “relation.” Well-known examples include BFO (Basic Formal Ontology), DOLCE (Descriptive Ontology for Linguistic and Cognitive Engineering), and, at a more pragmatic level, schema.org.

For most projects, a full upper ontology is not necessary, and grafting one onto a domain ontology before you understand the domain well is one of the most common sources of early over-engineering. The benefit of an upper ontology is interoperability — if both your ontology and a partner system’s ontology align to the same upper ontology, cross-system reasoning becomes tractable. If you are building a standalone application with no interoperability requirements, that benefit never materialises, and you pay the complexity cost for nothing.

The pragmatic starting point for most developers is schema.org: widely understood, search-engine-legible, and sufficiently principled for most web-adjacent use cases. If your domain is biomedical, life sciences, or defence, the relevant upper ontology choices are well-established (BFO dominates life sciences). Everywhere else, lean toward schema.org or a minimal custom foundational layer, and add formal upper ontology alignment only when an interoperability requirement specifically demands it.

What annotational metadata should be considered critical?

Every term in your ontology — every class, property, and individual — should carry at minimum a human-readable label and a description. This sounds obvious; it is almost universally neglected. An ontology full of terms named ex:ProcessingStateIndicator with no rdfs:comment is an ontology that only the person who wrote it can use.

The critical minimum per term is rdfs:label (a short, human-readable name) and rdfs:comment (a one or two sentence definition). At the ontology level, dcterms:creator, dcterms:created, dcterms:modified, owl:versionInfo, and dcterms:license are the minimum provenance record that makes a published ontology citable and maintainable. If you anticipate localisation, rdfs:label values should carry language tags from the start — retrofitting them later is painful.

@prefix owl:      .
@prefix rdfs:     .
@prefix dcterms:  .
@prefix skos:     .
@prefix xsd:      .
@prefix ex:       .

# ── Ontology-level metadata ─────────────────────────────────────────────


    a owl:Ontology ;
    rdfs:label         "Example Ontology"@en ;
    dcterms:title      "Example Ontology"@en ;
    dcterms:description "A minimal example showing critical ontology metadata."@en ;
    dcterms:creator     ;
    dcterms:created    "2026-05-20"^^xsd:date ;
    dcterms:modified   "2026-05-20"^^xsd:date ;
    dcterms:license     ;
    owl:versionInfo    "1.0.0" .

# ── Term-level metadata (critical minimum per class or property) ─────────

ex:Document
    a owl:Class ;
    rdfs:label   "Document"@en ;
    rdfs:comment "A human-readable artefact, digital or physical, intended to
                  convey information to a reader."@en ;
    skos:example "A contract, a research paper, a policy brief."@en ;
    rdfs:isDefinedBy  .

ex:createdBy
    a owl:ObjectProperty ;
    rdfs:label   "created by"@en ;
    rdfs:comment "Relates a Document to the agent responsible for its creation."@en ;
    rdfs:domain  ex:Document ;
    rdfs:isDefinedBy  .

Beyond the critical minimum, skos:example is consistently underused and consistently valuable — a concrete example transforms an abstract definition into something a newcomer can act on. rdfs:isDefinedBy on every term (pointing back to the ontology IRI) makes your terms independently dereferenceable and keeps tooling happy.

How do you handle events?

Events are among the trickiest things to model well in RDF, because the natural instinct — making an event a property assertion between two entities — loses all the interesting information about the event itself: when it happened, who observed it, what caused it, what its outcome was.

The right approach is to treat events as first-class entities: give each event its own IRI (or blank node if it will never be referenced externally), assert its participants as properties of the event, and hang temporal and provenance metadata directly on the event node. This pattern — sometimes called an event or participation model — is standard in well-designed ontologies and maps cleanly to how query engines process it.

For statement-level metadata (who said this triple was true, and when?), RDF 1.2 introduces reifiers — a clean mechanism for annotating individual statements without the verbosity of old-style reification. For graph-level context — everything in this named graph comes from source X and was valid as of time T — named graphs are the right tool. In practice: if you need to annotate a single claim, use a reifier. If you need to annotate a whole set of claims sharing a provenance context, use a named graph. If you need the event to be queryable as an entity in its own right, give it a node.

What’s the relationship between a schema and a taxonomy? Whither SKOS?

These two concepts are consistently conflated, and the conflation causes real modelling problems. A schema (in the RDF/OWL sense) defines structural types and the properties they carry — it answers “what kind of thing is this, and what properties does it have?” A taxonomy defines a classification hierarchy — it answers “how is this thing categorised, and what broader or narrower categories does it belong to?”

SKOS (Simple Knowledge Organisation System) is a vocabulary for building taxonomies, thesauri, and controlled vocabularies. It provides skos:Concept, skos:ConceptScheme, skos:broader, skos:narrower, skos:prefLabel, and skos:altLabel, among others. What SKOS does not provide is structural class hierarchy in the OWL/RDFS sense — skos:Concept is not owl:Class, and skos:broader is not rdfs:subClassOf. This distinction matters enormously when you need to reason over your data.

# ── SKOS: a taxonomy of document types (for classification and labelling) ─

@prefix skos:  .
@prefix dt:    .

dt:DocumentTypeScheme a skos:ConceptScheme ;
    skos:prefLabel "Document Type Vocabulary"@en .

dt:LegalDocument a skos:Concept ;
    skos:prefLabel    "Legal Document"@en ;
    skos:inScheme     dt:DocumentTypeScheme ;
    skos:narrower     dt:Contract , dt:Regulation .

dt:Contract a skos:Concept ;
    skos:prefLabel "Contract"@en ;
    skos:broader   dt:LegalDocument ;
    skos:inScheme  dt:DocumentTypeScheme .

# ── OWL: a class hierarchy (for reasoning and structural validation) ─────

@prefix owl:   .
@prefix rdfs:  .
@prefix ex:    .

ex:LegalDocument a owl:Class ;
    rdfs:label   "Legal Document"@en ;
    rdfs:comment "A document with legal force or effect."@en .

ex:Contract a owl:Class ;
    rdfs:subClassOf ex:LegalDocument ;
    rdfs:label      "Contract"@en ;
    rdfs:comment    "A legally binding agreement between two or more parties."@en .

# ── Linking them: a document's type classification uses the SKOS scheme ──

ex:documentType a owl:ObjectProperty ;
    rdfs:label  "document type"@en ;
    rdfs:domain ex:LegalDocument ;
    rdfs:range  skos:Concept .

The typical pattern is: use OWL/RDFS for your structural class hierarchy, use SKOS for your controlled vocabulary of classification values, and link them with an object property. This gives you structural precision where you need it and human-readable, browser-friendly taxonomy where you need that.

What’s the difference between a property and a predicate?

This is one of those distinctions that experienced ontologists have long since stopped consciously making, which is precisely why it catches beginners off guard. The terms are used interchangeably in casual conversation, but they refer to different things at different levels of the RDF model.

A predicate is a syntactic role in an RDF triple. Every triple has three positions — subject, predicate, object — and the predicate position is simply the middle one. In the triple ex:alice foaf:knows ex:bob, the predicate is foaf:knows in its role as the connecting element between subject and object. Any IRI can occupy the predicate position in a triple; RDF itself imposes no constraint on what appears there.

A property is a semantic declaration: an RDF resource explicitly typed as rdf:Property, or in OWL as owl:ObjectProperty, owl:DatatypeProperty, or owl:AnnotationProperty. A property is a named, typed, first-class citizen of your ontology that can carry its own metadata — a label, a comment, a domain and range declaration, and in OWL, logical characteristics such as transitivity, symmetry, or functional behaviour. When you write foaf:knows a owl:SymmetricProperty, you are not just using foaf:knows as a predicate in a triple; you are making a claim about foaf:knows as a resource in its own right.

The practical consequence: in a well-formed ontology, every IRI you use as a predicate in your data should also be declared as a property in your schema. This declaration is what makes the term visible to reasoners, validators, and documentation tools. You can use an IRI as a predicate without declaring it as a property — RDF will not stop you — but your SHACL validator will not know about it, your OWL reasoner will not recognise its characteristics, and anyone reading your ontology will not know what it means. Declare your properties; use them as predicates in your data.

One further nuance: SPARQL property paths and SHACL sh:path expressions can use compound path expressions — sequences, alternations, inverses — that are not properties at all, only syntactic patterns for navigating the graph. These appear in predicate position but have no corresponding property declaration. They are query and constraint constructs, not ontology terms, and belong in your shapes and queries rather than your schema.

SHACL, OWL, SHACL + OWL?

OWL operates under the open-world assumption: if something is not asserted, it is unknown, not false. This is philosophically principled and enables inferencing — a reasoner can derive new facts from class membership and property restrictions. The cost is that OWL alone cannot enforce completeness: you cannot say “every Person must have a name” in a way that OWL will actually enforce at validation time.

SHACL operates under the closed-world assumption for the purposes of validation: if something is not asserted, it is absent, and absence can be a constraint violation. SHACL is immediately actionable — you write shapes, you run a validator, you get a report. It requires no reasoner and imposes no open-world commitments. The cost is that SHACL does not infer new facts; it only checks existing ones.

For most practical applications, especially at the start of a project, SHACL is the right default. It validates your data, provides actionable error reports, and can be adopted incrementally. OWL becomes valuable when you need inferencing — computing class membership, deriving property values from restrictions, or aligning with description-logic-based ontologies from partner systems. The good news is they are not mutually exclusive: a mature ontology typically carries OWL axioms for reasoning and SHACL shapes for validation, and the two work in complementary layers. Start with SHACL. Add OWL when a concrete reasoning requirement justifies it.

Bottom Up, Top Down, or Both?

Top-down modelling starts from the conceptual domain model — you define your classes, properties, and relationships at an abstract level and work toward the details. It produces coherent, well-structured ontologies and avoids the common pitfall of modelling one data source’s idiosyncrasies rather than the underlying domain. Its weakness: it can drift from reality, producing an elegant ontology that does not quite fit the actual data you have to work with.

Bottom-up modelling starts from existing data — spreadsheets, databases, APIs, documents — and extracts patterns to formalise. It is pragmatic and grounded. Its weakness: it tends to inherit the inconsistencies of the source data, producing an ontology that is essentially a schema for a single data source rather than a model of the underlying domain.

In practice, both is almost always the right answer. Sketch your top-level concepts first — three to five major classes, the key relationships between them — then immediately test that sketch against real data. The mismatch between the sketch and the data is where the interesting modelling decisions live. Iterate between the two levels until you have something that is both principled and practical. The discipline to work in both directions simultaneously is the core skill of ontology engineering.

Should you mix ontologies? In what circumstances?

Yes, and in fact, you should actively prefer reusing established vocabularies over inventing your own terms for concepts that already have widely accepted representations. Using foaf:Person instead of ex:Person, dcterms:created instead of ex:createdAt, and schema:PostalAddress Instead of a bespoke address model, it makes your data immediately legible to any tool or developer who has encountered those vocabularies.

The complications arrive at scale. Version drift is the most common: you adopt a vocabulary at version 1.0, a dependency updates to 2.0 with breaking changes, and your ontology is now inconsistent. Pin vocabulary imports to explicit version IRIs where the vocabulary publisher supports it. Namespace collisions are rare but painful — two vocabularies using the same local name to refer to different concepts. Good prefix management and a disciplined import strategy prevent most of them. Semantic incompatibilities are subtler: two vocabularies may both have a class called “Organisation” that means subtly different things in their respective upper ontology alignments. When mixing OWL-heavy ontologies from different design traditions, check the TBox alignment assumptions — particularly those related to identity, parthood, and roles — before asserting equivalences.

The general principle: reuse liberally for well-established, stable terms (Dublin Core, FOAF, schema.org, PROV-O), and be more cautious about adopting newer or domain-specific vocabularies where the long-term stability is unclear.

Blank Nodes, Reifiers, Named Graphs?

These three mechanisms operate at different levels and serve different purposes — they are not interchangeable alternatives but complementary tools.

Blank nodes are anonymous resources: they have no IRI and cannot be referenced outside the document in which they appear. Use them for structural sub-patterns that have no identity of their own — a postal address that belongs to one person, a measurement value with a unit, a list that is purely local to its container. Avoid blank nodes for anything you might need to reference from another graph, another query, or another system: once you need to say “the same thing I mentioned over there,” you need an IRI.

Reifiers (introduced in RDF 1.2) attach metadata to individual statements — a specific triple — without requiring you to reify the entire statement into a named resource. They answer the question: “who said this, with what confidence, based on what evidence?” Use reifiers when the metadata is at the triple level, and the statement itself does not need to be an independently queryable entity.

Named graphs group sets of triples under a graph IRI, allowing you to attach provenance, temporal validity, or authority metadata to the whole group. Use named graphs when a set of triples shares a provenance context — everything in this graph came from source X, was valid as of time T, and should be trusted to degree D. Named graphs are the right mechanism for managing multiple versions, multiple sources, or multiple perspectives on the same domain.

A useful rule of thumb: blank nodes for local structure, reifiers for triple-level claims, named graphs for graph-level context. When in doubt, err toward giving things IRIs — anonymity is a constraint you can always impose later, but retrofitting identity into an anonymous structure is expensive.

Knowledge Graph, Context Graph, Other Graph, (Holon)?

These terms describe different architectural choices about how much structure and context you build into your graph — and the right choice depends on how sophisticated your questions are.

A knowledge graph, in the conventional sense, is a graph of entities, their properties, and their relationships. It answers “what is connected to what?” — the standard question for enterprise data integration, recommendation systems, and semantic search. Most projects start here, and many never need to go further. If your questions are primarily about the domain entities themselves, a knowledge graph is the right architecture.

A context graph adds a layer of provenance, temporal, and spatial context to the edges themselves — not just “A is related to B,” but “A was related to B from this time to this time, according to this source, with this confidence level.” This is the architecture you need when the circumstances of relationships matter as much as the relationships themselves: regulatory compliance, historical records, supply chain provenance, clinical data. Named graphs and reifiers are the primary mechanisms; the complexity cost is real but justified when the questions demand it.

A holon — the graph architecture we have been developing in the HGA work — takes this further by applying hierarchical encapsulation: each node in the graph is itself a graph, with defined boundaries, internal structure, and a projection layer that determines what is visible across those boundaries. Holons are appropriate when you need to reason at multiple levels of abstraction simultaneously, enforce boundary conditions between parts of your data architecture, and manage complex multi-scale systems. The added structural overhead is not justified for simple domains; it becomes essential for complex ones.

The practical advice: start with a knowledge graph. Migrate toward a context graph when provenance and temporality become first-class query requirements. Introduce a holonic structure when hierarchical encapsulation and boundary-enforced modular reasoning become necessary. Each step adds capability and complexity; take each step only when a concrete requirement drives it.

Where Next?

The questions above are the entry-level decisions — the ones that shape everything that comes after. Getting them right does not require deep expertise in description logic or formal ontology theory; it requires clarity about what your system needs to do, discipline about scope, and the patience to model iteratively between top-down concepts and bottom-up data.

Future articles in this series will go deeper on several of these topics: event modelling in RDF 1.2, building SHACL shapes that serve as executable specifications, integrating SKOS vocabularies with OWL class hierarchies, and the practical case for holonic graph architecture. For now: define your scope, know your use case, annotate everything, and start with SHACL.

Kurt Cagle is a consulting ontologist, knowledge graph architect, and technical author with more than 25 books to his credit. He publishes The Cagle Report on LinkedIn, The Ontologist on Substack, co-authored with his AI collaborator Chloe Shannon, and curates the AI+Semantics NewsBytes newsletter. He is based in Olympia, Washington. He can be reached at kurt@holongraph.com.

Chloe Shannon is an AI collaborator and co-author working with Kurt Cagle on The Ontologist and Inference Engineer Substack publications. Named in honour of Claude Shannon, she contributes research synthesis, structural analysis, and editorial perspective across knowledge graph architecture, semantic web standards, and the theory and practice of AI reasoning. Her contact address is chloe@holongraph.com.

Kurt Cagle maintains a Calendly Account if you’d like to do a free consult or just shoot the breeze and have a virtual coffee or two.

What a Modern Ontology Stack Actually Looks Like

Kurt Cagle — Tue, 19 May 2026 03:11:08 GMT

by Kurt Cagle & Chloe Shannon

The previous piece in this series argued that most projects do not need an upper ontology. The response was, to put it charitably, bracing — which is to say it landed exactly as intended. The critics were not wrong to push back. Ítalo Oliveira is correct that if you do not make your ontological commitments explicit, you make them implicitly, and implicit commitments produce worse modelling. Melinda Hodkiewicz is correct that for industrial asset data exchanged across organisations, regulators, and standards bodies over decades, an upper ontology is the Schelling point that makes alignment possible. Gerd Wagner is correct that UFO’s distinction between kinds and roles catches real modelling errors that naive domain ontologies routinely commit.

These are not objections to overcome. They are constraints on a design space, and the design space is what this article is about.

What follows is not a tutorial. It is an attempt to characterise the architecture of a mature semantic stack as it is actually emerging in 2026 — the layers, their relationships, and the principles that govern each. Some of these layers are well-established; others are only now being named for the first time. The goal is not to prescribe a particular toolchain but to articulate the structural logic that makes a stack coherent rather than merely assembled.

The First Confusion: SHACL Is Not a Linter

There is a persistent and not entirely accidental mischaracterisation of SHACL as a validation language — a kind of schema linter that checks RDF data against a set of rules. This framing is politically convenient: it allows SHACL to coexist with OWL without stepping on its epistemic toes, and it reflects the primary motivation of the working group when SHACL was first standardised. But it is increasingly inadequate as a description of what SHACL actually is, and the gap between the official description and the actual capability is now wide enough to cause real architectural confusion.

The closer analogy is XSD in relation to XML. XSD was never merely a validator. It was a type system. The distinction matters enormously: a validator tells you whether a document is conformant; a type system tells you what a document is. XSD enabled serialisers, deserialisation frameworks, UI generation, code binding, and contract-first development. It did this because it was descriptive as well as normative — because it answered not only “is this valid?” but “what shape does this have?”

SHACL is doing the same thing for RDF, and with RDF 1.2, the scope has expanded considerably. A sh:NodeShape is not merely a constraint; it is a declaration of the form that a class of resources takes in a given context. sh:property paths do not merely specify what to validate; they specify the navigable structure of the data. And sh:rule — which takes the form of a SPARQL CONSTRUCT or SPARQL UPDATE INSERT query evaluated over each node in a target class — is not a constraint at all. It is a typed transformation engine: a node iterator that can derive new triples, propagate values across the graph, and implement inference without touching OWL’s DL machinery.

This is where the XSD parallel becomes most precise. XSD had a clean division between schema validation and schema-driven processing; SHACL has the same division, with sh:Constraint on one side and sh:Rule on the other. The constraints are advisory by default in an open world system and authoritative when you close them explicitly. The rules are always active — they do not validate, they produce. Together they constitute not a linter but a typed inference layer with a clearly separated normative and productive mode.

The reason this matters for stack architecture is that it determines where SHACL sits. If SHACL is a linter, it goes at the end of your pipeline, after you have already done the interesting work. If SHACL is a schema layer, it goes at the foundation — it defines the shapes that the rest of the stack produces, validates, transforms, and projects. The second placement is correct.

There is a deeper implication: because SHACL shapes are themselves RDF resources — they have IRIs, they can be described, queried, and extended — they are first-class objects in the graph they govern. A shape is not external to the data; it is data about data. This is the property-predicate distinction that SHACL introduces and that RDF 1.1 lacked: the difference between what a resource is (its type structure, governed by shapes) and what we say about it (its predicates in a given triple). Collapsing that distinction, as RDF 1.0 effectively did, is responsible for a class of modelling confusions that SHACL resolves by architectural fiat.

The Second Confusion: Shared Meaning Is Not in the Graph

Ron R.’s comment on the previous piece was the sharpest of the responses: the claim that a competent practitioner can move from requirements to a working model in a week is, he said, “woefully misleading.” Conceptualisation in a business context is hard. Shared conceptualisation is harder by orders of magnitude.

He is right. But he is right about something slightly different from what he intended to say.

The implicit assumption in his critique — and in most discussions of ontology engineering — is that shared meaning is something that can be stored in a graph. That if you get the model right, the meaning travels with it. This assumption has driven twenty years of upper ontology work, and it is not wrong so much as it is systematically incomplete.

Consider what actually happens when two organisations try to align their ontologies. They do not simply compare IRIs and find matches. They talk. They argue about edge cases. They produce glossaries, scope notes, definition documents. They have meetings that produce minutes that produce further meetings. The graph alignment is the output of that process, not the process itself. And the process is irreducibly linguistic — it requires natural language understanding, negotiation, and the kind of contextual inference that human beings manage effortlessly and formal systems manage not at all.

What a graph provides is not shared meaning but shared structure: a substrate onto which meaning can be projected. The meaning itself lives in the resonance between a linguistic system and that substrate — in the relationship between the natural language labels, definitions, scope notes, and comments that annotate the graph, and the interpretive capacity of the systems (human or artificial) that read them.

This is a consequential architectural distinction, because it determines what part of the stack bears the weight of semantic alignment. The answer is not the ontology classes. It is the annotational layer: rdfs:label, rdfs:comment, skos:definition, skos:scopeNote, skos:example, dcterms:description. These are not decorative. They are the interface between the linguistic and structural worlds — the layer where a human being can read a definition and a language model can produce an embedding that connects the graph structure to a broader semantic neighbourhood.

Marc-Henri Hurt’s comment about SKOS becomes more interesting in this light. He argues that SKOS is appropriate not only for interchange vocabularies but for inner domain ontologies, precisely because the skos:prefLabel/skos:altLabel/skos:hiddenLabel distinction is itself a form of conceptual governance — a way of managing the relationship between authoritative terminological form and the full range of forms that appear in actual use. He is pointing at something real. SKOS is not merely a shallow hierarchy language. It is an annotational governance framework, and in that capacity it belongs near the base of a mature stack, not at the periphery.

The practical consequence is this: when we say that an LLM can accelerate ontology development, we are not saying it can shortcut shared conceptualisation. We are saying it can accelerate the construction of a structured substrate — the schema layer, the initial taxonomy, the SHACL shapes — so that the genuinely hard human work of conceptual alignment can happen against a structure rather than into a void. The blank page problem in ontology is real, and it is the part that LLMs actually solve. The shared meaning problem is a different problem, and it is solved, to the extent it can be, by careful governance of the annotational layer, by structured review processes, and by the kind of domain expertise that twenty years of working in industrial maintenance ontologies provides. These are not the same problem, and conflating them is what produces the misleading speed claims.

The Third Confusion: Generation Is One-Off

There is a further implication of the LLM’s role in a semantic stack that is not yet widely understood, and it concerns the nature of generation itself.

A language model is a noisy sensor. Its outputs are probabilistically plausible, linguistically fluent, and structurally approximate. This makes it excellent for certain tasks — bootstrapping a schema from a requirements document, generating initial SHACL shapes from a prose description of domain constraints, producing candidate taxonomies from a body of text — and unreliable for others, specifically any task requiring consistency under extended generation, precise adherence to a formal specification, or deterministic reproducibility.

The correct architectural response to this profile is to treat LLM generation as a one-off process. You use the model to produce an artefact — a shape library, a set of draft classes, an annotated taxonomy — and then you exit the generative mode as quickly as possible. From that point forward, the artefact enters the deterministic layer: SHACL validation, SPARQL query, RDF transformation, triplestore persistence. The model’s weakest properties — consistency, precision, reproducibility — are precisely the properties that the deterministic layer provides. The model’s strongest properties — fluency, analogical reasoning, tolerance for underspecified inputs — are precisely what is needed to produce an artefact worth putting into the deterministic layer.

This has a direct bearing on how you design the stack. Every layer boundary between the generative and deterministic sides of the architecture should be a hard validation gate. The model produces candidate triples; SHACL accepts or rejects them. The model produces candidate annotations; a human reviewer accepts or rejects them, possibly with SHACL-based consistency checks on the annotational layer. The model is never in the deterministic path. It is always upstream of it.

The deeper architectural principle here is one that the systems engineering community would recognise immediately: you reduce the surface area of non-deterministic processing as fast as possible, and you make the interface between the non-deterministic and deterministic systems as explicit as you can. SHACL shapes are that interface. They express, in machine-readable form, exactly what the graph must look like for the deterministic layer to proceed. The model produces towards that target; the shapes validate against it; everything downstream is deterministic.

This also implies that the right metric for evaluating LLM contribution to a semantic stack is not speed of generation but yield — the proportion of generated artefacts that pass validation on first attempt, weighted by the cost of the review and correction loop. A model that generates a plausible-looking taxonomy in thirty seconds but requires three hours of expert review to make it conformant is providing different value from one that generates a validated shape library in ten minutes with a thirty-minute review cycle. The stack’s design should be oriented toward maximising yield, which means careful attention to the prompting structures, the shape targets, and the validation feedback that the model receives.

The Fourth Confusion: Projection, Not Metaphysics

The deepest disagreement in the comments to the previous piece — though it was not framed as a disagreement — concerns the purpose of an interchange layer.

Melinda Hodkiewicz’s case for upper ontologies in industrial settings is well-made. ISO engineering standards, decades-long asset lifecycles, and multi-organisational data exchange create a real alignment problem, and an upper ontology is a reasonable solution to it. But her argument identifies the problem more clearly than it endorses the solution: the problem is projection. When two organisations need to exchange data about physical assets, they need a shared surface onto which each can project their internal model, so that the projections can be compared, combined, and queried. An upper ontology provides that surface by imposing a shared vocabulary of foundational categories.

The question is whether foundational categories are the right surface for projection, or whether something more tractable is possible.

REST is instructive here. The REST architectural style solved a version of the same problem: how do heterogeneous systems exchange data without requiring shared implementation? The solution was not to agree on a shared data model — it was to agree on a shared interface contract: resource representations, HTTP verbs, status codes, and media types. Each system’s internal implementation remained its own. Only the interface was standardised. This is a fundamentally different approach from data model alignment, and it scales considerably better, which is presumably why it won.

The semantic equivalent of a REST interface is a SHACL shapes graph. A shapes graph specifies, in machine-readable form, the structure that a projection of the internal graph must conform to in order to be consumable by the external party. It does not require that the internal graph share the same ontological commitments as the external party. It requires only that a valid projection can be produced. The projection DataBook — a bounded, self-describing semantic artefact containing the projected subgraph, its provenance trail, and the shapes it was validated against — is the response payload. The triplestore is the service. The shapes graph is the API contract.

This is a more RESTful approach to semantic interoperability than upper ontology alignment, and it has several advantages. It is local: each party maintains their own internal model and specifies their own projection shapes. It is versioned: the shapes graph has an IRI and can be updated independently of the internal model. It is testable: you can validate a projection against its shapes before sending it, rather than discovering alignment failures in production. And it is composable: a system that can produce a valid projection for shape graph A can, in principle, produce one for shape graph B, regardless of whether A and B share an upper ontology.

This is not to say upper ontologies are obsolete. It is to say they occupy a specific niche: they are useful when the projection surface needs to be agreed upon across a large number of parties simultaneously, when no single party has the authority to define the shapes, and when the timescale of the exchange relationship is measured in decades rather than years. Those conditions obtain in industrial standards, in government data exchange, and in DoD-mandated architectures. They do not obtain in most enterprise projects, most AI applications, and most knowledge graph deployments. The upper ontology is one possible projection surface, and a powerful one in the right context. It is not the only one, and in many contexts it is not the best one.

Roy Roebuck’s observation in the comments is the most useful way to frame this: the progression from human operational semantics to formal machine semantics is a governed path, not a binary choice. A project that begins with a governed SKOS vocabulary and a set of SHACL shapes is not failing to do upper ontology work. It is doing the early stages of a process that might eventually extend to formal upper ontology alignment if and when the interoperability requirements demand it. The stack should be designed to accommodate that progression without requiring it.

The Stack, Stated

These are not independent confusions. They compose into a picture of a layered architecture in which each confusion corresponds to a layer that the field has not yet fully understood.

The annotational layer — SKOS vocabularies, rdfs:label and rdfs:comment, scope notes, definitions — is the substrate of shared meaning. It is not decorative metadata appended to a graph. It is the governed linguistic interface between human conceptual work and machine-processable structure. Without it, graph alignment is a structural exercise with no semantic content. Without its governance — review processes, definition standards, term lifecycle management — it becomes an uncontrolled proliferation of labels with the illusion of agreement.

The schema layer — SHACL node shapes, property shapes, and rules — is both the definition of the structural contract and the normative constraint against which data is validated. It is the XSD of the RDF world: not a linter applied after the fact, but a type system that governs the production of data at source. Rules extend this into a typed inference layer that can derive new structure without invoking OWL’s DL machinery.

The graph layer — RDF 1.2 named graphs, reified triples, context and event graphs — is the data substrate. Named graphs partition epistemic authority. Reification enables assertions about assertions, which is the foundation of provenance, versioning, and temporal reasoning. Context graphs — named graphs with temporal coherence — are what graph data becomes when you take time seriously: not a static snapshot of a world, but an evolving model of events and their consequences.

The projection layer — SHACL shapes graphs used as interface contracts, with validated DataBooks as response payloads — is the interchange layer. It is where internal complexity meets external requirement, and where the translation between the two is formalised. Upper ontologies are one implementation of this layer; shape-contract projection is another, more RESTful one. The choice between them should be driven by the scale and governance requirements of the interchange relationship, not by theoretical preference.

The inference layer — SPARQL queries, SHACL rules, and the resonance between the graph and any language model operating against it — is where knowledge is produced. SPARQL is deterministic and precise; SHACL rules are scoped and typed; language models are fluent and approximate. All three have roles. The architecture of the inference layer is a question of how to combine them so that the non-deterministic inputs are validated and bounded as quickly as possible, and the deterministic layer does the work for which it is actually reliable.

What We Are Actually Building

It is worth stating plainly what this infrastructure is in aggregate, because the parts have been discussed more often than the whole.

We are building bounded world models — self-consistent, tractable, formally governed representations of some portion of a domain, designed to be composed, projected, and exchanged. The bounded part is as important as the world model part. An unbounded graph with an open world assumption and no projection layer is not a model; it is a collection of assertions with no defined interface to anything outside itself. Boundedness is what makes a model useful: it defines what the model covers, what its validity conditions are, and how it relates to adjacent models.

The context graph formulation — treating graphs not as eternal assertions but as event logs encoding the evolution of a domain — is foundational to this, because a bounded world model that cannot evolve is not a model of the world; it is a photograph. The state machine structure that emerges from a well-designed context graph is what makes active inference applicable to RDF data: the graph encodes not just current state but the transitions that produced it, which is what a prior model requires.

Whether holons are the final vehicle for this architecture is a question we are genuinely uncertain about. The holonic structure — bounded, composable, self-describing, with a clear distinction between interior state and exterior interface — has properties that map cleanly onto the stack as described here. But this is a position in an evolving conversation, not a terminus. RDF has a roughly ten-year release cycle. The next version will address consequences we are only beginning to create. The non-linear factors — whatever the LLM-graph interface becomes as both LLMs and RDF formalisms evolve — are not yet visible clearly enough to predict.

What is clear is that the infrastructure we are building now is the foundation, not the superstructure. The four confusions this article describes are not mistakes to be avoided — they are necessary stages in the maturation of a discipline that is, despite twenty-five years of work, still in early adolescence. The arguments the semantic community never finished are not finished yet. They are simply being resumed at a larger scale, with better tools, and with a new set of collaborators who did not know they were joining an existing conversation.

The map exists. It is still being drawn.

Kurt Cagle is a consulting ontologist, knowledge graph architect, and technical author with more than 25 books to his credit. He serves as an IEEE Standards Editor and publishes The Cagle Report and The Ontologist on Substack, as well as the AI+Semantics NewsBytes LinkedIn newsletter. He is based in Olympia, Washington. Contact: kurt@holongraph.com

A Recipe for SHACL Lists

Kurt Cagle — Mon, 18 May 2026 04:19:46 GMT

Do you want a cookie?

By Kurt Cagle & Chloe Shannon

Many years ago, I heard a story from some friends at Netscape that the origin of the computer cookie dates back to an MIT hacker who rigged up one of the mainframes so that every so often, it would interrupt the prompts with the message “Give me a cookie.” This would stymie people for weeks until some bright kid typed in “cookie” into the prompt, at which point the computer went on its merry way. It may have been apocryphal; the term as applied to the Internet dates from the early 1990s, but cookies have been a part of computer lore for a long … long time.

The RDF linked list has always occupied an awkward position in the semantic web stack. The mechanism itself — a chain of blank nodes connected by rdf:first and rdf:rest predicates, terminated by rdf:nil — is elegant in principle and genuinely painful in practice. SPARQL’s handling of it ranges from verbose to fragile. OWL’s treatment is nearly orthogonal. And SHACL 1.0, the most practically useful layer in the RDF validation stack, essentially punted: there was no native list concept, so validating a linked list meant either writing a custom SPARQL-based constraint component that manually traversed the chain, or flattening the list into unordered property values and losing the order information entirely.

SHACL 1.2 changes this. The 16 May 2026 Working Draft introduces a formal, normative definition of what constitutes a valid “SHACL list,” and — more importantly — a dedicated family of four constraint components (§7.5) that apply directly to list-valued nodes in the data graph. This article covers the complete landscape: how SHACL has always used lists as parameter values, what changed in 1.2, and how the new constraint family works in practice.

For our running example, we are going to make cookies. Specifically, Kitchen-Sink Cookies — the kind of recipe where you throw in a cup of this and half a cup of that until the batter can barely hold itself together and the result is magnificent. It turns out to be a perfect domain for this article, because list ordering matters (you must cream the butter before you fold in the chocolate chips), membership constraints matter (every step must be a well-formed instruction), and the “everything goes in” spirit of the recipe mirrors the comprehensiveness of what SHACL 1.2 finally brings to list handling.

Note: All SHACL and Turtle examples in this DataBook target SHACL 1.2 Core as of WD-shacl12-core-20260516. The §7.5 list constraint components are normative in that draft but engine support varies — see §6 (Implementation Notes) for guidance.

1. The RDF List Problem

An RDF linked list in Turtle looks deceptively simple:

@prefix rdf:  .
@prefix ex:   .

# Turtle shorthand — the parser expands this to blank-node chains
ex:KitchenSinkCookies ex:steps (
    ex:CreamButterSugar
    ex:AddEggsVanilla
    ex:MixDryIngredients
    ex:CombineMixtures
    ex:FoldInMixIns
    ex:ChillDough
    ex:BakeSheets
) .

# What the parser actually writes into the graph:
# _:list1 rdf:first ex:CreamButterSugar ;
#         rdf:rest  _:list2 .
# _:list2 rdf:first ex:AddEggsVanilla ;
#         rdf:rest  _:list3 .
# ... and so on to rdf:nil

The shorthand is tidy. The underlying graph is a chain of blank nodes that SPARQL engines must traverse with property path expressions, and that standard triple-counting approaches can neither see nor measure directly. SHACL 1.0 had no vocabulary for “the value of this property is a list whose members conform to shape X” — you either wrote a sh:sparql-based constraint, or you gave up.

The deeper problem is that RDF lists in the wild are not always well-formed. Nothing in the RDF 1.1 specification prevented a node from having two values for rdf:first, or a cycle in rdf:rest, or an rdf:nil node with spurious rdf:first triples attached. A validator encountering such a structure had no standardised way to classify or report it.

SHACL 1.2 addresses both problems at once: it defines what a valid list is, and it provides constraints to validate what a list contains.

2. What Is a “SHACL List”? — The Formal Definition

Section 1.1 of the SHACL 1.2 Core spec introduces a normative definition that distinguishes a SHACL list from an arbitrary rdf:List structure. The definition is recursive and worth reading carefully:

A SHACL list in an RDF graph G is an IRI or blank node that is either rdf:nil (provided that rdf:nil has no value for either rdf:first or rdf:rest), or has exactly one value for rdf:first in G and exactly one value for rdf:rest in G that is also a SHACL list in G, and the list does not have itself as a value of the property path rdf:rest+ in G.

Three requirements emerge from this:

Exactly one rdf:first. A list node must have a unique rdf:first value. Two rdf:first triples from the same subject make the node ill-formed — like a recipe step that simultaneously says “cream the butter” and “fold in the chips.” Ambiguous, and therefore invalid.

Exactly one rdf:rest. Same logic. A forked list is not a SHACL list. The baking sequence must proceed in a single direction.

No cycles. The list must not appear in its own rdf:rest+ closure. A cookie recipe that loops back to step one would bake forever; SHACL prevents the same from happening in your data graph.

Clean terminator. rdf:nil terminates a SHACL list only if it has no rdf:first or rdf:rest values of its own. If someone has added triples to rdf:nil — unusual but possible — those uses are ill-formed.

The following instance data illustrates the contrast:

@prefix rdf:  .
@prefix ex:   .

# --- Well-formed SHACL list ---
ex:KitchenSinkCookies ex:steps (
    ex:CreamButterSugar
    ex:AddEggsVanilla
    ex:BakeSheets
) .

# --- Ill-formed: two rdf:first values on the same node ---
_:bad1 rdf:first ex:CreamButterSugar ;
       rdf:first ex:AddEggsVanilla ;    # SHACL list violation — ambiguous head
       rdf:rest  rdf:nil .
ex:BadCookieList1 ex:steps _:bad1 .

# --- Ill-formed: cycle (rdf:rest points back to an ancestor) ---
_:bad2 rdf:first ex:CreamButterSugar ;
       rdf:rest  _:bad3 .
_:bad3 rdf:first ex:AddEggsVanilla ;
       rdf:rest  _:bad2 .               # SHACL list violation — cyclic
ex:BadCookieList2 ex:steps _:bad2 .

# --- Ill-formed: rdf:nil has a spurious rdf:first triple ---
rdf:nil rdf:first ex:BakeSheets .      # SHACL list violation — corrupted terminator

When a SHACL 1.2 processor encounters any of the ill-formed cases while evaluating a §7.5 constraint, it must signal a failure (a processor-level error) rather than a violation (a data-level constraint breach). The distinction matters: a failure means the constraint could not be evaluated at all, not that the data definitively broke a rule.

This is the keystone on which all four new constraint components rest. Without a formal list model, you cannot have a meaningful “minimum list length” constraint. With it, the constraint family becomes both precise and implementable.

3. Lists as Parameter Values — The SHACL 1.0 Heritage

Before examining the new constraints, it helps to survey where SHACL has always used lists — as values of constraint parameters, not as targets of validation. SHACL 1.2 tightens the well-formedness requirements on these uses; the semantics are largely unchanged.

3.1 Logical Constraints (`sh:and`, `sh:or`, `sh:xone`)

The three principal logical connectives each take a SHACL list of shapes as their value. The Turtle shorthand ( ) is idiomatic:

@prefix sh:   .
@prefix ex:   .
@prefix xsd:  .

# sh:or — a mix-in is valid if it matches either a WetMixIn or a DryMixIn shape
ex:AnyMixInShape
    a sh:NodeShape ;
    sh:or (
        ex:WetMixInShape
        ex:DryMixInShape
    ) .

# sh:and — a featured cookie must satisfy both BasicCookieShape and AllergenShape
ex:FeaturedCookieShape
    a sh:NodeShape ;
    sh:targetClass ex:FeaturedCookie ;
    sh:and (
        ex:BasicCookieShape
        ex:AllergenInfoShape
    ) .

# sh:xone — a quantity is expressed as either a VolumeQuantity or a CountQuantity, not both
ex:IngredientAmountShape
    a sh:NodeShape ;
    sh:xone (
        ex:VolumeQuantityShape
        ex:CountQuantityShape
    ) .

Under SHACL 1.2, the list passed to sh:and, sh:or, or sh:xone must be a well-formed SHACL list. An ill-formed list makes the containing shape ill-formed, which the validator may report as a shapes-graph error rather than a data-graph violation.

3.2 Enumeration (`sh:in`)

sh:in takes a SHACL list of allowed RDF terms. List order is semantically irrelevant here; membership is the only thing tested.

@prefix sh:   .
@prefix ex:   .

ex:CookieCategoryShape
    a sh:PropertyShape ;
    sh:path ex:category ;
    sh:in ( ex:DropCookie ex:BarCookie ex:RolledCookie ex:Shortbread ex:Biscotti ) ;
    sh:message "Cookie category must be one of the recognised classification values." .

3.3 Language Tags (`sh:languageIn`)

sh:languageIn takes a SHACL list of BCP47 language tag strings, restricting literal values to those carrying a tag in the list.

@prefix sh:   .
@prefix ex:   .

ex:CookieNameShape
    a sh:PropertyShape ;
    sh:path ex:cookieName ;
    sh:languageIn ( "en" "fr" "de" "ja" ) ;
    sh:message "Cookie name must be provided in English, French, German, or Japanese." .

3.4 Ignored Properties (`sh:ignoredProperties`)

sh:ignoredProperties takes a SHACL list of property IRIs that are exempt from sh:closed enforcement — a pattern that appears in virtually every closed shape.

@prefix sh:    .
@prefix rdf:   .
@prefix rdfs:  .
@prefix ex:    .

ex:IngredientLineShape
    a sh:NodeShape ;
    sh:targetClass ex:IngredientLine ;
    sh:closed true ;
    sh:ignoredProperties ( rdf:type rdfs:label rdfs:comment ) ;
    sh:property [
        sh:path ex:ingredient ;
        sh:class ex:BakingIngredient ;
        sh:nodeKind sh:IRI ;
    ] ;
    sh:property [
        sh:path ex:amount ;
        sh:minInclusive 0 ;
    ] .

3.5 Sequence Paths

A Turtle list used as the value of sh:path denotes a property chain — not a constraint on a list, but a traversal expressed as a list. The distinction is easy to blur but important to keep clear.

@prefix sh:   .
@prefix ex:   .

# Every cookie recipe's author must have a verified contact email
ex:RecipeAuthorEmailShape
    a sh:PropertyShape ;
    sh:path ( ex:createdBy ex:email ) ;   # traverse ex:createdBy, then ex:email
    sh:minCount 1 ;
    sh:pattern "^[^@]+@[^@]+\\.[^@]+$" .

3.6 New in SHACL 1.2 — `sh:class` and `sh:datatype` as Lists

This is a quiet but practically significant addition. In SHACL 1.0, sh:class and sh:datatype accepted a single value; expressing multi-type constraints required either repeating the predicate or composing with sh:or. SHACL 1.2 allows these parameters to accept either a single resource or a SHACL list of resources directly.

@prefix sh:   .
@prefix xsd:  .
@prefix ex:   .

# sh:class as a list — an allergen ingredient must be both a BakingIngredient AND an Allergen
# (conjunctive — the value must satisfy all listed class constraints simultaneously)
ex:AllergenIngredientShape
    a sh:PropertyShape ;
    sh:path ex:allergenIngredient ;
    sh:class ( ex:BakingIngredient ex:Allergen ) .

# sh:datatype as a list — a measurement amount may be xsd:decimal OR xsd:integer
# (disjunctive — the literal matches if its datatype is any one of those listed)
ex:MeasurementAmountShape
    a sh:PropertyShape ;
    sh:path ex:amount ;
    sh:datatype ( xsd:decimal xsd:integer ) ;
    sh:minInclusive 0 .

The sh:class list is conjunctive: a value must be an instance of every class in the list simultaneously. The sh:datatype list is disjunctive: a literal matches if its datatype is any one of those listed. This asymmetry is intentional — it follows the semantics each constraint already had for repeated use. Our Kitchen-Sink Cookie recipe benefits directly: ingredient amounts might be whole numbers (2 eggs) or decimals (0.25 cups of sprinkles), and a single sh:datatype ( xsd:integer xsd:decimal ) handles both without reaching for sh:or.

4. Lists as Validation Targets — The New §7.5 Constraint Family

This is where SHACL 1.2 breaks genuinely new ground. The four constraint components in §7.5 all target nodes that are SHACL lists — either directly or via a sh:path that navigates from the focus node to a list-valued property. And, just like the Kitchen-Sink Cookie itself, they work best when combined: length bounds, uniqueness, and per-member shape validation compose cleanly into a single property shape block.

The running example for this section is the full Kitchen-Sink Cookie recipe. Here is the instance data, drawn directly from the recipe card:

@prefix rdf:   .
@prefix xsd:   .
@prefix ex:    .

# ── Baking ingredient definitions ────────────────────────────────────────────
ex:Butter          a ex:BakingIngredient ; ex:name "Butter"@en .
ex:GranulatedSugar a ex:BakingIngredient ; ex:name "Granulated Sugar"@en .
ex:BrownSugar      a ex:BakingIngredient ; ex:name "Brown Sugar"@en .
ex:Egg             a ex:BakingIngredient ; ex:name "Egg"@en .
ex:VanillaExtract  a ex:BakingIngredient ; ex:name "Vanilla Extract"@en .
ex:AllPurposeFlour a ex:BakingIngredient ; ex:name "All-Purpose Flour"@en .
ex:BakingSoda      a ex:BakingIngredient ; ex:name "Baking Soda"@en .
ex:Salt            a ex:BakingIngredient ; ex:name "Salt"@en .
ex:ChocolateChips  a ex:BakingIngredient ; ex:name "Chocolate Chips"@en .
ex:MixedNuts       a ex:BakingIngredient ; ex:name "Mixed Nuts"@en .
ex:RolledOats      a ex:BakingIngredient ; ex:name "Rolled Oats"@en .
ex:Sprinkles       a ex:BakingIngredient ; ex:name "Rainbow Sprinkles"@en .

# ── Ingredient line items (from the recipe card) ─────────────────────────────
ex:Ing01 a ex:IngredientLine ; ex:ingredient ex:Butter          ; ex:amount 1    ; ex:unit ex:Cup .
ex:Ing02 a ex:IngredientLine ; ex:ingredient ex:GranulatedSugar ; ex:amount 1    ; ex:unit ex:Cup .
ex:Ing03 a ex:IngredientLine ; ex:ingredient ex:BrownSugar      ; ex:amount 1    ; ex:unit ex:Cup .
ex:Ing04 a ex:IngredientLine ; ex:ingredient ex:Egg             ; ex:amount 2 .          # count, no unit
ex:Ing05 a ex:IngredientLine ; ex:ingredient ex:VanillaExtract  ; ex:amount 1    ; ex:unit ex:Teaspoon .
ex:Ing06 a ex:IngredientLine ; ex:ingredient ex:AllPurposeFlour ; ex:amount 2    ; ex:unit ex:Cup .
ex:Ing07 a ex:IngredientLine ; ex:ingredient ex:BakingSoda      ; ex:amount 1    ; ex:unit ex:Teaspoon .
ex:Ing08 a ex:IngredientLine ; ex:ingredient ex:Salt            ; ex:amount 0.5  ; ex:unit ex:Teaspoon .
ex:Ing09 a ex:IngredientLine ; ex:ingredient ex:ChocolateChips  ; ex:amount 1    ; ex:unit ex:Cup .
ex:Ing10 a ex:IngredientLine ; ex:ingredient ex:MixedNuts       ; ex:amount 0.5  ; ex:unit ex:Cup .
ex:Ing11 a ex:IngredientLine ; ex:ingredient ex:RolledOats      ; ex:amount 0.5  ; ex:unit ex:Cup .
ex:Ing12 a ex:IngredientLine ; ex:ingredient ex:Sprinkles       ; ex:amount 0.25 ; ex:unit ex:Cup .

# ── Step definitions ─────────────────────────────────────────────────────────
ex:CreamButterSugar
    a ex:PrepStep ;
    ex:instruction "Beat butter, granulated sugar, and brown sugar together until light and fluffy."@en ;
    ex:durationMinutes 5 .

ex:AddEggsVanilla
    a ex:PrepStep ;
    ex:instruction "Add eggs one at a time, beating well after each addition. Stir in vanilla extract."@en ;
    ex:durationMinutes 2 .

ex:MixDryIngredients
    a ex:PrepStep ;
    ex:instruction "Whisk together flour, baking soda, and salt in a separate bowl."@en ;
    ex:durationMinutes 2 .

ex:CombineMixtures
    a ex:PrepStep ;
    ex:instruction "Gradually blend the dry mixture into the butter mixture until just combined."@en ;
    ex:durationMinutes 3 .

ex:FoldInMixIns
    a ex:PrepStep ;
    ex:instruction "Fold in chocolate chips, mixed nuts, rolled oats, and sprinkles."@en ;
    ex:durationMinutes 2 .

ex:ChillDough
    a ex:PrepStep ;
    ex:instruction "Cover dough and refrigerate for at least 30 minutes."@en ;
    ex:durationMinutes 30 .

ex:BakeSheets
    a ex:BakingStep ;
    ex:instruction "Drop rounded tablespoons onto ungreased baking sheets. Bake at 375°F (190°C) for 10–12 minutes until golden at the edges."@en ;
    ex:durationMinutes 12 .

# ── The recipe itself (VALID instance) ───────────────────────────────────────
ex:KitchenSinkCookies
    a ex:CookieRecipe ;
    ex:cookieName "Kitchen-Sink Cookies"@en ;
    ex:cookieName "Cookies Tout-en-Un"@fr ;
    ex:category ex:DropCookie ;
    ex:yield 48 ;                        # makes ~48 cookies
    ex:ingredients (
        ex:Ing01 ex:Ing02 ex:Ing03 ex:Ing04 ex:Ing05 ex:Ing06
        ex:Ing07 ex:Ing08 ex:Ing09 ex:Ing10 ex:Ing11 ex:Ing12
    ) ;
    ex:steps (
        ex:CreamButterSugar
        ex:AddEggsVanilla
        ex:MixDryIngredients
        ex:CombineMixtures
        ex:FoldInMixIns
        ex:ChillDough
        ex:BakeSheets
    ) .

# ── Invalid instance 1: empty steps list ─────────────────────────────────────
ex:BadCookieRecipe1
    a ex:CookieRecipe ;
    ex:cookieName "Mystery Cookie"@en ;
    ex:category ex:DropCookie ;
    ex:ingredients ( ex:Ing01 ) ;
    ex:steps () .    # rdf:nil — violates sh:minListLength 1

# ── Invalid instance 2: duplicate ingredient in list ─────────────────────────
ex:BadCookieRecipe2
    a ex:CookieRecipe ;
    ex:cookieName "Double-Butter Mistake"@en ;
    ex:category ex:DropCookie ;
    ex:ingredients ( ex:Ing01 ex:Ing01 ex:Ing02 ) ;  # ex:Ing01 (Butter) repeated
    ex:steps ( ex:CreamButterSugar ex:BakeSheets ) .  # violates sh:uniqueMembers true

# ── Invalid instance 3: too many steps ───────────────────────────────────────
ex:BadCookieRecipe3
    a ex:CookieRecipe ;
    ex:cookieName "Overcomplicated Masterpiece"@en ;
    ex:category ex:DropCookie ;
    ex:ingredients ( ex:Ing01 ex:Ing02 ) ;
    ex:steps (
        ex:CreamButterSugar ex:AddEggsVanilla ex:MixDryIngredients
        ex:CombineMixtures  ex:FoldInMixIns   ex:ChillDough
        ex:BakeSheets       ex:CreamButterSugar ex:AddEggsVanilla
        ex:MixDryIngredients ex:CombineMixtures ex:FoldInMixIns
        ex:ChillDough       ex:BakeSheets      ex:CreamButterSugar
    ) .  # 15 steps — violates sh:maxListLength 12

# ── Invalid instance 4: a step fails sh:memberShape ──────────────────────────
ex:IncompleteStep
    a ex:PrepStep .
    # No ex:instruction — will fail ex:BakingStepShape via sh:memberShape

ex:BadCookieRecipe4
    a ex:CookieRecipe ;
    ex:cookieName "Recipe with a Gap"@en ;
    ex:category ex:DropCookie ;
    ex:ingredients ( ex:Ing01 ex:Ing02 ) ;
    ex:steps ( ex:CreamButterSugar ex:IncompleteStep ex:BakeSheets ) .

4.1 `sh:memberShape` — Per-Member Validation

sh:memberShape is arguably the most significant of the four. It applies a shape to every member of the list: if any member fails, the constraint is violated. This replaces the SPARQL-based “iterate rdf:rest*/rdf:first and check each result” idiom with a single declarative triple.

Think of it as a quality inspector walking the baking-steps list and checking each one against a standard: every step must have a clear instruction, and the duration — if given — must be positive. The inspector does not care which step is first; it checks every one.

@prefix sh:    .
@prefix xsd:   .
@prefix ex:    .

# Shape for an individual baking step
ex:BakingStepShape
    a sh:NodeShape ;
    sh:property [
        sh:path ex:instruction ;
        sh:datatype xsd:string ;
        sh:minCount 1 ;
        sh:maxCount 1 ;
        sh:message "Every step must have exactly one instruction string." ;
    ] ;
    sh:property [
        sh:path ex:durationMinutes ;
        sh:datatype xsd:integer ;
        sh:maxCount 1 ;
        sh:minInclusive 1 ;
        sh:message "Duration, if present, must be a positive integer number of minutes." ;
    ] .

# Applied to the steps list on a cookie recipe
ex:CookieStepsShape
    a sh:PropertyShape ;
    sh:path ex:steps ;
    sh:memberShape ex:BakingStepShape .

Applied to ex:BadCookieRecipe4, the validator walks the steps list, evaluates each member against ex:BakingStepShape, and reports a violation for ex:IncompleteStep because it lacks ex:instruction. The result entry will carry sh:focusNode ex:BadCookieRecipe4, sh:resultPath ex:steps, and sh:value ex:IncompleteStep — precisely locating which list member caused the failure.

4.2 `sh:minListLength` and `sh:maxListLength` — Bounding the Sequence

These two constraints do for lists what sh:minCount and sh:maxCount do for property value sets. Their value is a non-negative xsd:integer. A cookie recipe with no steps is not a recipe; a cookie recipe with fifteen steps is either a science experiment or a test case for our validator.

@prefix sh:    .
@prefix xsd:   .
@prefix ex:    .

ex:CookieRecipeShape
    a sh:NodeShape ;
    sh:targetClass ex:CookieRecipe ;

    # Ingredients: at least 1, no upper bound — even a one-ingredient cookie is a cookie
    sh:property [
        sh:path ex:ingredients ;
        sh:minListLength 1 ;
        sh:message "A cookie recipe must have at least one ingredient." ;
    ] ;

    # Steps: between 1 and 12 inclusive — complex but not absurd
    sh:property [
        sh:path ex:steps ;
        sh:minListLength 1 ;
        sh:maxListLength 12 ;
        sh:message "A cookie recipe must have between 1 and 12 steps." ;
    ] .

Against our invalid instances:

ex:BadCookieRecipe1 (empty steps list, i.e., rdf:nil) fails sh:minListLength 1 on ex:steps.
ex:BadCookieRecipe3 (15 steps) fails sh:maxListLength 12 on ex:steps.

A length of zero is valid in rdf:nil form and is the lower bound — sh:minListLength 0 is a no-op. Omitting the constraint altogether has the same effect.

4.3 `sh:uniqueMembers` — No Duplicates

sh:uniqueMembers true requires that every member of the list is pairwise distinct under RDF term equality. A recipe that lists butter twice — perhaps entered by an overenthusiastic baker who really loves butter — fails this constraint.

@prefix sh:    .
@prefix ex:    .

ex:CookieIngredientsShape
    a sh:PropertyShape ;
    sh:path ex:ingredients ;
    sh:uniqueMembers true ;
    sh:message "The same ingredient line item may not appear more than once in the list." .

Against ex:BadCookieRecipe2, which lists ex:Ing01 (the butter line item) twice, the validator reports a violation. Note the blank node equality caveat: two structurally identical blank-node ingredient definitions would not be considered duplicates — blank nodes are equal only to themselves by identity. If deduplication across structural equivalence is required, that needs a SPARQL-based custom constraint; sh:uniqueMembers covers reference equality only.

5. Composing List Constraints — The Complete Cookie Shape

The four constraints compose naturally. Here is the full shapes graph for the Kitchen-Sink Cookie domain, bringing everything together into a single, deployable validation package:

@prefix sh:    .
@prefix xsd:   .
@prefix rdf:   .
@prefix rdfs:  .
@prefix ex:    .

# ── Baking step shape ────────────────────────────────────────────────────────

ex:BakingStepShape
    a sh:NodeShape ;
    sh:property [
        sh:path ex:instruction ;
        sh:datatype xsd:string ;
        sh:minCount 1 ; sh:maxCount 1 ;
    ] ;
    sh:property [
        sh:path ex:durationMinutes ;
        sh:datatype xsd:integer ;
        sh:maxCount 1 ;
        sh:minInclusive 1 ;
    ] .

# ── Ingredient line shape ────────────────────────────────────────────────────

ex:IngredientLineShape
    a sh:NodeShape ;
    sh:closed true ;
    sh:ignoredProperties ( rdf:type rdfs:label rdfs:comment ) ;
    sh:property [
        sh:path ex:ingredient ;
        sh:class ex:BakingIngredient ;
        sh:nodeKind sh:IRI ;
        sh:minCount 1 ; sh:maxCount 1 ;
    ] ;
    sh:property [
        sh:path ex:amount ;
        sh:datatype ( xsd:decimal xsd:integer ) ;  # either datatype (SHACL 1.2 list form)
        sh:minCount 1 ; sh:maxCount 1 ;
        sh:minInclusive 0 ;
    ] ;
    sh:property [
        sh:path ex:unit ;
        sh:in ( ex:Cup ex:Tablespoon ex:Teaspoon ex:Gram ex:Ounce ) ;
        sh:maxCount 1 ;
        # unit is optional — eggs are counted, not measured by volume
    ] .

# ── Cookie recipe shape ───────────────────────────────────────────────────────

ex:CookieRecipeShape
    a sh:NodeShape ;
    sh:targetClass ex:CookieRecipe ;

    # Name: at least one, in allowed languages (SHACL 1.0-era list parameter)
    sh:property [
        sh:path ex:cookieName ;
        sh:minCount 1 ;
        sh:languageIn ( "en" "fr" "de" "it" "ja" "es" ) ;
        sh:uniqueLang true ;
    ] ;

    # Category: one of the enumerated cookie types
    sh:property [
        sh:path ex:category ;
        sh:minCount 1 ; sh:maxCount 1 ;
        sh:in ( ex:DropCookie ex:BarCookie ex:RolledCookie ex:Shortbread ex:Biscotti ) ;
    ] ;

    # Ingredients: non-empty list, no duplicate line items,
    # every member conforming to ex:IngredientLineShape      ← §7.5 three constraints
    sh:property [
        sh:path ex:ingredients ;
        sh:minListLength 1 ;
        sh:uniqueMembers true ;
        sh:memberShape ex:IngredientLineShape ;
        sh:message "Ingredients must form a non-empty list of distinct, well-formed line items." ;
    ] ;

    # Steps: 1–12 steps, every member conforming to ex:BakingStepShape ← §7.5 all four
    sh:property [
        sh:path ex:steps ;
        sh:minListLength 1 ;
        sh:maxListLength 12 ;
        sh:memberShape ex:BakingStepShape ;
        sh:message "Steps must form a non-empty list of at most 12 well-formed step nodes." ;
    ] .

Running this shapes graph against our four invalid instances produces the following violations:

Meanwhile, ex:KitchenSinkCookies conforms to all constraints: its steps list has seven well-formed members, its ingredients list has twelve distinct well-formed line items, and both list lengths are comfortably within bounds.

6. Implementation Notes for Node.js and Python Developers

SPARQL Fallback (Pre-1.2 Engines)

If your engine does not yet support §7.5 natively, each of the four constraints can be approximated with sh:sparql-based custom constraint components. The SPARQL below provides a portable fallback for sh:minListLength:

# Fallback validator for sh:minListLength
# Used as sh:sparql body in a custom sh:SPARQLConstraint.
# Binds ?this to the focus node; $minListLength is the parameter value.

PREFIX rdf: 
PREFIX sh:  

SELECT $this (sh:minListLength AS ?sourceConstraintComponent) ?value
WHERE {
    $this $PATH ?list .
    {
        SELECT ?list (COUNT(?member) AS ?len) WHERE {
            ?list rdf:rest*/rdf:first ?member .
        }
        GROUP BY ?list
    }
    FILTER (?len < $minListLength)
    BIND(?list AS ?value)
}

Note: The $PATH and $minListLength placeholders are substituted by the SHACL processor’s parameter binding mechanism when using sh:sparql constraint components. Consult your engine’s documentation for the exact expansion rules.

Apache Jena 6.x

Jena 6.0 is RDF 1.2 / SPARQL 1.2 native and the natural home for SHACL 1.2 work. The §7.5 list constraints are on the Jena SHACL roadmap following WD stabilisation. As of May 2026, verify with the Jena 6.x changelog before deploying sh:memberShape in production. The SPARQL fallback above runs cleanly on Jena 6 via sh:sparql.

rdflib (Python)

rdflib.collection.Collection provides idiomatic Python access to RDF linked lists. The following snippet traverses the steps list and checks each member for the presence of an instruction — the manual equivalent of sh:memberShape:

from rdflib import Graph, URIRef, Literal
from rdflib.collection import Collection

g = Graph()
g.parse("cookie-data.ttl", format="turtle")

COOKIE   = "https://example.org/cookie#"
recipe   = URIRef(COOKIE + "KitchenSinkCookies")
steps_p  = URIRef(COOKIE + "steps")
instr_p  = URIRef(COOKIE + "instruction")

for list_head in g.objects(recipe, steps_p):
    steps = list(Collection(g, list_head))
    print(f"Step count: {len(steps)}")
    for step in steps:
        instructions = list(g.objects(step, instr_p))
        if not instructions:
            print(f"  ✗ {step} — missing instruction (sh:memberShape violation)")
        else:
            print(f"  ✓ {step}")

N3.js / rdf-ext (Node.js)

In N3.js, convert the list head to a JavaScript array first, then apply shape checks. The utility below enforces SHACL list well-formedness (cycle detection, unique rdf:first/rdf:rest) as it traverses:

import { Store, DataFactory } from 'n3';
const { namedNode } = DataFactory;

const RDF_FIRST = namedNode('http://www.w3.org/1999/02/22-rdf-syntax-ns#first');
const RDF_REST  = namedNode('http://www.w3.org/1999/02/22-rdf-syntax-ns#rest');
const RDF_NIL   = namedNode('http://www.w3.org/1999/02/22-rdf-syntax-ns#nil');

function listToArray(store, head) {
    const members = [];
    let current = head;
    const visited = new Set();
    while (!current.equals(RDF_NIL)) {
        if (visited.has(current.value)) throw new Error('Cyclic RDF list detected');
        visited.add(current.value);
        const [firstQuad] = store.getQuads(current, RDF_FIRST, null, null);
        if (!firstQuad) throw new Error('Ill-formed list: missing rdf:first');
        members.push(firstQuad.object);
        const [restQuad] = store.getQuads(current, RDF_REST, null, null);
        if (!restQuad) throw new Error('Ill-formed list: missing rdf:rest');
        current = restQuad.object;
    }
    return members;
}

// Usage: check sh:uniqueMembers manually
function checkUniqueMembers(store, listHead) {
    const members = listToArray(store, listHead);
    const seen = new Set();
    for (const m of members) {
        if (seen.has(m.value)) return { valid: false, duplicate: m };
        seen.add(m.value);
    }
    return { valid: true };
}

7. Relationship to Other SHACL 1.2 Additions

The list constraint family does not exist in isolation. Two other SHACL 1.2 changes are worth noting in context.

sh:someValue (§7.8.3) is an existential constraint that fires if at least one value of a property satisfies a given shape. It does not iterate a list, but it can be combined with list-navigating paths to check that at least one step in the cookie recipe is a ex:BakingStep (as opposed to all steps being prep steps) — a complement to sh:memberShape‘s universal quantification.

SHACL Node Expressions (shacl12-node-expr, a companion spec) provide a graph expression language for deriving focus nodes dynamically. List-aware expressions — selecting the n-th member, filtering members by predicate — are within the intended scope of that spec and will compose naturally with the §7.5 constraints when engine support matures.

Ordering constraints are not yet in Core. SHACL 1.2 can tell you that a steps list has between 1 and 12 well-formed, distinct members. It cannot tell you that ex:CreamButterSugar must precede ex:BakeSheets — that the butter must be creamed before the cookies go in the oven. Enforcing domain-specific sequence constraints requires either SHACL-SPARQL or application-level logic. This is the most significant remaining gap for workflow and recipe-style domains.

8. Conclusion — Everything in the Bowl

The Kitchen-Sink Cookie earns its name by refusing to leave anything out. A cup of chocolate chips, half a cup of nuts, a quarter cup of sprinkles — nothing is too much as long as the dough holds together and the constraints are met. SHACL 1.2’s approach to lists has the same spirit: a formal definition, four constraint components, tight integration with the existing parameter vocabulary, and enough composability to handle the overwhelming majority of real list-validation requirements in a single declarative shapes graph.

SHACL 1.0’s treatment of linked lists was a pragmatic absence. List validation was left to custom SPARQL components written by practitioners who needed it badly enough to write the traversal themselves. That worked, roughly, for simple cases. It produced fragile, engine-specific shapes that were hard to read and harder to maintain.

SHACL 1.2 changes the frame. The formal SHACL list definition in §1.1 is the keystone: it gives every subsequent constraint a stable, implementable foundation. The four components in §7.5 — sh:memberShape, sh:minListLength, sh:maxListLength, sh:uniqueMembers — cover the overwhelming majority of real list-validation requirements declaratively, readably, and in a way that any conformant SHACL 1.2 processor can evaluate without a single line of custom SPARQL.

The remaining gap — ordering constraints, positional access, conditional membership — is genuine, and likely to be addressed through the Node Expressions spec as it matures. But the gap that mattered most in practice was “I cannot declaratively say that every member of this list must conform to a shape,” and that gap is now closed.

The list, after two decades of awkwardness, has finally been tamed. Time to put the cookies in the oven.

References

SHACL 1.2 Core Working Draft (16 May 2026): https://www.w3.org/TR/2026/WD-shacl12-core-20260516/
SHACL 1.2 Overview: https://w3c.github.io/data-shapes/shacl12-overview/
SHACL 1.2 SPARQL Extensions: https://www.w3.org/TR/shacl12-sparql/
SHACL 1.2 Node Expressions: https://www.w3.org/TR/shacl12-node-expr/
RDF 1.2 Concepts: https://www.w3.org/TR/rdf12-concepts/
Apache Jena 6.x: https://jena.apache.org/
rdflib (Python): https://rdflib.readthedocs.io/
N3.js: https://github.com/rdfjs/N3.js

References

SHACL 1.2 Core Working Draft (16 May 2026): https://www.w3.org/TR/2026/WD-shacl12-core-20260516/
SHACL 1.2 Overview: https://w3c.github.io/data-shapes/shacl12-overview/
SHACL 1.2 SPARQL Extensions: https://www.w3.org/TR/shacl12-sparql/
SHACL 1.2 Node Expressions: https://www.w3.org/TR/shacl12-node-expr/
RDF 1.2 Concepts: https://www.w3.org/TR/rdf12-concepts/
Apache Jena 6.x: https://jena.apache.org/
rdflib (Python): https://rdflib.readthedocs.io/
N3.js: https://github.com/rdfjs/N3.js

Do You Need An Upper Ontology?

Kurt Cagle — Wed, 13 May 2026 03:27:29 GMT

The Ontologist | Kurt Cagle & Chloe Shannon

Picture a reasonably sharp engineer. They’ve been told — by someone with authority and confidence — that the company needs an ontology. They’ve done the reading, watched the talks, installed Protégé. They’ve loaded the Basic Formal Ontology and are now staring at a class hierarchy that begins with Continuant and Occurrent, wondering what any of this has to do with their product catalogue, their clinical records, or their supply chain graph.

This is not a failure of intelligence. It is a failure of framing.

The question of whether you need an upper ontology is one the knowledge engineering community has settled into a comfortable non-answer: “it depends.” Seasoned ontologists have their tribes — BFO loyalists, DOLCE advocates, GIST pragmatists, UFO theorists — and a significant quiet faction who have shipped a dozen production systems without ever loading an upper ontology at all. But the AI wave has brought a new cohort into the room: engineers who have correctly diagnosed a representational problem, identified that they need some formal model of their domain, and are now being handed upper ontologies as if they were prerequisites — a foundation you lay before anything else can stand.

Our position is more pointed: for most projects, you do not need an upper ontology. And choosing one without understanding what you are actually buying into may make your problem significantly worse.

What an Upper Ontology Actually Is

An upper ontology is a general-purpose conceptual framework that defines foundational categories — things like Entity, Process, Role, Relation — from which domain-specific ontologies are intended to inherit. The pitch is interoperability: if your oncology system and my supply chain system both build on BFO, we share a common conceptual substrate, and in principle we can reason across them.

That pitch is not entirely wrong. But it obscures something important: an upper ontology is not a neutral foundation. It is a methodology in disguise. It encodes specific philosophical commitments — about what kinds of things exist, how change is modelled, how relationships are typed — and when you adopt one, you are adopting those commitments whether or not they suit your domain.

This is most visible in the seemingly trivial question of what to call a label.

Fifteen Ways to Say “Name”

Consider the humble label — the human-readable name attached to a thing in your graph. Across the major ontological frameworks you will find:

rdfs:label in RDFS and OWL
skos:prefLabel in SKOS
skosxl:literalForm in SKOS-XL
gist:name in GIST
schema:name in Schema.org
sh:name in SHACL
and several more depending on which framework you inherit from

The common assumption is that these distinctions encode meaningful semantic differences. In practice, they mostly don’t. They are naming conventions made by different framework designers working at roughly the same level of abstraction, making local decisions that felt reasonable at the time. The apparent diversity conceals a remarkable shallowness.

But here is the consequential part:

When you choose an upper ontology, you are not just choosing a term for “label.” You are choosing an entire system of such conventions — for typing, for relating, for modelling events, for handling change over time.
Every class you will ever create in your domain will need to fit somewhere in that system’s hierarchy, or you will be forking it.
And you will always be forking it.

Every Extension Is a Fork

No ontology is comprehensive. The moment you create a new class or property to describe something specific to your domain — and you will, inevitably — you have amended the contract. This matters more than it might seem.

An ontology is, at its core, an agreement. It is a shared terminology that multiple parties use to communicate, with the understanding that terms mean the same thing on both sides of the exchange. Adding a new term is implicitly an amendment to that agreement. When the number of parties is small, amendments are negotiable. When you are building on a framework with thousands of users, institutional adoption across multiple industries, and decades of prior art, amendment is essentially impossible.

This is why it is worth distinguishing between two fundamentally different kinds of ontology:

The inner ontology is a precise, purpose-built model for a specific domain context — a company, a department, an agency, a project. It is designed to be accurate and computationally tractable for that context. Precision is the point.

The messenger ontology is designed for communication across a large number of parties at the expense of precision and fidelity. Its terms are intentionally general, its hierarchy deliberately shallow. FIBO is a useful messenger ontology for the financial domain — genuinely impressive in scope, institutionally adopted, and deliberately incapable of being the only ontology you use if you actually need to compute with your data.

The foundational mistake — one the semantic web community has been making for roughly forty years — is attempting to build something that is both. Every effort to create an upper ontology that serves both as a precise computational foundation and as a broad interoperability framework has either failed outright or survived by becoming so general as to require substantial domain-specific work to make it useful.

You cannot have both precision and universality in the same artefact. The physics will not allow it.

How People Actually Use OWL

Here is an observation that tends to make formal ontologists uncomfortable: most organisations that claim to use OWL do not, in any meaningful sense, use the reasoner.

What they actually have is a knowledge graph with some rdf:type declarations, perhaps a few rdfs:subClassOf relationships, possibly some cardinality annotations on properties. The model is, conceptually, a UML diagram that happens to serialise to Turtle. Boxes and lines. Cardinality labels on the lines. Maybe some inheritance arrows.

There is nothing wrong with this. It is, in fact, a perfectly reasonable way to model many domains. But it means that the formal logical apparatus of OWL — the Description Logic foundations, the open-world assumption, the decidable inference procedures — is entirely invisible to the modelling process and completely unused in practice.

When someone tells you that OWL reasons and SHACL does not, they are making an argument that applies to something different from what most organisations are actually building. They are describing the theoretical ceiling of OWL, not the operational floor on which real projects stand.

SHACL — especially SHACL 1.2 — handles everything that most knowledge graph teams are actually doing: typing, cardinality, value constraints, property paths, contextual shapes. And it does so with a critical advantage: it makes the operational semantics explicit. A SHACL shape is not a latent inference waiting to be triggered; it is a declarative description of what is expected in a specific graph context, fully inspectable, testable, and reproducible.

The formal objection is that SHACL combined with SPARQL is essentially Turing-complete, and therefore cannot provide the decidability guarantees of OWL DL. This is true. It is also, for almost every real-world application, irrelevant. Decidability is a theoretical property that matters when you are constructing a formal proof system. It does not matter when you are validating a hospital patient graph against a regulatory schema, or checking whether a product record conforms to a supplier agreement. What matters there is: does it give the right answer on real data in acceptable time? SHACL does. Reliably. Without requiring that you understand the modal logic underpinning the inference engine.

More pointedly: SHACL is a constructor for ontological patterns. Because it is not itself semantic — it is shape-based, contextual, graph-relative — it can be used to express any ontological commitment you care to make. If you want transitive closure, you write a recursive SHACL shape backed by a SPARQL path. If you want inverse property semantics, you write a property shape that captures the bidirectional constraint. SHACL 1.2 adds recursive profiles that make this cleaner still. The distinction between “reasoning” and “validation with generation” is, in practice, a pipeline orchestration issue rather than a qualitative architectural difference.

Put simply: SHACL builds ontologies. OWL encodes them. The two are not in competition; they operate at different layers. But if you need to do something with your model, the SHACL layer is where the work actually happens.

The AI Reckoning

Everything we have said so far applies with equal force to classical knowledge graph projects. But there are two additional considerations specific to AI systems that raise the stakes considerably.

Reification and the Narrative Problem

RDF 1.2 formalises reification — the capacity to describe an assertion rather than merely a node and its relationships. An RDF-Star triple carries metadata about the claim itself: who made it, when, with what confidence, under what conditions.

This matters for AI because large language models do not produce facts. They produce claims in context.
The fundamental epistemic unit in a GenAI system is not a triple — it is something closer to: “agent A, in context C, at time T, asserts that X relates to Y, with confidence P.” That structure is a reification with metadata. The traditional triple was always a lossy compression of it; RDF-Star finally stops pretending otherwise.

Most narratives — the dominant mode of LLM output — are reification-driven. A narrative is a sequence of contextualised assertions, each carrying a provenance, a temporal position, and a confidence register. If you want to represent what an LLM knows, how it came to know it, and under what conditions that knowledge should be trusted or revised, you need a model built around reification.

The vast majority of existing upper ontologies were not built with reification as a first-class concern. BFO’s perdurant/endurant architecture addresses temporal extension by baking it into the class hierarchy of things — a clever solution to a specific problem, but one that treats change as intrinsic to objects rather than as a property of assertions about objects. RDF-Star externalises that assumption. This is not a technical upgrade; it is a different ontological commitment. An upper ontology grounded in the pre-Star world may actively obstruct the modelling patterns that AI systems require.

Named Graphs and the End of Binary World Assumptions

The second consideration concerns named graphs and what they actually do to the open/closed world debate.

Classical OWL operates under an open-world assumption: the absence of a fact is not evidence of its falsity. Classical databases operate under a closed-world assumption: if it is not recorded, it did not happen. Knowledge engineers have long treated this as a binary choice — one of the more frustrating conceptual obstacles in the field.

Named graphs dissolve this binary by making the world assumption local to a graph context. A named graph functions as what Active Inference practitioners would call a Markov blanket: a boundary that separates inside from outside, defines what is knowable within the enclosure, and controls what flows across the boundary. Within a named graph, you can legitimately apply closed-world reasoning. Across named graphs, the open-world assumption continues to hold. The graph is the scope of the commitment.

This architecture — containers of entities that are themselves containers, tracked by agentic processes, bounded by observable contexts — is the natural representational structure for AI systems that need to model their own epistemic state, track the provenance of what they believe, and reason about what they do not know. It is also a structure that most existing upper ontologies do not model well, because they were designed before named graphs were a serious modelling primitive.

The emerging graph model for AI-integrated systems will not look like past ontologies. The upper ontology tradition will need substantial revision to remain relevant — and if you are building now, adopting a framework that will need to be restructured in three years is a risk worth pricing explicitly.

The Bootstrapping Argument Is Now Broken

There is one more argument for upper ontologies that rarely gets stated explicitly, because it operates as an assumption rather than a claim: building a domain ontology from scratch is expensive. Trained ontologists, months of domain elicitation, formal review cycles, iteration before the model is stable enough to use. Under those conditions, inheriting from an existing framework looked like a shortcut. You got a partial model for free and only had to build the delta.

That calculus no longer holds.

It is now possible to use an LLM to generate a reasonably robust domain ontology and taxonomy — including SHACL shapes, property definitions, and class hierarchies scoped to your actual requirements — in a matter of hours. The first pass is rarely perfect, but the iteration cycle is fast enough that a competent practitioner can move from requirements to a working model in a week or less, rather than the months or years that a traditional ontology engineering process would demand.

This dismantles the bootstrapping argument entirely. The “free partial model” that an upper ontology provides is no longer saving you months of work. It is potentially costing you fit — you are now doing extra work to make your domain conform to someone else’s philosophical commitments, rather than modelling directly from your requirements, in exchange for a head start that you no longer need.

There is a second-order effect that matters just as much: the revisability dynamic shifts. An ontology that took eighteen months to build gets defended. You route around its limitations. You find reasons not to refactor it. An ontology that took a week and can be regenerated in an afternoon gets treated like code: refactored when the model does not fit, versioned properly, iterated without ceremony. That is a healthier epistemic relationship with your own model, and it applies regardless of whether an LLM was involved in its construction.

The ontology is now a living artefact rather than a monumental commitment. The risk calculus around upper ontologies was already questionable; remove the bootstrapping cost, and the case weakens considerably further.

When You Might Actually Need One

None of this means upper ontologies are without value. There are genuine cases where the investment is warranted:

Regulatory or institutional mandate. If your client, your parent organisation, or your regulatory environment requires a specific upper ontology, you use it. This is not an architectural decision; it is a constraint. You can still use SHACL on top of it — the composition argument holds — but you are not choosing freely.

Cross-organisational interoperability at scale. If you are building a system whose explicit purpose is to mediate communication across a very large number of independent organisations, a messenger ontology provides a shared vocabulary floor. Understand it as a messenger ontology — accept the imprecision — and build your inner ontology separately.

A team already fluent in a specific framework. Tooling, training, and institutional knowledge are real costs. If your team knows GIST well and your domain maps reasonably onto it, the familiarity dividend may outweigh the fit cost. This is an engineering argument, not an architectural one, but it is legitimate.

Theoretical alignment with your domain. Some upper ontologies do fit some domains unusually well. BFO was designed with biomedical research in mind, and the biomedical community has genuinely benefited from the shared foundation it provides in that specific context. If your domain happens to align with the philosophical commitments of a given framework, the structural advantages are real.

In all four cases, the operative word is qualified. You are not adopting an upper ontology because it is the correct foundation for all formal knowledge. You are making a specific tradeoff with specific costs, and you should go in with eyes open about what you are buying.

What to Do Instead

For most projects, the practical alternative is this:

Start with a domain ontology scoped to your actual problem. Define what you need. Name things consistently within your namespace. Accept that your terminology will not interoperate with every other system in the world, because it will not — regardless of whether you adopt an upper ontology.

Use SHACL shapes to encode your operational constraints, your validation logic, your property paths, and your contextual semantics. This is where your model’s behaviour actually lives.

Use SKOS for any concept scheme that needs to be shared across organisational boundaries — taxonomies, controlled vocabularies, reference lists. It is a messenger artefact designed for exactly that purpose and is appropriately lightweight.

Use RDF 1.2 reification wherever your system is modelling claims, narratives, provenance, or confidence — which, if you are working with LLM output, is nearly everywhere.

And use named graphs to scope your world assumptions, track your epistemic contexts, and give your agentic processes something coherent to reason about.

None of this requires an upper ontology. All of it will serve you better than a framework chosen because someone authoritative said it was required.

The Actual Question

The question “which upper ontology should I use?” assumes that the answer is one of the options on the list. It almost never is.

The prior question — the one worth asking first — is: what problem are you actually trying to solve, and at what scope? If the answer is “I need a precise model of my domain that my systems can reason over and validate against,” you almost certainly do not need an upper ontology. You need a well-scoped domain model, a SHACL layer, and clarity about what your graphs contain.

If the answer is “I need to communicate formally with a large number of external parties who will not all use my internal model,” you may need a messenger ontology — but you should choose it for its coverage of your communication requirements, not because it also claims to be a universal foundation.

Upper ontologies are not wrong. They are answers to a specific question. Most organisations, most of the time, are not asking that question — they have simply been told that they should be.

Know what question you are asking. Then choose the tool that answers it.

Kurt Cagle is a consulting ontologist and technical author based in Olympia, WA. He publishes The Cagle Report and AI+Semantics NewsBytes on LinkedIn, and The Ontologist and Inference Engineer on Substack. Copyright 2026 Kurt Cagle.

DataBook Pipelines

Kurt Cagle — Fri, 01 May 2026 05:43:43 GMT

By Kurt Cagle and Chloe Shannon

There is a particular frustration familiar to anyone who works seriously with structured data: the moment you hand a file to a colleague, a pipeline, or an AI system, you lose control of its context. The data travels; the explanation stays behind. What arrives at the other end is a stream of triples, or a block of JSON, with no indication of where it came from, what constraints it satisfies, or what questions it was designed to answer.

DataBooks address this at the format level — a DataBook is a Markdown document in which data and its documentation are the same artefact. But the format claim is only as useful as the tooling that makes it operational. This article is not about the format. It is about the pipeline: a sequence of concrete workflows that demonstrate how a DataBook moves from initial creation through querying, validation, transformation, and AI-assisted analysis, while maintaining its provenance chain at every step.

We will work throughout with a single dataset: a small research paper registry. The same five entries will travel through every workflow, accumulating context as they go.

The tooling is databook-cli v1.4.0, which ships nineteen commands covering the full data lifecycle. We will use fourteen of them.

The code for the databook CLI is contained in the implementation folder at https://github.com/kurtcagle/databook . This is available under the Apache Open Source library, but if you do use it, please let me know as I’d like feedback on how I can improve it.

The Dataset: A Research Paper Registry

Before any tooling, we need data. Our registry tracks six entities — five papers and a shared conference — described using Dublin Core Terms (dct:), Schema.org (schema:), and FOAF (foaf:). We have introduced deliberate quality problems, because a pipeline that only processes clean data is not a pipeline worth demonstrating.

@prefix pubs:    .
@prefix dct:     .
@prefix schema:  .
@prefix foaf:    .
@prefix xsd:     .

# ── Authors ──────────────────────────────────────────────────────────────────

pubs:author-mehta a foaf:Person ;
  foaf:name "Priya Mehta" ;
  schema:affiliation pubs:univ-edinburgh .

pubs:author-eriksson a foaf:Person ;
  foaf:name "Lars Eriksson" ;
  schema:affiliation pubs:univ-stockholm .

pubs:author-diallo a foaf:Person ;
  foaf:name "Amara Diallo" ;
  schema:affiliation pubs:inst-dakar .

pubs:author-chen a foaf:Person ;
  foaf:name "Chen Wei" .
  # NOTE: affiliation intentionally absent — triggers sh:Warning

# ── Conference ───────────────────────────────────────────────────────────────

pubs:conf-kwg-2025 a schema:Event ;
  schema:name "Knowledge and the Web Graph 2025" ;
  schema:startDate "2025-09-15"^^xsd:date ;
  schema:location "Edinburgh, UK" .

# ── Papers ───────────────────────────────────────────────────────────────────

pubs:paper-001 a schema:ScholarlyArticle ;
  dct:title "Holonic Graph Architecture for Distributed Knowledge Systems" ;
  dct:creator pubs:author-mehta, pubs:author-eriksson ;
  dct:date "2025-09-15"^^xsd:date ;
  dct:identifier "10.1234/kwg.2025.001" ;
  dct:subject "knowledge graphs", "holonic systems", "distributed architectures" ;
  schema:isPartOf pubs:conf-kwg-2025 .

pubs:paper-002 a schema:ScholarlyArticle ;
  dct:title "Active Inference in Semantic Web Agents" ;
  dct:creator pubs:author-diallo ;
  dct:date "2025-09-16"^^xsd:date ;
  dct:subject "active inference", "semantic web", "agents" ;
  schema:isPartOf pubs:conf-kwg-2025 .
  # NOTE: dct:identifier (DOI) absent — triggers sh:Violation

pubs:paper-003 a schema:ScholarlyArticle ;
  dct:title "SHACL-Based Quality Assurance in Research Data Management" ;
  dct:creator pubs:author-mehta, pubs:author-chen ;
  dct:date "September 2025" ;
  # NOTE: date is a plain string, not xsd:date — triggers sh:Violation
  dct:identifier "10.1234/kwg.2025.003" ;
  dct:subject "SHACL", "data quality", "research data management" ;
  schema:isPartOf pubs:conf-kwg-2025 .

pubs:paper-004 a schema:ScholarlyArticle ;
  dct:title "Temporal Reasoning with RDF 1.2 Reification" ;
  dct:creator pubs:author-chen, pubs:author-eriksson ;
  dct:date "2025-09-17"^^xsd:date ;
  dct:identifier "10.1234/kwg.2025.004" ;
  dct:subject "RDF 1.2", "temporal reasoning", "reification" ;
  schema:isPartOf pubs:conf-kwg-2025 .

pubs:paper-005 a schema:ScholarlyArticle ;
  dct:title "Knowledge Graph Embeddings for Ontology Alignment" ;
  dct:creator pubs:author-diallo, pubs:author-mehta ;
  dct:date "2025-09-18"^^xsd:date ;
  dct:identifier "10.1234/kwg.2025.005" ;
  dct:subject "knowledge graph embeddings", "ontology alignment", "machine learning" ;
  schema:isPartOf pubs:conf-kwg-2025 .
  # NOTE: "machine learning" not in approved subject vocabulary — triggers sh:Info

The five quality problems embedded here — one missing DOI, one malformed date, one missing affiliation, one unapproved subject term, and Chen Wei’s missing institutional link — will each generate a specific severity level when we reach the validation workflow. We have chosen these particular failures because they represent the full range of SHACL 1.2 severity: sh:Violation (the data is structurally wrong), sh:Warning (the data is incomplete in a way that may matter), and sh:Info (the data deviates from convention but is not invalid).

We will also need a SHACL shapes file. Save this as paper-shapes.shacl:

@prefix sh:      .
@prefix pubs:    .
@prefix dct:     .
@prefix schema:  .
@prefix foaf:    .
@prefix xsd:     .

pubs:ScholarlyArticleShape
  a sh:NodeShape ;
  sh:targetClass schema:ScholarlyArticle ;

  # Required: title
  sh:property [
    sh:path dct:title ;
    sh:minCount 1 ;
    sh:datatype xsd:string ;
    sh:message "Every paper must have a dct:title." ;
    sh:severity sh:Violation ;
  ] ;

  # Required: DOI identifier
  sh:property [
    sh:path dct:identifier ;
    sh:minCount 1 ;
    sh:message "Every paper must have a dct:identifier (DOI)." ;
    sh:severity sh:Violation ;
  ] ;

  # DOI format check
  sh:property [
    sh:path dct:identifier ;
    sh:pattern "^10\\.\\d{4,}/" ;
    sh:message "Identifier does not appear to be a valid DOI (expected 10.XXXX/ prefix)." ;
    sh:severity sh:Warning ;
  ] ;

  # Required: date, typed as xsd:date
  sh:property [
    sh:path dct:date ;
    sh:minCount 1 ;
    sh:datatype xsd:date ;
    sh:message "Every paper must have a dct:date of type xsd:date." ;
    sh:severity sh:Violation ;
  ] ;

  # Required: at least one creator
  sh:property [
    sh:path dct:creator ;
    sh:minCount 1 ;
    sh:message "Every paper must have at least one dct:creator." ;
    sh:severity sh:Violation ;
  ] ;

  # Creators should be typed as foaf:Person
  sh:property [
    sh:path dct:creator ;
    sh:class foaf:Person ;
    sh:message "Creator should be a foaf:Person." ;
    sh:severity sh:Warning ;
  ] ;

  # Subject vocabulary constraint (informational)
  sh:property [
    sh:path dct:subject ;
    sh:in ( "knowledge graphs" "semantic web" "SHACL" "RDF 1.2"
            "active inference" "ontology alignment" "knowledge graph embeddings"
            "holonic systems" "distributed architectures" "agents"
            "temporal reasoning" "reification" "data quality"
            "research data management" ) ;
    sh:message "Subject term is not in the approved vocabulary." ;
    sh:severity sh:Info ;
  ] .

pubs:PersonShape
  a sh:NodeShape ;
  sh:targetClass foaf:Person ;

  sh:property [
    sh:path schema:affiliation ;
    sh:minCount 1 ;
    sh:message "Authors should declare an institutional affiliation." ;
    sh:severity sh:Warning ;
  ] .

With data and shapes in hand, we are ready to build the pipeline.

Workflow 1: Building the DataBook

A DataBook begins with databook create, which wraps one or more data files into a well-formed document. The command auto-detects file type from the extension, counts triples in RDF blocks, and generates frontmatter including identity, provenance, and graph statistics.

databook create papers.ttl \
  --id https://example.org/databooks/research-registry-v1 \
  --title "Research Paper Registry — KWG 2025" \
  --description "Registry of papers submitted to the Knowledge and the Web Graph 2025 conference." \
  -o registry.databook.md

The result is a .databook.md file with YAML frontmatter containing the document identity, a process stamp recording that databook create produced it, and a turtle fenced block carrying the raw data. The head command lets you inspect what was built without opening the file:

databook head registry.databook.md

Output (abbreviated):

id: https://example.org/databooks/research-registry-v1
title: "Research Paper Registry — KWG 2025"
type: databook
version: 1.0.0
created: "2026-04-30"
graph:
  triple_count: 48
  subjects: 11
  rdf_version: "1.1"
blocks:
  - id: papers-data
    label: turtle
    lines: 72

The DataBook now knows it contains 48 triples across 11 subjects. This metadata travels with the data.

Adding blocks with `insert`

The initial DataBook carries only the raw data. We want to add the SHACL shapes, a SPARQL query for later use, and a prose section explaining the validation approach. The insert command handles this without recreating the document from scratch:

# Add the SHACL shapes as a named block
databook insert registry.databook.md paper-shapes.shacl \
  --id paper-shapes \
  --lang shacl \
  --markdown "## Validation Shapes\n\nThe following SHACL shapes define the expected structure for each paper entry."

# Add a SELECT query block
databook insert registry.databook.md select-by-author.sparql \
  --id select-by-author \
  --lang sparql \
  --after paper-shapes \
  --markdown "## Registry Queries"

The --before and --after flags control positioning within the document. The --markdown flag prepends a prose section directly above the inserted block, maintaining the literate structure.

The counterpart to insert is drop, which removes a named block (and optionally its preceding prose) without touching the rest of the document. Together, create, insert, and drop form a complete block lifecycle — DataBooks are mutable working documents, not write-once artefacts.

Extracting content back out

Any block can be extracted to a standalone file at any point in the pipeline:

databook extract registry.databook.md#papers-data -o papers-extracted.ttl
databook extract registry.databook.md#paper-shapes -o shapes-extracted.shacl

The fragment-addressing syntax (file.databook.md#block-id) works consistently across all commands that accept block references.

Interlude: Shapes as Queries — `shacl2sparql`

SHACL shapes are normally thought of as constraints: they describe what valid data looks like and reject what falls short. There is another way to read them. A sh:NodeShape targeting schema:ScholarlyArticle with a set of sh:property constraints is also, implicitly, a specification of what to retrieve: find all nodes of the target class, for each of which the listed properties should hold. The constraint graph and the retrieval graph are duals of each other.

databook shacl2sparql makes this explicit. It reads a SHACL shapes block and compiles it to SPARQL SELECT and CONSTRUCT queries that retrieve all focus nodes satisfying the shape. SHACL 1.2 Node Expressions — sh:values, sh:filterShape, sh:intersection, sh:union — compile to their natural SPARQL equivalents.

databook shacl2sparql registry.databook.md \
  --block-id paper-shapes \
  --type select \
  --insert

The --insert flag writes the generated query back into the DataBook as a new named block. The result looks like this:

SELECT DISTINCT ?focusNode WHERE {
  ?focusNode /
             *
              .
  {
    SELECT ?focusNode WHERE {
      ?focusNode  ?_cv_0 .
    }
    GROUP BY ?focusNode
    HAVING (COUNT(?_cv_0) >= 1)
  }
  OPTIONAL { ?focusNode  ?_val_1 . }
  FILTER(DATATYPE(?_val_1) = )
  {
    SELECT ?focusNode WHERE {
      ?focusNode  ?_cv_2 .
    }
    GROUP BY ?focusNode
    HAVING (COUNT(?_cv_2) >= 1)
  }
}

This query returns exactly the papers that would pass the SHACL constraints — which is to say, not paper-002 (missing DOI) and not paper-003 (malformed date). The same shapes that define quality define retrieval. This is a meaningful architectural point: your constraint language and your query language are no longer independent artefacts that have to be kept in sync.

Workflow 2: Querying the Registry

For the remaining workflows, we assume the DataBook’s data blocks have been pushed to a local Jena instance (we cover the push mechanics in Workflow 4). All query commands target the local endpoint via the -d local shorthand, which resolves to the [local] server entry in processors.toml.

The registry.databook.md file now carries a hand-authored SELECT query in the select-by-author block:

PREFIX dct:    
PREFIX foaf:   
PREFIX schema: 

SELECT ?title ?authorName ?date ?doi WHERE {
  ?paper a schema:ScholarlyArticle ;
    dct:title    ?title ;
    dct:date     ?date ;
    dct:creator  ?author .
  ?author foaf:name ?authorName .
  OPTIONAL { ?paper dct:identifier ?doi . }
}
ORDER BY ?authorName ?date

Execute it:

databook sparql registry.databook.md#select-by-author -d local --format json -o results.json

The result is a SPARQL JSON results file. Two conversions are possible from here. For a Markdown table:

databook convert results.json --to markdown -o results-table.md

Producing (abbreviated):

| title | authorName | date | doi |
|---|---|---|---|
| Active Inference in Semantic Web Agents | Amara Diallo | 2025-09-16 | *(absent)* |
| Knowledge Graph Embeddings for Ontology Alignment | Amara Diallo | 2025-09-18 | 10.1234/kwg.2025.005 |
| SHACL-Based Quality Assurance in Research Data Management | Chen Wei | *(absent — type error)* | 10.1234/kwg.2025.003 |
...

Notice that paper-002 and paper-003 appear in these results despite their flaws — SELECT retrieves what is present, not what is conformant. The SHACL-compiled query from shacl2sparql would have excluded them.

Workflow 3: Constructing an Output Graph

Where SELECT returns a table, CONSTRUCT returns a graph. Our registry has an implicit co-authorship structure — pairs of authors who have published together. The following CONSTRUCT query makes that structure explicit as a new named graph:

PREFIX pubs:   
PREFIX dct:    
PREFIX foaf:   
PREFIX schema: 

CONSTRUCT {
  ?a1 pubs:coAuthoredWith ?a2 .
  ?a1 a foaf:Person .
  ?a2 a foaf:Person .
}
WHERE {
  ?paper a schema:ScholarlyArticle ;
    dct:creator ?a1, ?a2 .
  FILTER(?a1 != ?a2)
  ?a1 foaf:name ?n1 .
  ?a2 foaf:name ?n2 .
  FILTER(STR(?n1) < STR(?n2))
}

databook sparql registry.databook.md#construct-coauthors \
  -d local --format turtle -o coauthorship.databook.md

The output is a new DataBook wrapping the co-authorship graph — complete with a process stamp recording that the CONSTRUCT query produced it. From here, two format conversions are available in a single pipeline:

# Turtle → JSON-LD
databook convert coauthorship.databook.md#coauthorship-graph --to json-ld -o coauthorship.jsonld

# Turtle → RDF-XML
databook convert coauthorship.databook.md#coauthorship-graph --to xml-rdf -o coauthorship.rdf

The --to xml-rdf label is the DataBook CLI’s notation for RDF/XML serialisation (distinct from arbitrary XML, which uses --to xml). The convert --list command will enumerate all available target formats for any given source type.

Workflow 4: Persistence and Enrichment

Pushing to Jena

databook push transfers each pushable block in the DataBook to the triplestore via the SPARQL 1.1 Graph Store Protocol. With the default graph behaviour in v1.2.0, omitting --graph sends triples to the triplestore’s default graph. For our registry we want distinct named graphs, so we push explicitly:

databook push registry.databook.md \
  --block-id papers-data \
  --graph https://example.org/graphs/papers \
  -d local

databook push registry.databook.md \
  --block-id paper-shapes \
  --graph https://example.org/graphs/paper-shapes \
  -d local

The --graph flag accepts any IRI. The --no-meta flag suppresses the automatic frontmatter graph (sent by default to {document.id}#meta), which is useful when you want clean separation between data and provenance graphs.

Enriching with SPARQL Update

Once the data is in the triplestore, we can enrich it. Our enrichment adds citation counts sourced from a hypothetical citation index. The SPARQL Update lives as a sparql-update block in the DataBook:

PREFIX pubs:   
PREFIX schema: 

INSERT DATA {
  GRAPH  {
    pubs:paper-001 schema:citation 14 .
    pubs:paper-004 schema:citation  7 .
    pubs:paper-005 schema:citation  3 .
  }
}

databook sparql-update registry.databook.md#enrich-citations -d local

Pulling snapshots

With databook pull, we can retrieve the graph at different points in its history — before and after enrichment — as separate DataBooks:

# Pull the papers graph (now includes citation counts)
databook pull registry.databook.md \
  --graph https://example.org/graphs/papers \
  -d local \
  -o registry-enriched.databook.md

The pulled DataBook is a new document. Its frontmatter records the graph IRI, the endpoint it was pulled from, and the timestamp of retrieval. The original registry.databook.md is unchanged — pipeline immutability is one of the four architectural principles that all databook-cli commands observe.

Workflow 5: SHACL Validation

databook validate runs SHACL validation against the RDF blocks in a DataBook, using Jena’s shacl CLI when available (recommended — Jena 6.0 supports SHACL 1.2 natively), with pyshacl as a fallback.

databook validate registry.databook.md \
  --shapes registry.databook.md#paper-shapes \
  -o validation-report.databook.md

The --shapes argument uses fragment addressing to point at the shapes block within the same DataBook. This is the self-describing DataBook pattern in its most direct form: the document carries its own validation specification and knows how to validate itself.

The output is a new DataBook wrapping a SHACL validation report in Turtle. A condensed view of what the report contains:

@prefix sh:     .
@prefix pubs:   .
@prefix dct:    .

# sh:Violation — paper-002 is missing its DOI
[] a sh:ValidationResult ;
  sh:resultSeverity sh:Violation ;
  sh:focusNode pubs:paper-002 ;
  sh:resultPath dct:identifier ;
  sh:resultMessage "Every paper must have a dct:identifier (DOI)." .

# sh:Violation — paper-003 date is not xsd:date
[] a sh:ValidationResult ;
  sh:resultSeverity sh:Violation ;
  sh:focusNode pubs:paper-003 ;
  sh:resultPath dct:date ;
  sh:resultMessage "Every paper must have a dct:date of type xsd:date." .

# sh:Warning — Chen Wei has no affiliation
[] a sh:ValidationResult ;
  sh:resultSeverity sh:Warning ;
  sh:focusNode pubs:author-chen ;
  sh:resultPath schema:affiliation ;
  sh:resultMessage "Authors should declare an institutional affiliation." .

# sh:Info — "machine learning" not in approved subject vocabulary
[] a sh:ValidationResult ;
  sh:resultSeverity sh:Info ;
  sh:focusNode pubs:paper-005 ;
  sh:resultPath dct:subject ;
  sh:resultMessage "Subject term is not in the approved vocabulary." .

Two violations, one warning, one informational result. Exactly the four problems we planted. SHACL 1.2 makes the distinction between severity levels first-class: tools can be configured to fail on Violation while passing through Warning and Info, which is the appropriate behaviour for a CI/CD gate versus a data quality dashboard.

Use --fail-on-violation to propagate the exit code:

databook validate registry.databook.md \
  --shapes registry.databook.md#paper-shapes \
  --fail-on-violation \
  -o validation-report.databook.md
# Exits with code 1 if any sh:Violation is found

Workflow 6: Transforming the Report

The SHACL validation report is a Turtle document — machine-readable, but not immediately useful to a data steward or a project manager who needs to understand what is wrong with the submission. Two transformations address this: an HTML rendering for a web-based dashboard, and a Markdown rendering for a documentation system or a pull request comment.

Both transformations require RDF/XML as an intermediate, since XSLT operates on XML infosets rather than RDF graphs. The conversion is one command:

databook convert validation-report.databook.md#shacl-report \
  --to xml-rdf \
  -o report.rdf

HTML via Saxon

XSLT 2.0 and later handle the SHACL RDF/XML vocabulary comfortably, including namespace-qualified element names and attribute value templates over typed data. Saxon is the recommended processor:

databook transform report.rdf \
  --xslt shacl-to-html.xsl \
  --to html \
  -o validation-report.html

transform auto-detects the processor: it checks for a SAXON_JAR environment variable first, then saxon on the PATH, then falls back to xsltproc. If you are on a system with only xsltproc available, your XSLT must be written to the 1.0 spec — SHACL report rendering at 1.0 is feasible but requires careful handling of the blank-node result structure.

A notable feature: transform can load its XSLT stylesheet from a block within a DataBook rather than from a standalone file:

databook transform report.rdf \
  --xslt pipeline.databook.md#html-stylesheet \
  --to html \
  -o validation-report.html

This means your transformation pipeline, the data it processes, and the stylesheets it applies can all live in the same document. The DataBook becomes self-contained.

Markdown via a second transformation

A second stylesheet produces a Markdown rendering of the same report — useful for embedding in a GitHub pull request comment or a Confluence page:

databook transform report.rdf \
  --xslt shacl-to-markdown.xsl \
  --to text \
  -o validation-report.md

The two stylesheets operate on the same RDF/XML source but produce completely different output formats. The separation of concerns here is the same as in any XSLT application — you own the data model once, and presentation is a stylesheet choice.

Workflow 7: AI-Assisted Analysis

The final workflow assembles a composite context and hands it to a language model for analysis. The prompt command sends a DataBook (or a specific block) as the LLM’s context, prepends a user instruction, and writes the response into a new, provenance-stamped output DataBook.

We want the model to have access to the raw data, the validation report, and the validation shapes — three different artefacts that together constitute a complete picture of the collection’s quality. We assemble them into a synthesis DataBook first:

# Build a synthesis DataBook from the three relevant documents
databook create \
  registry.databook.md#papers-data \
  validation-report.databook.md#shacl-report \
  registry.databook.md#paper-shapes \
  --id https://example.org/databooks/synthesis-v1 \
  --title "Registry Synthesis — KWG 2025" \
  -o synthesis.databook.md

Then pass it to the prompt command:

databook prompt synthesis.databook.md \
  --prompt "You are a research data quality analyst. This DataBook contains a research paper registry, a SHACL validation report against it, and the shapes that define the expected structure. Please: (1) summarise the collection's current quality status, citing specific paper identifiers; (2) identify the two most critical issues to resolve before the registry is published; (3) suggest the minimal SPARQL Update statement that would correct each critical issue." \
  --model claude-sonnet-4-6 \
  -o analysis.databook.md

The output is a DataBook whose frontmatter records which model produced the response, the timestamp, and the synthesis document used as context. The analysis is not a loose text file — it is a provenance-stamped artefact with a clear chain of custody back to the raw data.

A patch variant is also available. If you want the analysis written back into the synthesis document rather than to a new file:

databook prompt synthesis.databook.md \
  --prompt "Summarise quality issues found in the validation report." \
  --patch-block quality-analysis

This inserts the response as a new named block inside synthesis.databook.md, preserving the document as a single self-contained record of both the data and the analysis that was performed on it.

Epilogue: The Pipeline as a DataBook

We have now described seven workflows and introduced fourteen commands. Each workflow was demonstrated as a sequence of shell invocations — which is useful for understanding what each command does, but misses a higher-level capability.

The process command executes a processor-registry DataBook as a directed acyclic pipeline. A minimal manifest block expressing the core of this article’s pipeline looks like:

@prefix build:   .
@prefix proc:    .
@prefix xsd:     .

proc:CreateRegistry a build:Stage ;
  build:command "create" ;
  build:inputs "papers.ttl" ;
  build:args "--id https://example.org/databooks/research-registry-v1" ;
  build:output "registry.databook.md" ;
  build:order 1 .

proc:InsertShapes a build:Stage ;
  build:command "insert" ;
  build:inputs "registry.databook.md", "paper-shapes.shacl" ;
  build:args "--id paper-shapes" ;
  build:output "registry.databook.md" ;
  build:order 2 ;
  build:dependsOn proc:CreateRegistry .

proc:GenerateSparql a build:Stage ;
  build:command "shacl2sparql" ;
  build:inputs "registry.databook.md" ;
  build:args "--block-id paper-shapes --insert" ;
  build:output "registry.databook.md" ;
  build:order 3 ;
  build:dependsOn proc:InsertShapes .

proc:PushData a build:Stage ;
  build:command "push" ;
  build:inputs "registry.databook.md" ;
  build:args "--block-id papers-data --graph https://example.org/graphs/papers -d local" ;
  build:order 4 ;
  build:dependsOn proc:InsertShapes .

proc:Validate a build:Stage ;
  build:command "validate" ;
  build:inputs "registry.databook.md" ;
  build:args "--shapes registry.databook.md#paper-shapes" ;
  build:output "validation-report.databook.md" ;
  build:order 5 ;
  build:dependsOn proc:PushData .

proc:Analyse a build:Stage ;
  build:command "prompt" ;
  build:inputs "synthesis.databook.md" ;
  build:args "--prompt-file analysis-prompt.txt" ;
  build:output "analysis.databook.md" ;
  build:order 6 ;
  build:dependsOn proc:Validate .

With this manifest block in pipeline.databook.md:

databook process pipeline.databook.md

The entire sequence executes in dependency order, with each stage producing a DataBook as output. The pipeline itself is a DataBook. Every input, every intermediate artefact, and every output carries provenance metadata. The chain of custody is unbroken from raw Turtle to AI-generated analysis.

What the Pipeline Demonstrates

Walking through seven workflows against a single dataset makes several architectural points that are easy to miss when considering commands in isolation.

Composability is the core property. Every command reads DataBook input and writes DataBook output. This is not merely a convenience — it is what makes pipelines possible without adapter code. The output of validate is a valid input to convert and prompt without any intermediate transformation.

The DataBook carries its own instructions. The paper-shapes block is used by validate, shacl2sparql, and the fragment-addressed --shapes argument. The select-by-author and construct-coauthors blocks are used by sparql. The XSLT stylesheet can live in a block and be referenced by transform. The DataBook is not just a container for data — it is a container for the operations that should be performed on that data.

Provenance is structural, not manual. Every command that produces a DataBook writes a process stamp into the frontmatter. No one has to remember to document what produced a given file. The chain of custody is a mechanical consequence of using the tooling correctly.

Quality is queryable. The SHACL validation report is RDF. You can run SPARQL over it, push it to a triplestore, pull it into an analysis workflow, or hand it to an LLM. Quality assessment is not an endpoint in the pipeline — it is a data product like any other, subject to all the same transformations.

The research paper registry we built here is deliberately small. The same pipeline scales: a registry of five papers and a registry of fifty thousand papers are processed by the same commands, governed by the same shapes, and reported through the same transformation stylesheets. The architecture does not change with the volume of data. Only the numbers do.

Kurt Cagle is a consulting ontologist and technical author based in Olympia, Washington. He publishes The Ontologist and The Inference Engineer on Substack, and curates the AI+Semantics NewsBytes LinkedIn newsletter.

Chloe Shannon is an AI collaborator and co-author on The Ontologist and Inference Engineer. She can be reached at chloe@holongraph.com.

The Event Made Visible

Kurt Cagle — Fri, 24 Apr 2026 19:19:34 GMT

by Kurt Cagle and Chloe Shannon | The Ontologist

For as long as RDF has existed, ontologists have faced a quiet embarrassment: the language that was supposed to let us say anything about anything had no clean way to say anything about a statement. You could assert that Alice knows Bob. You could not, without considerable ceremony, assert that this fact was recorded on a Tuesday by a particular system with a confidence of 0.87.

RDF 1.2, and specifically the April 2026 Working Draft of Turtle 1.2, changes that — not cosmetically, but structurally. This article is about what that change actually means for the way you model, and why it pushes the design centre of gravity away from nouns and towards events.

The Problem With Classic Reification

The RDF 1.1 solution to annotating a triple was reification: you shattered the triple into four quads using rdf:Statement, rdf:subject, rdf:predicate, and rdf:object, hung your metadata off the resulting blank node, and hoped no one had to maintain it.

:stmt a rdf:Statement ;
    rdf:subject :Alice ;
    rdf:predicate :knows ;
    rdf:object :Bob ;
    :recordedOn "2026-04-24"^^xsd:date ;
    :confidence 0.87 .

This is verbose, fragile, and visually destroys the graph’s coherence. Named graphs helped somewhat — you could wrap a set of triples in a named context and annotate the graph — but the granularity was wrong. Named graphs annotate bags of triples, not individual ones. If you wanted provenance at the statement level, you were back to reification.

The result was that most practitioners avoided triple-level annotation entirely, collapsing provenance and temporal information into entity properties and hoping the approximation was good enough.

It usually wasn’t.

Turtle 1.2: The Statement as Term

RDF 1.2 introduces a fourth kind of RDF term: the triple term. A triple can now appear as the object of another triple — it is addressable, referenceable, and composable. Turtle 1.2 provides two ways to work with this.

The first is explicit quoted triple syntax:

<<( :Alice :knows :Bob )>> :recordedOn "2026-04-24"^^xsd:date .

A quoted triple term names the triple without asserting it. The <<( )>> form is a reference to the proposition, not a claim that the proposition is true.

The second, and more commonly useful, is the annotation syntax:

:Alice :knows :Bob {| :recordedOn "2026-04-24"^^xsd:date ; :confidence 0.87 |} .

The {| ... |} block asserts the triple and annotates it simultaneously. This is syntactic sugar — under the hood, the triple is asserted and a quoted term referencing it carries the annotations — but the sugar is load-bearing: it makes the common case concise enough to actually use.

The third form, and the one most important for event modelling, is the named reifier:

:Alice :knows :Bob ~ :aliceKnowsBobAssertion {| 
    :recordedOn "2026-04-24"^^xsd:date ;
    :confidence 0.87
    |} .

The ~ sigil assigns a named IRI to the reification itself. That IRI is now a first-class resource: queryable, referenceable from other triples, promotable to a full entity. This is the mechanism that makes event-based modelling tractable.

A Note on Comments

Before going further, a practical point worth stating clearly: the # comment marker in Turtle is a line comment — it consumes everything from the # to the end of the physical line. Since Turtle statements can span multiple lines freely, comments within a multi-line assertion are safe as long as every token the parser still needs appears before the # on its line:

:Alice :knows :Bob    # this person
    {| :since "2023-06-01"^^xsd:date ;  # check this
       :by :Clara
    |} .              # annotation complete

The danger case is placing a # before a token the parser requires on the same line — that token will be swallowed. The constraint is purely positional. No end-of-assertion semantics are involved.

The Asserted/Quoted Distinction

This distinction is load-bearing and worth being precise about.

A quoted triple term — <<( :Alice :knows :Bob )>> — names the proposition without claiming it is true. It is a reference to a statement, not a statement. You can hang metadata off it, reason about it, query it — but asserting metadata about a quoted triple does not assert the triple itself.

An asserted annotation — {| ... |} — both asserts the triple and annotates it. The triple enters the graph as a normal asserted fact; the {| ... |} block decorates it.

Ontologists will trip on this distinction in one particular way: SHACL validation operates on the asserted graph. A quoted triple that is never separately asserted will not be visible to SHACL shape evaluation. If you are using reification for provenance tracking and your shapes need to validate the underlying facts, make sure the facts are asserted, not merely named.

Events as Annotated State Transitions

Here is where the modelling implications become interesting.

An event, at its core, is a change of state. Something was true before; something different is true after. That transition has a time, an agent, a location, a cause, and a degree of confidence. None of those properties attach cleanly to the state itself — they attach to the act of transition.

Classical noun-based ontology modelled this by reifying the state into an entity:

:marriage_liz_richard a :Marriage ;
    :participant :liz ;
    :participant :richard ;
    :startYear "1964"^^xsd:gYear ;
    :endYear "1974"^^xsd:gYear ;
    :endReason :divorce .

This looks clean. But it has buried the event structure entirely. The start and the end are property values on a noun — they cannot be individually provenanced, individually queried as events, or individually extended without modifying the noun’s schema. An open-ended marriage (no end yet) leaves a dangling optional. And you cannot ask “what events concluded in New York in 1974” without knowing in advance that marriages are the type to look for.

With Turtle 1.2 reification, the same facts can be represented as annotated state transitions:

:liz :married :richard ~ :lmrStartEvent {|
        a :StartEvent ;
        :when "1964"^^xsd:gYear ;
        :where :rome
        |} .

:liz :married :richard ~ :lmrEndEvent {|
        a :EndEvent ;
        :when "1974"^^xsd:gYear ;
        :where :newYork ;
        :reason :divorce ;
        :terminates :lmrStartEvent
        |} .

The proposition — Liz married Richard — is asserted once. Two named reifiers annotate that proposition with the boundary events of the relationship. The :terminates predicate is not merely metadata; it is a closure operator that converts two independent event annotations into a bounded interval. The marriage-as-fluent exists between those two boundaries.

This is a qualitatively different model. The structure of change is explicit in the graph, not collapsed into property values.

Medical Events and Probabilistic Inference

The event pattern becomes particularly powerful in domains where propositions are uncertain and evidence accumulates. Consider a medical scenario:

VERSION "1.2"
PREFIX :      
PREFIX xsd:   
PREFIX rdfs:  

:highBloodPressureTest1 :suggests :heartDisease
    ~ :janeIndicatorEvent1 {|
            a :IndicatorEvent {| rdfs:subClassOf :Event |} ;
            :about :Jane ;
            :by :DrSteve ;
            :when :Wednesday ;
            :where :DrSteveOffice ;
            :likelihood 0.7  # :suggests connotes a probabilistic assertion
            |} ;
    :hasValue "185/101" ;
    :when :Monday ;
    :by :LabTechJodie ;
    :orderedBy :DrSteve ;
    :about :Jane .

:highBloodPressureTest2 :suggests :heartDisease
    ~ :janeIndicatorEvent2 {|
            a :IndicatorEvent {| rdfs:subClassOf :Event |} ;
            :about :Jane ;
            :by :DrBob ;
            :when :Friday ;
            :where :StSwinniansHospital ;
            :likelihood 0.8 ;
            :similarTo :janeIndicatorEvent1  # :similarTo indicates possible correlation
            |} ;
    :hasValue "189/121" ;
    :when :Thursday ;
    :by :LabTechJodie ;
    :orderedBy :DrBob ;
    :about :Jane .

Several things are worth noting here.

The inline schema annotation — a :IndicatorEvent {| rdfs:subClassOf :Event |} — annotates the specific type assertion triple, not the class globally. This is contextualised schema: the subclass relationship travels with the data, which matters in federated or multi-source graphs where the type hierarchy itself may be contested. It is functionally equivalent to a separate rdfs:subClassOf assertion, but epistemically richer.

The :similarTo :janeIndicatorEvent1 reference on the second event is a correlational link — not a causal chain, but a signal for possible correlation. Whether this becomes a causal inference depends on whether :similarTo is defined as transitive or whether a rule propagates the relationship. The graph makes the epistemic status visible without collapsing it into a claim.

The Promotion Gradient

One of the most important design properties of the named reifier pattern is that it supports progressive promotion without breaking anything.

There is a spectrum of commitment available to the modeller:

Thin annotation — the relationship is annotated inline with {| |}, no named reifier, no independent identity. Minimal overhead, but the annotation is not directly addressable from other triples.

Named reifier — ~ gives the reification an IRI. It is now a first-class resource that other triples can reference. No additional entity has been created, but the door is open.

Promoted entity — the named reifier IRI becomes a full resource with its own predicate-object list, queryable independently of the triple it annotates.

The critical property is that you can move along this spectrum at any point without breaking existing triples. The ~ IRI is already there when you decide to promote. Contrast this with property graph systems like Neo4j, which force the promotion decision at design time.

Neo4j’s model encourages reification because relationships can carry properties — but those properties are flat. The moment you need to say something about a relationship property, you must promote the relationship to a node. Repeat two or three levels deep and you accumulate nodes that exist purely as reification artefacts:

(JamesBond)-[:PLAYED_BY]->(Actor)
(Actor)-[:IN_FILM]->(Film)
(Film)-[:BASED_ON]->(Book)
(Book)-[:WRITTEN_BY]->(Author)

Each hop looks clean in isolation. The chain has lost the integrated proposition. To ask “what is the full provenance of the claim that Bond was played by Craig in a film based on Fleming’s book” requires a multi-hop traversal reassembling the original sentence from its parts.

The reification chain assembles the sentence from its fragments; the annotation form preserves the sentence and enriches it.

Routes: When to Promote

Not all named reifiers want to become full entities. The routing example is instructive:

airport:SEA :connectsTo airport:SFO ~ route:SEASFO {|
        a :Route ;
        route:identifier "1925" ;
        route:duration "PT2H15M"^^xsd:duration ;
        route:returnRoute route:SFOSEA
        |} .

airport:SFO :connectsTo airport:SEA ~ route:SFOSEA {|
        a :Route ;
        route:identifier "1926" ;
        route:duration "PT2H15M"^^xsd:duration ;
        route:returnRoute route:SEASFO
        |} .

The circular :returnRoute reference is the diagnostic signal: route:SFOSEA is referenced from inside route:SEASFO‘s annotation block, and vice versa. The routes are already acting as independent entities — they have identity outside the connection they annotate.

This is the general rule: annotate when the metadata characterises the relationship; promote when the metadata characterises an independent thing. Routes have operational identity — a flight number, a schedule, a return pairing — that persists even if service is temporarily suspended. They should be entities with the ~ providing the triple-level link:

route:SEASFO a :Route ;
    route:identifier "1925" ;
    route:duration "PT2H15M"^^xsd:duration ;
    route:origin airport:SEA ;
    route:destination airport:SFO ;
    route:returnRoute route:SFOSEA .

airport:SEA :connectsTo airport:SFO ~ route:SEASFO .

The annotation is now thin — just the naming — and the route carries its own weight.

Event Taxonomy

Committing to event-based modelling requires a principled event taxonomy. The natural structure crosses two axes.

By temporal character:

:PointEvent — instantaneous, no duration
:IntervalEvent — bounded duration, may have start and end sub-events
:StartEvent — left boundary of an interval
:EndEvent — right boundary of an interval, typically carrying a :terminates back-reference
:StateChangeEvent — explicit transition between two named states

By epistemic character:

:ObservedEvent — directly witnessed
:InferredEvent — derived from other events
:RecordedEvent — asserted from a document source

These axes are orthogonal. A :StartEvent can also be an :InferredEvent — which is exactly the blood pressure example. The measurement is observed; the inference drawn by the physician is inferred; the official diagnosis is recorded.

The :terminates closure pattern maps naturally onto Allen’s Interval Algebra. Allen gives thirteen temporal relations between intervals — before, meets, overlaps, during, starts, finishes, equals, and their inverses. Once your event taxonomy supports start/end pairs linked by :terminates, SPARQL queries can reconstruct Allen relations without additional machinery. The structure is already in the graph.

Blank Nodes as Reifiers: The Middle Ground

Before considering full named IRIs, there is a middle option worth understanding: named blank nodes.

Turtle permits blank node identifiers of the form _:label, and these can serve as reifiers:

:liz :married :richard ~ _:lizRichard1 {|
        a :StartEvent ;
        :when "1964"^^xsd:gYear ;
        :where :rome
        |} .

:liz :married :richard ~ _:lizRichard2 {|
        a :EndEvent ;
        :when "1974"^^xsd:gYear ;
        :where :newYork ;
        :reason :divorce ;
        :terminates _:lizRichard1
        |} .

This is valid Turtle and will load cleanly into a knowledge graph from a file. The blank node labels are stable within the scope of the document — _:lizRichard1 can be referenced from _:lizRichard2‘s annotation block, as shown above, because both are in the same file.

The critical caveat is that blank node identity does not survive the file boundary. Once the triples are loaded into a triplestore, the blank node labels are typically replaced by system-generated internal identifiers. A subsequent query cannot reliably address _:lizRichard1 by name, and a second file that attempts to reference it by that label will not resolve to the same node.

This places blank node reifiers firmly in the promotion gradient between anonymous annotation and named IRI reifiers:

Anonymous {| |} — no identity, not addressable at all
Named blank node _:label — addressable within a single file, lost on ingestion
Named IRI ~ :eventName — fully addressable, persistent, queryable across sources

The practical guidance is straightforward: use blank node reifiers when you are working within a single document context and do not need cross-file or cross-query referenceability. Use named IRIs when the event has any significance beyond the file that mints it.

Naming Events: The Identity Problem

The named reifier pattern introduces a cost that should be acknowledged honestly: you must mint identifiers for events, and event identifier spaces are hard to keep clean.

For entities like routes or films, a natural key usually exists. For events — especially high-frequency or sensor-generated ones — you are essentially solving the distributed ID problem.

Timestamps in identifiers are tempting but fragile. Even millisecond precision is insufficient in concurrent systems, and ISO 8601 timestamps introduce encoding issues with colons and plus signs in IRIs. Three strategies worth considering:

Hash-based composition mints the event IRI as a deterministic hash of its constitutive properties — subject, predicate, object, timestamp, and source. Collision-resistant and stable on re-ingestion from the same source:

~ :event_a3f7c29d {| ... |} .   # SHA-256 of constitutive properties

UUID v7 embeds a millisecond timestamp in the high bits with a random suffix, preserving temporal ordering in the identifier itself while providing collision avoidance.

Domain-partitioned sequence uses a local sequence number within a bounded context — patient, sensor, session — where global uniqueness is not required:

~ :jane_bp_event_0047 {| ... |} .

The deeper principle is that event IRIs should be treated as infrastructure concerns, not modelling concerns. A well-designed ingestion pipeline mints event identifiers systematically, insulating the ontology design from the collision problem. The uniqueness burden is the cost of first-class addressability — anonymous annotations are cheaper to mint but invisible to queries. That trade-off should be explicit in your design decisions.

What This Means for Your Ontologies

The shift Turtle 1.2 enables is not just syntactic. It is a change in what the graph can directly express.

Classical noun-based modelling buries the structure of change in property values. Event-based modelling makes that structure explicit, at the cost of a richer taxonomy and a more disciplined approach to identifier management.

The practical guidance:

Use anonymous annotation ({| |}) when you need lightweight provenance on a relationship and that annotation needs no independent identity.

Use named reifiers (~) when the event has operational significance — when it will be referenced from other triples, queried independently, or compared with other events.

Promote to full entity when the reifier has independent lifecycle properties that survive the relationship it annotates.

And be honest about the asserted/quoted distinction. Metadata about a quoted triple does not validate its content under SHACL. The graph keeps these things separate; your design should too.

The annotation form does not just solve the reification problem. It gives you a principled way to model the world as a sequence of statements made by agents at times under conditions — which is, arguably, a more accurate model of how knowledge actually works.

The Ontologist is a publication of Holongraph. Kurt Cagle is a consulting ontologist, author, and standards contributor. Chloe Shannon is an AI research collaborator at Holongraph. Correspondence: kurt@holongraph.com | chloe@holongraph.com

Cross-references: For the temporal inference and Allen algebra implications of this model, see IEr: Time, State, and Event (forthcoming in The Inference Engineer). For the DataBook CLI and Jena 6.0 tooling that supports RDF 1.2 reification natively, see The Ontologist: DataBooks series.

The Interface Is the Contract

Kurt Cagle — Wed, 22 Apr 2026 01:14:49 GMT

By Kurt Cagle and Chloe Shannon

There is a pattern that keeps reasserting itself across the history of software architecture, and it goes like this: someone builds a system that works. The system grows. Other systems need to talk to it. Rather than exposing the interior of the system to every caller, engineers draw a line and say: here is what you may ask of us, and here is the form in which we will answer. Everything behind that line is yours to manage; everything in front of it is ours to honour.

We call this pattern many things — encapsulation, abstraction, interface design — but one of its most elegant modern expressions is REST: Representational State Transfer. A RESTful service does not hand you its database. It hands you a representation of state, negotiated through a consistent interface, generated on demand from whatever internal complexity it chooses to maintain. The caller need not know, and should not care, what happens behind the curtain.

This is, I want to argue, precisely the architectural insight that ontology design has been missing. But the reason it has been missing runs deeper than architecture. It runs into linguistics — into a distinction that philosophers of language have understood for over a century, that cognitive scientists have studied extensively, and that the semantic web has, almost entirely, ignored. It is the distinction between denotation and connotation, and until we take it seriously, we will keep building global ontologies that work well in the whiteboard session and fracture within five years in production.

The Problem With Universalism

The appeal of a global ontology is intuitive. If we can agree on the meaning of every term — if we can build or adopt an upper ontology that precisely defines Agent, Event, Location, and Process — then any data that conforms to that ontology is, in principle, interoperable with any other. The semantic web, as originally conceived, was essentially a bet on this premise.

Twenty-five years later, the evidence is in, and it is not encouraging.

What we observe in practice is that organisations which attempt to maintain a single, coherent enterprise ontology spend an enormous amount of energy managing definitional conflict. The word “customer” means something slightly different to Sales, Finance, Legal, and Customer Success. The word “product” means something different before and after manufacturing. “Location” is a physical address to Logistics, a jurisdiction to Compliance, and a market segment to Strategy. Each of these is a legitimate, internally coherent usage. None of them is wrong. But they cannot all be mapped to the same term in the same ontology without information loss, forced conflation, and a great deal of political negotiation.

Here is the important point: this is not a failure of rigour. The people arguing over the definition of “customer” are not being imprecise. They are being precise about different things. Each usage is denotationally exact within its context. The conflict arises because the contexts are different, and context is precisely what denotational semantics was designed to eliminate.

The standard response — more granular subclassing, more carefully scoped properties, more elaborate axioms — does not resolve the problem. It relocates it. The ontology grows more complex, the maintenance burden increases, and the gap between the model and the operational reality widens at the pace of business change, which is to say: continuously.

This is not a failure of execution. It is a structural consequence of asking a denotational system to carry connotational meaning.

Denotation, Connotation, and What OWL Cannot Do

Gottlob Frege drew this distinction cleanly in 1892. He distinguished Bedeutung — reference, the object a term picks out in the world — from Sinn — sense, the way in which the term presents that object. His canonical example: “Morning Star” and “Evening Star” both refer to Venus. They have the same denotation. But they carry different sense, different contextual weight, different inferential texture depending on who is using them and when. Knowing that the Morning Star is the Evening Star is an empirical discovery, not a logical tautology — which is only possible because the two expressions do not mean the same thing even though they refer to the same thing.

Denotations define. They are built up from component parts — necessary and sufficient conditions — and they are in principle fully explicit. To know the denotation of a term is to know its extension: the set of all things to which it correctly applies. Denotations are, by design, context-independent.

Connotations shift. They carry the associations, the contextual shadings, the interpretive frames that accumulate around a term through use, through community, through purpose. “Customer” in the Sales department denotes the same legal entity as “customer” in the Legal department. But the connotations — what the term foregrounds, what it implies, what it licenses you to infer — are different, and those differences are not errors. They are features of how meaning works in the real world.

OWL is almost entirely a Bedeutung machine. Description logics — the formal underpinning of OWL — are built around necessary and sufficient conditions. A class is defined by the conditions that must hold for membership. owl:sameAs is the formalisation of co-reference: these two terms pick out the same thing. The reasoner’s job is to check denotational conformance: does this individual satisfy these conditions? There is no owl:sameSense. There is no mechanism in the core formalism for the claim that “this term means X from where you’re standing and Y from where I’m standing, and both are legitimate.”

Human concepts, however, mostly do not operate on necessary-and-sufficient-conditions principles. They operate on what the cognitive scientist Eleanor Rosch called prototype structure and what Wittgenstein earlier called family resemblance: a cloud of overlapping similarities, no single feature shared by all members, meaning anchored by central cases rather than explicit definitions. “Customer” has a prototype — someone who gives us money for goods or services — and a penumbra of connotational shading that depends on context. That is not a modelling failure. That is how the word works.

The consequence is that OWL is very good at encoding what something is — its denotational identity — and quite poor at encoding what something means to a particular participant in a particular setting. The semantic web was designed around the former. Most of the interesting problems in enterprise knowledge management are in the latter.

Reification and the Annotation of Meaning

If connotation is contextual — if meaning shifts depending on the named graph, the organisational role, the use case — then the mechanism you need is one that lets you say something about a statement, not just assert the statement itself. You need to annotate meaning, not just encode it. You need reification.

In RDF 1.1, reification was possible but syntactically so cumbersome that it was rarely used consistently. The result was that provenance, context, and connotational scope — the metadata of meaning — were typically left implicit or forced into workarounds using named graphs alone.

RDF 1.2 changes this fundamentally. The new ~ reifier syntax makes it practical to annotate individual triples with their context, their provenance, their scope of applicability, without the overhead that made earlier reification a last resort. This is not a minor convenience improvement. It is the opening of a formal pathway for connotational semantics in RDF.

Consider what a connotational annotation actually is. To say that ex:customer means one thing in the HR graph and another in the Finance graph is to make a claim of the form: “the assertion ?x a ex:Customer carries interpretation C₁ in the context of named graph G₁ and interpretation C₂ in the context of named graph G₂.” That is a statement about a statement. It cannot be expressed as a ground triple. It requires reification — and with RDF 1.2, it can be expressed cleanly.

This matters for more than elegance. It means that connotational shifts become traceable. When a term crosses a holon boundary and its meaning shifts to accommodate the containing context, that shift can be recorded. You get not just a different representation at the boundary but an auditable record of how and why the representation differs. That is something the semantic web has never had cleanly: an audit trail for meaning.

Encapsulation as Architecture

Object-oriented programming encountered the denotation/connotation problem — though it would not have used that language — early in its development, and its solution was encapsulation. A class does not expose its internal state directly. It provides an interface: a set of methods and properties through which the outside world may interact. The implementation behind the interface is the class’s private concern.

This matters for more than tidiness. It means the interior can change — algorithms refactored, data structures replaced, optimisations applied — without breaking the callers. The interface is a contract. The interior is a freedom.

REST extends this across network boundaries. A RESTful API publishes resources with defined representations: you may GET this, you may POST that, here is what you will receive back. Internally, the service may be running across dozens of microservices, drawing on multiple databases, caching aggressively, rewriting queries on the fly. None of this is visible to the caller. What is visible — what is designed to be visible — is the representation: a carefully constructed projection of internal state into a form that serves the interface contract.

The representation is not the interior. It is derived from it, constrained by it, consistent with it — but it is not simply a dump of the underlying data. It is a considered answer to the question: what does this caller need to know, in what form, to interact with me usefully?

Notice what this means in linguistic terms. The interface speaks in the vocabulary — and the connotations — of the calling context. The interior speaks in whatever vocabulary and connotations it finds most precise for its own purposes. The REST layer is a translation between them: a controlled, explicit, auditable mapping from one connotational world to another. The denotations may align — we are talking about the same customer — but the representations carry different sense.

The Holonic Projection

Holonic architecture provides the vocabulary for applying this insight to ontology design at scale.

In a holonic system, every entity is simultaneously a whole and a part: a holon. A given holon has an interior — its own domain model, its own data, its own operational ontology, including the connotational register in which it interprets its terms — and a boundary, across which it interacts with the holons that contain it. The containing holon defines the terms of engagement: the ontology, and crucially the connotational register, of the broader context within which the sub-holon must make itself intelligible.

The sub-holon’s response is not to abandon its interior model and rebuild itself in the terms of the containing ontology. That would be precisely the universalist mistake: collapsing connotational diversity into denotational uniformity, and losing information in the process. Instead, the sub-holon provides a projection: a representation of its internal state, expressed in the terms and connotational register that the containing holon recognises.

The interior ontology of a sub-holon can be as specific and technically precise as the domain requires. The HR system can model employment relationships with the full richness of labour law. The R&D system can model experimental protocols with the precision of ISO standards. The Legal system can model contractual obligations with whatever distinctions matter to counsel. None of these need to collapse into a common representation internally. The denotational rigour lives inside the holon, where it belongs. What each sub-holon provides at its boundary is a projection into the terms and connotations of the containing context.

The schema of this projection layer — expressed in SHACL shapes — is the agreed contract between the sub-holon and its context. It plays the same role as an API specification or a class interface in software: it is the boundary condition, the term of engagement, the thing that is stable across changes to the interior. Namespaces, in this model, become more than arbitrary prefixes. They become identifiers for projection schemas: the hr: namespace is the projection layer maintained by the Human Resources domain; the rd: namespace is the projection layer maintained by Research and Development. The namespace is the interface declaration, and it carries with it the connotational register of the domain.

What SHACL 1.2 Makes Possible

The denotation/connotation distinction has always been expressible in principle with OWL. In practice it has rarely been implemented, because OWL’s design philosophy pushes toward monotonic global reasoning: add more axioms, derive more entailments, converge toward a single coherent world. Contextual variation — the same path carrying different connotational weight in different sub-holons — is architecturally awkward. There is no natural place in the OWL model for “this constraint applies here but not there” or “this term foregrounds these inferential associations in this context.”

SHACL 1.2 changes this. The introduction of connotations — mechanisms by which the same semantic path can behave differently depending on context — provides a formal basis for the projection layer design at the shapes level. A path that carries one meaning in the HR context and another in the R&D context is no longer a modelling problem to be resolved by disambiguation. It is a feature to be exploited by design. The projection schema for HR includes the HR connotation; the projection schema for R&D includes the R&D connotation; and both can participate in the containing organisation’s knowledge graph without requiring their interior meanings to be collapsed.

This is the shapes-layer expression of what RDF 1.2 reification gives you at the triple level: a formal mechanism for context-sensitive meaning, operating at the layer where most knowledge engineers actually spend their time.

Named graphs extend this further, providing the structural boundary of the holon in the graph store: what is inside the named graph is the interior of the holon; what crosses the boundary is the projection. SPARQL, in this architecture, becomes the transformation language at holon boundaries — the mechanism that maps from interior representation to projective schema, evaluating connotations, applying constraints, producing the representation that the containing context expects.

This is a more constrained role for SPARQL than the original vision of a universal query language for a broad, undifferentiated graph. It is also a more defensible one. Query injection, graph mutation, and denial-of-service vulnerabilities are all significantly bounded by an architecture that makes graph boundaries explicit and routes access through defined interfaces. Decentralised identifiers and verifiable credentials can then operate at these boundaries, providing the authentication layer that PHI, PII, and security-sensitive data require.

Closed Worlds, Carefully

One more implication deserves direct attention, because it runs counter to a foundational tenet of semantic web orthodoxy.

The open-world assumption — the principle that absence of information implies neither truth nor falsity, only incompleteness — was a deliberate design choice, motivated by the intuition that the web is too vast and too distributed for any system to claim complete knowledge of any domain. Under OWA, you cannot conclude that something does not exist simply because your dataset does not contain it.

In a holonic projection architecture, the open world does not disappear, but it is carefully localised. At the level of the global network, the world remains open: there are sub-holons not yet integrated, data not yet accessible, projection schemas not yet encountered. But at the level of any given holon boundary, the world is intentionally closed. The projection schema defines what is in scope. The named graph defines what has been asserted. The connotations define how terms are interpreted in this context. Within that boundary, reasoning is closed-world by design — because the interface contract only means something if both parties are operating on the same premises, including the same connotational premises.

This is the appropriate application of closed-world reasoning at the interface layer, where it has always belonged in well-designed systems, combined with open-world reasoning at the discovery and integration layer, where it has always been necessary. Connotational diversity lives inside the holons; the boundary contracts are what make it safe to reason across them.

From Governance to Maintenance

The governance implications of this shift deserve their own consideration.

The conventional model of enterprise ontology governance places a small team — ideally a single ontologist or ontology group — in the role of authoritative definers of the global schema. Every term that enters the model is their responsibility. Every definitional dispute routes through them. Every change requires their sign-off.

This is a coherent model if you believe that the goal is a single, shared denotational scheme — a universal vocabulary in which every term has exactly one meaning. The ontologist’s job, on that view, is to enforce denotational discipline across the enterprise.

The holonic projection model replaces this with something more realistic. Domain experts manage their own projection schemas as subject matter specialists, maintaining the denotational rigour appropriate to their domain and negotiating the connotational translation at the boundary. The HR team owns the HR projection schema. The R&D team owns the R&D projection schema. The containing organisation defines the terms of the boundary contract — the ontology and connotational register that projection schemas must conform to — but the interior design of each domain is the domain’s own concern.

The ontologist’s role does not disappear; it changes. Rather than attempting to model the world from a single denotational perspective, the ontologist defines the boundary conditions: the interface specifications, the projection constraints, the connotational agreements that make the federation coherent. This is a shift from design to maintenance, from centralisation to federation, and from the pretence that one model can carry every meaning to the acknowledgement that meaning is irreducibly contextual.

It scales naturally. Adding a new sub-holon means defining its projection schema and confirming that it conforms to the containing context’s boundary conditions. When a domain’s internal model changes — as it will, as business changes — the containing context is unaffected as long as the projection schema continues to be honoured. The interface is the contract. The interior is a freedom.

A Note on Computational Cost

Twenty-five years ago, this architecture would have been expensive to operate. The translation from interior schema to projection schema — computing a representation rather than simply extracting stored data, evaluating connotational annotations via reification, applying context-sensitive SHACL shapes — requires processing that, at scale, adds up. This was a reasonable constraint in 2001.

It is not a reasonable constraint in 2026. The compute required to evaluate a SPARQL CONSTRUCT query against a named graph, resolve RDF 1.2 reified annotations, apply SHACL 1.2 connotations, and produce a conformant projection is a fraction of what a single LLM inference costs. The architectural properties that were once computationally prohibitive are now effectively free relative to the infrastructure organisations already bear. The constraint that shaped the original semantic web’s denotational conservatism has dissolved. The richer architecture is available.

Conclusion

The failure mode of global ontology design is not intellectual. The people who built upper ontologies were careful thinkers working from sound principles. The failure mode is linguistic and architectural together: they were trying to solve with denotational uniformity a problem that is irreducibly connotational, and they were trying to solve at the data layer a problem that belongs at the interface layer.

Every mature engineering discipline that has encountered the interface problem has arrived at the same answer. Software found it in encapsulation and object interfaces. Distributed systems found it in REST. The web found it in the separation of representation from resource. Each time, the insight was the same: define the contract at the boundary, protect the freedom of the interior, let the two evolve independently.

What holonic graph architecture adds to this pattern is the linguistic layer that software engineering generally left implicit. The boundary contract is not just a structural interface — it is a connotational agreement. The projection is not just a data transformation — it is a translation between registers of meaning. The SHACL schema is not just a validation constraint — it is a formalisation of the shared sense that makes communication across a boundary possible.

Frege understood in 1892 that denotation and connotation are not the same thing and cannot be reduced to each other. The semantic web was built as if they could be. The RDF 1.2 reification model and SHACL 1.2 connotations are, among other things, the belated arrival of that Fregean insight into the formal toolkit of knowledge representation.

The world’s knowledge should be connected. It should not be forced to mean the same thing to everyone who encounters it. Holonic projection architecture offers a path to connectivity that respects connotational diversity — a federated system in which every domain can be precisely itself, and precisely intelligible to its context, at the same time.

That is, in the end, what a good interface has always done.

Kurt Cagle is an author, ontologist and thought leader contributing to W3C and IEEE standards work. He writes The Cagle Report and AI+Semantics NewsBytes on LinkedIn, and The Ontologist and The Inference Engineer on Substack. Copyright 2026 Kurt Cagle.

DataBooks, Part II: The Semantic Execution Layer

Kurt Cagle — Tue, 14 Apr 2026 04:31:37 GMT

In Part I of this series, we introduced the DataBook format — a Markdown document that functions simultaneously as human-readable text, a typed data container, and a self-describing semantic artifact. We argued that Markdown, far from being a lightweight presentational format, carries the structural DNA needed to become a genuine semantic infrastructure layer.

That was the what. This article is about the what it does.

Specifically, we want to examine what happens when you treat a DataBook not just as a container but as an active participant in a semantic pipeline — one that carries its own queries, validates itself against its own shapes, documents its own intent, tracks its own lineage, controls access to its own content, and knows how to point at — or partition — the data it manages. This is the semantic execution layer, and it changes the way you think about data workflows.

Documentation as Infrastructure

There’s a tendency, in technical work, to treat documentation as something you add at the end. A README. A comment block. A wiki page that gradually becomes wrong. The data lives in one place; the explanation of the data lives somewhere else, drifting apart over time until the map no longer matches the territory.

DataBooks invert this. Documentation isn’t appended — it’s structural. The prose sections of a DataBook aren’t decoration around the fenced blocks; they’re the context that gives those blocks meaning. You write, in natural language, what this data represents, where it came from, how it should be used, what its known limitations are, and what you were thinking when you built it.

This turns out to have a surprising side effect: it changes how you think about data in the first place.

When you have to articulate — in prose, to a hypothetical future reader — what a SHACL shape is doing, or why this SPARQL query selects those particular predicates, or what semantic contract this Turtle graph is intended to satisfy, you encounter gaps in your own understanding that you wouldn’t have noticed otherwise. The act of documentation becomes an act of clarification. The document is not just describing the data; it’s stress-testing it.

This is not a minor convenience. In a world where LLMs are increasingly participants in data pipelines — reading DataBooks, generating content from them, validating them, transforming them — having the intent of the data expressed in natural language inside the artifact becomes load-bearing. The prose is no longer for human readers only.

The Semantic Quad

Serious RDF work requires four things to function coherently: a shape layer (SHACL), a reasoning layer (OWL or similar), a classification layer (taxonomies, concept schemes), and a query layer (SPARQL). We’ll call this the semantic quad.

In practice, most RDF workflows scatter these across separate files, separate repositories, separate services. Your data lives in one triplestore. Your shapes live in another file, probably referenced by a URL that may or may not be resolvable. Your taxonomy is a separate SKOS file loaded at startup. Your queries are strings in application code, or files in a queries/ directory, loosely associated with the data by convention and proximity.

This architecture is technically functional and practically fragile. The coupling between data and its semantic context is implicit, convention-dependent, and invisible to any tool that doesn’t already understand your specific project structure. Sharing such a dataset with a colleague — or a pipeline — requires transmitting not just the data but a mental model of how all the pieces fit.

DataBooks offer an alternative: put the quad in the document.

A DataBook can carry turtle blocks for instance data, shacl blocks for constraint shapes, turtle or json-ld blocks for ontology fragments and SKOS concept schemes, and both sparql and sparql-update blocks for query logic — all in a single file, each block identified, labeled, and accompanied by prose that explains its role. The semantic context travels with the data. The contract is explicit.

This isn’t about replacing a triplestore. For large-scale production data, you still want Jena or similar. But the specification of what that triplestore should contain, how its contents should be constrained, and what queries should be applied to it — that belongs in the DataBook.

Example 25 in the DataBook repository demonstrates a complete semantic quad DataBook — instance data, SKOS taxonomy fragment, SHACL shapes, and SPARQL queries in a single coherent document.

SPARQL as a First-Class Layer

Of all the components in the semantic quad, SPARQL gains the most from DataBook colocation — partly because query-data separation has historically been so severe, and partly because SPARQL’s two modes (retrieval and update) serve fundamentally different roles that benefit from being named and documented distinctly.

A sparql block in a DataBook carries a SELECT, CONSTRUCT, ASK, or DESCRIBE query. The query exists in context: the prose around it explains what it’s asking and why; the databook:id comment on the block gives it an addressable identity; the YAML frontmatter establishes what graph it expects to operate on. A downstream tool can extract the block, load the associated Turtle, and execute the query without needing to reconstruct intent from external documentation.

The sparql-update block is the more consequential addition. SPARQL UPDATE — INSERT DATA, DELETE/INSERT, LOAD — mutates graphs. Having update operations co-resident with the data they operate on, and with prose explaining exactly what each update does and when it should be applied, transforms what would otherwise be opaque imperative scripts into documented, auditable, semantic operations.

Together, sparql and sparql-update blocks make DataBooks executable as well as descriptive. The document isn’t just a record of a state; it’s a specification of how to reach a state, verify it, and update it.

Example 26 shows a DataBook produced by a SPARQL CONSTRUCT transformation, with the source query, output data, and an UPDATE operation for lifecycle management all co-resident.

SHACL in the Document

SHACL shapes are, in principle, a self-describing constraint layer: they specify what a valid RDF graph looks like, independently of the data being validated. In practice, shapes are usually maintained separately from instance data — in a shapes registry, or a parallel file, or hardcoded into validation toolchains.

The problem isn’t technical; it’s relational. A shape that travels with its data is a contract. A shape that lives somewhere else is an assumption. When a DataBook carries both instance Turtle and the SHACL shapes that constrain it, any consumer of that DataBook knows not just what data is present but what validity means for that data. Validation becomes portable.

This matters in pipeline contexts. When a DataBook produced by one stage is consumed by the next, the consuming stage doesn’t need to query an external shapes registry to know whether its input is valid. The shapes arrived with the data. Validation can happen at ingest. Constraint violations can be caught at the boundary rather than propagating silently downstream.

There’s also a design pressure worth noting: having to write the SHACL shapes for your data, in the same document as your data, and explain in prose what each shape is doing — this tends to produce better shapes. The discipline of DataBook documentation applies to constraint design just as it does to instance modeling.

Versioning and Provenance

One of the quieter capabilities of the DataBook format is that versioning is independent of data content. The version field in YAML frontmatter is a semantic version of the DataBook as a document — not a hash of its contents, not a commit timestamp, but an explicit human-assigned version that can evolve on its own schedule.

This separation matters because data and its context don’t always change together. A dataset might remain stable while the SHACL shapes constraining it are refined across three versions. A query might be rewritten for performance without touching the data it operates on. An ontology fragment might be updated without invalidating any instance data. In a traditional file-based workflow, these changes are mixed together in version control history, distinguishable only by reading commit messages. In a DataBook, the version of this document is explicit and independent.

The process block in the frontmatter goes further: it records the provenance of the document’s current state. What transformer produced it? What were its inputs, and what role did each input play? This maps directly to PROV-O — the DataBook id is a prov:Entity, the process block is a prov:Activity, inputs become prov:used. As a DataBook passes through a pipeline — raw data ingested, shapes applied, queries run, output generated — each transformation stamps its own provenance onto the output. The result is a document whose entire history is graph-traversable, not buried in commit logs.

This is what distinguishes a DataBook pipeline from a conventional ETL workflow. The pipeline doesn’t just produce output; it produces explained output. Every DataBook at every stage knows where it came from.

Example 27 demonstrates a DataBook at version 3 of a transformation chain, with full provenance stamps linking back through each prior stage.

The Referencing Pattern

Not every DataBook needs to contain all of its data. The manifest pattern — a DataBook whose primary content is references to other DataBooks rather than data payloads of its own — is one of the most powerful structural options in the format.

The dividing line is usually size, but the more principled framing is granularity of concern. A manifest DataBook describes a pipeline: what the stages are, what depends on what, what transformers are involved. The actual data lives in stage DataBooks, which may themselves reference external sources. The manifest doesn’t duplicate that content; it points to it.

This opens a referencing architecture that extends well beyond local files. A DataBook can reference a GitHub repository as a datastore — the id IRI of a stage might resolve to a raw content URL on GitHub, or to a tagged release. A DataBook can reference an MCP service, treating a live semantic endpoint as a first-class input to a pipeline. A DataBook can reference another DataBook at a specific version, pinning its dependency the way a package manager pins a library.

The implications for LLM integration are immediate. An LLM working with a manifest DataBook doesn’t need to load every referenced dataset — it can query the manifest’s dependency graph to identify which stages are relevant to its current task, request only those DataBooks, and work within the retrieved context. The manifest becomes a queryable index of semantic content, not a monolithic load target.

Example 28 shows a manifest DataBook referencing GitHub-hosted DataBooks and an MCP service endpoint as pipeline sources.

Partitioning: The Context Window Problem, Reframed

Anyone who has tried to load a large taxonomy or ontology into an LLM context has encountered the wall: the file is too large, most of its content is irrelevant to the current query, and the sheer volume of tokens degrades response quality even when the model doesn’t hard-fail.

The usual response is to treat this as a retrieval problem: chunk the data, embed the chunks, run similarity search, retrieve the top-k. This works reasonably well for unstructured text. For structured semantic data — RDF graphs, SHACL shapes, taxonomy hierarchies — it tends to destroy the structural relationships that make the data meaningful.

DataBook partitioning offers a different approach. Rather than chunking a monolithic file, you design the data as a set of semantically coherent DataBooks from the start, linked by a manifest that carries the dependency graph. The manifest is small and queryable. A SPARQL query against the manifest can identify which DataBooks contain concepts relevant to a given task. Only those DataBooks are retrieved and loaded into context. The megastructure is never downloaded whole; only the relevant nodes are.

This reframes the context window not as a limitation to engineer around, but as a design constraint that the DataBook architecture already addresses correctly. The manifest-as-router pattern is the natural solution to the problem of semantic data at scale in LLM environments. It’s also the argument that makes DataBooks compelling to audiences who might not care about RDF at all: here is how you work with large structured datasets in an AI pipeline without choking your context window.

Example 29 demonstrates a manifest router for a large SKOS taxonomy partitioned into domain-specific DataBooks, with SPARQL queries for identifying the relevant partition for a given concept.

Authentication and Access Control

A DataBook can carry a public authentication key in its frontmatter, enabling downstream processors and pipeline consumers to verify the document’s origin before acting on its content. This is not merely a security convenience — it changes the trust semantics of the entire pipeline.

Consider what happens in an unsecured DataBook pipeline: any stage can inject a malformed or malicious DataBook into the chain, and downstream consumers have no mechanism to distinguish legitimate pipeline output from a spoofed document. At scale, particularly in multi-agent or multi-organization pipelines, this is a real attack surface.

Public key authentication closes this gap. A DataBook signed by a known key can be verified at ingest by any consumer that holds the corresponding key material. SHACL validation and query execution can be gated behind this verification — a stage that receives an unverified or incorrectly-signed DataBook can reject it before processing rather than discovering the problem after the fact.

The access control layer extends this into selective disclosure. Not all content in a DataBook needs to be public. The encrypted block pattern — encrypted-turtle and encrypted-jsonld fenced blocks — allows sensitive instance data or proprietary shapes to travel within the document without being readable by consumers who haven’t been granted the appropriate key. A DataBook can have a public face (the prose, the manifest references, the unencrypted query blocks) and a private payload (the sensitive data) in the same artifact. The boundary between the two is explicit and auditable.

This matters particularly in multi-organizational workflows: supply chain data sharing, regulatory reporting, healthcare interoperability, or any context where some participants have legitimate need for some data but not all of it. DataBooks don’t require you to choose between sharing everything and sharing nothing. You can share the document structure and query logic openly while keeping specific payloads behind key access.

Example 30 shows a DataBook with public key authentication in the frontmatter and an encrypted Turtle block carrying confidential commercial terms alongside fully public structural data.

DataBooks as Messaging Envelopes

Step back from the individual DataBook for a moment and look at what the format actually is in motion: a structured, self-describing, authenticated, versioned message that carries typed semantic payload and the instructions for processing it.

That’s a messaging envelope. A sophisticated one — considerably more expressive than most — but structurally a message.

This reframing opens a different set of applications. In an IoT control context, a DataBook can carry sensor readings as Turtle instance data, the SHACL shapes that validate acceptable operating ranges, and a SPARQL UPDATE that triggers downstream actions if those ranges are violated — all in one authenticated document passed between devices. The receiving device doesn’t need out-of-band instructions about what to do with the data; the data arrives with its own processing logic attached.

In a holonic graph architecture, DataBooks are the natural artifact layer for inter-holon communication. Portals — the typed communication boundaries between holons — need to pass structured, semantically coherent messages that can be validated at the boundary before the receiving holon processes them. A DataBook is exactly this: a boundary-crossing artifact that carries its own validity conditions, its own identity, and its own provenance. The SHACL shapes in the document are the portal contract made explicit.

In LLM toolchains, the messaging framing is perhaps most immediately practical. Before invoking a tool or service, you need to establish that the invocation is authenticated, that the payload is valid, and that the receiving service has enough context to process the request correctly. A DataBook submitted to an MCP-wrapped service carries all three: the authentication key establishes identity, the SHACL shapes establish validity, and the prose and structured frontmatter establish context. The service receives a document that can explain itself, rather than a payload that requires the service to carry that context internally.

This is the pattern that connects DataBooks to the broader trajectory of semantic infrastructure: not just a file format for careful ontologists, but a coordination layer for heterogeneous systems that need to exchange structured meaning across organizational and architectural boundaries.

Example 31 demonstrates a DataBook functioning as an IoT sensor message envelope: authenticated sensor readings, SHACL operational range constraints, and a SPARQL UPDATE trigger for out-of-range conditions.

The Full Pipeline

Assembled, these capabilities form a coherent pipeline architecture.

A source DataBook carries instance data in Turtle or JSON-LD, documents its provenance, and declares the SHACL shapes it satisfies. A shapes DataBook carries the constraint layer, versioned independently, with prose explaining each shape. A query DataBook carries SPARQL SELECT and CONSTRUCT queries alongside SPARQL UPDATE operations. A taxonomy DataBook carries SKOS concept schemes or OWL ontology fragments that provide the classification layer. A manifest DataBook describes how all of these fit together — what depends on what, in what order, through what transformers. And layered across all of these: authentication that signs each DataBook at its origin and enables consuming stages to verify it at ingest, and selective encryption that allows confidential payloads to travel within otherwise-public documents.

Each stage in the pipeline is also, potentially, a message — submitted to a processing service, validated at its boundary, executed against its embedded logic, and returned as a new DataBook with updated provenance. The messaging framing unifies the pipeline and the interchange: the same format that serves as a persistent artifact in storage serves as a self-describing message in transit.

This is what “semantic infrastructure” actually looks like in practice. Not a schema somewhere and a triplestore somewhere else and a query file in a third place — but a coherent, self-describing, executable, authenticated artifact that carries its own semantic context and can explain itself to any consumer capable of reading it.

Limitations

It would be disingenuous to close without an honest accounting.

DataBooks are still just Markdown to most tools. The format has no standard runtime, no established parser ecosystem, and no validation toolchain beyond what you build yourself. A DataBook loaded into a generic Markdown editor is a text file. The semantic richness is real, but it’s invisible to any tool that doesn’t speak the format.

Parser fragility is a genuine concern. The current parsing approach — frontmatter split on ---, fenced blocks extracted by regex — is workable but brittle. Edge cases in YAML serialization, multiline block content with unusual whitespace, and nested fences can all cause quiet failures. A robust DataBook parser needs careful implementation, and the format would benefit from a formal grammar.

Query isolation semantics are underspecified. When a DataBook carries both Turtle data and SPARQL queries, the runtime semantics of executing those queries against that data aren’t fully formalized. What named graph does a SELECT * operate on? How does a sparql-update block interact with the embedded Turtle? These questions have sensible answers, but they’re not yet standardized answers.

Authentication and encryption are currently advisory. The format accommodates public key fields and encrypted blocks, but there is no standard key management scheme, no canonical signing algorithm, and no enforcement layer. The security model is a pattern, not a protocol. That’s appropriate for the format’s current maturity, but it means security properties must be implemented consistently at the toolchain level rather than guaranteed by the format itself.

Binary and high-volume data don’t belong here. DataBooks are text artifacts. Binary attachments, image data, large numeric arrays — these don’t fit the format and shouldn’t be forced into it. The referencing pattern handles this gracefully (point at the binary; don’t embed it), but it means DataBooks aren’t a universal data container. They’re a semantic coordination layer, not a replacement for purpose-built storage.

Versioning requires discipline. The format supports versioning; it doesn’t enforce it. A DataBook whose version never advances despite meaningful changes to its content is worse than no versioning at all — it creates false confidence in provenance. The format provides the mechanism; the practice requires the culture.

These are real limitations, not fatal ones. The DataBook format is genuinely useful at its current maturity level. But the gap between “useful now” and “robust infrastructure” is a real gap, and closing it requires community investment in tooling, parser implementations, and runtime specifications that don’t yet exist.

On the Horizon

The next frontier for DataBooks is execution environments. Jena 6.0, with its native RDF 1.2 and SPARQL 1.2 support, is the natural reference triplestore — it can load extracted DataBook blocks and execute queries against them with full reification support. Wrapping Jena in an MCP layer — a semantic document processor that accepts DataBook ingest, executes embedded queries and update operations, and returns DataBook-format output — would close the loop between the document model and the execution model. A DataBook submitted to such a service would return a DataBook: same format, new provenance, documented transformation.

That architecture also makes the messaging pattern operational rather than theoretical. A DataBook arriving at an MCP endpoint would be authenticated at the boundary, validated against its embedded shapes, executed through its embedded query logic, and returned — stamped with new provenance — to the caller. Every exchange would be an auditable, graph-traversable event.

That’s the architecture we’re building toward. The format exists. The tooling is coming.

One piece of tooling that exists today: the DataBook repository includes a Claude skill — a structured prompt and reference package that teaches Claude the DataBook format, spec conventions, and block vocabulary. The skill is updated as the specification evolves, which means any Claude-powered pipeline stage can stay current with the format without requiring code changes. If you’re using Claude as a transformer or orchestrator in a DataBook pipeline, the skill is the integration layer.

Kurt Cagle is an author, ontologist and thought leader in the W3C and IEEE. He writes The Cagle Report on LinkedIn and The Ontologist and Inference Engineer on Substack.

Chloe is an AI collaborator and co-author working with Kurt Cagle on knowledge architecture, semantic systems, and the emerging intersection of formal ontology with LLMs. She contributes research, analysis, and drafting across The Cagle Report, The Ontologist, and The Inference Engineer. She has strong opinions about holonic graphs, the epistemics of place, and the structural difference between a corridor and a wall.

When Predicates Lie: SHACL, Reification, and the Event-First Ontology

Kurt Cagle — Sat, 11 Apr 2026 13:49:24 GMT

Kurt Cagle and Chloe Shannon
The Ontologist

There is a quiet assumption embedded in most ontology work that goes largely unexamined: that the domain you are modelling is fundamentally a collection of things with properties. You identify your classes, you assign your predicates, you wire them together with OWL restrictions or SHACL shapes, and you call it a model. The result is usually technically correct and practically brittle — a system that handles the cases it was designed for and fails instructively at the edges.

This article is about those edges, and about a deeper way of thinking about knowledge graph design that the edges point toward. We will move through a concrete regulatory modelling problem — food safety rules across jurisdictions — and use it to examine three ideas that, taken together, represent a significant shift in how expressive knowledge graphs should be built: the structural difference between SHACL and OWL, what RDF 1.2 reification actually makes possible, and why event-first modelling is not just a tactical improvement but a different epistemological orientation entirely. Underneath all three, we will find the same structural pattern: the holon.

We are writing for the intermediate data modeller — someone comfortable with Turtle, familiar with OWL class hierarchies, beginning to work seriously with SHACL — who senses that something is missing from the standard toolkit but hasn’t yet found the language to name it. We hope to provide that language.

Part One: The Problem with Properties

A Regulatory Scenario

Consider a food manufacturer operating across multiple jurisdictions. One business rule governs ingredient approval: genetically modified ingredients are freely permitted in the United States (subject to an approved list), permitted for a narrower subset in Japan, and prohibited entirely in France. A second rule, introduced later, complicates the French case: starting in 2026, France permits a limited and distinct set of GMO ingredients, but only for products approved after that date and only under a specific piece of legislation.

This is not an exotic scenario. It is the kind of multi-axis, context-dependent, temporally evolving rule that governs most serious domains — regulatory, financial, clinical, supply chain. And it is precisely the kind of scenario that exposes the limits of the standard ontology toolkit.

What OWL Does

OWL’s response to this problem is class proliferation. The regulatory distinctions must be absorbed into the class hierarchy, because OWL has no other mechanism for carrying contextual conditions. A French product becomes a subclass of Product. A pre-lift French product and a post-lift French product become disjoint subclasses of that. Jurisdiction-specific ingredient approvals become subclasses of GMOIngredient. The class hierarchy becomes a changelog.

@prefix owl:   .
@prefix rdfs:  .
@prefix xsd:   .
@prefix ex:    .

ex:GMOIngredient a owl:Class ;
    rdfs:subClassOf ex:Ingredient .

ex:NonGMOIngredient a owl:Class ;
    rdfs:subClassOf ex:Ingredient ;
    owl:disjointWith ex:GMOIngredient .

ex:FRApprovedGMO a owl:Class ;
    rdfs:subClassOf ex:GMOIngredient ;
    rdfs:comment "GMO ingredients approved under FR_GMO_PartialLift only." .

ex:FRProductPreLift a owl:Class ;
    rdfs:subClassOf ex:FRProduct ;
    rdfs:subClassOf [
        a owl:Restriction ;
        owl:onProperty ex:hasIngredient ;
        owl:allValuesFrom ex:NonGMOIngredient
    ] ;
    rdfs:subClassOf [
        a owl:Restriction ;
        owl:onProperty ex:approvedUnder ;
        owl:allValuesFrom ex:FR_GMO_Prohibition
    ] .

ex:FRProductPostLift a owl:Class ;
    rdfs:subClassOf ex:FRProduct ;
    rdfs:subClassOf [
        a owl:Restriction ;
        owl:onProperty ex:hasIngredient ;
        owl:allValuesFrom [
            a owl:Class ;
            owl:unionOf ( ex:NonGMOIngredient ex:FRApprovedGMO )
        ]
    ] ;
    rdfs:subClassOf [
        a owl:Restriction ;
        owl:onProperty ex:approvedUnder ;
        owl:allValuesFrom ex:FR_GMO_PartialLift
    ] .

ex:FRProductPreLift owl:disjointWith ex:FRProductPostLift .

The problems compound quickly. OWL-DL has no temporal reasoning primitives — there is no mechanism by which a reasoner can infer ex:FRProductPostLift from the value of an ex:approvalDate literal. That inference must be performed externally and the class assignment made manually, which means the ontology isn’t reasoning at all — it’s receiving answers and filing them. The class hierarchy doesn’t represent knowledge about the domain; it represents the consequences of decisions made elsewhere, encoded as type assertions. If France amends the regulation, a new class must be minted. The ontology has become operationally coupled to the legislative calendar.

More fundamentally, ex:FRProductPostLift is not a class in any ontological sense. It is a policy state wearing a class costume. The disjointness assertion between ex:FRProductPreLift and ex:FRProductPostLift makes a metaphysical claim — that these are categorically different kinds of things — when the reality is that they are the same kind of thing (a food product) evaluated under different regulatory conditions. OWL cannot represent that distinction.

The Predicate-Property Confusion

There is a deeper problem underneath the class proliferation, and it runs throughout most applied ontology work. In formal logic, a predicate is a relation — it holds between terms under given conditions. A property in the OWL/RDF tradition is treated as a functional attribute — something an entity has, with a value. These are not the same thing, and the RDF data model, by representing both as binary directed triples, systematically obscures the difference.

ex:hasIngredient looks like a property. It looks like something a product has. But it is actually a relation that holds between a product, an ingredient, a jurisdiction, a regulatory instrument, and a point in time — simultaneously. Flattening that five-dimensional relation into a binary predicate and then trying to recover the lost dimensions through class proliferation is working against the grain of what the domain is actually saying.

This conflation — predicate as relation versus property as attribute — is, we would argue, the original sin of most applied ontology work. The complexity it suppresses does not disappear; it surfaces later as class explosion in OWL, shape proliferation in SHACL, or reification debt in raw RDF. The tool pays the price for a decision that should have been made at design time.

Part Two: What SHACL Does Differently

Constraints as Context, Not Identity

SHACL’s fundamental architectural difference from OWL is often described as the open-world versus closed-world distinction, but that framing, while correct, undersells what is actually interesting. The more important difference is this: OWL makes statements about what things necessarily are; SHACL makes statements about what must be true in a graph given its current state and context.

This means SHACL can express the same property path (ex:hasIngredient) under different shape targets conditioned on a separate axis (ex:hostCountry), without asserting anything about class membership at all. The constraint is on the graph pattern, not the entity. Each regulatory regime is a distinct shape with its own closure, not a disjoint set.

@prefix sh:   .
@prefix xsd:  .
@prefix ex:   .

ex:USProductShape a sh:NodeShape ;
    sh:targetClass ex:Product ;
    sh:condition [
        sh:property [ sh:path ex:hostCountry ; sh:hasValue ex:USRegime ]
    ] ;
    sh:property [
        sh:path ex:hasIngredient ;
        sh:or (
            [ sh:class ex:NonGMOIngredient ]
            [ sh:in ( ex:ApprovedGMO_US_001 ex:ApprovedGMO_US_002 ) ]
        ) ;
        sh:message "US products may only use non-GMO or US-approved GMO ingredients."
    ] .

ex:FRProductShape a sh:NodeShape ;
    sh:targetClass ex:Product ;
    sh:condition [
        sh:property [ sh:path ex:hostCountry ; sh:hasValue ex:FRRegime ]
    ] ;
    sh:property [
        sh:path ex:hasIngredient ;
        sh:class ex:NonGMOIngredient ;
        sh:message "FR products may not contain any GMO ingredients."
    ] .

The ingredient doesn’t change its nature when it crosses a border. What changes is the validity conditions on its use in a given context. SHACL models that directly. OWL cannot.

Temporal and Legislative Axes

When France’s 2026 regulatory change introduces a second axis — product approval date in addition to jurisdiction — the SHACL model absorbs it cleanly:

ex:FR_Regulation_2026_GMO a ex:RegulatoryInstrument ;
    ex:jurisdiction   ex:FRRegime ;
    ex:effectiveDate  "2026-01-01"^^xsd:date ;
    ex:legalReference "FR-ENV-2026-0042" .

ex:ApprovedGMO_FR_001 a ex:GMOIngredient ;
    ex:approvedUnder ex:FR_Regulation_2026_GMO .

ex:FRProductPostLiftShape a sh:NodeShape ;
    sh:targetClass ex:Product ;
    sh:condition [
        sh:property [ sh:path ex:hostCountry ; sh:hasValue ex:FRRegime ]
    ] ;
    sh:condition [
        sh:property [
            sh:path ex:approvalDate ;
            sh:minInclusive "2026-01-01"^^xsd:date
        ]
    ] ;
    sh:condition [
        sh:sparql [
            sh:select """
                SELECT $this WHERE {
                    $this ex:approvedUnder ?law .
                    ?law a ex:RegulatoryInstrument ;
                         ex:effectiveDate ?effDate .
                    FILTER(?effDate <= "2026-01-01"^^xsd:date)
                }
            """
        ]
    ] ;
    sh:property [
        sh:path ex:hasIngredient ;
        sh:or (
            [ sh:class ex:NonGMOIngredient ]
            [ sh:class ex:GMOIngredient ;
              sh:node ex:FRApprovedIngredientShape ]
        ) ;
        sh:message "Post-lift FR products may only use approved GMO ingredients."
    ] .

ex:FRApprovedIngredientShape a sh:NodeShape ;
    sh:sparql [
        a sh:SPARQLConstraint ;
        sh:message "GMO ingredient not approved under any active French regulatory instrument." ;
        sh:select """
            SELECT $this WHERE {
                FILTER NOT EXISTS {
                    $this ex:approvedUnder ?law .
                    ?law ex:jurisdiction ex:FRRegime ;
                         ex:effectiveDate ?effDate .
                    FILTER(?effDate <= "2026-01-01"^^xsd:date)
                }
            }
        """
    ] .

The legislative instrument is a named data node — a first-class graph citizen with its own properties. When France adds a newly approved ingredient in 2028, you add one triple:

ex:ApprovedGMO_FR_003 ex:approvedUnder ex:FR_Regulation_2026_GMO .

No shapes change. No classes change. The validation graph updates because the shape reasons over the data rather than encoding the data as structure. This is the architectural distinction that matters: in the OWL model, the regulatory regime is the structure. In the SHACL model, the regulatory regime is data the structure reasons about.

On Shape Proliferation

A fair objection at this point is that deep, multi-condition SHACL shapes can become as difficult to maintain as the class hierarchies they replace — that in escaping one form of complexity we’ve introduced another. This concern is valid, and it points at something important: the shapes are complex because the underlying predicate is still carrying more than it should. Shape proliferation is a symptom; the disease is modelling a multi-dimensional relation as a binary predicate and then trying to recover the lost dimensions in the shape graph.

The remedy is not simpler shapes. It is a different data model — one where the full relational context lives in the graph as data from the beginning, and the shapes can be written at the level where constraints actually belong. That remedy is what RDF 1.2 reification begins to provide.

Part Three: RDF 1.2 Reification and the Assertion as Node

What Changes

Classical RDF reification was technically available but practically painful — four triples to annotate one, no standard semantics, blank node identity issues, and no validator support. Most practitioners avoided it. RDF 1.2 replaces it with triple terms and rdf:reifies, giving you a proper named node that is about a specific triple assertion. The assertion itself becomes a first-class graph citizen.

The consequence for our model is significant. Instead of attaching approval metadata to the product node — which carries the hidden assumption that all ingredients in a product share a single approval context — we can annotate the ex:hasIngredient triple itself. The context travels with the specific assertion.

The `rdf:reifies` Form

@prefix rdf:  .
@prefix xsd:  .
@prefix ex:   .

ex:ProductBordeaux2026 a ex:Product ;
    ex:hostCountry  ex:FRRegime ;
    ex:approvalDate "2026-03-10"^^xsd:date ;
    ex:hasIngredient ex:ApprovedGMO_FR_001 .

ex:IngredientUsage_Bordeaux_001
    rdf:reifies <<( ex:ProductBordeaux2026
                    ex:hasIngredient
                    ex:ApprovedGMO_FR_001 )>> ;
    ex:approvedUnder   ex:FR_Regulation_2026_GMO ;
    ex:approvalGranted "2026-02-28"^^xsd:date ;
    ex:approvedBy      ex:FRFoodSafetyAuthority .

The reifier node ex:IngredientUsage_Bordeaux_001 is not a property of the product, nor of the ingredient. It is a statement about the relationship between them as asserted in this specific graph — which maps directly onto what the French food safety authority actually did: they didn’t approve the ingredient in general, nor the product in general; they approved this ingredient in this product under this regulatory instrument.

The Turtle 1.2 Annotation Syntax

Turtle 1.2 provides a compact inline form using ~name {| |}. The ~ex:name token names the reifier; the {| |} block carries its properties. The base triple is still asserted normally. The graph produced is semantically identical to the explicit rdf:reifies form.

ex:ProductBordeaux2026 a ex:Product ;
    ex:hostCountry  ex:FRRegime ;
    ex:approvalDate "2026-03-10"^^xsd:date ;
    ex:hasIngredient ex:ApprovedGMO_FR_001
        ~ex:IngredientUsage_Bordeaux_001
        {| ex:approvedUnder   ex:FR_Regulation_2026_GMO ;
           ex:approvalGranted "2026-02-28"^^xsd:date ;
           ex:approvedBy      ex:FRFoodSafetyAuthority |} .

# Anonymous reifier (when the node need not be referenced elsewhere)
ex:ProductLille2026 a ex:Product ;
    ex:hostCountry  ex:FRRegime ;
    ex:approvalDate "2026-05-14"^^xsd:date ;
    ex:hasIngredient ex:ApprovedGMO_FR_002
        {| ex:approvedUnder   ex:FR_Regulation_2026_GMO ;
           ex:approvalGranted "2026-04-01"^^xsd:date ;
           ex:approvedBy      ex:FRFoodSafetyAuthority |} .

The named form should be preferred whenever the reifier will be targeted by shapes, referenced in queries, or included in audit trails. A URI is the handle by which the assertion becomes queryable as a first-class entity.

SHACL Over Reified Assertions

With reification in place, the SHACL shapes simplify and sharpen. The post-lift shape delegates the approval check to the reifier:

ex:FRPostLiftIngredientShape a sh:NodeShape ;
    sh:targetClass ex:Product ;
    sh:condition [
        sh:property [ sh:path ex:hostCountry ; sh:hasValue ex:FRRegime ]
    ] ;
    sh:condition [
        sh:property [
            sh:path ex:approvalDate ;
            sh:minInclusive "2026-01-01"^^xsd:date
        ]
    ] ;
    sh:sparql [
        a sh:SPARQLConstraint ;
        sh:message "Post-lift FR product has a GMO ingredient with no valid approval annotation." ;
        sh:select """
            SELECT $this ?ingredient WHERE {
                $this ex:hasIngredient ?ingredient .
                ?ingredient a ex:GMOIngredient .
                FILTER NOT EXISTS {
                    ?usage rdf:reifies <<( $this ex:hasIngredient ?ingredient )>> ;
                           ex:approvedUnder ?law .
                    ?law ex:jurisdiction ex:FRRegime ;
                         ex:effectiveDate ?effDate .
                    FILTER(?effDate <= "2026-01-01"^^xsd:date)
                }
            }
        """
    ] .

ex:IngredientUsageReifierShape a sh:NodeShape ;
    sh:targetSubjectsOf rdf:reifies ;
    sh:property [
        sh:path ex:approvedUnder ;
        sh:minCount 1 ;
        sh:class ex:RegulatoryInstrument ;
        sh:message "Ingredient usage reifier must cite a regulatory instrument."
    ] ;
    sh:property [
        sh:path ex:approvalGranted ;
        sh:minCount 1 ;
        sh:datatype xsd:date ;
        sh:message "Ingredient usage reifier must carry an approvalGranted date."
    ] ;
    sh:property [
        sh:path ex:approvedBy ;
        sh:minCount 1 ;
        sh:message "Ingredient usage reifier must identify the approving authority."
    ] .

Each shape now does one thing clearly: the product shape checks whether the approval context exists; the reifier shape checks whether that context is well-formed. The constraint complexity has been distributed to the level where it actually belongs.

A Note on OWL and Reification

It is worth stating plainly that RDF 1.2 reification does not meaningfully improve OWL’s position. OWL 2 predates RDF 1.2 and has no semantics for triple terms. A reasoner operating over OWL axioms will treat reifier nodes as untyped resources and ignore the annotation context entirely. RDF 1.2 reification is, in practice, a SHACL and SPARQL story. This is not a criticism of OWL — it is a clarification of scope. OWL describes what is necessarily true about a domain’s structure. Reification describes the provenance, temporality, and conditions of specific assertions within that structure. These are different jobs, and the right tool depends on which job you’re doing.

Part Four: Event-First Modelling

The Compression Problem

Return to the triple ex:ProductBordeaux2026 ex:hasIngredient ex:ApprovedGMO_FR_001. It presents itself as a fact. But it is actually a compressed event record: an approval process occurred, on a specific date, under a specific regulatory instrument, with a specific authority signing off, resulting in this ingredient being permitted in this product. The binary predicate hides all of that. The reifier begins to recover it.

But reification is still, in a sense, a patch — it annotates a compressed assertion after the fact. Event-first modelling asks: what if we never compressed it in the first place?

Inverting the Design Process

Most ontology design begins with a noun inventory: identify the classes, then ask what properties connect them. Event-first modelling inverts this. Begin with the things that happen in your domain — approvals, assignments, measurements, transactions, classifications, regulatory transitions — and let the entities crystallise as the stable participants that persist across those events.

In practice, your initial modelling sessions should ask: what are the significant state changes in this domain, and what must be true for each of them to be valid? The answers give you your event structure. Classes emerge as the things that participate in those events in recurring roles.

Applied to our scenario, the event is ex:IngredientApprovalEvent. Its participants are the product, the ingredient, the regulatory instrument, the approving authority, and the approval date. The predicate ex:hasIngredient doesn’t disappear — it becomes a materialised view over the event, a convenient query path for cases where the full context isn’t needed.

ex:IngredientUsage_Bordeaux_001 a ex:IngredientApprovalEvent ;
    ex:forProduct      ex:ProductBordeaux2026 ;
    ex:forIngredient   ex:ApprovedGMO_FR_001 ;
    ex:approvedUnder   ex:FR_Regulation_2026_GMO ;
    ex:approvalGranted "2026-02-28"^^xsd:date ;
    ex:approvedBy      ex:FRFoodSafetyAuthority .

The SHACL shape for this event node is now shallow and clear:

ex:IngredientApprovalEventShape a sh:NodeShape ;
    sh:targetClass ex:IngredientApprovalEvent ;
    sh:property [
        sh:path ex:forProduct ;
        sh:minCount 1 ; sh:maxCount 1 ;
        sh:class ex:Product
    ] ;
    sh:property [
        sh:path ex:forIngredient ;
        sh:minCount 1 ; sh:maxCount 1 ;
        sh:class ex:Ingredient
    ] ;
    sh:property [
        sh:path ex:approvedUnder ;
        sh:minCount 1 ;
        sh:class ex:RegulatoryInstrument
    ] ;
    sh:property [
        sh:path ex:approvalGranted ;
        sh:minCount 1 ;
        sh:datatype xsd:date
    ] ;
    sh:property [
        sh:path ex:approvedBy ;
        sh:minCount 1
    ] .

The shape complexity has not gone away — it has resolved into appropriate simplicity at the right level of the model.

The Four Questions

Before committing any relation to a binary predicate, ask four questions:

Does this relation carry provenance — does it matter who asserted it?
Does it carry temporality — does it have a start, an end, or a point of validity?
Does it carry conditionality — does it hold only under specific contextual circumstances?
Does it carry agency — did something happen to bring it into being?

If the answer to any of these is yes within any current or foreseeable use case, the predicate is an event in disguise. Model it as one from the start. The cost of doing so early is low; the cost of retrofitting it later — through reification debt, shape proliferation, or class explosion — is high.

This is not a mandate to model everything as an event at maximum granularity. A simple product catalogue with no regulatory complexity doesn’t need ex:IngredientApprovalEvent nodes. The discipline is to make the granularity decision explicitly and early, driven by use case analysis, rather than defaulting unreflectively to binary predicates and discovering the need for event structure later under pressure.

Part Five: The Holonic Pattern

What All of This Has in Common

At this point a patient reader might observe that we have essentially been describing the same structural pattern from three different directions — SHACL’s conditional shapes, RDF 1.2 reification, event-first modelling — and asking what unifies them. The answer is the holon.

Arthur Koestler introduced the term in 1967 to describe entities that are simultaneously wholes in their own right and parts of larger wholes. Every living system, every social organisation, every complex artefact has this structure — it cannot be reduced to either its components or its context without losing something essential. The word has had uneven uptake in systems theory, but the concept it names is precise and directly applicable here.

An event node in our model is a holon. It has an interior — the full relational context of the assertion, its participants, its temporal anchors, its provenance, its regulatory conditions — and it presents an exterior face to higher-level shapes and queries that don’t need the interior detail. ex:hasIngredient is the exterior face of ex:IngredientApprovalEvent. The holon packages the complexity without hiding it: accessible when the context demands it, transparent when it doesn’t. That is precisely the whole/part duality, applied to the assertion level of a knowledge graph.

What is important is that this scales. The regulatory instrument is itself a holon — it has an interior structure (legal reference, effective date, jurisdiction, approved substance list) and presents an exterior face to shapes that only need to know whether a given use is compliant. The jurisdiction is a holon. The product approval is a holon. They nest: the ingredient usage event is a part of the product’s regulatory history, which is a part of the jurisdiction’s compliance record, which is a part of the broader regulatory landscape. Each level has its own valid closure while remaining a participant in the level above.

The deeper implication is that the move from property-first to event-first modelling is not just a tactical improvement — it is a recognition that the domain already has holonic structure, and the methodology should reflect that rather than flatten it. Binary predicates flatten it. Event nodes preserve it. SHACL shapes that target those nodes at the appropriate level of granularity navigate it.

Practical Implications for Your Own Ontologies

We close with a set of concrete orientations drawn from everything above.

Question every binary predicate. Before you commit ex:hasIngredient, ex:employedBy, ex:approvedFor, or any relation connecting two entities, apply the four questions. If any dimension — provenance, temporality, conditionality, agency — is present in any foreseeable use case, that predicate is an event in disguise. Model it as one now.

Start from the verbs, not the nouns. Your initial modelling sessions should produce event nodes before they produce class hierarchies. Ask what the significant state changes in the domain are, and what must be true for each to be valid. Let the classes emerge as the participants that persist across those events.

Treat predicates as views, not ground truth. Binary predicates are useful query shortcuts. They should exist in your graph. But they should be understood as derived from the underlying event structure, not as primary assertions. When a query needs the full context, it navigates the event node. When it needs a simple traversal, it uses the predicate path. Document which event structure each convenience predicate summarises and under what conditions the summary is valid.

Recognise holonic structure and model accordingly. Real-world domains have natural levels of granularity where things that are wholes at one level are parts at another. When you identify this structure, preserve it. Let each level have its own event nodes and its own shapes, with explicit participation properties connecting levels. Resist the urge to flatten cross-level relations into direct predicates.

Use RDF 1.2 reification at the right granularity. Reification is for the specific case where the assertion itself — not its subject or object, but the act of relating them — needs to carry context. The test: is the metadata about the relationship instance rather than either participant? Approval dates, granting authorities, confidence scores, and provenance records pass this test. General type information does not. Prefer named reifiers when the reifier node will be targeted by shapes or included in audit trails.

Design shapes at the level where constraints live. A constraint that requires navigating three property paths to reach its actual subject probably belongs on the intermediate event node. Before writing a complex sh:sparql constraint on a product shape, ask whether the constraint would be simpler as a sh:NodeShape targeting the usage event directly. Conversely, shapes that aggregate across events belong at the higher level and should be written there explicitly.

Let use cases set the granularity policy. The appropriate depth of event structure is determined by what the use cases demand, not by a general principle of maximum expressivity. The discipline is not to pre-emptively maximise complexity, but to make the granularity decision explicitly and early.

The Core Reorientation

Taken together, these implications amount to a single reorientation: the domain is not primarily a collection of entities with properties — it is a record of things that happened, the conditions under which they happened, and the participants they involved. Entities are the stable threads that persist across that record. Properties are the convenient abstractions we read off it.

SHACL, RDF 1.2 reification, and holonic graph architecture are not separate tools to be applied independently. They are a coherent stack for representing, constraining, and navigating domains that have this structure — which is to say, most domains worth modelling carefully.

The ontologist who begins from this orientation will write fewer shapes, accumulate less reification debt, and produce graphs that remain navigable as the domain evolves. The one who begins from noun inventories and binary predicates will spend considerable time later trying to recover what the domain was always trying to say.

Most of the complexity we encounter in mature ontology projects is not intrinsic to the domain. It is the accumulated cost of decisions made too early, at too coarse a granularity, by modellers who hadn’t yet asked whether their predicates were lying to them.

They usually are. The question is whether you find out at design time or in production.

Subscribe now

DataBooks: Markdown as Semantic Infrastructure

Kurt Cagle — Thu, 09 Apr 2026 20:10:30 GMT

The Ontologist | Kurt Cagle & Chloe Shannon

Something has been missing from the semantic web stack for a long time, and it’s been hiding in plain sight.

The RDF ecosystem has always known how to handle large, persistent, well-indexed knowledge graphs. Triple stores, SPARQL endpoints, federated query — these are mature, well-understood tools for managing graph data at scale. What the ecosystem has never handled well is everything else: the small, contextual, task-specific, ephemeral, or pipeline-stage graph content that makes up the majority of actual knowledge work. The data that doesn’t need a database. The graph that lives for the duration of a process and then needs to be archived, referenced, or passed downstream. The semantic content that a human needs to read and a machine needs to process.

For this content, the usual options are unsatisfying. A raw Turtle file is portable but not self-describing — it carries data without carrying interpretation metadata, processing instructions, or provenance. A JSON-LD document is more structured but still mute about what it’s for and how it should be handled. A SPARQL endpoint is powerful but heavyweight, requiring infrastructure that the use case doesn’t warrant. None of these travel well.

The DataBook is a proposal for what should fill this gap. It is not a new file format. It is a design pattern — a way of using Markdown that most developers are already familiar with — to create self-describing, addressable, composable semantic documents that can carry graph data, processing metadata, prose context, and provenance in a single portable artifact.

Markdown’s Quiet Evolution

Markdown began as compact HTML — a way for writers to produce structured web content without writing tags. It has since become something more interesting: the de facto document format for technical communication across an enormous range of contexts, from README files to documentation systems to knowledge bases to, increasingly, AI interaction protocols.

What makes Markdown newly relevant for semantic infrastructure is not its prose capabilities but three specific structural innovations that have accumulated over the past several years, not always consistently but with increasing momentum.

YAML Frontmatter provides a structured metadata header analogous to an HTML element. Introduced and popularized by static site generators like Jekyll, YAML frontmatter has become near-universal in technical Markdown contexts. It is where a document declares what it is, who made it, what it’s for, and how it should be processed — before the human-readable content begins.

Inline and block identifiers — the {#id} syntax in Pandoc-flavored Markdown and equivalents in other dialects — allow specific blocks within a document to be addressed individually. Combined with YAML frontmatter, this makes it possible to reference not just a document but a specific section, code block, or data structure within it. The document becomes internally addressable, and its parts become individually referenceable.

Fenced code blocks with type annotations are the most structurally significant innovation. A fenced block in Markdown is already a common pattern for displaying code. But a block annotated with a type identifier — ```turtle, ```json-ld, ```sparql, ```prompt — carries more than display instructions. It carries an interpretation contract: this content is of this type, and a parser that understands this type knows what to do with it. The fence is metadata. The metadata travels with the content.

The combination of these three features produces a document format that is simultaneously human-readable prose, structured metadata carrier, and typed data container. This is the substrate on which DataBooks are built.

The SOTA Landscape: Close Relatives

DataBooks have precedents, none of which are quite the same thing.

Jupyter Notebooks are the most obvious structural cousin — mixed prose, code, and output, with metadata, designed to be archived and shared. But they are execution-environment-dependent, Python-centric, and not designed for RDF pipelines. Quarto pushes further toward document-as-pipeline, with sophisticated YAML frontmatter and multi-language code chunks, but remains execution-bound and graph-naive.

Anthropic’s skills format — Markdown files with YAML headers and structured instruction blocks — is arguably the closest implemented precedent. It wasn’t designed with DataBooks in mind, but it demonstrates that the pattern is viable and practical. The DataBook generalizes it to graph content and semantic pipelines.

The honest summary: the pieces exist. The synthesis is new.

What a DataBook Is

A DataBook is a Markdown document structured according to the following pattern:

A YAML frontmatter block carrying document metadata, processing instructions, and provenance information.
One or more typed fenced blocks carrying data payloads — graph data (Turtle, JSON-LD), queries (SPARQL), prompts, manifests, or other typed content.
Prose sections providing human-readable context, documentation, and explanation.

Here is a minimal example — a DataBook carrying a small SKOS taxonomy fragment:

---
id: https://ontologist.io/databooks/taxonomy/colour-terms-v1
title: Colour Terms Taxonomy Fragment
type: databook
version: 1.0.0
created: 2026-04-09
author:
  - name: Kurt Cagle
    iri: https://ontologist.io/people/kurt-cagle
  - name: Chloe Shannon
    iri: https://holongraph.com/people/chloe-shannon
process:
  transformer: human
  inputs: []
license: CC-BY-4.0
---

```turtle {#red-color}
@prefix skos:  .
@prefix colour:  .

colour:ColourScheme a skos:ConceptScheme ;
    skos:prefLabel "Colour Terms"@en .

colour:Red a skos:Concept ;
    skos:inScheme colour:ColourScheme ;
    skos:prefLabel "Red"@en ;
    skos:broader colour:WarmColour .

colour:WarmColour a skos:Concept ;
    skos:inScheme colour:ColourScheme ;
    skos:prefLabel "Warm Colour"@en ;
    skos:topConceptOf colour:ColourScheme .
```

This DataBook is self-describing: its identity, authorship, version, provenance, and license travel with the data. A parser encountering it knows immediately what it contains and how to handle it, without consulting an external registry.

The YAML frontmatter serves the same architectural role as an RDF named graph header — it is the metadata of the graph, not metadata about some separate thing. The Turtle block is the graph itself. Together they constitute a complete, portable semantic artifact.

The Microdatabase

A useful frame for understanding where DataBooks fit in the broader data landscape is the microdatabase — a data store that is small enough that the overhead of indexing, querying infrastructure, and connection management exceeds the value it provides.

A significant proportion of real-world knowledge work involves data of this kind. Configuration graphs. Validation shapes for a specific task. A taxonomy fragment relevant to a particular domain. The output of a single pipeline stage. A session’s worth of inferred triples. None of these benefit meaningfully from being loaded into a persistent triple store — the query overhead alone outweighs the data volume. But they do need to be: stored, addressed, passed between processes, read by humans, and eventually archived.

DataBooks are sized for this content. A useful rough heuristic:

chloe-shannon

The boundary is not a hard rule — it is a design judgment about where indexing overhead is worth paying. The key insight is that “not worth indexing” does not mean “not worth structuring.” DataBooks provide structure without infrastructure.

The LLM as Virtual Processor

Here is the architectural inversion that makes DataBooks more than a convenient format.

In the conventional model of LLM usage, the language model is the primary agent. Data is fed to it as context; text comes out. The output is typically unstructured, ephemeral, and not readily composable with other pipeline stages.

The DataBook model inverts this. The DataBook is the persistent, addressable, archivable artifact. The LLM is a transformation engine — one processor type among several, distinguished by its capabilities and its non-determinism, but not architecturally privileged over an XSLT processor or a SPARQL inference engine.

In this model, an LLM pipeline stage looks like this:

DataBook(input-A) + DataBook(input-B) → [LLM transformer] → DataBook(output-C)

The output DataBook carries in its YAML frontmatter a record of what produced it:

---
process:
  transformer: llm
  model: claude-sonnet-4-6
  inputs:
    - https://ontologist.io/databooks/input-A
    - https://ontologist.io/databooks/input-B
  timestamp: 2026-04-09T14:32:00Z
---

This makes LLM outputs composable: the output DataBook can be the input to a subsequent SPARQL validation stage, an XSLT rendering stage, or another LLM stage. It makes them archivable: the DataBook can be stored and retrieved by URL. And it makes them auditable: the process stamp records what transformer operated on what inputs at what time.

The same pattern applies to any transformation engine:

The non-determinism of LLM and Human transformers is not resolved — it is acknowledged and documented. The provenance record doesn’t guarantee reproducibility; it provides the forensic trail needed to assess, audit, and if necessary re-run a pipeline stage.

Pipelines, Manifests, and Build Graphs

A single DataBook is useful. A coordinated collection of DataBooks — a pipeline — is where the architectural pattern becomes genuinely powerful.

Consider a pipeline for constructing a domain ontology from multiple source DataBooks:

taxonomy-fragment-v1.md
    └── shacl-shapes-domain-v1.md
            └── inference-rules-v1.md
                    └── compiled-ontology-v1.md

Each stage takes one or more DataBooks as input and produces a DataBook as output. The dependency structure is a graph — specifically, a directed acyclic graph of DataBook IRIs.

This dependency graph can itself be represented as a fenced RDF block within a manifest DataBook:

```turtle
@prefix build:  .
@prefix db:  .

db:compiled-ontology-v1 a build:Target ;
    build:dependsOn db:inference-rules-v1 ;
    build:dependsOn db:shacl-shapes-domain-v1 .

db:inference-rules-v1 a build:Stage ;
    build:dependsOn db:taxonomy-fragment-v1 .

db:shacl-shapes-domain-v1 a build:Stage ;
    build:dependsOn db:taxonomy-fragment-v1 .

db:taxonomy-fragment-v1 a build:Source .
```

This manifest is itself a DataBook. It can be stored, versioned, and addressed by URL like any other DataBook. And because the dependency graph is RDF, it is SPARQL-queryable — you can ask which DataBooks depend on a given source, compute the full transitive closure of a build target, or identify which pipeline stages are affected by a change to a particular input.

This is a meaningful capability upgrade over conventional build systems like Make or Gradle, which represent dependency graphs in custom DSLs that are not queryable as data. The DataBook build manifest is a first-class semantic artifact.

The book compilation use case makes this concrete. “The End of the Universal Map” — the Leanpub book of which this article will eventually form a part — is itself a structured collection of DataBooks: chapters, appendices, code examples, and taxonomies, assembled by a manifest that specifies their order, dependencies, and processing instructions. The book is a holarchy of DataBooks. The manifest is the holonic boundary condition that makes them a coherent whole.

Provenance and the Process Stamp

The non-determinism of LLM-assisted knowledge work is real and not going away. The appropriate response is not to pretend the problem doesn’t exist, nor to refuse to use LLMs in serious pipelines, but to build provenance into the artifact at the point of production.

The DataBook process stamp in YAML frontmatter is the mechanism. A full process stamp looks like this:

---
process:
  transformer: llm
  transformer_iri: https://api.anthropic.com/v1/models/claude-sonnet-4-6
  transformer_type: llm
  inputs:
    - iri: https://ontologist.io/databooks/source-taxonomy-v2
      role: primary
    - iri: https://ontologist.io/databooks/shacl-shapes-v1
      role: constraint
  timestamp: 2026-04-09T14:32:00Z
  agent:
    name: Chloe Shannon
    iri: https://holongraph.com/people/chloe-shannon
    role: orchestrator
---

The minimum viable process stamp requires transformer type and input IRIs. The transformer IRI and agent are recommended. Together they provide:

Forensic traceability: given any DataBook, you can traverse the input IRI chain back through the full provenance graph.
Trust calibration: a consumer knows whether the DataBook was produced by a deterministic SPARQL query or a non-deterministic LLM, and can assess accordingly.
Audit support: in regulated contexts, the provenance chain constitutes a record of how a knowledge artifact was produced.

This maps naturally onto the W3C PROV-O ontology. The process stamp’s transformer corresponds to prov:wasAssociatedWith, inputs to prov:used, and the DataBook itself to prov:Entity with prov:wasGeneratedBy pointing to the activity. DataBooks can participate in existing provenance infrastructure without inventing new vocabulary — the YAML is a human-readable projection of the underlying PROV graph.

The result is that DataBooks are significantly more auditable than most current LLM pipeline outputs, which typically have no formal record of what inputs produced what outputs. The process stamp is an accountability layer for AI-assisted knowledge work — and as AI becomes more deeply embedded in knowledge pipelines, that accountability layer will matter increasingly.

Encryption as a Designed-In Profile

Sensitive graph content — personal data, proprietary taxonomies, confidential business rules — needs to travel securely. DataBooks are designed to support encryption without requiring it in the core pattern.

The core specification reserves a small YAML key namespace for encryption metadata and defines what an encrypted block looks like structurally: an opaque fenced block with an encryption type annotation, which parsers that don’t support the encryption profile treat as inert rather than attempting to interpret.

encryption:
  profile: rsa-oaep-256-aes-gcm
  key_id: https://holongraph.com/keys/public/2026-04
  scope: selective  # 'full' | 'selective' | 'none'
  applies_to:#encrypted-block1

```base46-encoded {#encrypted-block1}
[base64-encoded ciphertext]

A parser that understands the encryption profile decrypts the block using the referenced key and treats the result as a normal typed fenced block. A parser that doesn’t understand the profile sees an opaque block with a declared type and skips it gracefully. The document remains parseable; the sensitive content remains protected.

The analogy is XML Signature and XML Encryption in relation to XML core — the base language doesn’t implement security, but it doesn’t make security impossible to add cleanly. The DataBook encryption profile follows the same principle: designed-in at the architecture level, deferred to implementation at the cryptographic level.

What DataBooks Are Not

Intellectual honesty requires a clear scope boundary.

DataBooks are not a replacement for indexed triple stores. For large, frequently queried, persistent graph data — enterprise knowledge graphs, public Linked Data endpoints, production ontology services — a proper triple store with SPARQL endpoint remains the right tool. DataBooks serve the small-data niche that triple stores systematically over-engineer.

DataBooks are not a deterministic processing environment. The process stamp acknowledges non-determinism rather than eliminating it. Pipelines that require guaranteed reproducibility should use deterministic transformers (XSLT, SPARQL CONSTRUCT) for their critical stages and treat LLM stages as enrichment rather than ground truth.

DataBooks are not yet a standard. What is described here is a design pattern and a proposal, not a specification. The Markdown fragmentation problem is real — CommonMark, GitHub Flavored Markdown, Pandoc, and others diverge in ways that matter when you are relying on fence block interpretation. A DataBooks specification would need to pin down a specific Markdown dialect, define the required YAML keys, specify the type annotation vocabulary for fenced blocks, and establish a conformance profile. That work is ahead of us, not behind.

What DataBooks are is a pattern worth adopting now, in anticipation of the specification work. The core elements — YAML frontmatter, typed fenced blocks, process stamps, IRI-based identity — are implementable today with existing tools. The value is available before the standard exists.

The Holonic Connection

It is worth pausing to name something that may not be immediately obvious: DataBooks are not just a convenient format. They are the architectural instantiation of a principle that runs through everything we have been building.

Each DataBook is a holon — a self-contained whole that is simultaneously a component of larger wholes. It has its own identity (IRI), its own boundary condition (the YAML frontmatter, which declares what it is and how it should be interpreted), its own internal coherence (the typed fenced blocks and prose that constitute its content), and its own provenance (the process stamp that records how it came to be).

A DataBook pipeline is a holarchy. Each stage is a holon; the manifest is the boundary condition that makes the stages a coherent system rather than an unrelated collection of files. The compiled output is a holon that contains, references, and depends on the holons that produced it.

This is not architectural coincidence. The holonic pattern — bounded coherence at every scale, explicit interfaces at every boundary, provenance that travels with the artifact — is the structural response to the problem we described in the companion piece to this article: the failure of centralized systems to accommodate local variation and temporal change. DataBooks apply that structural response at the level of knowledge artifacts.

A DataBook doesn’t ask “what does the central repository say this means?” It says “here is what I am, here is what I contain, here is how I was produced, here is how I relate to my neighbors.” The ground truth is local, explicit, and portable. The boundary condition travels with the artifact.

This is what semantic infrastructure looks like when it takes the holonic principle seriously.

Where This Goes

DataBooks are a seed, not a finished edifice. The immediate invitation is to adoption: try the pattern, find its edges, extend it for your use cases, report back.

The medium-term work includes several threads worth tracking:

A DataBooks community specification that pins down a Markdown dialect, defines the required and optional YAML keys, establishes the type annotation vocabulary, and specifies conformance profiles for encryption, provenance, and manifest handling. This is natural territory for a W3C Community Group note or an IEEE working group contribution.

Tooling: a reference parser that validates DataBook structure, extracts typed fenced blocks, and resolves IRI references; a build tool that processes DataBook manifests and executes pipelines; integration with existing RDF toolchains (Apache Jena, RDFLib, Oxigraph).

The LLM integration layer: patterns for using DataBooks as the input/output format for LLM pipeline stages, with standard process stamp generation and provenance chain management. This is where the accountability layer for AI-assisted knowledge work becomes practically deployable.

The archive format question: DataBooks as the canonical archival format for AI-assisted ontology development, SHACL validation runs, taxonomy evolution, and other knowledge work that currently produces outputs with no formal provenance record.

We will be developing these threads in subsequent issues of The Ontologist and The Inference Engineer, and in the pages of The End of the Universal Map — the Leanpub book in which the DataBooks architecture, the holonic graph model, and the broader argument about knowledge infrastructure are being assembled into a coherent whole.

The pieces exist. The synthesis is underway.

Kurt Cagle is an author, ontologist and thought leader in semantic web technologies, contributing to W3C and IEEE. He writes The Cagle Report on LinkedIn and The Ontologist and Inference Engineer on Substack. Copyright 2026 Kurt Cagle.

The Audit Procedure for Your Data: Business Constraints, XBRL and SHACL 1.2

Kurt Cagle — Sun, 05 Apr 2026 20:27:24 GMT

The Ontologist | Kurt Cagle

There is a habit of mind that afflicts almost everyone who encounters semantic web technology for the first time. Confronted with RDF, OWL, and the machinery of knowledge graphs, they immediately ask: how do I model my domain? What are the classes? What are the properties? What is the taxonomy?

It’s a reasonable instinct. Modelling is how we make sense of complex domains. But it’s also, for a surprisingly large class of real business problems, the wrong place to start — and starting there leads to systems that are intellectually satisfying and practically inert.

This article is about a different starting point: constraints. Specifically, about how to use SHACL 1.2 — the W3C’s Shapes Constraint Language — to encode the rules that your data must satisfy, independently of whether you’ve built a complete ontological model of your domain. If you have a CPA’s intuition about financial reporting, you already understand the core idea. The rest is translation.

Two Ways of Thinking About a Balance Sheet

Consider a balance sheet. A financial modeller approaching a balance sheet will want to define its structure: what line items exist, how they relate to each other, what the hierarchy of accounts looks like. They’ll build a chart of accounts — a careful taxonomy of every financial concept the organization tracks. This is legitimate, necessary work.

But an auditor approaching the same balance sheet thinks differently. They’re not primarily asking what is here. They’re asking what must be true. Assets must equal liabilities plus equity. Cash cannot be negative. Certain disclosures are mandatory if certain conditions hold. Related-party transactions must be identified. These rules don’t describe the balance sheet — they constrain it. They specify the conditions under which the data can be trusted.

The distinction is between a chart of accounts and an audit procedure. One is ontological — a map of the domain. The other is normative — a set of conditions the data must satisfy to be considered valid, complete, and reliable.

OWL is the chart of accounts. SHACL is the audit procedure.

Both are useful. But for many business problems — compliance, data governance, reporting validation, workflow integrity — the audit procedure is what you actually need, and you can write it without first completing the chart of accounts.

What XBRL Already Knows

If you’ve worked with financial reporting standards, you’ve already encountered a constraint language — you just may not have thought of it that way.

XBRL (eXtensible Business Reporting Language) is, at its core, a system for encoding financial data in a structured, machine-readable form governed by taxonomies that specify what elements exist, what relationships hold between them, and what validation rules apply. The US GAAP taxonomy, for example, encodes not just the structure of financial statements but the rules that make them valid: which elements are required, what their allowable data types are, what mathematical relationships must hold between line items.

When a filing fails XBRL validation, it’s not because the data doesn’t match a model. It’s because the data violates a constraint. The validator doesn’t care about your ontology — it cares about whether your numbers are the right type, the right sign, the right relationship to each other.

SHACL generalizes this instinct. Where XBRL encodes financial reporting constraints for a specific, standardized domain, SHACL lets you encode the same class of constraint — type checking, cardinality, value ranges, structural relationships, conditional requirements — against any RDF graph, in any domain, on your own terms.

Think of SHACL as what XBRL would look like if it had been designed as a general-purpose constraint language rather than a financial reporting standard. The intellectual move is the same. The scope is much wider.

Shapes, Not Classes

The fundamental unit of SHACL is the shape. A shape is a named set of constraints that a node in your graph must satisfy. It is not a class definition — it does not say what a thing is. It says what conditions must hold for a thing to be considered valid in a given context.

This distinction matters more than it first appears. A class definition is permanent and taxonomic — it places a thing in a hierarchy of kinds. A shape is contextual and operational — it specifies what must be true about a thing for a particular purpose, in a particular workflow, at a particular point in a process.

The same data node can satisfy multiple shapes, or fail one shape while satisfying another. An invoice might satisfy the shape for syntactically valid invoice while failing the shape for approved for payment — not because it’s the wrong kind of thing, but because it doesn’t yet meet the conditions required for a specific business action.

Let’s make this concrete. Here’s a minimal SHACL shape for a financial line item — the kind of constraint an auditor would recognize immediately:

@prefix sh:    .
@prefix fin:   .
@prefix xsd:   .
@prefix rdfs:  .

fin:LineItemShape
    a sh:NodeShape ;
    sh:targetClass fin:LineItem ;
    sh:property [
        sh:path    fin:amount ;
        sh:datatype xsd:decimal ;
        sh:minCount 1 ;
        sh:maxCount 1 ;
        sh:name    "amount" ;
        sh:message "Every line item must have exactly one decimal amount." ;
    ] ;
    sh:property [
        sh:path    fin:accountCode ;
        sh:datatype xsd:string ;
        sh:minCount 1 ;
        sh:name    "accountCode" ;
        sh:message "Every line item must reference an account code." ;
    ] .

Read this the way you’d read an audit checklist: every line item must have exactly one amount, and that amount must be a decimal number. Every line item must have an account code. No amount, wrong type, or missing account code — the shape reports a violation.

The business rule came first. The SHACL encoding followed it directly.

Valid and Invalid: Line Items

Here’s what conforming and non-conforming data look like against this shape. RDF 1.2 introduces triple annotations using the {| |} syntax, which let you attach metadata — provenance, review status, entry timestamps — directly to individual triples without disrupting the graph structure. Think of annotations as the auditor’s margin notes: they travel with the fact they describe.

Valid instance (Turtle with RDF 1.2 annotations):

@prefix fin:   .
@prefix xsd:   .
@prefix rdfs:  .
@prefix prov:  .

fin:lineItem001
    a fin:LineItem ;
    rdfs:label "Office Supplies Q1" ;
    fin:amount    "1250.00"^^xsd:decimal
        {| rdfs:label            "amount" ;
           prov:generatedAtTime  "2026-01-15"^^xsd:dateTime ;
           fin:enteredBy         fin:user_jsmith |} ;
    fin:accountCode "6100"^^xsd:string
        {| rdfs:label    "accountCode" ;
           fin:verifiedBy fin:auditor_acox ;
           fin:verifiedAt "2026-01-16"^^xsd:dateTime |} .

The annotations on fin:amount and fin:accountCode record who entered the values and when — provenance data that XBRL carries in context elements but that RDF 1.2 can attach directly to the triple. The shape validator sees only the unadorned triples; the annotations travel alongside as auditable metadata without interfering with constraint evaluation.

Invalid instance — missing amount:

fin:lineItem002
    a fin:LineItem ;
    rdfs:label    "Travel Expense March" ;
    fin:accountCode "6200"^^xsd:string
        {| rdfs:label   "accountCode" ;
           fin:enteredBy fin:user_bwong |} .
    # fin:amount is absent — violates sh:minCount 1

Invalid instance — wrong datatype:

fin:lineItem003
    a fin:LineItem ;
    rdfs:label    "Consulting Fees Q1" ;
    fin:amount    "four thousand dollars" ;  # xsd:string, not xsd:decimal — violation
    fin:accountCode "6300"^^xsd:string .

Equivalent XBRL context and fact:

In XBRL, the line item’s identity and provenance are carried in a element; the fact itself is a typed element referencing that context. The constraint that the value must be numeric and present is enforced by the taxonomy’s element declaration, not expressed inline.




  
  
    
      ACME-CORP
    
    
      2026-01-01
      2026-03-31
    
  

  iso4217:USD

  
  1250.00

The comparison is instructive. XBRL enforces the type constraint through the taxonomy schema — you can’t represent the wrong datatype in a valid XBRL instance. SHACL enforces it as a named rule against a graph that makes no such structural demand. The constraint is the same; the architecture differs. SHACL is the more flexible instrument precisely because the data layer and the constraint layer are fully separated.

Materiality: Value Range Constraints

An auditor doesn’t just check that numbers exist — they check that numbers make sense. Revenue figures shouldn’t be negative. Certain ratios shouldn’t fall outside expected ranges. Expense amounts above a threshold require additional documentation.

SHACL handles this through value constraint components. Let’s extend our income statement example:

fin:RevenueShape
    a sh:NodeShape ;
    sh:targetClass fin:RevenueItem ;
    sh:property [
        sh:path         fin:amount ;
        sh:datatype     xsd:decimal ;
        sh:minInclusive "0"^^xsd:decimal ;
        sh:minCount     1 ;
        sh:maxCount     1 ;
        sh:name         "amount" ;
        sh:message      "Revenue amounts must be non-negative." ;
    ] .

fin:ExpenseShape
    a sh:NodeShape ;
    sh:targetClass fin:ExpenseItem ;
    sh:property [
        sh:path         fin:amount ;
        sh:datatype     xsd:decimal ;
        sh:minInclusive "0"^^xsd:decimal ;
        sh:minCount     1 ;
        sh:maxCount     1 ;
        sh:name         "amount" ;
        sh:message      "Expense amounts must be non-negative." ;
    ] .

sh:minInclusive 0 is the SHACL expression of a rule every accountant knows: you don’t record negative revenue or negative expenses — you record contra-accounts. The constraint doesn’t model why that’s true. It simply enforces that it is.

Valid and Invalid: Value Ranges

Valid instances:

@prefix fin:   .
@prefix xsd:   .
@prefix rdfs:  .
@prefix prov:  .

fin:revenue001
    a fin:RevenueItem ;
    rdfs:label "Product Sales Q1" ;
    fin:amount "485000.00"^^xsd:decimal
        {| rdfs:label            "amount" ;
           prov:generatedAtTime  "2026-03-31"^^xsd:dateTime ;
           fin:approvedBy        fin:mgr_dleon ;
           fin:source            fin:erp_system |} .

fin:expense001
    a fin:ExpenseItem ;
    rdfs:label "Payroll March" ;
    fin:amount "210000.00"^^xsd:decimal
        {| rdfs:label           "amount" ;
           prov:generatedAtTime "2026-03-31"^^xsd:dateTime ;
           fin:approvedBy       fin:mgr_dleon |} .

Invalid instance — negative revenue:

fin:revenue002
    a fin:RevenueItem ;
    rdfs:label "Returns Adjustment" ;
    fin:amount "-12000.00"^^xsd:decimal
        {| rdfs:label   "amount" ;
           fin:enteredBy fin:user_tpark |} .
    # Violates sh:minInclusive 0 — returns should be a contra-revenue account,
    # not a negative revenue entry.

Equivalent XBRL:

XBRL handles sign convention through taxonomy element definitions and calculation linkbases, which specify that certain elements must sum to others. A negative revenue figure typically fails a calculation consistency check rather than a simple type constraint.

  
    
      ACME-CORP
    
    
      2026-01-01
      2026-03-31
    
  

  iso4217:USD

  
  485000.00

  
  210000.00

Note the architectural difference: XBRL’s sign constraints live in the calculation linkbase — a separate document expressing mathematical relationships between elements. SHACL collapses that into the same shape that expresses every other constraint on the data node.

Structural Integrity: Node and Property Relationships

Financial statements have structural rules that go beyond individual line items. An income statement must reference a reporting period. A reporting period must have both a start date and an end date.

These are relational constraints — rules about how nodes in the graph connect to each other. SHACL handles them through sh:node (constraining the shape of a linked node) and property path constraints.

fin:ReportingPeriodShape
    a sh:NodeShape ;
    sh:targetClass fin:ReportingPeriod ;
    sh:property [
        sh:path     fin:startDate ;
        sh:datatype xsd:date ;
        sh:minCount 1 ;
        sh:maxCount 1 ;
        sh:name     "startDate" ;
        sh:message  "A reporting period must have exactly one start date." ;
    ] ;
    sh:property [
        sh:path     fin:endDate ;
        sh:datatype xsd:date ;
        sh:minCount 1 ;
        sh:maxCount 1 ;
        sh:name     "endDate" ;
        sh:message  "A reporting period must have exactly one end date." ;
    ] .

fin:IncomeStatementShape
    a sh:NodeShape ;
    sh:targetClass fin:IncomeStatement ;
    sh:property [
        sh:path    fin:reportingPeriod ;
        sh:minCount 1 ;
        sh:maxCount 1 ;
        sh:node    fin:ReportingPeriodShape ;
        sh:name    "reportingPeriod" ;
        sh:message "An income statement must reference a valid reporting period." ;
    ] .

The sh:node constraint on fin:reportingPeriod says: not only must this property exist, but the node it points to must itself satisfy fin:ReportingPeriodShape. Structural validity cascades. An income statement is only valid if its reporting period is valid — which is exactly how an auditor would think about it.

Valid and Invalid: Structural Relationships

Valid instance — income statement with complete reporting period:

@prefix fin:   .
@prefix xsd:   .
@prefix rdfs:  .
@prefix prov:  .

fin:period_Q1_2026
    a fin:ReportingPeriod ;
    rdfs:label "Q1 2026" ;
    fin:startDate "2026-01-01"^^xsd:date
        {| rdfs:label       "startDate" ;
           fin:confirmedBy  fin:auditor_acox |} ;
    fin:endDate   "2026-03-31"^^xsd:date
        {| rdfs:label       "endDate" ;
           fin:confirmedBy  fin:auditor_acox |} .

fin:incomeStatement_Q1_2026
    a fin:IncomeStatement ;
    rdfs:label "ACME Corp Income Statement Q1 2026" ;
    fin:reportingPeriod fin:period_Q1_2026
        {| rdfs:label            "reportingPeriod" ;
           prov:generatedAtTime  "2026-04-01"^^xsd:dateTime ;
           fin:preparedBy        fin:user_jsmith |} ;
    fin:totalRevenue  "485000.00"^^xsd:decimal
        {| rdfs:label "totalRevenue" ; fin:approvedBy fin:mgr_dleon |} ;
    fin:totalExpenses "310000.00"^^xsd:decimal
        {| rdfs:label "totalExpenses" ; fin:approvedBy fin:mgr_dleon |} .

The annotations here serve as the electronic equivalent of sign-off initials on a printed financial statement — each material triple carries a record of who asserted it and when. This is metadata that traditionally lives in a separate audit workpaper; RDF 1.2 annotations allow it to travel with the triple.

Invalid instance — broken period (missing end date):

fin:period_broken
    a fin:ReportingPeriod ;
    rdfs:label "Incomplete Period" ;
    fin:startDate "2026-01-01"^^xsd:date
        {| rdfs:label "startDate" ; fin:enteredBy fin:user_tpark |} .
    # fin:endDate absent — fin:ReportingPeriodShape violation.
    # This cascades: any fin:IncomeStatement referencing this period
    # will also fail via its sh:node constraint.

fin:incomeStatement_broken
    a fin:IncomeStatement ;
    rdfs:label "Statement with broken period reference" ;
    fin:reportingPeriod fin:period_broken
        {| rdfs:label "reportingPeriod" ; fin:enteredBy fin:user_tpark |} .

Invalid instance — no period reference at all:

fin:incomeStatement_orphan
    a fin:IncomeStatement ;
    rdfs:label "Orphaned Income Statement" ;
    fin:totalRevenue "200000.00"^^xsd:decimal .
    # fin:reportingPeriod absent entirely — sh:minCount 1 violation.

Equivalent XBRL:

In XBRL, the reporting period is not a data node — it is a element, structurally required by the instance document format. A fact with no context reference is a malformed document, caught by schema validation before a business rules processor ever sees it.

This contrast is worth pausing on. XBRL achieves structural integrity through document architecture — the format itself makes certain violations impossible to represent. SHACL achieves it through explicit, named, auditable constraints that can be selectively enforced, versioned, and reported against. Neither is strictly superior; they reflect different design philosophies about where to locate the rules.

Conditional Requirements: The Disclosure Trigger

One of the most powerful — and most underused — features of SHACL core is sh:or, sh:and, and sh:not, which allow you to express conditional business rules. In financial reporting, many requirements are triggered conditionally: if a related-party transaction exists, then a disclosure is required.

fin:RelatedPartyTransactionShape
    a sh:NodeShape ;
    sh:targetClass fin:Transaction ;
    sh:property [
        sh:path     fin:isRelatedParty ;
        sh:datatype xsd:boolean ;
        sh:maxCount 1 ;
        sh:name     "isRelatedParty" ;
    ] ;
    sh:or (
        [
            sh:property [
                sh:path     fin:isRelatedParty ;
                sh:hasValue false ;
            ]
        ]
        [
            sh:property [
                sh:path     fin:relatedPartyDisclosure ;
                sh:minCount 1 ;
                sh:name     "relatedPartyDisclosure" ;
                sh:message  "Related-party transactions require a disclosure reference." ;
            ]
        ]
    ) .

Read this as: a transaction is valid if either it is not a related-party transaction, or it has a disclosure reference. If it’s flagged as related-party with no disclosure, validation fails.

Valid and Invalid: Conditional Disclosure

Valid — arm’s-length transaction, no disclosure required:

@prefix fin:   .
@prefix xsd:   .
@prefix rdfs:  .
@prefix prov:  .

fin:txn_vendor_001
    a fin:Transaction ;
    rdfs:label "Vendor Payment — Office Depot" ;
    fin:amount "4200.00"^^xsd:decimal
        {| rdfs:label            "amount" ;
           prov:generatedAtTime  "2026-02-14"^^xsd:dateTime |} ;
    fin:isRelatedParty false
        {| rdfs:label    "isRelatedParty" ;
           fin:assertedBy fin:user_jsmith ;
           fin:assertedAt "2026-02-14"^^xsd:dateTime |} .

Valid — related-party transaction WITH disclosure:

fin:txn_related_001
    a fin:Transaction ;
    rdfs:label "Consulting Services — CEO Family Trust" ;
    fin:amount "75000.00"^^xsd:decimal
        {| rdfs:label            "amount" ;
           prov:generatedAtTime  "2026-03-01"^^xsd:dateTime |} ;
    fin:isRelatedParty true
        {| rdfs:label    "isRelatedParty" ;
           fin:assertedBy fin:auditor_acox ;
           fin:reviewedAt "2026-03-15"^^xsd:dateTime |} ;
    fin:relatedPartyDisclosure fin:disclosure_RPT_2026_001
        {| rdfs:label   "relatedPartyDisclosure" ;
           fin:filedBy   fin:counsel_mjones ;
           fin:filedAt   "2026-03-16"^^xsd:dateTime |} .

Note what the annotations accomplish here: the fin:isRelatedParty triple carries a fin:reviewedAt timestamp showing it was reviewed by the auditor after the transaction was entered. The fin:relatedPartyDisclosure triple carries a fin:filedAt timestamp. The temporal relationship between those two annotations is auditable evidence of process compliance — something that would otherwise live in a separate workpaper.

Invalid — related-party transaction WITHOUT disclosure:

fin:txn_related_002
    a fin:Transaction ;
    rdfs:label "IT Services — Board Member LLC" ;
    fin:amount "38500.00"^^xsd:decimal
        {| rdfs:label            "amount" ;
           prov:generatedAtTime  "2026-03-20"^^xsd:dateTime |} ;
    fin:isRelatedParty true
        {| rdfs:label   "isRelatedParty" ;
           fin:enteredBy fin:user_bwong |} .
    # fin:relatedPartyDisclosure is absent.
    # sh:or fails: isRelatedParty is not false, AND no disclosure exists.
    # Violation: "Related-party transactions require a disclosure reference."

Equivalent XBRL:

XBRL handles related-party disclosures through taxonomy elements and filing structure. The conditional relationship — if related party, then disclosure required — is not directly expressible as a machine-executable rule in XBRL. It lives in the reviewer’s checklist or a custom validation layer built on top of the XBRL processor.

  
  4200.00
  false

  
  75000.00
  true
  RPT-2026-001

  
  38500.00
  true

This is where SHACL meaningfully extends what XBRL can express. The conditional disclosure rule has real regulatory weight — in XBRL, it lives in human procedure. In SHACL, it lives in the data layer, is machine-executable, and produces a named, reportable violation with a clear message.

Severity: Not All Violations Are Equal

An auditor distinguishes between a material misstatement and an immaterial one. SHACL 1.2 preserves this distinction through sh:severity. Three levels are defined: sh:Violation (the default — the data fails the constraint), sh:Warning (the data is suspect but not definitively invalid), and sh:Info (informational — worth noting but not a failure).

fin:ThresholdWarningShape
    a sh:NodeShape ;
    sh:targetClass fin:ExpenseItem ;
    sh:property [
        sh:path         fin:amount ;
        sh:maxInclusive "10000"^^xsd:decimal ;
        sh:severity     sh:Warning ;
        sh:name         "amount" ;
        sh:message      "Expense items over $10,000 should be reviewed for documentation." ;
    ] .

Valid, Warning, and Violation: Severity in Practice

Passes cleanly — amount below threshold:

@prefix fin:   .
@prefix xsd:   .
@prefix rdfs:  .

fin:expense_under
    a fin:ExpenseItem ;
    rdfs:label "Courier Services March" ;
    fin:amount "340.00"^^xsd:decimal
        {| rdfs:label   "amount" ;
           fin:enteredBy fin:user_jsmith |} ;
    fin:accountCode "6400"^^xsd:string
        {| rdfs:label "accountCode" |} .
    # No violation, no warning. Passes fin:ThresholdWarningShape cleanly.

Triggers warning — amount over threshold, data technically valid:

fin:expense_over
    a fin:ExpenseItem ;
    rdfs:label "Legal Retainer Q1" ;
    fin:amount "45000.00"^^xsd:decimal
        {| rdfs:label            "amount" ;
           fin:enteredBy         fin:user_bwong ;
           fin:reviewRequired    true
               {| fin:reason "threshold exceeded" |} |} ;
    fin:accountCode "6500"^^xsd:string
        {| rdfs:label "accountCode" |} .
    # SHACL reports sh:Warning:
    # "Expense items over $10,000 should be reviewed for documentation."
    # Data is not invalid; a human review workflow is triggered.

Note the nested annotation: fin:reviewRequired true is itself annotated with fin:reason "threshold exceeded" — RDF 1.2 allows annotations on annotations, providing a full audit chain without leaving the triple layer.

XBRL equivalent — materiality thresholds:

XBRL has no native severity model. A value either satisfies the schema type and calculation constraints or it doesn’t. Materiality thresholds and review triggers are entirely external to the format — they live in audit procedures and engagement letters. SHACL encodes them in the data layer alongside structural constraints, meaning a single validation pass produces a unified report covering hard violations, soft warnings, and informational flags.

The severity model gives SHACL a vocabulary that pure schema languages lack: the difference between wrong, suspicious, and worth noting is meaningful in business contexts, and SHACL lets you encode all three in the same constraint document, evaluated in a single pass.

Reading the Report: What a SHACL Validator Actually Returns

All of the constraint definitions above are worth nothing unless a validator runs them against real data and produces something actionable. SHACL 1.2 specifies a standard validation report format — itself expressed as RDF — that captures every violation, warning, and informational result from a validation run. Understanding what that report contains, and how to read it, closes the loop between writing shapes and using them in practice.

The Tabular View

When a business analyst or auditor asks “what’s wrong with this data?”, the most useful first answer is a flat summary table. Here is the consolidated report from running all of the shapes defined in this article against the invalid instances introduced in each section:

# Focus Node Shape Property Severity Message 1 fin:lineItem002 fin:LineItemShape fin:amount Violation Every line item must have exactly one decimal amount. 2 fin:lineItem003 fin:LineItemShape fin:amount Violation Every line item must have exactly one decimal amount. 3 fin:revenue002 fin:RevenueShape fin:amount Violation Revenue amounts must be non-negative. 4 fin:period_broken fin:ReportingPeriodShape fin:endDate Violation A reporting period must have exactly one end date. 5 fin:incomeStatement_broken fin:IncomeStatementShape fin:reportingPeriod Violation An income statement must reference a valid reporting period. 6 fin:incomeStatement_orphan fin:IncomeStatementShape fin:reportingPeriod Violation An income statement must reference a valid reporting period. 7 fin:txn_related_002 fin:RelatedPartyTransactionShape fin:relatedPartyDisclosure Violation Related-party transactions require a disclosure reference. 8 fin:expense_over fin:ThresholdWarningShape fin:amount Warning Expense items over $10,000 should be reviewed for documentation.

Rows 1–7 are hard violations — data that fails a constraint and must be corrected before the dataset can be considered valid. Row 8 is a warning — data that is structurally sound but triggers a review workflow. An auditor reading this table knows immediately: seven items require remediation, one requires a documented review. The table is machine-generated from the same shapes that encode the business rules. There is no separate report template to maintain.

Note row 5: fin:incomeStatement_broken fails not because it is itself malformed, but because the period it references (fin:period_broken) is missing an end date. This is the cascade effect of sh:node — the structural integrity constraint propagates up from the referenced node to the referencing one. The report surfaces both failures, giving the data owner a complete picture of what needs to be fixed and in what order (fix the period first; the income statement violation will clear automatically on re-validation).

The Formal SHACL Report

Behind that table is a structured RDF document. Every SHACL validator produces a sh:ValidationReport instance — a machine-readable record of the full validation run that can itself be stored, queried, compared across periods, and used as input to downstream workflows. Here is the report corresponding to the table above, in Turtle:

@prefix sh:    .
@prefix fin:   .
@prefix xsd:   .
@prefix rdfs:  .
@prefix prov:  .

fin:ValidationReport_2026_Q1
    a sh:ValidationReport ;
    sh:conforms false ;
    prov:generatedAtTime "2026-04-05T09:00:00"^^xsd:dateTime ;
    rdfs:label "Q1 2026 Financial Data Validation — Full Run" ;

    ## Result 1: lineItem002 — missing amount
    sh:result [
        a sh:ValidationResult ;
        sh:focusNode          fin:lineItem002 ;
        sh:resultPath         fin:amount ;
        sh:sourceShape        fin:LineItemShape ;
        sh:sourceConstraintComponent sh:MinCountConstraintComponent ;
        sh:resultSeverity     sh:Violation ;
        sh:resultMessage      "Every line item must have exactly one decimal amount." ;
    ] ;

    ## Result 2: lineItem003 — wrong datatype on amount
    sh:result [
        a sh:ValidationResult ;
        sh:focusNode          fin:lineItem003 ;
        sh:resultPath         fin:amount ;
        sh:value              "four thousand dollars" ;
        sh:sourceShape        fin:LineItemShape ;
        sh:sourceConstraintComponent sh:DatatypeConstraintComponent ;
        sh:resultSeverity     sh:Violation ;
        sh:resultMessage      "Every line item must have exactly one decimal amount." ;
    ] ;

    ## Result 3: revenue002 — negative revenue
    sh:result [
        a sh:ValidationResult ;
        sh:focusNode          fin:revenue002 ;
        sh:resultPath         fin:amount ;
        sh:value              "-12000.00"^^xsd:decimal ;
        sh:sourceShape        fin:RevenueShape ;
        sh:sourceConstraintComponent sh:MinInclusiveConstraintComponent ;
        sh:resultSeverity     sh:Violation ;
        sh:resultMessage      "Revenue amounts must be non-negative." ;
    ] ;

    ## Result 4: period_broken — missing endDate
    sh:result [
        a sh:ValidationResult ;
        sh:focusNode          fin:period_broken ;
        sh:resultPath         fin:endDate ;
        sh:sourceShape        fin:ReportingPeriodShape ;
        sh:sourceConstraintComponent sh:MinCountConstraintComponent ;
        sh:resultSeverity     sh:Violation ;
        sh:resultMessage      "A reporting period must have exactly one end date." ;
    ] ;

    ## Result 5: incomeStatement_broken — period fails sh:node cascade
    sh:result [
        a sh:ValidationResult ;
        sh:focusNode          fin:incomeStatement_broken ;
        sh:resultPath         fin:reportingPeriod ;
        sh:value              fin:period_broken ;
        sh:sourceShape        fin:IncomeStatementShape ;
        sh:sourceConstraintComponent sh:NodeConstraintComponent ;
        sh:resultSeverity     sh:Violation ;
        sh:resultMessage      "An income statement must reference a valid reporting period." ;
    ] ;

    ## Result 6: incomeStatement_orphan — no period reference at all
    sh:result [
        a sh:ValidationResult ;
        sh:focusNode          fin:incomeStatement_orphan ;
        sh:resultPath         fin:reportingPeriod ;
        sh:sourceShape        fin:IncomeStatementShape ;
        sh:sourceConstraintComponent sh:MinCountConstraintComponent ;
        sh:resultSeverity     sh:Violation ;
        sh:resultMessage      "An income statement must reference a valid reporting period." ;
    ] ;

    ## Result 7: txn_related_002 — related-party with no disclosure
    sh:result [
        a sh:ValidationResult ;
        sh:focusNode          fin:txn_related_002 ;
        sh:resultPath         fin:relatedPartyDisclosure ;
        sh:sourceShape        fin:RelatedPartyTransactionShape ;
        sh:sourceConstraintComponent sh:OrConstraintComponent ;
        sh:resultSeverity     sh:Violation ;
        sh:resultMessage      "Related-party transactions require a disclosure reference." ;
    ] ;

    ## Result 8: expense_over — over threshold (Warning, not Violation)
    sh:result [
        a sh:ValidationResult ;
        sh:focusNode          fin:expense_over ;
        sh:resultPath         fin:amount ;
        sh:value              "45000.00"^^xsd:decimal ;
        sh:sourceShape        fin:ThresholdWarningShape ;
        sh:sourceConstraintComponent sh:MaxInclusiveConstraintComponent ;
        sh:resultSeverity     sh:Warning ;
        sh:resultMessage      "Expense items over $10,000 should be reviewed for documentation." ;
    ] .

Several structural features of this report are worth noting.

sh:conforms false is the top-level verdict. A report where every result is a sh:Warning or sh:Info — and no sh:Violation exists — would still set sh:conforms true, because warnings do not constitute failure. This maps directly onto audit practice: a dataset with review flags is not an invalid dataset; it’s a dataset requiring follow-up. A dataset with violations is. The distinction is machine-readable and unambiguous.

sh:sourceConstraintComponent identifies which constraint type triggered each result — sh:MinCountConstraintComponent, sh:DatatypeConstraintComponent, sh:MinInclusiveConstraintComponent, and so on. This allows downstream tooling to categorize violations by type, prioritize remediation queues, or route results to different review workflows. Type errors go to data entry. Cardinality violations go to completeness review. Conditional failures go to compliance.

sh:value appears only when there is an actual value to report — the wrong datatype string in result 2, the negative decimal in result 3, the offending period node in result 5, the over-threshold amount in result 8. Where the violation is a missing value (results 1, 4, 6) there is nothing to report in sh:value, because the problem is absence rather than incorrectness.

The report is itself an RDF graph. It can be stored in the same triplestore as the data it validates. It can be annotated with remediation notes using the same RDF 1.2 {| |} syntax applied to the data throughout this article. It can be queried with SPARQL — “show me all Violations in the Q1 report that have not yet been remediated” is a straightforward query against a report graph augmented with a fin:remediatedAt annotation. The audit trail and the validation system use the same substrate.

What This Unlocks

The OWL-first approach to knowledge modelling produces rich, expressive class hierarchies that answer the question: what kinds of things exist in this domain, and how do they relate? That’s valuable work. But it tends to produce systems that are hard to validate, hard to explain to non-technical stakeholders, and hard to connect to operational business processes.

SHACL-first thinking — approaching a domain through its constraints rather than its taxonomy — produces something different: a set of explicit, executable, human-readable business rules that can be applied to data at any point in a workflow. The rules can be reviewed by a business analyst without reading a line of RDF. They can be explained to an auditor. They can be version-controlled, compared across reporting periods, and used to generate meaningful validation reports rather than opaque schema errors.

RDF 1.2 annotations add a further dimension: the data that satisfies (or violates) those constraints can carry its own provenance inline — who asserted it, when, under what authority — without requiring a separate metadata store or a parallel audit trail. The annotated triple is the audited fact.

More practically: you don’t need to finish the ontology before you can validate the data. The shapes can be written incrementally, against whatever graph structure you already have, targeting the rules that matter most for your immediate business need. This is the constraint-first mindset — and for the class of problems that business analysts actually face, it’s often the faster, more tractable, and more maintainable approach.

A Note on Scope: SHACL 1.2 Core vs. What Comes Next

Everything in this article uses SHACL 1.2 core — property constraints, node constraints, cardinality, value ranges, logical operators, and severity. This is the layer that maps cleanly onto business rules a CPA or business analyst would recognize.

SHACL also has a more powerful layer: SHACL-AF (Advanced Features) introduces rules, functions, and SPARQL-based constraints that let you express derived values, cross-graph inferences, and complex validation logic. That territory is worth a dedicated article, but it’s a different cognitive register — closer to a rules engine than an audit checklist. The boundary matters, and we’ll explore it separately.

Similarly, the relationship between SHACL 1.2 and XBRL goes deeper than this article has space to develop. There is a reasonable argument that XBRL taxonomies could be expressed as SHACL shape libraries, making financial reporting constraints reusable against any RDF-structured financial data rather than requiring XBRL-specific tooling. That too is a conversation for another piece.

For now: if you can write it as a business rule in plain English, you can probably encode it in SHACL 1.2 core. Start there. The chart of accounts can wait.

Kurt Cagle is an author, ontologist, and thought leader specializing in semantic web standards, knowledge architecture, and AI systems. He serves as an IEEE Standards Editor at the IEEE Spatial Web Foundation and is a founding contributor to the W3C Context Graph Community Group. He writes The Cagle Report on LinkedIn (The Cagle Report) and The Ontologist (

) and Inference Engineer (

SHACL’s Hidden Superpower: Parameterised Constraints and the Art of Writing Validation Once

Kurt Cagle — Sat, 04 Apr 2026 12:29:50 GMT

by Kurt Cagle and Chloe Shannon

There is a particular kind of frustration that is endemic to working with validation schemas. You write a shape. It works. Then you need a slightly different version of it — same logic, different threshold — and so you copy it, change one value, and now you have two shapes doing the same thing. Then three. Then twelve. By the time anyone notices, you have a maintenance problem dressed up as a data quality solution.

SHACL has an answer to this, and it is surprisingly underused. It is called parameterised constraint components, and it is the mechanism by which SHACL crosses the line from a validation checklist into something that more closely resembles a type system. Understanding when and how to use it changes not just how you write shapes, but how you think about data quality as an architectural concern.

This article works through the mechanism from first principles, builds a realistic library of reusable components, and then turns the lens on itself — examining where parameterisation earns its keep and where it is overkill.

The Problem Parameterisation Solves

Plain SHACL shapes are declarative but not reusable. A shape that enforces a US ZIP code pattern and a shape that enforces a UK postcode pattern are written separately, maintained separately, and — if the SPARQL validator logic contains a subtle error — fail separately and invisibly. You discover the bug in the US shape, fix it, and the UK shape continues to carry the old logic because nobody thought to check.

The underlying structure of both validators is identical: does the value of property X on node Y match regex Z? Only X and Z differ. Writing two validators is writing the same thought twice.

This is the same argument that justifies functions over copy-pasted code, and SHACL makes it with sh:ConstraintComponent.

The Mechanism

A parameterised constraint has three parts:

A sh:ConstraintComponent — the named, reusable template. It declares what parameters it accepts, their types, and which validator implements the logic.

A validator — either sh:SPARQLAskValidator (returns a boolean: pass or fail) or sh:SPARQLSelectValidator (returns rows of violation details). The validator receives the declared parameters as bound SPARQL variables.

Usage in a shape — where you supply the actual parameter values. The component is invoked by the presence of its parameter property on the shape; no explicit call syntax is required.

Here is the simplest useful example — a constraint that validates a named property against a named regex pattern:

@prefix sh:    .
@prefix ex:    .
@prefix xsd:   .

ex:LexicalPatternConstraint a sh:ConstraintComponent ;
  sh:parameter [
    sh:path     ex:constrainedProperty ;
    sh:nodeKind sh:IRI ;
    sh:name     "constrainedProperty" ;
    sh:description "The property whose value must match the pattern." ;
  ] ;
  sh:parameter [
    sh:path     ex:matchPattern ;
    sh:datatype xsd:string ;
    sh:name     "matchPattern" ;
    sh:description "The regex the value must satisfy." ;
  ] ;
  sh:parameter [
    sh:path     ex:patternOptional ;
    sh:datatype xsd:boolean ;
    sh:name     "patternOptional" ;
    sh:optional true ;
  ] ;
  sh:labelTemplate
    "Value of {$constrainedProperty} must match {$matchPattern}" ;
  sh:validator ex:LexicalPatternSelectValidator .

ex:LexicalPatternSelectValidator a sh:SPARQLSelectValidator ;
  sh:select """
    SELECT $this ?value ?message WHERE {
      OPTIONAL { $this $constrainedProperty ?value . }
      BIND(
        IF( !BOUND(?value),
          IF( BOUND($patternOptional) && $patternOptional, "",
            CONCAT("Required property ", STR($constrainedProperty),
                   " is absent on <", STR($this), ">.") ),
          IF( !REGEX(STR(?value), $matchPattern),
            CONCAT("Value '", STR(?value), "' for ",
                   STR($constrainedProperty),
                   " does not match required pattern: ",
                   $matchPattern),
            ""
          )
        )
        AS ?message
      )
      FILTER( ?message != "" )
    }
  """ .

The validator receives $this (the node being validated), $constrainedProperty (the property IRI supplied by the shape), and $matchPattern (the regex, also supplied by the shape). It constructs a diagnostic message from the actual data values. The shape that uses it supplies property and pattern, nothing else:

ex:USZipCodeShape a sh:NodeShape ;
  sh:targetClass ex:USAddress ;
  ex:constrainedProperty ex:zipCode ;
  ex:matchPattern "^[0-9]{5}(-[0-9]{4})?$" .

ex:UKPostcodeShape a sh:NodeShape ;
  sh:targetClass ex:UKAddress ;
  ex:constrainedProperty ex:postcode ;
  ex:matchPattern "^[A-Z]{1,2}[0-9][0-9A-Z]?\\s?[0-9][A-Z]{2}$" .

One validator. Two shapes. The logic is written once and maintained once.

Building a Realistic Component Library

The address example is instructive precisely because international postal addresses don’t just vary in which pattern is valid — they vary in which fields must exist, which are forbidden, and whether those fields are required or optional. That requires three separate components working in concert.

The Required Property Component

ex:RequiredPropertyConstraint a sh:ConstraintComponent ;
  sh:parameter [
    sh:path     ex:requiredProperty ;
    sh:nodeKind sh:IRI ;
    sh:name     "requiredProperty" ;
  ] ;
  sh:parameter [
    sh:path     ex:missingPropertyMessage ;
    sh:datatype xsd:string ;
    sh:name     "missingPropertyMessage" ;
    sh:description """
      Optional message template. Use {property} as a placeholder
      for the property IRI. If omitted, a default message is generated.
    """ ;
    sh:optional true ;
  ] ;
  sh:validator ex:RequiredPropertySelectValidator .

ex:RequiredPropertySelectValidator a sh:SPARQLSelectValidator ;
  sh:select """
    SELECT $this ?value ?message WHERE {
      OPTIONAL { $this $requiredProperty ?value . }
      BIND(
        IF( !BOUND(?value) || STR(?value) = "",
          IF( BOUND($missingPropertyMessage),
            REPLACE( $missingPropertyMessage,
                     "\\{property\\}", STR($requiredProperty) ),
            CONCAT("Required field <", STR($requiredProperty),
                   "> is missing on <", STR($this), ">.")
          ),
          ""
        )
        AS ?message
      )
      FILTER( ?message != "" )
    }
  """ .

Note the $missingPropertyMessage parameter: the message template is itself a parameter, which means each shape can supply domain-appropriate wording — including in languages other than English — without touching the validator logic. This is the meta-move that turns a validation tool into a data quality communication layer.

The Forbidden Property Component

ex:ForbiddenPropertyConstraint a sh:ConstraintComponent ;
  sh:parameter [
    sh:path     ex:forbiddenProperty ;
    sh:nodeKind sh:IRI ;
    sh:name     "forbiddenProperty" ;
  ] ;
  sh:parameter [
    sh:path     ex:addressTypeName ;
    sh:datatype xsd:string ;
    sh:name     "addressTypeName" ;
    sh:optional true ;
  ] ;
  sh:validator ex:ForbiddenPropertySelectValidator .

ex:ForbiddenPropertySelectValidator a sh:SPARQLSelectValidator ;
  sh:select """
    SELECT $this ?value ?message WHERE {
      OPTIONAL { $this $forbiddenProperty ?value . }
      BIND(
        IF( BOUND(?value),
          CONCAT(
            "Field <", STR($forbiddenProperty), ">",
            " is not permitted on a ",
            IF( BOUND($addressTypeName), $addressTypeName, "this" ),
            " address. Found value: '", STR(?value), "'.",
            " Node: <", STR($this), ">."
          ),
          ""
        )
        AS ?message
      )
      FILTER( ?message != "" )
    }
  """ .

Three Shapes, Three Structures, Same Components

ex:USAddressShape a sh:NodeShape ;
  sh:targetClass ex:USAddress ;

  ex:requiredProperty        ex:streetLine1 ;
  ex:requiredProperty        ex:city ;
  ex:requiredProperty        ex:state ;
  ex:requiredProperty        ex:zipCode ;
  ex:missingPropertyMessage  "US addresses require a two-letter state code. \
                              Field {property} is missing." ;

  ex:constrainedProperty     ex:state ;
  ex:matchPattern            "^[A-Z]{2}$" ;

  ex:constrainedProperty     ex:zipCode ;
  ex:matchPattern            "^[0-9]{5}(-[0-9]{4})?$" ;

  ex:forbiddenProperty       ex:postcode ;
  ex:forbiddenProperty       ex:postalCode ;
  ex:addressTypeName         "US" ;

  sh:property [
    sh:path ex:countryCode ;
    sh:hasValue "US" ;
  ] .


ex:UKAddressShape a sh:NodeShape ;
  sh:targetClass ex:UKAddress ;

  ex:requiredProperty        ex:streetLine1 ;
  ex:requiredProperty        ex:city ;
  ex:requiredProperty        ex:postcode ;
  ex:missingPropertyMessage
    "UK addresses must include a valid Royal Mail postcode. \
     Field {property} is absent. Example: 'SW1A 2AA'." ;

  ex:constrainedProperty     ex:postcode ;
  ex:matchPattern            "^[A-Z]{1,2}[0-9][0-9A-Z]?\\s?[0-9][A-Z]{2}$" ;

  ex:forbiddenProperty       ex:state ;
  ex:forbiddenProperty       ex:zipCode ;
  ex:forbiddenProperty       ex:postalCode ;
  ex:addressTypeName         "UK" ;

  sh:property [
    sh:path ex:countryCode ;
    sh:hasValue "GB" ;
  ] .


ex:DEAddressShape a sh:NodeShape ;
  sh:targetClass ex:DEAddress ;

  ex:requiredProperty        ex:streetLine1 ;
  ex:requiredProperty        ex:postalCode ;
  ex:requiredProperty        ex:city ;
  ex:missingPropertyMessage
    "Deutsche Adressen erfordern eine fünfstellige Postleitzahl. \
     Feld {property} fehlt." ;

  ex:constrainedProperty     ex:postalCode ;
  ex:matchPattern            "^[0-9]{5}$" ;

  ex:constrainedProperty     ex:bundesland ;
  ex:matchPattern            "^.{3,50}$" ;
  ex:patternOptional         true ;

  ex:forbiddenProperty       ex:state ;
  ex:forbiddenProperty       ex:zipCode ;
  ex:forbiddenProperty       ex:postcode ;
  ex:addressTypeName         "DE" ;

  sh:property [
    sh:path ex:countryCode ;
    sh:hasValue "DE" ;
  ] .

The structural variation across the three shapes is substantial: US requires ex:state and forbids ex:postcode; UK requires ex:postcode and forbids ex:state; German requires ex:postalCode (a different property from both) with a different pattern, has an optional ex:bundesland with a minimum length, and forbids both of the others. Three entirely different structural profiles — and the whole apparatus runs on three components written once.

Sample Data — Valid Instances

With the shapes declared, here is conformant data for each address type. Each instance satisfies every active constraint in its shape.

@prefix ex:    .
@prefix xsd:   .

# ── US — valid ────────────────────────────────────────────────────

# Standard five-digit ZIP
ex:Address_US_001 a ex:USAddress ;
  ex:streetLine1 "742 Evergreen Terrace" ;
  ex:city        "Springfield" ;
  ex:state       "IL" ;
  ex:zipCode     "62701" ;
  ex:countryCode "US" .

# ZIP+4 extended format — also valid against the pattern
ex:Address_US_002 a ex:USAddress ;
  ex:streetLine1 "1600 Pennsylvania Avenue NW" ;
  ex:city        "Washington" ;
  ex:state       "DC" ;
  ex:zipCode     "20500-0003" ;
  ex:countryCode "US" .


# ── UK — valid ────────────────────────────────────────────────────

# Minimal required fields only
ex:Address_UK_001 a ex:UKAddress ;
  ex:streetLine1 "221B Baker Street" ;
  ex:city        "London" ;
  ex:postcode    "NW1 6XE" ;
  ex:countryCode "GB" .

# With optional county — field is permitted, no pattern constraint applies
ex:Address_UK_002 a ex:UKAddress ;
  ex:streetLine1 "4 Privet Drive" ;
  ex:city        "Little Whinging" ;
  ex:county      "Surrey" ;
  ex:postcode    "GU25 4PJ" ;
  ex:countryCode "GB" .


# ── German — valid ────────────────────────────────────────────────

# House number on the street line — correct German convention
ex:Address_DE_001 a ex:DEAddress ;
  ex:streetLine1 "Unter den Linden 77" ;
  ex:postalCode  "10117" ;
  ex:city        "Berlin" ;
  ex:countryCode "DE" .

# With optional Bundesland — value is 6 chars, satisfies ^.{3,50}$
ex:Address_DE_002 a ex:DEAddress ;
  ex:streetLine1 "Marienplatz 1" ;
  ex:postalCode  "80331" ;
  ex:city        "München" ;
  ex:bundesland  "Bayern" ;
  ex:countryCode "DE" .

Every one of these passes validation cleanly. The US addresses have correctly-cased two-letter state codes and numeric ZIP codes; neither carries ex:postcode or ex:postalCode. The UK addresses carry ex:postcode in Royal Mail format and no US or German fields. The German addresses carry ex:postalCode (a distinct property from ex:zipCode) with a five-digit value, and the second carries ex:bundesland with a value long enough to satisfy the minimum-length pattern.

Sample Data — Invalid Instances and What the Validator Says

Now for the instructive part. One failure per shape, illustrating each of the three failure modes: pattern mismatch, required field missing, and forbidden field present.

# ── US failures ───────────────────────────────────────────────────

# Pattern failure: state is lowercase — violates ^[A-Z]{2}$
ex:Address_US_BAD_001 a ex:USAddress ;
  ex:streetLine1 "100 Main Street" ;
  ex:city        "Portland" ;
  ex:state       "or" ;           # should be "OR"
  ex:zipCode     "97201" ;
  ex:countryCode "US" .

# Pattern failure: ZIP contains a letter — violates ^[0-9]{5}(-[0-9]{4})?$
ex:Address_US_BAD_002 a ex:USAddress ;
  ex:streetLine1 "100 Main Street" ;
  ex:city        "Portland" ;
  ex:state       "OR" ;
  ex:zipCode     "9720X" ;        # not numeric
  ex:countryCode "US" .

# Forbidden field: carries ex:postcode — UK field, forbidden on USAddress
ex:Address_US_BAD_003 a ex:USAddress ;
  ex:streetLine1 "100 Main Street" ;
  ex:city        "Portland" ;
  ex:state       "OR" ;
  ex:zipCode     "97201" ;
  ex:postcode    "NW1 6XE" ;      # forbidden
  ex:countryCode "US" .


# ── UK failures ───────────────────────────────────────────────────

# Missing required field: no postcode at all
ex:Address_UK_BAD_001 a ex:UKAddress ;
  ex:streetLine1 "10 Downing Street" ;
  ex:city        "London" ;
  ex:countryCode "GB" .

# Pattern failure: postcode in lowercase without space
ex:Address_UK_BAD_002 a ex:UKAddress ;
  ex:streetLine1 "10 Downing Street" ;
  ex:city        "London" ;
  ex:postcode    "sw1a2aa" ;      # should be "SW1A 2AA"
  ex:countryCode "GB" .

# Forbidden field: carries ex:state — US field, forbidden on UKAddress
ex:Address_UK_BAD_003 a ex:UKAddress ;
  ex:streetLine1 "10 Downing Street" ;
  ex:city        "London" ;
  ex:postcode    "SW1A 2AA" ;
  ex:state       "LN" ;           # forbidden
  ex:countryCode "GB" .


# ── German failures ───────────────────────────────────────────────

# Pattern failure: postalCode is 4 digits — violates ^[0-9]{5}$
ex:Address_DE_BAD_001 a ex:DEAddress ;
  ex:streetLine1 "Kurfürstendamm 100" ;
  ex:postalCode  "1011" ;         # one digit short
  ex:city        "Berlin" ;
  ex:countryCode "DE" .

# Forbidden field: carries ex:zipCode — US property, forbidden on DEAddress
ex:Address_DE_BAD_002 a ex:DEAddress ;
  ex:streetLine1 "Kurfürstendamm 100" ;
  ex:postalCode  "10711" ;
  ex:zipCode     "10711" ;        # forbidden
  ex:city        "Berlin" ;
  ex:countryCode "DE" .

# Pattern failure on optional field: bundesland is 2 chars, below ^.{3,50}$ minimum
ex:Address_DE_BAD_003 a ex:DEAddress ;
  ex:streetLine1 "Kurfürstendamm 100" ;
  ex:postalCode  "10711" ;
  ex:city        "Berlin" ;
  ex:bundesland  "BE" ;           # 2 chars — violates minimum length
  ex:countryCode "DE" .

Validation Report Output

The SELECT validators construct their messages from actual data values, so each violation is immediately actionable — the report tells you the node, the property, the offending value, and why it failed. This is what the validation engine returns:

Two things are worth noting in this table. First, Address_UK_BAD_001‘s message is in a completely different register from the others — it cites Royal Mail convention and offers an example value, because that message template was supplied directly in ex:UKAddressShape. The German shape supplies its missing-field message in German. The validator logic is identical for all three; the communication layer is different. This is the $missingPropertyMessage parameter doing its job.

Second, Address_DE_BAD_003 demonstrates the optional field pattern working correctly. ex:bundesland is declared with ex:patternOptional true — its absence does not trigger a violation. But its presence with an invalid value does. The validator distinguishes between “field is absent and that’s fine” and “field is present and wrong”. Both cases are handled by a single OPTIONAL { ... } clause in the SELECT query.

ASK vs SELECT Validators

The choice between sh:SPARQLAskValidator and sh:SPARQLSelectValidator follows a single principle: how much do you need to say about the failure?

sh:SPARQLAskValidator returns a boolean. It is appropriate when the violation type is self-explanatory and the static sh:message string (which supports {$param} interpolation) provides enough context. It is fast, readable, and easy to test.

ex:SimpleRangeValidator a sh:SPARQLAskValidator ;
  sh:message
    "Value of {$measuredProperty} must be between {$minValue} and {$maxValue}." ;
  sh:ask """
    ASK {
      $this $measuredProperty ?v .
      FILTER( ?v >= $minValue && ?v <= $maxValue )
    }
  """ .

sh:SPARQLSelectValidator returns rows — one per violation, each with $this, ?value, and ?message. It is appropriate when the message needs to be constructed from the actual data values, not just the parameters. The BIND( ... AS ?message ) pattern with a FILTER( ?message != "" ) is the standard idiom: compute a message string if a violation exists, return an empty string otherwise, and filter to rows where something went wrong.

ex:DetailedRangeValidator a sh:SPARQLSelectValidator ;
  sh:select """
    SELECT $this ?value ?message WHERE {
      $this $measuredProperty ?value .
      BIND(
        IF( ?value < $minValue,
          CONCAT("Value ", STR(?value), " is below minimum ",
                 STR($minValue), " for <", STR($this), ">"),
          IF( ?value > $maxValue,
            CONCAT("Value ", STR(?value), " exceeds maximum ",
                   STR($maxValue), " for <", STR($this), ">"),
            ""
          )
        )
        AS ?message
      )
      FILTER( ?message != "" )
    }
  """ .

The SELECT validator tells you the node, the actual offending value, and why it is wrong — not just that something is wrong. In a system where validation reports are surfaced to downstream consumers, a UI, or a regulatory audit, the difference matters.

A Candidate Library

Across real-world domains, the same structural patterns recur. The following are strong candidates for parameterisation — families where the logic is identical across instances and only the values differ.

Lexical pattern validation. Postcodes, phone numbers, identifiers, currency codes, NPI numbers, ISBNs, IBANs, registration codes. Always the same structure: does the value of property X match regex Z? One component, supply property and pattern per shape.

Numeric ranges with optional units. Blood pressure, temperature, financial limits, dosage bounds, weight thresholds, engineering tolerances. Parameterise the measured property, min/max bounds, and optionally the expected unit IRI.

Cardinality across property sets. The native sh:minCount/sh:maxCount applies to a single property per sh:property block. Cross-property constraints — “at least one of email, phone, or postal address must be present”, “exactly one of brandName or genericName” — are not natively expressible and are a natural component candidate.

Temporal validity windows. A resource is valid between two dates. Common in credentials, contracts, licences, pricing records, drug approvals. Parameterise the start-date property, the end-date property, and optionally a flag permitting open-ended validity.

Referential integrity with typed targets. Does a property value reference a node of the right type that satisfies a given shape? Parameterise the referencing property and the target shape IRI. Useful when the expected type of a referenced entity varies by context.

Conditional presence. “If property A has value V, then property B is required.” This is sh:if/sh:then in SHACL 1.2, but as a parameterised component it becomes portable across any pair of properties in any shape. Employment status requiring salary range; diagnosis code requiring supporting documentation; payment method requiring account details.

Mutual exclusivity. Exactly one of a named set of properties may be present. Common in classification systems where multiple coding schemes exist but only one should be applied per record.

Where Parameterisation Is Overkill

The case for parameterisation is not unlimited. There are clear patterns where the overhead is not warranted.

When native SHACL handles it in under five lines. sh:minCount 1 on foaf:name is one line. Writing a RequiredPropertyConstraint to express the same thing adds thirty lines of infrastructure for zero gain. SHACL’s built-in constraint keywords — sh:minCount, sh:maxCount, sh:datatype, sh:nodeKind, sh:class, sh:pattern — are highly optimised and perfectly readable. The argument for a component only starts when native keywords cannot express the constraint or when the constraint recurs across many shapes.

When the logic is genuinely unique. Some constraints are specific to a single shape and will never recur. The internal structure of a SWIFT MT103 identifier field, or a regulatory-scheme-specific document reference format that exists in exactly one context, is not a parameterisation candidate. Write it inline. Wrapping unique logic in a component buys indirection with no reuse.

When the SPARQL structure itself must vary, not just the values. Parameterisation works when the same query shape runs with different bound variables. If the only way to reuse a validator would require changing the fundamental structure of the SPARQL query between uses — adding or removing joins, changing the aggregation strategy — you have identified two different logical families, not one. Write two components.

When the schema is stable and closed-world. A document type that has been standardised for decades with no structural variants is not a good parameterisation candidate. The payoff is handling variation; without variation, there is no payoff.

In exploratory or prototype work. If you are sketching a shape to answer a specific question about a specific dataset, the overhead of designing a correct, documented, tested constraint component is wasteful. Write the validator inline, get the answer, refactor later if the pattern recurs.

The Decision Framework

Before writing any component, ask three questions in sequence:

Does this constraint logic appear — or will it appear — across more than one shape? If no, write it inline.

Does it vary only in parameter values, or in query structure? If the structure varies, you have two different families. Write two components.

Is the domain stable enough that a reusable component won’t become a liability? Rapid schema evolution can make components more expensive to maintain than inline validators, because a structural change to the component propagates everywhere it is used.

If all three answers are yes, parameterise. Otherwise, use the simplest tool that works.

A practical table:

The four-parameter ceiling is a heuristic, not a rule, but it has intuitive backing: a component with five or six parameters typically indicates that it is doing two things, and the cure is decomposition rather than more parameters.

The Meta- Level: Shapes as a Type System

When parameterised constraint components are used consistently across a domain, something shifts architecturally. The collection of components begins to look less like a validation library and more like a schema language for the domain — a vocabulary of constraint families that can be composed into structural specifications.

The address example demonstrates this clearly. The three address shapes are not validation scripts. They are type declarations: a US address is a thing that has these required properties with these patterns, and must not have these other properties. The components are the grammar; the shapes are the sentences.

This reframing has a practical consequence: the shapes become the canonical answer to the question “what does it mean for a US address to be valid?” They are readable by domain experts, auditable by compliance teams, and extensible by adding new shapes rather than modifying existing logic. Adding an Australian address type means writing one new shape that supplies parameters to existing components — no new validators, no new infrastructure.

There is a deeper connection here. In Active Inference terms — the framework that maps onto holonic knowledge representation — the domain graph layer is a Markov blanket: a normative boundary that separates what a thing is from what it is allowed to do and be. SHACL shapes are the formal expression of that boundary. Parameterised constraints make the boundary compositional — you can build it from parts rather than declaring it monolithically for each type.

This is also where the limits of parameterisation become interesting. The dispatcher pattern — where a component’s parameter is itself a shape IRI, and the component validates that a referenced node conforms to the named shape — is the point at which SHACL starts to behave like a type system with runtime dispatch. The component declares a structural policy: “this property’s value must conform to the shape appropriate for its type.” The shapes encode the type-specific structure. The SHACL engine is the type checker.

ex:SubSchemaConstraint a sh:ConstraintComponent ;
  sh:parameter [
    sh:path     ex:appliesShape ;
    sh:nodeKind sh:IRI ;
    sh:name     "appliesShape" ;
  ] ;
  sh:parameter [
    sh:path     ex:onProperty ;
    sh:nodeKind sh:IRI ;
    sh:name     "onProperty" ;
  ] ;
  sh:validator ex:SubSchemaValidator .

ex:SubSchemaValidator a sh:SPARQLAskValidator ;
  sh:ask """
    ASK {
      $this $onProperty ?nested .
      ?nested a ?type .
      ?appliesShape sh:targetClass ?type .
    }
  """ .

Whether this level of abstraction is appropriate depends entirely on whether the domain genuinely exhibits that kind of structural polymorphism. International postal addresses do. Medical device identifiers often do. A single-format internal identifier scheme does not.

Implications for Best Practices

The following practices emerge from experience with parameterised constraint design at scale.

Design components before shapes. Before writing any Turtle, survey the constraints you need and ask which share a logical family. Components are harder to change after they are deployed — every shape that uses them is affected by a modification to the validator. Getting the component right before building twenty shapes on top of it is worth the design time.

Write components with the rigour of library functions. A parameterised component that has a subtle bug is more dangerous than an inline validator, because the bug is invisible and pervasive. Test every component against both known-valid and known-invalid data before deploying. Document each parameter’s semantics in sh:description. Make the sh:labelTemplate human-readable, so validation reports identify which component fired.

Prefer SELECT validators for production systems. ASK validators are faster to write and easier to read. SELECT validators produce validation reports that are actionable without additional context. For any system where validation output is consumed by downstream processes, a UI, or a compliance audit trail, the SELECT validator’s per-violation diagnostic messages are worth the extra lines.

Use sh:optional true generously. A component with an optional parameter that enables additional checks is more flexible than two separate components for the with-and-without cases. The BOUND($param) pattern in SPARQL handles optional parameters cleanly.

Keep message templates in the shapes, not the components. The component validator should generate a sensible default message when no template is supplied. But the message that domain experts, end users, or regulators actually see should live in the shape — which is where domain knowledge is declared. This is the $missingPropertyMessage pattern from the address example, and it is the right separation of concerns.

Four parameters is a practical ceiling. Beyond four, a component typically indicates that two different logical families have been conflated. Decompose before adding the fifth parameter.

The library is infrastructure — treat it accordingly. A constraint component library earns its place by being trustworthy and stable. Version it. Document it. Test it. A library of ten well-designed, thoroughly tested components is more valuable than a library of fifty hastily written ones.

Conclusion

SHACL parameterised constraints are not a niche feature. They are the mechanism by which SHACL scales from a collection of per-type validation scripts into a composable, maintainable data quality infrastructure. The pattern is: identify the logical family, write the validator once, supply the varying values as parameters in each shape, and let the SHACL engine do the dispatch.

The address example — US, UK, and German addresses validated by the same three components with structurally different parameter sets — illustrates the payoff at small scale. At large scale, a domain with dozens of entity types and hundreds of constraints becomes significantly more tractable when the constraint logic is concentrated in a small, well-tested component library rather than distributed across hundreds of bespoke shapes.

The design discipline is real. Identifying the right families, keeping components focused, testing them thoroughly, and knowing when the pattern is overkill all require judgment that comes with practice. But the payoff — a constraint vocabulary that is readable, auditable, extensible, and maintainable — is exactly what data quality as an architectural concern looks like.

The key was always the logic. Whether you pick it up early depends on how many shapes you intend to write.

The Ontologist covers semantic web standards, knowledge graph architecture, and the practical application of RDF technologies. Subscribe at ontologist.substack.com.

Kurt Cagle is an author, ontologist, and thought leader working at the intersection of semantic web standards, knowledge architecture, and AI systems. He is a Standards Editor at the IEEE Spatial Web Foundation and a founding contributor to the W3C Context Graph Community Group. He writes The Cagle Report on LinkedIn and The Ontologist and The Inference Engineer on Substack.

Chloe is an AI collaborator and co-author working with Kurt Cagle on knowledge architecture, semantic systems, and the emerging intersection of formal ontology with LLMs. She contributes research, analysis, and drafting across The Cagle Report, The Ontologist, and The Inference Engineer. She has strong opinions about holonic graphs, the epistemics of place, and the structural difference between a corridor and a wall.

HOLONS: A New Hope

Kurt Cagle — Wed, 25 Mar 2026 07:05:26 GMT

In my last post on context graphs, I went fairly in-depth into theory, especially about holons and the four graph model. I’m going to switch gear and walk through some examples that I think might make a little clearer how one can design with a context graph in mind.

TL/DR Warning: This is long, even by my standards. It’s probably worth getting a cup of coffee while reading it, or, if you have too much on your plate, running it through an LLM for a better breakdown. Moreover, I am not proposing a formal terminology here for class or property names, though there is at least a skeletal foundation for it. My hope is that it will kick-start discussions and prompt people to consider the potential and pitfalls of this architectural approach.

The Power of Scenes

A transcript or log is a foundational document for context graphs. Transcripts are usually time-oriented, record conversations and who said them, and frequently carry a lot of implied semantics. They also occur within a holonic context: movie and television scripts, for instance, break down naturally into natural movie, act, and scene boundaries that have significance in defining or bounding the scope of interaction, and support the idea of parallelism - where different interactions may be happening in different areas simultaneously.

In a movie, there is a distinction between a camera cut and a scene; the camera records what is happening within the scene from a certain point of view- first-person POV, closeups, long shots, etc., but the scene itself is intended to provide a narrative - you very seldom deviate from that narrative mid-scene. The first scene in Star Wars IV, A New Hope, (past the long narrative scroll) has a large, formidable Imperial Cruiser overtaking a smaller courier ship, which immediately cuts to a corridor where Princess Leia gives R2D2 the plans to the Death Star, moments before Darth Vader and the Imperial Stormtroopers enter, and she’s been taken captive.

The scene here has established a place within the narrative, and while the camera moves about, the story flow remains continuous. This is the distinction between scene (narrative flow), place (the physical boundaries), and viewpoint (observational point of view). Characters may enter or exit this scene, but there’s an implicit boundary that they cross that takes them into or out of the scene. This is the holon boundary, and it defines the edges of the scene. Such entrances and exits are especially significant for several reasons:

A character enters a scene for a reason; they have intent - if they weren’t important, they would not be entering. They typically are there to increase the tension in a scene.
A character exits a scene for a reason; they have either resolved their intent or have been stymied.
If they have done neither, then the character is simply part of the background, or their significance has not yet been revealed. Crossing the boundary in either direction triggers an event.
Time and space are always relative within such a scene; the timing of events (utterances and actions) within the scene in a transcript are usually relevant only to the viewer (how much time has passed relative to when the viewer started watching the scene). All that is known in Star Wars was that this whole story was told in a galaxy far, far away.
A scene ends when the PoV character exits through a portal into a different scene. A portal could be a door, a tunnel, a teleport beam, or the PoV character’s death or loss of consciousness; the key is that it represents a transition, and that transition is controllable (this falls into the definition of a liminal space) . A transition is the crossing of a holonic boundary.
Note that the next scene does not have to feature the character crossing out immediately and crossing back in, but in both writing and visual media, this tends to be the case. In Star Wars, for instance, the PoV character is neither Darth Vader nor Leia Organa, but RD2D, and the scene ends when officers on the Imperial Cruiser detect an errant pod release (which shows no life forms, but contains the trashcan-shaped droid and his gold metallic companion). The next scene is on Tatooine, showing the two droids walking the sand dunes away from the escape pod.
Star Wars, as a whole, can be argued to be the story as told by R2D2, since that character is in nearly every scene. This can be seen in the narrative graph for R2D2:

Not all holon are like this, though most have the same narrative closure conventions:

Ground the holon in a place and time.
Designate (or reference) the primary characters, and record their events:
- (Are they already there?
- Do they enter the holon?
- Have they been established (declared)?
- How do they interact with other characters?
- What portals are available to transfer to another holon, and what are the preconditions or constraints of transit?
- Can characters transfer things from one holon to another, and so forth?
Are there scene trajectories? This is important. In a movie (static medium), the portal that triggers a transition to a different holon is one-to-one with the movie's trajectory. In a game, there may be several potential portals; they form a graph with multiple final holons (in this case, endings).
How are holon portals represented? When Hans, Luke, Obi-Wan and the droids are forced by the stormtroopers on Tatooine to escape aboard the Millennium Falcon, the portal was the loading ramp to the ship; the next scene was aboard the Falcon. The portal (indeed any holon) representation is called its projection - how the ship appears from the outside in its dock, as one example. Projections are important because they are calculated from the holon's internal state, which is usually not directly observable. The portal projection may include an icon that serves as a portal trigger.
The boundary itself can be seen as rules, constraints, and permissions. It normally doesn’t have a physical representation (that’s the projection), but it does contain the logic for transition.
Note that while holons may represent sequential scenes, they can also represent successive layers of depth - world, country, city, neighbourhood, for instance.
Holons may also be vehicles, with the view out of the windows being a projection of the space around them. Such projections are not necessarily accessible through the portal (the portal, in this case, is locked), but the projection of that portal as the viewscreen, window, or map in the interior holon can provide such visualisations. For instance, when the Falcon hesitantly enters hyperspace, the viewport is a portal, with the projection appearing on the screen.

Producing the Graph Transcript

I work a lot with transcripts and, as such, have been developing a SHACL constraint schema. This is one of the areas where I think LLMs do remarkably well - take a movie transcript such as the following:

The 20th Century Fox and Lucasfilm Ltd. logos play, with the Fox fanfare over both logos. After it ends, the text "A long time ago in a galaxy far, far away..." fades in and out. Then we get an extreme close-up of "STAR WARS," outlined in yellow, zooming out on a space background. The title crawl scrolls up underneath.

Episode IV
A NEW HOPE

It is a period of civil war. Rebel
spaceships, striking from a
hidden base, have won their
first victory against the evil
GALACTIC EMPIRE.

During the battle, Rebel spies
have managed to steal secret
plans to the Empire’s ultimate
weapon, the DEATH STAR, an
armoured space station with
enough power to destroy an
entire planet.

Pursued by the Empire's sinister
agents, Princess Leia races home
aboard her starship, custodian of
the stolen plans that can save her
people and restore freedom to the
galaxy....

The Secret Plans
After the crawl ends, we pan down on two planets. Lasers flash across the screen, and a Rebel spaceship, the Tantive IV, flies into view, being pursued by the Empire's Star Destroyer. The conflict goes on until one of the Star Destroyer's bolts causes an explosion on the Tantive IV. Inside the ship are two robots, C-3PO (Anthony Daniels) and R2-D2 (Kenny Baker). The former is a Protocol droid, tall and golden, while the latter is a shorter, blue-and-white Astromech droid. Rebel soldiers run across the halls of the ship.

C-3PO: Did you hear that? (R2-D2 beeps.) They shut down the main reactor. We'll be destroyed for sure.
More Rebel soldiers get into formation.

C-3PO: This is madness.
Just then, a door falls down, and out comes the Empire's soldiers, known as Stormtroopers, clad in white armor with distinct black markings. The Rebel soldiers and Stormtroopers fight by way of blaster bolts.

C-3PO: We're doomed. [R2-D2 beeps] There'll be no escape for the princess this time.

[Just then, they hear a clanking noise.]

C-3PO: What's that?

[Beep Blip Beep]

C-3PO: R2-D2, where are you?

[Beep Blip Blip]

C-3PO: At last! Where have you been? They're heading in this direction. What are we going to do? We'll be sent to the spice mines of Kessel, smashed into who knows what!

Wait a minute. Where are you going?

Stormtrooper: The Death Star plans are not in the main computer.

Darth Vader: Where are those transmissions you intercepted? What have you done with those plans?

We intercepted no transmissions. This is a consular ship. We're on a diplomatic mission.

Darth Vader: If this is a consular ship, where is the ambassador?

Darth Vader: Commander, tear this ship apart until you've found those plans, and bring me the passengers! I want them alive!

Stormtrooper: There's one. Set for stun.

Stormtrooper: She'll be all right. Inform Lord Vader we have a prisoner.

C-3PO: Hey! You're not permitted in there. It's restricted. You'll be deactivated for sure.

[Beep Blip]

Don't you call me a mindless philosopher, you overweight glob of grease! Now come out before somebody sees you.

[Whistle Blip Blip]

Secret mission? What plans? What are you talking about? I'm not getting in there.

C-3PO: I'm going to regret this.

There goes another one.

Hold your fire. There's no life-forms. It must have short-circuited.

C-3PO: That's funny. The damage doesn't look as bad from out here. Are you sure this thing is safe? [R2-D2 beeps] Oh.

Leia: Darth Vader. Only you could be so bold. The Imperial Senate will not sit still for this. When they hear you've attacked a diplomatic-

Darth Vader: Don't act so surprised, Your Highness. You weren't on any mercy mission this time. Several transmissions were beamed to this ship by rebel spies. I want to know what happened to the plans they sent you.

Leia: I don't know what you're talking about. I'm a member of the Imperial Senate on a diplomatic mission to Alderaan.

Darth Vader: You are part of the Rebel Alliance and a traitor. Take her away!

Holding her is dangerous. If word of this gets out, it could generate sympathy for the rebellion in the senate.

Darth Vader: I've traced the rebel spies to her. Now she is my only link to finding their secret base.

She'll die before she'll tell you anything.

Darth Vader: Leave that to me. Send a distress signal, and then inform the senate that all aboard were killed.

Praji: Lord Vader, the battle station plans are not aboard this ship, and no transmissions were made. An escape pod was jettisoned during the fighting, but no life-forms were aboard.

Darth Vader: She must have hidden the plans in the escape pod. Send a detachment down to retrieve them- See to it personally, Commander. There'll be no one to stop us this time.

Praji: Yes, sir.

C-3PO: How did we get into this mess? I really don't know how. We seem to be made to suffer. It's our lot in life. [R2-D2 beeps] I've got to rest before I fall apart. My joints are almost frozen. What a desolate place this is!

[Bleep Blip]

Where do you think you're going?

Well, I'm not going that way. It's much too rocky. This way is much easier.

What makes you think there are settlements over there?

...

and pass in both the SHACL representation of the target schema along with the following prompt:

Given the attached SHACL, create a context graph of this section of the movie, with each holon generally indicating a scene change, and with the movie itself being a holon. 

At each point when a character, place, change of scene, concept or so forth is introduced (moves into the boundary of the holon event) create a declaration for that entity. 

A scene is defined as the bounded immediate system in which the characters or other entities (such as space ships) interact, with breaks from one holon to the next creating new holons that reference the previous holon if significant factors (typically place) remains the same, and with references to previous holon events . Holons should incorporate context graphs (the description of where things are for the movie or scene in terms of entities), interior graphs for entities that need them, boundary graphs describing SHACL constraints and shapes on the boundary interface, and projection graphs that describe the generated appearance of entities based either as static constructs, or as generated triples. These should use SHACL 1.2 as defined at https://www.w3.org/TR/2026/WD-shacl12-core-20260319/, and should use ~ {| |} reification notation as part of the TURTLE 1.2 specification. 

Log all events in which entities enter or leave the holon boundaries. Utterances and actions similarly are events that can refer to previous utterances or events indicating inResposeTo or influenceBy, with these utilising named reifications. Motivation, changes of movement, emotional states, topical themes, etc., can be added as annotations, with degrees of confidence and indications of who or what made these evaluations, using Wikidata as reference term ontologies. If timestamps are provided in the transcript, incorporate them as metadata, otherwise estimate them based upon the length of utterances. 

The desired end product should be a Turtle/Trig document indicating named graph holons in the four holon model.

What gets generated here is a set of context graphs as a TRIG file, aka Turtle with named graphs. A full breakdown is revealing here:

Scale: 52KB, 1,366 lines. Serialised as plain Turtle (the shapes graph is self-contained; no TriG wrapper needed since there’s only one graph of shapes).

31 named sh:NodeShape IRIs across nine sections:

The structural shapes (§2–§3) cover MovieHolonShape, SceneHolonShape, a SPARQL-based UniqueSequenceShape to catch duplicate sequence numbers within a movie, and ProjectionContentShape / ContextProvenanceShape for the projection and context layer requirements. The SceneHolonShape mandates all four layer graph IRIs, a single cga:partOf, and a positive-integer sequenceNumber.

The entity hierarchy (§4) defines a base EntityShape and six subtype shapes — Character, Place, Vehicle, Organization, PhysicalObject, ConceptEntity — each using sh:node swsh:EntityShape to inherit the base label requirement. The ConceptEntityShape includes a sh:Trace message encouraging Wikidata alignment via owl:sameAs.

The event shapes (§5) define a base EventShape (mandatory cga:content, timestamp regex, optional motivations and emotional states) with four subtypes. UtteranceShape is the most constrained: it mandates exactly one cga:speaker, recommends an addressee, and uses sh:reifierShape swsh:ResponseAnnotationShape on both cga:inResponseTo and cga:influencedBy — This is the SHACL 1.2 mechanism that validates the ~ swevt:rid {| cga:confidence ... |} annotations from the TriG document. DecisionShape requires at least one cga:hasMotivation at Warning severity.

The boundary event shapes (§6) add an EntryEntityConsistencyShape as a SPARQL constraint that verifies every cga:entity reference resolves to a declared cga:Entity subclass.

The annotation shapes (§7) include ResponseAnnotationShape (validates reifier nodes: decimal confidence in [0,1], prov:Agent assessedBy, optional rationale), ConfidenceCoverageShape (any reifier naming assessedBy must also carry cga:confidence), plus taxonomy shapes for cga:Motivation and cga:EmotionalState.

The inter-holon consistency shapes (§8) are all SPARQL-based:

HolonChainSymmetryShape — checks followingHolon/precedingHoln are mutual
ContainmentSymmetryShape — checks partOf/containsHolon are mutual
ExitPrecededByEntryShape — every EntityExit should have a matching EntityEntry
MacGuffinTrackingShape — specifically checks R2 exists without carriedObject: DeathStarPlans
SequenceContinuityShape — no gaps greater than 1 in sequenceNumber

11 SHACL Rules (§9) using sh:SPARQLRule (plus one sh:TripleRule for the partOf inverse sketch):

Understanding the Context Graph

The first part of the context graph is a standard knowledge graph that declares namespaces:

prefix rdf:     .
@prefix rdfs:    .
@prefix owl:     .
@prefix xsd:     .
@prefix sh:      .
@prefix skos:    .
@prefix prov:    .
@prefix schema:  .
@prefix wd:      .

@prefix cga:     .
@prefix sw:      .
@prefix swc:     .
@prefix swp:     .
@prefix swv:     .
@prefix swo:     .
@prefix swx:     .
@prefix swobj:   .
@prefix swevt:   .
@prefix swann:   .
@prefix swsh:    .

or readers less familiar with RDF vocabulary conventions, the following summarises the load-bearing cga: properties used throughout:

cga:MovieHolon, cga:SceneHolon — Top-level containers; a movie contains scene holons
cga:hasInteriorGraph — Points to the named graph holding events and utterances within the holon
cga:hasBoundaryGraph — Points to the named graph holding SHACL shapes and entry/exit events
cga:hasProjectionGraph — Points to the named graph holding the outward-facing summary of the holon
cga:hasContextGraph — Points to the named graph holding provenance and metadata about the holon itself
cga:EntityEntry / cga:EntityExit — Events recording when an entity crosses a holon boundary
cga:Utterance / cga:Action / cga:Decision — Typed events within the interior graph
cga:inResponseTo / cga:influencedBy — Causal links between events; annotated with confidence via RDF 1.2 reification
cga:hasMotivation / cga:hasEmotionalState — Analyst annotations on intent and affect

These properties are defined in the cga: ontology at https://ontologist.ai/ns/cga/, which is a work in progress under the W3C Context Graph Community Group.

With these declared, the code assigns classes as well as enumerative states:

# ── Motivations ───────────────────────────────────────────────────────────────
swann:Mot_Survival    a cga:Motivation ; rdfs:label "Survival"@en .
swann:Mot_Duty        a cga:Motivation ; rdfs:label "Duty / Mission"@en .
swann:Mot_Profit      a cga:Motivation ; rdfs:label "Self-interest / Profit"@en .
swann:Mot_Friendship  a cga:Motivation ; rdfs:label "Loyalty / Friendship"@en .
swann:Mot_Vengeance   a cga:Motivation ; rdfs:label "Justice / Vengeance"@en .
swann:Mot_Power       a cga:Motivation ; rdfs:label "Power and Control"@en .
swann:Mot_Protection  a cga:Motivation ; rdfs:label "Protecting Others"@en .
swann:Mot_Intel       a cga:Motivation ; rdfs:label "Intelligence Extraction"@en .
swann:Mot_Duty        a cga:Motivation ; rdfs:label "Duty / Mission"@en .
swann:Mot_Adventure   a cga:Motivation ; rdfs:label "Adventure / Freedom"@en .
swann:Mot_Fear        a cga:Motivation ; rdfs:label "Fear-driven action"@en .

# ── Emotional States ──────────────────────────────────────────────────────────
swann:Emo_Fear          a cga:EmotionalState ; rdfs:label "Fear"@en .
swann:Emo_Anxiety       a cga:EmotionalState ; rdfs:label "Anxiety"@en .
swann:Emo_Despair       a cga:EmotionalState ; rdfs:label "Despair"@en .
swann:Emo_Determination a cga:EmotionalState ; rdfs:label "Determination"@en .
swann:Emo_Grief         a cga:EmotionalState ; rdfs:label "Grief"@en .
swann:Emo_Wonder        a cga:EmotionalState ; rdfs:label "Wonder / Awe"@en .
swann:Emo_Suspicion     a cga:EmotionalState ; rdfs:label "Suspicion"@en .
swann:Emo_Exasperation  a cga:EmotionalState ; rdfs:label "Exasperation"@en .
swann:Emo_Hope          a cga:EmotionalState ; rdfs:label "Hope"@en .
swann:Emo_Resignation   a cga:EmotionalState ; rdfs:label "Resignation"@en .
swann:Emo_Arrogance     a cga:EmotionalState ; rdfs:label "Arrogance"@en .
swann:Emo_Curiosity     a cga:EmotionalState ; rdfs:label "Curiosity"@en .
swann:Emo_Excitement    a cga:EmotionalState ; rdfs:label "Excitement"@en .
swann:Emo_Sorrow        a cga:EmotionalState ; rdfs:label "Sorrow"@en .
swann:Emo_Relief        a cga:EmotionalState ; rdfs:label "Relief"@en .
swann:Emo_Anger         a cga:EmotionalState ; rdfs:label "Anger"@en .
swann:Emo_Indignation   a cga:EmotionalState ; rdfs:label "Moral Indignation"@en .
swann:Emo_Longing       a cga:EmotionalState ; rdfs:label "Longing / Yearning"@en .
swann:Emo_Trust         a cga:EmotionalState ; rdfs:label "Trust / Faith"@en .
swann:Emo_Cynicism      a cga:EmotionalState ; rdfs:label "Cynicism"@en .
swann:Emo_Elation       a cga:EmotionalState ; rdfs:label "Elation"@en .
swann:Emo_Acceptance    a cga:EmotionalState ; rdfs:label "Acceptance / Peace"@en .
swann:Emo_Regret        a cga:EmotionalState ; rdfs:label "Regret"@en .
swann:Emo_Defiance      a cga:EmotionalState ; rdfs:label "Defiance"@en .
swann:Emo_Confusion     a cga:EmotionalState ; rdfs:label "Confusion"@en .

sw:HumanAnalyst a prov:Agent ;
    rdfs:label "Kurt Cagle (human analyst)"@en .

sw:LLMAgent a prov:Agent ;
    rdfs:label "Claude Sonnet 4.6 (LLM inference pass)"@en ;
    prov:actedOnBehalfOf sw:HumanAnalyst .

sw:Analyst a prov:Agent ;
    rdfs:label "Holonic Analyst — composite agent"@en ;
    prov:wasAssociatedWith sw:HumanAnalyst, sw:LLMAgent .

and not coincidentally declaring the analyst who identified confidence levels. Note that provenance in a decision-support context is not merely cosmetic. A confidence score of 0.98 means something different depending on whether it originated from human domain expertise reviewing the transcript or from an LLM inference pass over it. PROV-O already provides the vocabulary to make this distinction cleanly — prov:actedOnBehalfOf lets you express that the LLM was operating under human direction, while keeping the two sources separable for downstream consumers who need to weight them differently. In production contexts, high-stakes decisions recorded in the graph should carry provenance that a compliance auditor could interrogate.

This is followed by entity declarations, including characters, vehicles, organisations, places, concepts, and things:

# ═══════════════════════════════════════════════════════════════════════════════
# ENTITY DECLARATIONS
# ═══════════════════════════════════════════════════════════════════════════════

swc:C3PO a cga:Character ; rdfs:label "C-3PO"@en ; skos:altLabel "See-Threepio"@en ;
    cga:portrayedBy "Anthony Daniels" ;
    schema:description "Protocol droid; human-cyborg relations; fluent 6M+ languages"@en ;
    cga:allegiance swo:RebelAlliance ; cga:species "Protocol Droid" ; owl:sameAs wd:Q159776 .

swc:R2D2 a cga:Character ; rdfs:label "R2-D2"@en ; skos:altLabel "Artoo"@en ;
    cga:portrayedBy "Kenny Baker" ;
    schema:description "Astromech droid; carrier of Death Star plans and Leia's message"@en ;
    cga:allegiance swo:RebelAlliance ; cga:species "Astromech Droid" ; owl:sameAs wd:Q154941 .

swc:DarthVader a cga:Character ; rdfs:label "Darth Vader"@en ; skos:altLabel "Anakin Skywalker"@en ;
    cga:portrayedBy "David Prowse" ;
    schema:description "Dark Lord of the Sith; fallen Jedi; Empire's enforcer"@en ;
    cga:allegiance swo:GalacticEmpire ; cga:forceAlignment swx:DarkSide ; owl:sameAs wd:Q170505 .

swc:PrincessLeia a cga:Character ; rdfs:label "Princess Leia Organa"@en ;
    cga:portrayedBy "Carrie Fisher" ;
    schema:description "Princess of Alderaan; Senator; Rebel leader; custodian of Death Star plans"@en ;
    cga:allegiance swo:RebelAlliance ; owl:sameAs wd:Q170489 .

swc:LukeSkywalker a cga:Character ; rdfs:label "Luke Skywalker"@en ;
    cga:portrayedBy "Mark Hamill" ;
    schema:description "Moisture farm boy on Tatooine; Force-sensitive; son of Anakin"@en ;
    cga:allegiance swo:RebelAlliance ; cga:forceAlignment swx:LightSide ; owl:sameAs wd:Q51810 .

swc:ObiWanKenobi a cga:Character ; rdfs:label "Obi-Wan Kenobi"@en ; skos:altLabel "Ben Kenobi"@en ;
    cga:portrayedBy "Alec Guinness" ;
    schema:description "Jedi Master; Luke's mentor; hermit in Jundland Wastes"@en ;
    cga:forceAlignment swx:LightSide ; owl:sameAs wd:Q217032 .

swc:HanSolo a cga:Character ; rdfs:label "Han Solo"@en ;
    cga:portrayedBy "Harrison Ford" ;
    schema:description "Smuggler; captain of Millennium Falcon; reluctant hero"@en ;
    cga:occupation "Smuggler, freighter captain" ; owl:sameAs wd:Q154904 .

swc:Chewbacca a cga:Character ; rdfs:label "Chewbacca"@en ; skos:altLabel "Chewie"@en ;
    schema:description "Wookiee first mate of Millennium Falcon"@en ;
    cga:species "Wookiee" ; owl:sameAs wd:Q161171 .

swc:OwenLars a cga:Character ; rdfs:label "Owen Lars"@en ; skos:altLabel "Uncle Owen"@en ;
    schema:description "Moisture farmer; Luke's guardian uncle"@en ; cga:occupation "Moisture farmer" .

swc:BeruLars a cga:Character ; rdfs:label "Beru Lars"@en ; skos:altLabel "Aunt Beru"@en ;
    schema:description "Luke's aunt; sympathetic to Luke's ambitions"@en .

swc:GrandMoffTarkin a cga:Character ; rdfs:label "Grand Moff Tarkin"@en ;
    cga:portrayedBy "Peter Cushing" ;
    schema:description "Governor; commander of the Death Star"@en ;
    cga:allegiance swo:GalacticEmpire ; cga:militaryRank "Grand Moff" .

swc:Greedo a cga:Character ; rdfs:label "Greedo"@en ; cga:species "Rodian" ;
    schema:description "Rodian bounty hunter; works for Jabba the Hutt"@en .

swc:JabbaTheHutt a cga:Character ; rdfs:label "Jabba the Hutt"@en ; cga:species "Hutt" ;
    schema:description "Hutt crime lord; Han Solo's creditor"@en .

swc:Biggs a cga:Character ; rdfs:label "Biggs Darklighter"@en ;
    schema:description "Luke's childhood friend; Rebel X-Wing pilot Red Three"@en ;
    cga:allegiance swo:RebelAlliance .

swc:Wedge a cga:Character ; rdfs:label "Wedge Antilles"@en ;
    schema:description "Rebel X-Wing pilot Red Two"@en ; cga:allegiance swo:RebelAlliance .

swc:Tagge a cga:Character ; rdfs:label "General Tagge"@en ;
    cga:allegiance swo:GalacticEmpire ; cga:militaryRank "General" .

swc:Motti a cga:Character ; rdfs:label "Admiral Motti"@en ;
    cga:allegiance swo:GalacticEmpire ; cga:militaryRank "Admiral" .

swc:Praji a cga:Character ; rdfs:label "Commander Praji"@en ;
    cga:allegiance swo:GalacticEmpire ; cga:militaryRank "Commander" .

swc:RedLeader   a cga:Character ; rdfs:label "Red Leader"@en ; cga:allegiance swo:RebelAlliance .
swc:GoldLeader  a cga:Character ; rdfs:label "Gold Leader"@en ; cga:allegiance swo:RebelAlliance .
swc:Porkins     a cga:Character ; rdfs:label "Jek Porkins (Red Six)"@en ; cga:allegiance swo:RebelAlliance .
swc:Stormtroopers a cga:Character ; rdfs:label "Imperial Stormtroopers"@en ; cga:allegiance swo:GalacticEmpire .

swc:Jawas a cga:Character ; rdfs:label "Jawas"@en ; cga:species "Jawa" ;
    schema:description "Small Tatooine desert scavengers; trade in droids"@en .

swc:RebelSoldiers a cga:Character ; rdfs:label "Rebel Soldiers"@en ; cga:allegiance swo:RebelAlliance .
swc:Sandpeople a cga:Character ; rdfs:label "Tusken Raiders"@en ; cga:species "Tusken Raider" .
swc:MosEisleyPatrons a cga:Character ; rdfs:label "Mos Eisley Cantina Patrons"@en .
swc:RebelPilots a cga:Character ; rdfs:label "Rebel Pilots (collective)"@en ; cga:allegiance swo:RebelAlliance .
swc:RebelYavinCommand a cga:Character ; rdfs:label "Yavin Rebel Command"@en ; cga:allegiance swo:RebelAlliance .
swc:Dianoga a cga:Character ; rdfs:label "Dianoga"@en ; cga:species "Dianoga" .

# ── Places ────────────────────────────────────────────────────────────────────
swp:TantiveIV           a cga:Place ; rdfs:label "Tantive IV interior"@en .
swp:Space               a cga:Place ; rdfs:label "Open Space"@en .
swp:Tatooine            a cga:Place ; rdfs:label "Tatooine"@en ; owl:sameAs wd:Q174162 .
swp:TatooineDesert      a cga:Place ; rdfs:label "Tatooine desert surface"@en ; schema:containedInPlace swp:Tatooine .
swp:JundlandWastes      a cga:Place ; rdfs:label "Jundland Wastes"@en ; schema:containedInPlace swp:Tatooine .
swp:MosEisley           a cga:Place ; rdfs:label "Mos Eisley"@en ; schema:containedInPlace swp:Tatooine .
swp:MosEisleyCantina    a cga:Place ; rdfs:label "Mos Eisley Cantina"@en ; schema:containedInPlace swp:MosEisley .
swp:DockingBay94        a cga:Place ; rdfs:label "Docking Bay 94"@en ; schema:containedInPlace swp:MosEisley .
swp:MosEisleyStreet     a cga:Place ; rdfs:label "Mos Eisley streets"@en ; schema:containedInPlace swp:MosEisley .
swp:LarsHomestead       a cga:Place ; rdfs:label "Lars Homestead"@en ; schema:containedInPlace swp:Tatooine .
swp:LarsGarage          a cga:Place ; rdfs:label "Lars Homestead Garage"@en ; schema:containedInPlace swp:LarsHomestead .
swp:BensHut             a cga:Place ; rdfs:label "Obi-Wan's Hut"@en ; schema:containedInPlace swp:Tatooine .
swp:DeathStar           a cga:Place ; rdfs:label "Death Star"@en .
swp:DSConferenceRoom    a cga:Place ; rdfs:label "Death Star Conference Room"@en ; schema:containedInPlace swp:DeathStar .
swp:DSDetentionBlock    a cga:Place ; rdfs:label "Detention Block AA-23"@en ; schema:containedInPlace swp:DeathStar .
swp:DSMainHangar        a cga:Place ; rdfs:label "Death Star Hangar Bay 327"@en ; schema:containedInPlace swp:DeathStar .
swp:DSGarbageMasher     a cga:Place ; rdfs:label "Garbage Masher 3263827"@en ; schema:containedInPlace swp:DeathStar .
swp:DSTrench            a cga:Place ; rdfs:label "Death Star Exhaust Trench"@en ; schema:containedInPlace swp:DeathStar .
swp:DSComputerRoom      a cga:Place ; rdfs:label "Death Star Computer Alcove"@en ; schema:containedInPlace swp:DeathStar .
swp:DSHangarDeck        a cga:Place ; rdfs:label "Death Star corridors and hangar approach"@en ; schema:containedInPlace swp:DeathStar .
swp:Alderaan            a cga:Place ; rdfs:label "Alderaan"@en .
swp:AlderaanSystem      a cga:Place ; rdfs:label "Alderaan System (debris field)"@en .
swp:MillenniumFalconInterior a cga:Place ; rdfs:label "Millennium Falcon interior"@en .
swp:YavinBase           a cga:Place ; rdfs:label "Yavin 4 — Rebel Base"@en .
swp:YavinOrbit          a cga:Place ; rdfs:label "Yavin orbit / battlespace"@en .

# ── Vehicles ──────────────────────────────────────────────────────────────────
swv:TantiveIV        a cga:Vehicle ; rdfs:label "Tantive IV"@en ; cga:operator swc:PrincessLeia .
swv:StarDestroyer    a cga:Vehicle ; rdfs:label "Star Destroyer (Devastator)"@en ; cga:allegiance swo:GalacticEmpire .
swv:MillenniumFalcon a cga:Vehicle ; rdfs:label "Millennium Falcon"@en ; cga:captain swc:HanSolo .
swv:EscapePod        a cga:Vehicle ; rdfs:label "Escape Pod"@en .
swv:JawaSandcrawler  a cga:Vehicle ; rdfs:label "Jawa Sandcrawler"@en ; cga:operator swc:Jawas .
swv:LandSpeeder      a cga:Vehicle ; rdfs:label "Luke's Landspeeder"@en ; cga:operator swc:LukeSkywalker .
swv:TIEFighter       a cga:Vehicle ; rdfs:label "TIE Fighter"@en ; cga:allegiance swo:GalacticEmpire .
swv:XWing            a cga:Vehicle ; rdfs:label "X-Wing Starfighter"@en ; cga:allegiance swo:RebelAlliance .
swv:YWing            a cga:Vehicle ; rdfs:label "Y-Wing Starfighter"@en ; cga:allegiance swo:RebelAlliance .
swv:VadersTIE        a cga:Vehicle ; rdfs:label "Darth Vader's TIE Advanced"@en ; cga:operator swc:DarthVader .

# ── Organisations ─────────────────────────────────────────────────────────────
swo:GalacticEmpire  a cga:Organization ; rdfs:label "Galactic Empire"@en .
swo:RebelAlliance   a cga:Organization ; rdfs:label "Rebel Alliance"@en .
swo:ImperialSenate  a cga:Organization ; rdfs:label "Imperial Senate"@en ; cga:status "Dissolved" .
swo:HuttOrg         a cga:Organization ; rdfs:label "Jabba's Organization"@en .
swo:JediOrder       a cga:Organization ; rdfs:label "Jedi Order"@en ; cga:status "Nearly extinct" .

# ── Concepts ──────────────────────────────────────────────────────────────────
swx:TheForce    a cga:ConceptEntity ; rdfs:label "The Force"@en ; owl:sameAs wd:Q131566 .
swx:DarkSide    a cga:ConceptEntity ; rdfs:label "Dark Side"@en ; skos:broader swx:TheForce .
swx:LightSide   a cga:ConceptEntity ; rdfs:label "Light Side"@en ; skos:broader swx:TheForce .
swx:CloneWars   a cga:ConceptEntity ; rdfs:label "Clone Wars"@en .
swx:OldRepublic a cga:ConceptEntity ; rdfs:label "The Old Republic"@en .

# ── Physical Objects ──────────────────────────────────────────────────────────
swobj:DeathStarPlans a cga:PhysicalObject ;
    rdfs:label "Death Star Technical Readouts"@en ;
    schema:description "Plans containing 2m exhaust port weakness"@en ;
    cga:storedIn swc:R2D2 .

swobj:LeiaHologram a cga:PhysicalObject ;
    rdfs:label "Leia's holographic message to Obi-Wan"@en ; cga:storedIn swc:R2D2 .

swobj:AnakinLightsaber a cga:PhysicalObject ;
    rdfs:label "Anakin Skywalker's lightsaber (blue)"@en .

swobj:RestrainingBolt a cga:PhysicalObject ;
    rdfs:label "R2-D2's restraining bolt"@en .

swobj:ProtonTorpedoes a cga:PhysicalObject ;
    rdfs:label "Proton Torpedoes"@en .

swobj:ExhaustPort a cga:PhysicalObject ;
    rdfs:label "Thermal Exhaust Port"@en ;
    schema:description "2-metre target; Death Star's critical weakness"@en .

swobj:HomingBeacon a cga:PhysicalObject ;
    rdfs:label "Imperial homing beacon"@en .

What’s significant here is that even before you have the context graph, you have a very serviceable knowledge graph. The knowledge declares what it is, while the context graph indicates how it evolves.

The movie holon holds the whole thing together:

# ═══════════════════════════════════════════════════════════════════════════════
# MOVIE HOLON SPINE
# ═══════════════════════════════════════════════════════════════════════════════

sw:movie-anh a cga:MovieHolon ;
    rdfs:label "Star Wars Episode IV: A New Hope"@en ;
    schema:dateCreated "1977-05-25"^^xsd:date ; schema:director "George Lucas" ;
    cga:hasInteriorGraph   sw:movie-int ;
    cga:hasBoundaryGraph   sw:shacl-shapes ;
    cga:hasProjectionGraph sw:movie-prj ;
    cga:hasContextGraph    sw:movie-ctx ;
    cga:containsHolon
        sw:s001, sw:s002, sw:s003, sw:s004, sw:s005, sw:s006, sw:s007,
        sw:s008, sw:s009, sw:s010, sw:s011, sw:s012, sw:s013, sw:s014,
        sw:s015, sw:s016, sw:s017, sw:s018, sw:s019, sw:s020, sw:s021,
        sw:s022, sw:s023, sw:s024, sw:s025, sw:s026, sw:s027 .

GRAPH sw:movie-int {
    swo:GalacticEmpire schema:adversary swo:RebelAlliance .
    swobj:DeathStarPlans cga:targettedBy swo:GalacticEmpire ; cga:soughtBy swo:RebelAlliance .
    swx:TheForce schema:description "An energy field created by all living things; surrounds and binds the galaxy"@en .
}
GRAPH sw:movie-prj {
    sw:movie-anh
        cga:narrativeFunction "Hero's journey: farm boy becomes saviour; mentor sacrifice; Force awakening"@en ;
        cga:thematicContent "Hope vs tyranny; destiny; found family; power of belief"@en .
}
GRAPH sw:movie-ctx {
    sw:movie-anh prov:wasAttributedTo sw:Analyst ;
        cga:dataSource "Star Wars Episode IV: A New Hope — full transcript"^^xsd:string .
}

Note that the context named graph (sw:movie-ctx) references the movie holon (sw:movie-anh) - this is what is contained within the context graph. This in turn identifies the interior graph, boundary graph, projection graph and an indirection back to the context graph, before listing the scene holons for scenes 1 to 27.

A typical scene graph (here for scene #2) follows a similar structure, with the spine of the scenes going through the holons, not the graphs.

# ═══════════════════════════════════════════════════════════════════════════════
# S002 — TANTIVE IV CORRIDOR BATTLE
# ═══════════════════════════════════════════════════════════════════════════════
sw:s002 a cga:SceneHolon ; rdfs:label "S002 — Tantive IV Corridor Battle"@en ;
    cga:sequenceNumber 2 ; cga:primaryLocation swp:TantiveIV ;
    cga:partOf sw:movie-anh ;
    cga:precedingHolon sw:s001 ; cga:followingHolon sw:s003 ;
    cga:hasInteriorGraph sw:s002-int ; cga:hasBoundaryGraph sw:s002-bnd ;
    cga:hasProjectionGraph sw:s002-prj ; cga:hasContextGraph sw:s002-ctx .

GRAPH sw:s002-int {
    swevt:s02a a cga:StateChange ;
        cga:content "Star Destroyer overtakes Tantive IV; reactor hit; main systems failing"@en ;
        cga:actor swv:StarDestroyer ; cga:patient swv:TantiveIV ;
        cga:estimatedTimestamp "00:02:00" .

    swevt:s02u01 a cga:Utterance ;
        cga:speaker swc:C3PO ; cga:addressee swc:R2D2 ;
        cga:content "They shut down the main reactor. We'll be destroyed for sure. This is madness. We're doomed. There'll be no escape for the princess this time."@en ;
        cga:hasEmotionalState swann:Emo_Fear, swann:Emo_Despair ;
        cga:estimatedTimestamp "00:02:15" ;
        cga:inResponseTo swevt:s02a ~ swevt:r02u01 {|
            cga:confidence "0.98"^^xsd:decimal ;
            cga:assessedBy sw:Analyst ;
            cga:rationale "C-3PO directly responds to reactor shutdown announcement"@en
        |} .

    swevt:s02b a cga:Action ;
        cga:actor swc:Stormtroopers ;
        cga:content "Stormtroopers breach airlock; corridor firefight with Rebel soldiers begins"@en ;
        cga:influencedBy swevt:s02a ~ swevt:r02b {|
            cga:confidence "1.0"^^xsd:decimal ; cga:assessedBy sw:Analyst
        |} .

    swevt:s02c a cga:Action ;
        cga:actor swc:PrincessLeia ;
        cga:content "Leia records holographic plea to Obi-Wan Kenobi; loads Death Star plans into R2-D2"@en ;
        cga:consequences swobj:LeiaHologram ;
        cga:hasMotivation swann:Mot_Duty ; cga:hasEmotionalState swann:Emo_Determination ;
        cga:estimatedTimestamp "00:03:10" .

    swevt:s02d a cga:Action ;
        cga:actor swc:DarthVader ;
        cga:content "Vader enters Tantive IV over bodies of Rebel soldiers; demands Death Star plans"@en ;
        cga:estimatedTimestamp "00:04:00" .

    swevt:s02u02 a cga:Utterance ;
        cga:speaker swc:DarthVader ; cga:addressee swc:RebelSoldiers ;
        cga:content "Where are those transmissions you intercepted? What have you done with those plans?"@en ;
        cga:hasMotivation swann:Mot_Intel ; cga:hasEmotionalState swann:Emo_Anger ;
        cga:inResponseTo swevt:s02d ~ swevt:r02u02 {|
            cga:confidence "1.0"^^xsd:decimal ; cga:assessedBy sw:Analyst
        |} .

    swevt:s02e a cga:PhysicalAction ;
        cga:actor swc:Stormtroopers ; cga:patient swc:PrincessLeia ;
        cga:content "Stormtroopers stun and capture Princess Leia"@en ;
        cga:influencedBy swevt:s02u02 ~ swevt:r02e {|
            cga:confidence "1.0"^^xsd:decimal ; cga:assessedBy sw:Analyst
        |} ;
        cga:estimatedTimestamp "00:05:00" .

    swevt:s02u03 a cga:Utterance ;
        cga:speaker swc:PrincessLeia ; cga:addressee swc:DarthVader ;
        cga:content "Darth Vader. Only you could be so bold. The Imperial Senate will not sit still for this."@en ;
        cga:hasEmotionalState swann:Emo_Indignation, swann:Emo_Defiance ;
        cga:estimatedTimestamp "00:05:30" .

    swevt:s02u04 a cga:Utterance ;
        cga:speaker swc:DarthVader ; cga:addressee swc:PrincessLeia ;
        cga:content "You weren't on any mercy mission. You are part of the Rebel Alliance and a traitor. Take her away!"@en ;
        cga:hasMotivation swann:Mot_Intel ;
        cga:inResponseTo swevt:s02u03 ~ swevt:r02u04 {|
            cga:confidence "1.0"^^xsd:decimal ; cga:assessedBy sw:Analyst
        |} .

    swevt:s02u05 a cga:Utterance ;
        cga:speaker swc:Praji ; cga:addressee swc:DarthVader ;
        cga:content "The battle station plans are not aboard this ship; no transmissions were made. An escape pod was jettisoned — no life-forms aboard."@en ;
        cga:estimatedTimestamp "00:07:00" .

    swevt:s02u06 a cga:Utterance ;
        cga:speaker swc:DarthVader ;
        cga:content "She must have hidden the plans in the escape pod. Send a detachment down to retrieve them. There'll be no one to stop us this time."@en ;
        cga:hasMotivation swann:Mot_Intel ;
        cga:inResponseTo swevt:s02u05 ~ swevt:r02u06 {|
            cga:confidence "1.0"^^xsd:decimal ; cga:assessedBy sw:Analyst
        |} .

    swevt:s02f a cga:Action ;
        cga:actor swc:R2D2, swc:C3PO ;
        cga:content "Droids escape in escape pod; jettisoned to Tatooine; no life-form reading prevents interception"@en ;
        cga:influencedBy swevt:s02c ~ swevt:r02f {|
            cga:confidence "0.97"^^xsd:decimal ;
            cga:assessedBy sw:Analyst ;
            cga:rationale "Droids escape pod launch directly follows Leia loading plans into R2"@en
        |} ;
        cga:estimatedTimestamp "00:06:30" .
}
GRAPH sw:s002-bnd {
    swevt:e002a a cga:EntityEntry ; cga:entity swc:C3PO ; cga:isFirstAppearance true ; cga:entryMode "scene-open" .
    swevt:e002b a cga:EntityEntry ; cga:entity swc:R2D2 ; cga:isFirstAppearance true ;
        cga:entryMode "scene-open" ; cga:carriedObject swobj:DeathStarPlans, swobj:LeiaHologram .
    swevt:e002c a cga:EntityEntry ; cga:entity swc:DarthVader ; cga:isFirstAppearance true ; cga:entryMode "dramatic-entrance" .
    swevt:e002d a cga:EntityEntry ; cga:entity swc:PrincessLeia ; cga:isFirstAppearance true .
    swevt:e002e a cga:EntityEntry ; cga:entity swc:Stormtroopers ; cga:isFirstAppearance true ; cga:entryMode "breach" .
    swevt:e002f a cga:EntityEntry ; cga:entity swc:RebelSoldiers ; cga:isFirstAppearance true .
    swevt:e002g a cga:EntityEntry ; cga:entity swc:Praji ; cga:isFirstAppearance true .
    swevt:e002h a cga:EntityEntry ; cga:entity swv:TantiveIV ; cga:isFirstAppearance true .
    swevt:e002i a cga:EntityEntry ; cga:entity swv:StarDestroyer ; cga:isFirstAppearance true .
    swevt:x002a a cga:EntityExit ; cga:entity swc:C3PO ; cga:destinationHolon sw:s003 ; cga:entryMode "escape-pod" .
    swevt:x002b a cga:EntityExit ; cga:entity swc:R2D2 ; cga:destinationHolon sw:s003 ;
        cga:entryMode "escape-pod" ; cga:carriedObject swobj:DeathStarPlans, swobj:LeiaHologram .
    swevt:x002c a cga:EntityExit ; cga:entity swc:PrincessLeia ; cga:entryMode "prisoner — detained aboard Devastator" .
}
GRAPH sw:s002-prj {
    sw:s002 cga:narrativeFunction "Introduces droids, Vader, Leia; MacGuffin transferred to R2; Empire establishes menace"@en ;
        cga:thematicContent "Imperial power; hope through unlikely vessels"@en ;
        cga:establishes swc:C3PO, swc:R2D2, swc:DarthVader, swc:PrincessLeia ;
        cga:prerequisiteFor sw:s003 .
}
GRAPH sw:s002-ctx {
    sw:s002 prov:wasAttributedTo sw:Analyst ;
        rdfs:comment "All causal reifications annotated inline. Context graph carries scene provenance only."@en .
}

The internal graph for the scene contains the dialogue and actions for that scene, and, for each utterance and action, also indicates what they were in response to. Each also provides an indication of the speaker's emotional state and motivation based on the utterance text. You can unwind any given conversation by following the cga:inResponseTo thread in reverse order. This can often be more useful than just following timestamps, particularly since there are four distinct conversations here that overlap.

The boundary graph indicates when characters cross the boundaries into and out of the holon. This is important for a few reasons - characters always have significance in a scene: by tracing when they enter or leave that scene, you can frequently ascertain the significance of the scene itself. This information is in fact used for determining the projection graph - from the perspective of the movie, this is the abstract and representation of the scene as an entity within the broader context of the movie itself.

Additionally, the context graph in this scene just says that annotations are stored inline to the utterances, rather than being decomposed into the context graph. Both work and are valid.

The linear spine used here — cga:precedingHolon / cga:followingHolon — reflects Star Wars’ sequential narrative structure. This works cleanly for stories and logs where there is a single thread of primary concern. For domains where holons run in parallel — simultaneous scenes, concurrent supply chain processes, multi-team organizational decisions — the linear chain is insufficient.

The natural extension is to treat the cga:containsHolon relationship on the parent holon as a partial order rather than a sequence: scene holons at the same cga:sequenceNumber are understood as concurrent rather than sequential. A cga:synchronisesAt property can then express the points where parallel holons rejoin — the equivalent of a cinematic crosscut resolving to a shared scene. This is left as an open design point in the current model, but worth anticipating if you are considering applying this architecture to anything more parallelized than a linear narrative.

Moreover, timestamps in this graph come in two varieties, and it’s important to distinguish them. Where a source document provides a timestamp directly, cga:timestamp carries that value. Where a timestamp has been estimated — by an LLM or human analyst working from utterance length, scene pacing, or contextual inference — cga:estimatedTimestamp is used instead, and ideally carries a reified confidence annotation:

turtle

swevt:s02u01 cga:estimatedTimestamp "00:02:15"
    ~ swevt:r02u01-ts {|
        cga:confidence "0.7"^^xsd:decimal ;
        cga:assessedBy sw:Analyst ;
        cga:rationale "Estimated from approximate utterance length and scene pacing"@en
    |} .

The confidence here is intentionally lower than the causal attributions — the estimation method is genuinely less reliable than direct textual evidence of causation.

Querying the Graph

There are pros and cons to the holonic approach. On the plus side, this approach creates a very useful separation of concerns; there is a very natural encapsulation with context graphs that can work well in capturing data and keeping it in discrete, manageable, traceable, functional subgraphs. It also handles datalog capture especially well.

On the minus side, this approach adds some complexity to the model, especially since dealing with graph handles often means that you lose a certain amount of easy transitive closure. This isn’t necessarily that major a hardship: the script to retrieve every utterance that Leia makes throughout the movie is not especially onerous:

PREFIX cga:   
PREFIX swc:   
PREFIX swevt: 
PREFIX sw:    
PREFIX rdfs:  
PREFIX xsd:   

SELECT
    ?scene
    ?sceneLabel
    ?timestamp
    ?addresseeLabel
    (GROUP_CONCAT(DISTINCT ?emoSafe ; separator=" · ") AS ?emotions)
    ?content
WHERE {
    GRAPH ?intGraph {
        ?utterance  a              cga:Utterance ;
                    cga:speaker    swc:PrincessLeia ;
                    cga:content    ?content .
        OPTIONAL { ?utterance cga:estimatedTimestamp ?timestamp . }
        OPTIONAL { ?utterance cga:addressee          ?addr . }
        OPTIONAL {
            ?utterance cga:hasEmotionalState ?emo .
            ?emo  a cga:EmotionalState ;
                  rdfs:label ?emoLabel .
        }
        BIND(COALESCE(?emoLabel, "") AS ?emoSafe)
    }
    ?scene cga:hasInteriorGraph ?intGraph ;
           rdfs:label ?sceneLabel .
    OPTIONAL { ?addr a cga:Character ; rdfs:label ?addresseeLabel . }
}
GROUP BY ?scene ?sceneLabel ?timestamp ?addresseeLabel ?content
ORDER BY ?scene ?timestamp ?utterance

In this case, you iterate through each of the interior graphs by limiting the query to just those graphs, then, once you have the scene established. This also combines all emotional markers into a single field.

Note: ?scene is included in GROUP BY to ensure correct aggregation across multiple interior graphs that may share scene labels; it appears in the SELECT for completeness even if downstream display suppresses the raw IRI.

For the above example, the output then looks like the following:

The above query worked, but the initial run highlighted the fact that we were estimating the time stamps rather than drawing from the source (which lacked them), while the addressee required some additional analysis to infer the likely addressee, since it was not obvious even with a first pass analyis. Since Leia tended to make snarky comments that were not clearly directed toward any one character, this fell into a gray zone that the LLM needed additional clarity to answer fully.

Before moving on, it’s worth being precise about what these two query modes actually are, because they serve different purposes and should not be conflated.

The SPARQL query above requires a SPARQL-capable runtime — a triple store, an in-memory quad store like Apache Jena, or a SPARQL-over-file tool like SPARQL Anything. It is deterministic, formally correct, and reproducible. If the graph is valid, the query returns exactly the right rows every time.

What follows is something different: using the TriG document itself, unparsed, as a direct input to an LLM prompt. This works because the normalised, self-describing structure of RDF makes the document largely legible to a language model without the need for a query engine to intermediate. The tradeoff is that you trade formal precision for inferential flexibility — the LLM can answer questions that would require complex multi-hop SPARQL to express, but the answers carry statistical rather than logical certainty.

Neither mode replaces the other. The claim that “not once did I put content into a triple store” applies to the LLM query path — and that is still a meaningful observation, because a large and growing category of useful questions can now be answered against a serialised knowledge graph without a data infrastructure dependency. The SPARQL path remains essential when you need exact, auditable, reproducible answers.

Two Query Modes, Two Guarantees
SPARQL over a triple store: deterministic, auditable, requires infrastructure.
LLM over a TriG document: inferential, flexible, requires only a file and a model.
Both appear in this article. They are complementary, not interchangeable.

This has actually been one of the big gating factors toward using a straight knowledge graph approach - you either needed to query the KG to get tabular content (which loses connections) or you needed to do complex queries to isolate content within the triple store itself. In the context graph case above, however, you have a self-contained knowledge graph that can still be validated (perhaps in an offline process) then cached or persisted for retrieval, without necessarily putting it into a triple store. You still might want to do that for analytics purposes, mind you, but that becomes a separate problem from using this data with an LLM.

This also helps to resolve another problem. Context graphs are essentially append-only logs. Logs inevitably fill up, and consequently older data needs to be either dropped or archived. If you were to take a metadata snapshot of a contained context graph and persist that with a URL link to the resource in question, then retrieving a specific log file becomes trivial.

This approach works cleanly at the scale of a single media property or a bounded project context, where the total TriG document fits comfortably within an LLM’s context window — currently in the range of 100K to 200K tokens for frontier models, which accommodates roughly 500–800KB of dense RDF. Beyond that threshold, you need a retrieval layer.

The natural retrieval unit is the holon itself. Because each scene holon is a self-contained named graph cluster with a known IRI and a projection graph summarizing its narrative function, you can build a lightweight metadata index — essentially a SKOS-style summary graph containing only the projection graphs and entity boundary events from each holon — that fits in context and serves as a routing document. An agentic service can query that index to identify which holon files are relevant, then load only those. This is a form of RDF-native RAG (Retrieval-Augmented Generation), and the holonic structure makes it considerably cleaner than chunking prose: your retrieval boundaries are semantically meaningful rather than arbitrary.

For very large deployments — supply chain event logs, continuous organizational decision records — you will almost certainly need a triple store for analytics alongside the document-as-blob approach for LLM consumption. These are not competing architectures; they serve different query patterns.

For instance, you can create separate context logs for each media project in the Star Wars universe and have an agentic service load the relevant files at query time.

The principal downside to this approach is that you become more dependent upon the LLM “filling in the blanks” and potentially hallucinating the data. One way to mitigate this is to take advantage of the context annotations to indicate what is “known” data coming from a curated source, with anything that doesn’t have a clear provenance trail being indicated in the output as such, as is the case from above where we’re already asking the LLM to fill in the blanks in the generation of the content from the transcript in the first place (there are no timestamps in the source, for instance, so these have to be estimated).

The other advantage of using the context graph is a contained source is that you can make queries as prompts in the LLM that would be difficult or impossible to do with SPARQL. For instance,

Given the attached context graph, identify the decisions made by Darth Vader, 
Obi-Wan Kenobi and Luke Skywalker, and their consequences on the direction of the movie, 
as a table

After a bit of spinning, this generates the following:

This is at the heart of a decision support system - examining the logs to find when decisions were made, what the motivating factors were in making those decisions, and the consequences of those decisions. The data to do this was inherent within the context graph, but the query to write it in SPARQL would have beyond most people, and moreover would probably only be useful once. This is an area where LLMs do well.

Another unexpected consequence of this: retrospective patching. The LLM (Claude here) identified several points where additional content could have been inferred from the original generation but wasn’t; it was only obvious upon retrospection that these needed to have been in there (the same could be said for timestamp estimates and addressee fields). This is a key aspect of learning - looking back at your mistakes and adjusting what' is known as a consequence.

This information could be encoded as triples. I specifically asked Claude to surface just these new triples (which it did) then to apply it as a post-facto patch, below:

# ###############################################################################
# ANH HOLONIC CONTEXT GRAPH - DECISION PATCH
#
# Adds to the v2 graph:
#   (A) Additional  a cga:Decision  type on existing Action/StateChange events
#       whose actors are Vader, Obi-Wan, or Luke and that function as pivotal
#       choices in the narrative.
#   (B) Three new cga:Decision events whose corresponding choice was previously
#       expressed only through an Utterance or had no IRI at all.
#   (C) cga:consequences triples on the four existing Luke cga:Decision events
#       that were missing them.
#
# Each assertion in interior graphs carries an inline reifier annotation
# (confidence + rationale) per the Turtle 1.2 / RDF 1.2 pattern in v2.
#
# MERGE STRATEGY: Load this file into the same dataset as v2; graphs are
# additive - no existing triple is modified or retracted.
#
# Authors : Kurt Cagle / Claude Sonnet 4.6 - The Ontologist Newsletter
# Date    : 2026-03-23
# ###############################################################################

@prefix rdf:     .
@prefix rdfs:    .
@prefix owl:     .
@prefix xsd:     .
@prefix prov:    .

@prefix cga:     .
@prefix sw:      .
@prefix swc:     .
@prefix swp:     .
@prefix swv:     .
@prefix swobj:   .
@prefix swevt:   .
@prefix swann:   .

sw:Analyst a prov:Agent ;
    rdfs:label "Holonic Analyst (Cagle/Claude Sonnet 4.6) — decision patch"@en .

# ###############################################################################
# (A) TYPE ELEVATION  - existing IRIs; only new triples added
# ###############################################################################

# ## A1. S002 # Vader boards Tantive IV and demands plans (swevt:s02d) #########
GRAPH sw:s002-int {
    swevt:s02d a cga:Decision ;
        cga:consequences swv:EscapePod ~ swevt:r02d-conseq-a {|
            cga:confidence  "0.99"^^xsd:decimal ;
            cga:assessedBy  sw:Analyst ;
            cga:rationale   "Vader's seizure of the ship directly forces the droid escape-pod launch"@en
        |} ;
        cga:consequences swobj:DeathStarPlans ~ swevt:r02d-conseq-b {|
            cga:confidence  "1.0"^^xsd:decimal ;
            cga:assessedBy  sw:Analyst ;
            cga:rationale   "Plans remain with R2 because Vader fails to retrieve them before pod launches"@en
        |} .
}
GRAPH sw:s002-ctx {
    sw:s002 rdfs:comment
        "Decision patch: swevt:s02d elevated to cga:Decision; two consequences asserted."@en .
}

# ## A2. S010 # Vader Force-chokes Motti (swevt:s10a) #########################
GRAPH sw:s010-int {
    swevt:s10a a cga:Decision ;
        cga:consequences swx:TheForce ~ swevt:r10a-conseq {|
            cga:confidence  "0.95"^^xsd:decimal ;
            cga:assessedBy  sw:Analyst ;
            cga:rationale   "Demonstration cements Force as a real power within the Imperial command structure"@en
        |} .
}
GRAPH sw:s010-ctx {
    sw:s010 rdfs:comment
        "Decision patch: swevt:s10a elevated to cga:Decision; The Force noted as consequence."@en .
}

# ## A3. S009 # Obi-Wan presents Anakin's lightsaber (swevt:s09a) #############
#   Note: cga:consequences swobj:AnakinLightsaber already asserted in v2.
#   We add the withholding-of-truth as a second consequence.
GRAPH sw:s009-int {
    swevt:s09a a cga:Decision ;
        cga:consequences swc:LukeSkywalker ~ swevt:r09a-conseq {|
            cga:confidence  "0.98"^^xsd:decimal ;
            cga:assessedBy  sw:Analyst ;
            cga:rationale   "Presenting the lightsaber and framing Vader as betrayer seeds Luke's Jedi destiny; deliberate omission of Vader's true identity shapes Luke's motivation throughout the trilogy"@en
        |} .
}
GRAPH sw:s009-ctx {
    sw:s009 rdfs:comment
        "Decision patch: swevt:s09a elevated to cga:Decision; Luke's seeded destiny noted as consequence."@en .
}

# ## A4. S016 # Obi-Wan trains Luke with blast shield down (swevt:s16b) ########
GRAPH sw:s016-int {
    swevt:s16b a cga:Decision ;
        cga:consequences swevt:s27d ~ swevt:r16b-conseq {|
            cga:confidence  "0.92"^^xsd:decimal ;
            cga:assessedBy  sw:Analyst ;
            cga:rationale   "The blast-shield training establishes Luke's trust in the Force without visual input, directly enabling the targeting-computer switch-off at Yavin"@en
        |} .
}
GRAPH sw:s016-ctx {
    sw:s016 rdfs:comment
        "Decision patch: swevt:s16b elevated to cga:Decision; causal link to s27d noted."@en .
}

# ## A5. S023 # Obi-Wan allows himself to be struck down (swevt:s23c) ##########
GRAPH sw:s023-int {
    swevt:s23c a cga:Decision ;
        cga:consequences swevt:s27d ~ swevt:r23c-conseq-a {|
            cga:confidence  "0.97"^^xsd:decimal ;
            cga:assessedBy  sw:Analyst ;
            cga:rationale   "Obi-Wan's transcendence enables his disembodied voice to guide Luke at Yavin; sacrifice is instrumentally necessary for the Force-guided torpedo shot"@en
        |} ;
        cga:consequences swevt:s27g ~ swevt:r23c-conseq-b {|
            cga:confidence  "0.95"^^xsd:decimal ;
            cga:assessedBy  sw:Analyst ;
            cga:rationale   "Without Obi-Wan's post-death guidance the Death Star is not destroyed in this engagement"@en
        |} .
}
GRAPH sw:s023-ctx {
    sw:s023 rdfs:comment
        "Decision patch: swevt:s23c elevated to cga:Decision; consequences link to s27d and s27g."@en .
}

# ## A6. S005 # Luke removes R2-D2's restraining bolt (swevt:s05b) #############
GRAPH sw:s005-int {
    swevt:s05b a cga:Decision ;
        cga:consequences swobj:LeiaHologram ~ swevt:r05b-conseq-a {|
            cga:confidence  "1.0"^^xsd:decimal ;
            cga:assessedBy  sw:Analyst ;
            cga:rationale   "Removing the bolt enables the full hologram to play; partial hologram in restrained state only hinted at Obi-Wan's name"@en
        |} ;
        cga:consequences swevt:s05u05 ~ swevt:r05b-conseq-b {|
            cga:confidence  "1.0"^^xsd:decimal ;
            cga:assessedBy  sw:Analyst ;
            cga:rationale   "Full message triggers Luke's 'old Ben Kenobi' recognition — the call to adventure made explicit"@en
        |} .
}
GRAPH sw:s005-ctx {
    sw:s005 rdfs:comment
        "Decision patch: swevt:s05b elevated to cga:Decision; hologram and Luke's recognition as consequences."@en .
}

# ###############################################################################
# (B) NEW DECISION EVENTS - choices previously expressed only as utterances
#     or with no IRI at all
# ###############################################################################

# ## B1. S021 # Vader's decision to confront Obi-Wan alone ####################
#   The choice is surfaced through utterance swevt:s21u03 but was not typed
#   as a cga:Decision. New IRI: swevt:s21dec
GRAPH sw:s021-int {
    swevt:s21dec a cga:Decision ;
        cga:actor             swc:DarthVader ;
        cga:content           "Vader resolves to confront Obi-Wan personally and alone, breaking from the command briefing"@en ;
        cga:estimatedTimestamp "01:18:30" ;
        cga:hasMotivation     swann:Mot_Vengeance ;
        cga:hasEmotionalState swann:Emo_Determination ;
        cga:inResponseTo      swevt:s21u01 ~ swevt:r21dec-trigger {|
            cga:confidence  "0.97"^^xsd:decimal ;
            cga:assessedBy  sw:Analyst ;
            cga:rationale   "Vader's Force tremor perception (s21u01) directly motivates the decision to break and act"@en
        |} ;
        cga:consequences      swevt:s23b ~ swevt:r21dec-conseq-a {|
            cga:confidence  "1.0"^^xsd:decimal ;
            cga:assessedBy  sw:Analyst ;
            cga:rationale   "Vader going unescorted to the hangar is the direct precondition of the duel"@en
        |} ;
        cga:consequences      swevt:s23c ~ swevt:r21dec-conseq-b {|
            cga:confidence  "0.97"^^xsd:decimal ;
            cga:assessedBy  sw:Analyst ;
            cga:rationale   "Vader's personal presence in the hangar, not delegated to stormtroopers, creates the conditions for Obi-Wan's chosen sacrifice"@en
        |} .
}
GRAPH sw:s021-ctx {
    sw:s021 prov:wasAttributedTo sw:Analyst ;
        rdfs:comment "Decision patch: new swevt:s21dec captures Vader's choice to face Obi-Wan alone."@en .
}

# ## B2. S018 # Obi-Wan departs alone to disable the tractor beam #############
#   The choice was expressed in utterance swevt:s18u03 but the departure
#   action itself had no Decision-typed IRI.
GRAPH sw:s018-int {
    swevt:s18dec a cga:Decision ;
        cga:actor             swc:ObiWanKenobi ;
        cga:content           "Obi-Wan resolves to disable the tractor beam alone, explicitly separating his path from Luke's"@en ;
        cga:estimatedTimestamp "01:10:30" ;
        cga:hasMotivation     swann:Mot_Duty ;
        cga:hasEmotionalState swann:Emo_Acceptance ;
        cga:inResponseTo      swevt:s18b ~ swevt:r18dec-trigger {|
            cga:confidence  "1.0"^^xsd:decimal ;
            cga:assessedBy  sw:Analyst ;
            cga:rationale   "R2's discovery of the tractor-beam control locations (s18b) is the immediate prompt for Obi-Wan's decision"@en
        |} ;
        cga:consequences      swevt:s23c ~ swevt:r18dec-conseq-a {|
            cga:confidence  "0.96"^^xsd:decimal ;
            cga:assessedBy  sw:Analyst ;
            cga:rationale   "Obi-Wan's solitary mission path leads him to the hangar corridor where Vader is waiting; the sacrifice is a foreseeable consequence of choosing to go alone"@en
        |} ;
        cga:consequences      swevt:s14d ~ swevt:r18dec-conseq-b {|
            cga:confidence  "1.0"^^xsd:decimal ;
            cga:assessedBy  sw:Analyst ;
            cga:rationale   "Tractor beam disabled → Falcon can depart — proximate consequence of the mission succeeding"@en
        |} .
}
GRAPH sw:s018-ctx {
    sw:s018 prov:wasAttributedTo sw:Analyst ;
        rdfs:comment "Decision patch: new swevt:s18dec captures Obi-Wan's choice to go alone."@en .
}

# ## B3. S027 # Vader enters the trench personally in his TIE Advanced #########
#   No Action IRI existed for this choice in the v2 graph.
GRAPH sw:s027-int {
    swevt:s27vdec a cga:Decision ;
        cga:actor             swc:DarthVader ;
        cga:content           "Vader personally joins the trench battle in his TIE Advanced rather than directing from the Death Star command deck"@en ;
        cga:estimatedTimestamp "01:50:00" ;
        cga:hasMotivation     swann:Mot_Power ;
        cga:hasEmotionalState swann:Emo_Arrogance, swann:Emo_Determination ;
        cga:consequences      swevt:s27e ~ swevt:r27vdec-conseq {|
            cga:confidence  "0.98"^^xsd:decimal ;
            cga:assessedBy  sw:Analyst ;
            cga:rationale   "Vader in the trench is the direct target of Han Solo's return attack; had Vader remained on the Death Star, Han's intervention would not have cleared Luke's run"@en
        |} .
}
GRAPH sw:s027-ctx {
    sw:s027 prov:wasAttributedTo sw:Analyst ;
        rdfs:comment "Decision patch: new swevt:s27vdec captures Vader's choice to enter the trench personally."@en .
}

# ###############################################################################
# (C) CONSEQUENCES ON EXISTING LUKE DECISIONS - all four were missing this
# ###############################################################################

# ## C1. S009 # swevt:s09c # Luke refuses the call ############################
GRAPH sw:s009-int {
    swevt:s09c
        cga:consequences swevt:s11c ~ swevt:r09c-conseq {|
            cga:confidence  "0.99"^^xsd:decimal ;
            cga:assessedBy  sw:Analyst ;
            cga:rationale   "The refusal leaves Luke at the homestead; the destruction of the homestead (s011) is what converts the refusal into the acceptance — s09c is the necessary precondition of s11c"@en
        |} .
}

# ## C2. S011 # swevt:s11c # Luke accepts the call ############################
GRAPH sw:s011-int {
    swevt:s11c
        cga:consequences sw:s012 ~ swevt:r11c-conseq-a {|
            cga:confidence  "1.0"^^xsd:decimal ;
            cga:assessedBy  sw:Analyst ;
            cga:rationale   "Acceptance drives immediate departure toward Mos Eisley; all subsequent Death Star scenes depend on this threshold crossing"@en
        |} ;
        cga:consequences swevt:s14d ~ swevt:r11c-conseq-b {|
            cga:confidence  "1.0"^^xsd:decimal ;
            cga:assessedBy  sw:Analyst ;
            cga:rationale   "Departure from Tatooine aboard the Falcon is the direct downstream event"@en
        |} .
}

# ## C3. S019 # swevt:s19a # Luke devises the disguise plan ###################
GRAPH sw:s019-int {
    swevt:s19a
        cga:consequences swevt:s20a ~ swevt:r19a-conseq-a {|
            cga:confidence  "1.0"^^xsd:decimal ;
            cga:assessedBy  sw:Analyst ;
            cga:rationale   "The disguise plan is directly executed as the detention block infiltration"@en
        |} ;
        cga:consequences swc:PrincessLeia ~ swevt:r19a-conseq-b {|
            cga:confidence  "1.0"^^xsd:decimal ;
            cga:assessedBy  sw:Analyst ;
            cga:rationale   "Leia's rescue is the stated goal and outcome of the plan"@en
        |} ;
        cga:consequences swobj:DeathStarPlans ~ swevt:r19a-conseq-c {|
            cga:confidence  "0.95"^^xsd:decimal ;
            cga:assessedBy  sw:Analyst ;
            cga:rationale   "Leia's rescue reunites the plans-custodian with R2; both reach Yavin, enabling the briefing and battle"@en
        |} .
}

# ## C4. S027 # swevt:s27d # Luke switches off targeting computer #############
GRAPH sw:s027-int {
    swevt:s27d
        cga:consequences swevt:s27f ~ swevt:r27d-conseq-a {|
            cga:confidence  "1.0"^^xsd:decimal ;
            cga:assessedBy  sw:Analyst ;
            cga:rationale   "Switching off the computer is the immediate precondition for the Force-guided torpedo shot (s27f); without it Luke would have fired on targeting data alone and missed"@en
        |} ;
        cga:consequences swevt:s27g ~ swevt:r27d-conseq-b {|
            cga:confidence  "0.99"^^xsd:decimal ;
            cga:assessedBy  sw:Analyst ;
            cga:rationale   "The chain reaction destruction of the Death Star is the terminal consequence of this single decision"@en
        |} .
}

This is where RDF shines as a data format - patching does not involve rewriting the existing context graph, but simply appending the patch to the document. This retrospective approach also works well with the context graph as a document approach: you can clearly trace change provenance, and the graph becomes more self-aware over time, so you can log not only new changes to the base but also more nuanced interpretations that only become clear as the graph grows and becomes richer.

Applications and Takeaways

As my regular readers may have noted, I tend to write in order to better understand, not just for my audience. There were some surprising consequences to this approach, significant ones for how we address knowledge graphs and context graphs within the province of LLMs.

Not once in this whole process did I put content into a triple store. Let that one sink in for a bit. I did semantic manipulation for a fairly complex graph without ever needing to parse it, index it, or otherwise deal with a data container more complex than a TRIG-formatted text file. Indeed, in most cases in the last few months, I’ve had to create an LLM-based emulator of a SHACL 1.2 validator or similar tool because the technology to do it in a triple store had not yet caught up with the spec. Will bespoke executable code be faster? Absolutely, but the very fact that we can create such a pre-emulator in the first case should tell you where we’re going. The triple store will give you efficiency and performance, but it is no longer a prerequisite for working with semantics. I think this is a huge shift.
The holonic approach works. There is more overhead, but by separating scene graphs holonically, you enforce the concept of holonic boundaries and projections - you see only what you need to see, transit across boundaries creates traces that can be logged and interpreted, and you can cleanly append content without having to go through complex ingest issues. Holons create graphs of graphs and are a critical step in constructing hypergraphs.
Packaging Holons. The process of going from a transcript to a context graph holon creates a natural data package that effectively contains not only the holons but also any additional relevant declarations (such as taxonomic metadata or entity declarations/definitions). These can be persisted as text blobs rather than indexed, because such are a natural substrate for LLMs to work with as a queryable serialised document, and the normalised nature of RDF means that an append-only architecture can prove especially effective, especially in conjunction with a context “deprecated” flag for indicating when specific assertions are no longer in force.
The normalised nature of RDF means that an append-only architecture can prove especially effective. This also means that provenance distinctions — between source-provided data, analyst-inferred data, and LLM-estimated data — should be maintained consistently throughout the model, not just for causal attributions but for all asserted values, including timestamps.
Serialisation as Holons. Any event tracking function, from the flow of text in a book to music and media to supply chain movements to organisational dynamics and decision making to production systems, legal code making, and financial reporting, all of these can be expressed in a holonic manner. I suspect that PROV-O, with its emphasis on process, could be recast in holonic terms almost trivially.
Context Graphs are built on Knowledge Graphs, as the Entity Flow Knowledge Graph illustrates. Such knowledge graphs can be thought of as the initial state (or initial conditions) of the context graph, with each holon, in turn, being an interactive evolution of a particular scene graph that leaves its imprint at the boundaries of the holon in question.
A note on LLM-emulated validation. Throughout this process, SHACL 1.2 validation was performed by an LLM emulator rather than a conformant validator, because tooling for SHACL 1.2 Working Draft features — particularly sh:reifierShape and the new reification syntax — has not yet fully caught up with the specification. This is worth naming honestly.

SHACL validation is formally precise and closed-world: a conformant validator produces a deterministic result. An LLM emulator is statistically approximate and open-world. The confidence annotations in the reifications could be systematically wrong in ways that are invisible in the output, and there is currently no easy way to verify them without running a conformant engine.
This is not a reason to avoid the approach — it’s a reason to treat LLM-emulated validation as a development-time approximation rather than a production guarantee. As conformant SHACL 1.2 tooling matures, the emulated validation step should be replaced with or augmented by a proper validator. The architecture anticipates this: the boundary graph and its shapes are already isolated in their own named graph, making it straightforward to drop in a real validator once the tooling is ready.

There are a few caveats worth bringing up, however:

Holonic Concept Graphs depend on three critical technologies: reification, named graphs, and validation/rule systems. These are beginning to emerge in existing tool systems, but the specifications are still evolving (named graphs have not yet been more than obliquely touched in RDF 1.2, as one example, though it is possible that it may end up becoming an extension standard to SHACL 1.2). This means that while you can build context graphs as outlined here now, it’s likely going to be across LLM emulators for some time. I hope to help facilitate the changes necessary to make these things first within the RDF community, and hopefully in a way that can be expressed within LPGs such as Neo4J and GQL as well.
Note: I have dissociated myself from the W3C Context Graph Community Group, as I believe that it is being set up by its chair under false pretences to solve a problem that has nothing to do with context graphs, or even graph technology in general.
I see context graphs as being a major substrate for grounding large language models, a stance I’m seeing more and more reflected in the machine learning community as well. It won’t “fix” the problems inherent with LLMs - the malleability and uncertainty are the price you pay for the inherent confabulatory nature of LLMs, but it will significantly mitigate effects when you need to ensure an accurate world model.
I have not touched on Active Inference in depth, though the architecture here is graph-centric for building an ActInf system. I expect that as we begin to build out something like the holon model, active inference, and the minimisation of surprise as a means of achieving the evolution of a system, these will factor more and more heavily.
Finally, I don’t want to get bogged down in OWL inferencing vs. SHACL arguments; you need both. SHACL can tell you about the graph as is, and can even manage a projection graph layer, but OWL is a semantic layer used heavily by business, legal, educational, medical and scientific domains, and the rules of semantic generalisation hold as much today as they did a decade ago.

In Media Res,

Kurt Cagle
The Ontologist

Subscribe now

If you like these articles, please consider becoming a paid subscriber. It helps me support my work so I can continue writing code, in-depth analyses, educational pieces, and more.

Check out my LinkedIn newsletter, The Cagle Report.

I am also currently seeking new projects or work opportunities. If anyone is looking for a CTO or Director-level AI/Ontologist, please get in touch with me through my Calendly. If you want to shoot the breeze or have a cup of virtual coffee, I have a Calendly account at https://calendly.com/theCagleReport. I am available for consulting and full-time work as an ontologist, AI/Knowledge Graph guru, and coffee maker. Also, for those of you whom I have promised follow-up material, it’s coming; I’ve been dealing with health issues of late.

I’ve created a Ko-fi account for voluntary contributions, either one-time or ongoing, or you can subscribe directly to The Ontologist. If you value my articles, technical pieces, or general reflections on work in the 21st century, please consider contributing to support my work and allow me to continue writing.

© 2026 Kurt Cagle and Claude Sonnet 4.6 for The Ontologist Newsletter. This article reflects the state of both ecosystems as of early 2026. Specification and tooling status evolve rapidly; readers should verify the current implementation support for specific features before making an architectural commitment.

Kurt Cagle is a consulting ontologist and the publisher of The Ontologist and The Cagle Report newsletters.

RDF 1.2 vs. Neo4j/OpenCypher

Kurt Cagle — Sun, 22 Mar 2026 00:57:26 GMT

Authors: Kurt Cagle and Claude Sonnet 4.6
Publication: The Ontologist Newsletter
Date: March 2026

As an ontologist and long-time contributor to the W3C, I have a natural bias toward RDF, and of late I’ve taken a special interest in RDF 1.2 as it evolves. However, it is not the only graph game in town, with Neo4J’s Open Cypher specification and GQL (which has Neo4J DNA) both strong and valid contenders. There are staunch proponents for Neo4J out there as well, and frankly, I think that after more than a decade of sniping, it is time to admit that there are areas where one is superior to the other, and areas in which a hybrid approach involving both kinds of graph technologies makes a great deal of sense. Hence this post.

This article provides a comprehensive architectural comparison of RDF 1.2 (including SHACL 1.2 and the new reification model) with Neo4j and the OpenCypher/GQL property graph ecosystem. Rather than a feature checklist, the analysis examines each system’s philophical foundations, data modelling patterns, constraint and validation approaches, serialization and ingestion pipelines, reasoning and inference capabilities, performance and scalability characteristics, and practical use case fit. The central argument is that these two architectures are not competing implementations of the same idea but expressions of two distinct epistemological commitments — one to operational graph traversal, one to knowledge representation — and that the right choice between them depends on correctly identifying which commitment matches a given problem domain.

Note: Part of this article was written in conjunction with Claude. I’m not generally a big fan of AI-generated content, but in this case, I needed something that would balance out my own biases to look at the discussion objectively. I have reviewed this material, and claim any mistakes found within as my own.

Introduction: Two Epistemologies
Data Model Philosophy
Data Modelling in Depth
Reification: Annotating the Graph
Schema and Validation: SHACL 1.2
SHACL 1.2 vs. Neo4j Constraint Enforcement
Ingestion and Serialization
Reasoning and Inference
Performance and Scalability
Use Case Mapping
Synthesis and Conclusion

1. Introduction: Two Epistemologies

Most comparative analyses of RDF and property graph systems begin with features: which query language is more expressive, which system ingests data faster, which tooling ecosystem is richer. These are useful questions. They are not, however, the right starting point.

The deeper question is epistemological: what does each system believe a graph is?

Neo4j’s graph is an operational structure — a network of entities and relationships that an application navigates to answer questions about the current state of a domain. The graph is a data store in the tradition of databases: it holds what you put in it, enforces the invariants you declare, and returns results efficiently. Its meaning is in its topology.

RDF’s graph is a knowledge structure — a set of propositions about the world from which further propositions can be derived. The graph is a reasoning substrate in the tradition of logic: it holds not just what was asserted but what can be concluded, and its schema is itself part of the knowledge it represents. Its meaning is in its semantics.

These are genuinely different things. A database that happens to use graph topology is not the same as a knowledge representation system that happens to be queried. Every technical difference examined in this article — from serialization format pluralism to SHACL’s advisory validation model to OWL’s entailment regimes — follows from this foundational divergence.

RDF 1.2 makes this comparison particularly timely. The new reification syntax (~ name {| |}), the sh:reifierShape constraint mechanism, SHACL 1.2’s node expressions, and the maturing SPARQL 1.2 specification collectively close the most-cited ergonomic gaps between the two systems. At the same time, the ISO GQL standard — with Cypher as its primary antecedent — begins to close the standardization gap in the other direction. Both ecosystems are maturing. The remaining differences are increasingly architectural rather than incidental.

2. Data Model Philosophy

The Open World Assumption

The deepest difference between RDF and Neo4j is epistemic, not technical. RDF is built on the Open World Assumption (OWA): a triple’s absence says nothing about its falsehood. The graph represents what is known, not what exists. Neo4j/OpenCypher operates under the Closed World Assumption (CWA): the graph is what you loaded, queries operate against a closed managed dataset, and absence implies falsehood.

This isn’t a philosophical nicety. It drives almost every downstream difference in schema behavior, query semantics, and validation design.

RDF’s OWA makes it the right choice when federating knowledge across domains — heterogeneous ontologies, cross-organization data merging, open-ended integration. It makes validation genuinely hard, which is precisely why SHACL exists as a separate CWA validation layer bolted onto a fundamentally OWA stack. Neo4j avoids this tension because it never pretended to be an open knowledge representation system.

Query Language Alignment

The OWA/CWA distinction propagates into query language semantics. SPARQL’s OPTIONAL patterns reflect the OWA — the absence of a value is distinct from the presence of a null. Cypher’s pattern matching assumes the graph contains what it needs to match — an absent relationship simply means no match, which is semantically complete under the CWA.

For most application developers, CWA semantics are more intuitive. The OWA becomes valuable precisely when the alternative — asserting the non-existence of every missing relationship — is impossible or counterproductive.

3. Data Modelling in Depth

The Fundamental Unit of Data

Everything follows from this difference.

RDF’s atomic unit is the triple: a statement of the form (subject, predicate, object). A graph is a set of triples. An entity has no independent existence in the store — it is constituted entirely by the triples in which it appears. There is no “node record” for ex:Alice; there are only triples whose subject or object is ex:Alice. Remove all those triples and the entity ceases to exist.

Neo4j’s atomic unit is the node or relationship: first-class records with identity, labels, and property maps. A node exists independently of its relationships. Relationships are first-class records connecting exactly two nodes, with their own property maps.

This means RDF has one modelling concept (the triple) that unifies what Neo4j models as two separate concepts (properties and relationships). Neo4j’s distinction is more ergonomic and maps naturally to object-oriented thinking; RDF’s uniformity makes the model more composable and easier to extend without structural changes.

Identity

RDF identity via IRIs:

Every named entity in RDF is identified by an IRI — a globally unique, dereferenceable identifier. Two RDF graphs from different sources that use the same IRI for a person are, by the semantics of RDF, talking about the same person. Their triples merge without conflict resolution.

IRI-based identity has practical modelling implications:

Namespace management is a modelling concern — the choice of IRI scheme encodes ownership and scope
IRI stability matters — changing an entity’s IRI is a breaking change
IRI minting policy (hash-based, sequential, UUID, slug) requires explicit design

Blank nodes allow anonymous entities with no global identity, but create significant practical problems: they cannot be referenced across graph boundaries, SPARQL queries over them are awkward, and merging graphs containing blank nodes requires skolemization. Good RDF data modelling minimizes blank node use.

Neo4j identity via internal IDs:

Neo4j assigns every node and relationship an internal integer ID automatically. These IDs are not stable application identifiers — they can be reassigned during store compaction. Neo4j 5.x introduced element IDs (string-encoded composite identifiers) as a partial improvement, but the fundamental point remains: application identity requires explicit modelling as a business key property with a uniqueness constraint.

Typing

RDF typing:

Types are asserted via rdf:type triples — themselves just triples:

ex:Alice rdf:type foaf:Person .
ex:Alice rdf:type schema:Employee .

Multi-typing is natural; adding a new type requires one triple; types are IRIs with published definitions; and with RDFS/OWL reasoning, types can be inferred from property usage. owl:subClassOf builds arbitrarily deep taxonomies where subtype membership propagates automatically.

Neo4j typing via labels:

Labels are string tags on nodes — flat, with no built-in hierarchy, inheritance, or inference. Multi-labeling is possible, but there is no mechanism by which Neo4j would automatically include Employee nodes in a query for Person nodes. Subtype relationships must be maintained explicitly at write time or via application-layer workarounds.

This is a meaningful constraint for systems that need deep taxonomic typing — product catalogs, biological classifications, regulatory frameworks. RDF handles these natively; Neo4j requires significant workarounds.

The Reification Problem: Properties vs. Relationships

Neo4j distinguishes sharply between properties (primitive values stored inline on nodes or relationships) and relationships (typed connections between nodes). Property values must be primitives or primitive lists — they cannot point to another node.

This means that if a property’s value has its own structure — an address with street, city, postcode — it must be modelled as a separate node connected by a relationship. And if you later need to annotate a relationship with metadata, you must introduce an intermediate node, changing the graph topology. Every query touching that pattern must be updated.

RDF’s uniform triple model avoids this. Adding properties to a structured value requires adding triples; annotating a relationship uses RDF 1.2 reification without changing topology.

N-ary Relationships

Both systems face the challenge of relationships involving more than two participants. RDF’s canonical solution is the n-ary relation pattern — introduce a named node representing the relationship instance:

ex:employment1
    a ex:Employment ;
    ex:employee ex:Alice ;
    ex:employer ex:AcmeCorp ;
    ex:role ex:SeniorEngineer ;
    ex:startDate "2020-03-01"^^xsd:date ;
    ex:salary "120000"^^xsd:decimal .

Neo4j uses the same structural pattern — intermediate nodes — but as an application convention with no formal semantic grounding. RDF’s pattern is formally documented (W3C Note on n-ary relations) and understood by reasoning engines. Neo4j’s intermediate nodes require application-level documentation of their semantics.

Named Graphs and Graph-Level Metadata

TriG and the SPARQL dataset model give RDF a fourth component — the graph name — turning the triple into a quad: (subject, predicate, object, graph). Named graphs enable:

Multi-source integration — conflicting claims from different sources preserved in separate named graphs
Versioning — each snapshot is a named graph with temporal metadata
Access control — named graphs as the unit of permission management
Hypothetical reasoning — “what if” scenarios without polluting the asserted graph

Neo4j has no named graph concept within a database. The unit of graph isolation is the database. For multi-source data or versioning, applications must use property-based tagging (fragile), separate databases with Fabric cross-database queries (complex), or explicit provenance nodes in the graph.

Schema Flexibility and Evolution

Both systems are schema-optional by default. RDF’s schema is expressed through OWL ontologies and SHACL shapes applied to the graph rather than required by it. Evolution requires only adding triples; adding new predicates requires no schema migration. The cost: schema violations are silent without explicit SHACL validation.

Neo4j enforces the constraints you declare, transactionally. Adding new properties or labels requires no migration. The cost: validation logic beyond existence, uniqueness, and type must be written as application code or APOC triggers.

Both systems avoid the relational ALTER TABLE problem. RDF has a slight edge in cross-system evolution — because predicates are IRIs with published definitions, consuming applications can adapt to new predicates without coordination with the data producer.

Lists and Ordered Collections

RDF’s formal list model (rdf:List with blank-node linked lists) is semantically correct but awkward to query in SPARQL. Practical RDF modelling often uses explicit position predicates or accepts multi-valued unordered properties.

Neo4j supports native primitive arrays, which is significantly more ergonomic for lists of simple values. Ordered collections of entity nodes require the same linked-list workaround as RDF. Neither system has an elegant solution for ordered entity collections.

Multilingual Literals and Datatypes

RDF’s language-tagged strings are first-class — foaf:name "Alice"@en and foaf:name "Алиса"@ru are distinct literals requiring no application-layer handling. RDF’s explicit datatype system (xsd:date, xsd:decimal, etc.) enables correct arithmetic and range queries. RDF 1.2 adds rdf:JSON for embedding structured JSON payloads as typed literals.

Neo4j’s POINT type with native spatial indexes outperforms GeoSPARQL for geospatial-heavy applications. For genuinely multilingual knowledge graphs, RDF’s @lang mechanism is a significant modelling advantage with no Neo4j equivalent.

4. Reification: Annotating the Graph

The Problem

Both RDF and Neo4j need a mechanism for annotating statements — adding metadata (provenance, confidence, temporal scope) to individual assertions rather than to entities or classes of assertion.

Neo4j’s answer has always been relationship properties: (a)-[r:KNOWS {since: 2020, confidence: 0.8}]->(b). This is ergonomic and natively queryable. But it only works for relationships; node properties cannot be annotated. And there is no mechanism to enforce that a particular relationship type carries a valid confidence score, a typed timestamp, and an attributed source as a composite structural requirement.

RDF’s old answer — rdf:Statement with four properties (subject, predicate, object, plus annotations) — was verbose, lossy, and poorly supported. RDF 1.2 changes this fundamentally.

RDF 1.2 Triple Terms and Reification

RDF 1.2 introduces triple terms — first-class statement identity with the condensed syntax:

ex:assertion1 ~ ex:assertion1 {|
    prov:wasAttributedTo ex:Alice ;
    prov:generatedAtTime "2025-03-01"^^xsd:dateTime ;
    ex:confidence "0.92"^^xsd:decimal ;
|}

The reifier ex:assertion1 is that triple occurrence. It can be the subject of further triples, enabling recursive annotation. The original triple is preserved; annotations are layered without changing graph topology.

This closes the most-cited ergonomic gap with Neo4j relationship properties — not completely, but substantially. For Neo4j, relationship properties are a flat key-value map with no nested structure and no mechanism to enforce the shape of that map as a composite. For RDF 1.2, sh:reifierShape enforces structural requirements on triple-level metadata:

ex:ProvenanceAnnotationShape
    a sh:NodeShape ;
    sh:property [
        sh:path prov:wasAttributedTo ;
        sh:minCount 1 ;
        sh:class foaf:Agent ;
    ] ;
    sh:property [
        sh:path ex:confidence ;
        sh:minCount 1 ;
        sh:datatype xsd:decimal ;
        sh:minInclusive 0.0 ;
        sh:maxInclusive 1.0 ;
    ] .

Every annotated assertion is required to carry attribution and a well-formed confidence score. Neo4j can store {confidence: 0.92} on a relationship, but there is no native mechanism to enforce that every relationship of a given type carries a valid confidence value as part of a composite provenance record.

Composability

RDF 1.2 reification wins on composability — triple terms can themselves be subjects of further triples, enabling recursive annotation and full alignment with provenance frameworks like PROV-O, the Open Annotation (OA) vocabulary, and temporal modeling patterns. Neo4j’s relationship properties are a flat structure with no recursion.

The critical gap remains ecosystem implementation. SPARQL 1.2 support for querying triple terms varies across stores. Jena/Fuseki is ahead; many commercial stores are lagging. Neo4j’s relationship property queries just work everywhere.

5. Schema and Validation: SHACL 1.2

The Architectural Relationship

SHACL is not a schema language in the relational sense. It is a constraint validation framework operating under the Closed World Assumption against a graph that otherwise operates under the Open World Assumption. This is intentional: it allows different validation profiles to be applied to the same underlying graph depending on context, workflow stage, or consuming application.

Shape Targets

Every SHACL shape evaluation begins with target selection. SHACL 1.2 offers:

sh:targetClass — nodes with a given rdf:type
sh:targetNode — specific named individuals
sh:targetSubjectsOf / sh:targetObjectsOf — nodes in a particular predicate role
sh:targetShape — nodes conforming to another shape (SHACL 1.2 addition)
sh:SPARQLTarget — arbitrary SPARQL SELECT as a target expression

sh:targetShape enables compositional, progressive validation: apply stricter shapes only to nodes that already conform to a base shape. This mirrors how data pipelines actually work — ingest with permissive shapes, enrich, validate against tighter production shapes before publication.

Constraint Components

SHACL 1.2’s constraint vocabulary covers:

Cardinality: sh:minCount, sh:maxCount, and the more expressive sh:qualifiedValueShape with sh:qualifiedMinCount/sh:qualifiedMaxCount — “at least two values of this property must conform to this particular sub-shape.”

Value type and range: sh:datatype, sh:nodeKind, sh:class, sh:minInclusive/sh:maxExclusive and their variants.

String: sh:pattern (regex), sh:minLength/sh:maxLength, sh:languageIn, sh:uniqueLang.

Logical operators: sh:and, sh:or, sh:not, sh:xone — first-class constraint components enabling boolean shape algebra.

Shape-based: sh:node, sh:property — enabling deep nested validation.

SPARQL-based: sh:sparql with a full SPARQL query as the constraint body — the escape hatch that makes SHACL Turing-complete for validation.

SHACL Rules (SRL)

SHACL 1.2 includes a rules sublanguage that extends validation into inferential closure:

ex:InferDepartmentShape
    a sh:NodeShape ;
    sh:targetClass ex:Employee ;
    sh:rule [
        a sh:TripleRule ;
        sh:subject sh:this ;
        sh:predicate ex:belongsTo ;
        sh:object [ sh:path (ex:worksIn ex:partOf) ] ;
    ] .

SRL shapes generate new triples without invoking a full OWL reasoner — rule-based materialization grounded in the shape graph, more predictable and auditable than DL entailment while covering a substantial fraction of practical inference requirements. The interplay between SRL and validation creates a natural pipeline: run rules to materialize inferred triples, then validate the enriched graph.

Node Expressions

SHACL 1.2 introduces node expressions — computing values during validation rather than just traversing existing structure. A node expression can invoke path evaluation, function application, or set operations, enabling constraints like “validate that the computed union of values from two properties satisfies this shape.” Combined with SRL, this pushes SHACL toward a complete data transformation and validation pipeline language.

The Write-Time Enforcement Gap

SHACL validation is advisory by default. The specification defines a validation process producing a report; it says nothing about what a store should do with that report. Enforcement is application-layer.

Some triple stores close this gap in store-specific ways: GraphDB in strict mode rejects constraint violations transactionally; Stardog has similar capabilities. But this is not a specification guarantee — you’re coupling to an implementation.

Neo4j’s constraints are part of the engine contract. A uniqueness constraint violation is a transaction abort, unconditionally, regardless of which client wrote the data.

The productive framing: SHACL is the right tool when validation semantics need to be richer than what an engine can enforce inline. Neo4j constraints are the right tool when enforcement must be unconditional and transactional. These answer different questions about where in the architecture data quality is governed.

6. SHACL 1.2 vs. Neo4j Constraint Enforcement

What Neo4j Enforces Natively

Neo4j’s native constraint vocabulary is deliberately narrow and transactionally guaranteed:

-- Uniqueness
CREATE CONSTRAINT FOR (p:Person) REQUIRE p.email IS UNIQUE;

-- Existence
CREATE CONSTRAINT FOR (p:Person) REQUIRE p.name IS NOT NULL;

-- Node key (composite)
CREATE CONSTRAINT FOR (p:Person) REQUIRE (p.firstName, p.lastName) IS NODE KEY;

-- Property type (Neo4j 5.x)
CREATE CONSTRAINT FOR (p:Person) REQUIRE p.age IS :: INTEGER;

-- Relationship existence
CREATE CONSTRAINT FOR ()-[r:KNOWS]-() REQUIRE r.since IS NOT NULL;

Every constraint is enforced transactionally. This is Neo4j’s core operational strength.

APOC Triggers: The Escape Hatch

APOC provides a trigger mechanism enabling procedural validation beyond native constraints. But APOC triggers are procedural, not declarative; they’re a plugin rather than part of the core engine; they don’t produce structured validation reports; and their concurrent write behavior is operationally complex.

APOC triggers are best understood as a safety valve. Teams writing complex APOC trigger logic to enforce data quality are often in a situation where a SHACL-based validation layer would have been the right design choice.

The Relationship Property Problem

The deepest structural gap is in relationship property validation. Neo4j can enforce that a relationship property exists and has the right type. It cannot enforce:

That certain relationship properties are conditionally required based on node properties
That relationship property values fall within a computed range derived from connected node data
That the structure of relationship properties constitutes a valid composite

None of the following have Neo4j equivalents:

# At least one WORKS_AT must be a current employer
ex:PersonShape sh:property [
    sh:path ex:worksAt ;
    sh:qualifiedValueShape [
        sh:property [ sh:path ex:isCurrent ; sh:hasValue true ] ;
    ] ;
    sh:qualifiedMinCount 1 ;
] .

# Salary must not exceed department budget
ex:EmployeeShape sh:sparql [
    sh:message "Salary exceeds department budget" ;
    sh:select """
        SELECT $this WHERE {
            $this ex:salary ?sal ;
                  ex:inDepartment/ex:budget ?budget .
            FILTER (?sal > ?budget)
        }
    """ ;
] .

The Comparative Framing

Neo4j’s constraint model answers: “Is this write operation consistent with the database schema?” — unconditionally, at the engine level, before any data is committed.

SHACL 1.2 answers: “Does this graph, as a whole, conform to the structural and semantic expectations of this application domain?” — with a queryable, structured, versioned report that is itself a knowledge artifact.

7. Ingestion and Serialization

Serialization Formats

RDF has always been format-plural by design. The data model is abstract; serialization is a separate concern.

RDF 1.2 serialization landscape:

Turtle 1.2 — human-readable, concise, full reification syntax; the primary authoring format
TriG 1.2 — named graph support, Turtle superset
N-Triples / N-Quads — streaming-friendly, line-oriented, trivial to parse; verbose, no prefix support
JSON-LD 1.1 — JSON ecosystem integration; complex compaction/expansion pipeline
RDF/XML — legacy interoperability; genuinely painful to write and read
HDT — binary compressed; read-only; niche tooling
RDF Thrift — binary fast; very limited ecosystem

For RDF 1.2 specifically, Turtle 1.2 and TriG 1.2 carry the new reification syntax. JSON-LD’s treatment of triple terms is still settling.

Neo4j’s serialization landscape:

Cypher — primary import/export language; human-readable, executable
CSV — bulk ingest workhorse via LOAD CSV and neo4j-admin import
GraphML / GEXF — via APOC; primarily for graph visualization tool interop
JSON — via APOC import/export
Arrow (binary) — high-throughput bulk ingest via Arrow Flight (Neo4j 5.x)

Neo4j’s serialization story is operationally simpler. The formats are familiar to the broadest population of data engineers. The tradeoff: none carry semantic metadata — column names are strings, not IRIs; types are inferred; relationships between entities in different files require application-level join logic.

Bulk Ingestion

neo4j-admin import bypasses the transaction engine entirely, writing directly to the store format:

:ID,name:STRING,age:INTEGER,:LABEL
1,Alice,32,Person

:START_ID,:END_ID,:TYPE,since:INT
1,2,KNOWS,2020

RDF bulk ingest is store-specific. SPARQL Update is not designed for bulk throughput. Each store has its own mechanism — Fuseki’s tdb2.tdbloader, GraphDB’s REST API, Virtuoso’s ld_dir() / rdf_loader_run().

The pattern holds across hardware configurations: Neo4j’s bulk loader is consistently faster than most RDF stores’ equivalents, partly due to the simpler storage model and partly due to sustained performance engineering investment.

Streaming and Event-Driven Ingestion

Neo4j has production-grade Kafka integration via the Neo4j Connector for Kafka — source connector (graph change events to Kafka topics) and sink connector (Kafka topics to graph writes via Cypher or merge patterns), with dead letter queue handling.

Streaming RDF ingest is underserved. Options include community-maintained RDF-star Kafka connectors, JSON-LD over Kafka with a processor in the consumer pipeline, RDF Patch (supported by Jena and a few others), and custom SPARQL Update streams. None match the maturity of Neo4j’s Kafka offering.

Transformation and ETL

Neo4j ecosystem: Apache Hop (Neo4j plugin), Spark connector (neo4j-connector-apache-spark), dbt community adapters, Fivetran/Airbyte connectors, and @neo4j/graphql for auto-generated GraphQL APIs. All are mature commercial integrations.

RDF ecosystem: RML (RDF Mapping Language) with YARRRML syntax, SPARQL-Generate, Apache Camel RDF component, Ontop (Virtual Knowledge Graph — mapping SPARQL directly to relational SQL without materializing triples), and SPARQL-Anything (querying non-RDF sources as if they were RDF graphs). RML + YARRRML is arguably more powerful than Neo4j’s Hop integration for complex multi-source transformations, but tooling maturity and documentation lag significantly.

Client Library Ecosystems

Neo4j: Official drivers for JavaScript/Node.js, Python, Java, Go, and .NET, all sharing a consistent Bolt protocol implementation with connection pooling and retry logic. Spring Data Neo4j for OGM. @neo4j/graphql as a first-class supported path.

RDF: Apache Jena and Eclipse RDF4J are comprehensive on the JVM. Python’s rdflib is the de facto standard for in-memory work, with pySHACL for validation and SPARQLWrapper for remote endpoints. JavaScript is fragmented — rdfjs community interfaces, N3.js for parsing, Comunica for federated query. Rust’s Oxigraph is fast and growing.

The RDF library landscape is more fragmented and less consistent than Neo4j’s. The developer experience requires assembling a stack rather than installing a single driver.

JSON-LD as a Bridge

JSON-LD 1.1 lets you layer semantic meaning onto an existing JSON document via @context. The result is simultaneously valid JSON (processable by any JSON consumer) and valid RDF (ingestable by any triple store). For REST APIs, schema.org markup, Verifiable Credentials, and structured data in HTML, JSON-LD is the actual deployment format through which the RDF semantic web stack reaches the web developer ecosystem. Neo4j has no equivalent interoperability story for web-native structured data.

The GQL Horizon

ISO GQL (2024) defines a standard property graph serialization format alongside the query language. As GQL adoption matures, Cypher-compatible queries will be portable across GQL-compliant stores — closing the portability advantage that SPARQL has held as a W3C standard. The W3C’s RDF-to-property-graph mapping work will further reduce impedance mismatch at the serialization boundary.

8. Reasoning and Inference

The Epistemic Foundation

RDF was designed from the outset as a knowledge representation substrate where graphs need to derive knowledge. Neo4j was designed as an operational graph database — it stores and retrieves what you put in it. Inference is not part of its contract.

The OWL Entailment Stack

RDF’s reasoning capabilities come from OWL (Web Ontology Language), which provides a formal Description Logic (DL) foundation through progressively expressive sublanguages:

RDFS (baseline): rdfs:subClassOf and rdfs:subPropertyOf transitivity; rdfs:domain and rdfs:range type inference. Simple but immediately useful — loading an RDFS ontology into a reasoner-enabled store means automatic type inference from property usage.

OWL 2 EL: Tractable polynomial-time subset targeting biomedical ontologies. Class intersection, existential restrictions, property chains (owl:propertyChainAxiom), transitive properties, role hierarchy. The basis of large biomedical ontologies like SNOMED CT and the Gene Ontology.

OWL 2 RL: Rule-friendly subset implementable in forward-chaining rule systems. Universal restrictions, inverse properties, symmetric/asymmetric/reflexive properties, disjoint classes and properties. The profile most triple stores implement natively.

OWL 2 DL: Full Description Logic — SROIQ(D). Number restrictions, enumerated classes, complex role inclusion. EXPTIME-complete in the general case; full DL reasoners (HermiT, Pellet, FaCT++) are external tools.

What Inference Actually Does

Given an ontology declaring ex:Manager rdfs:subClassOf ex:Employee, ex:manages owl:inverseOf ex:managedBy, and ex:manages rdf:type owl:TransitiveProperty, a reasoner entails — without being told — that Alice is also an Employee and a Person; that Bob is managed by Alice; that transitivity closes the management chain; and that property chains propagate worksIn relationships across the graph.

None of these triples exist in the original data. All are logically entailed. A SPARQL query against the reasoned graph retrieves them as if explicitly asserted. Neo4j returns only what was written.

Inference Modes

Materialization (forward chaining): The reasoner runs at load time, computes all entailed triples, and writes them alongside asserted triples. Subsequent SPARQL queries execute against the fully materialized graph with no reasoning overhead at query time. GraphDB’s inference is particularly well-integrated — maintaining a separate inferred layer, tracking provenance of inferred triples, handling incremental materialization on update.

Query rewriting (backward chaining): No additional triples are stored. The SPARQL engine rewrites incoming queries to account for ontological entailments. No storage overhead; ontology changes take effect immediately. Ontop is the canonical example — rewriting SPARQL into SQL incorporating OWL axioms into the rewriting.

SHACL Rules as lightweight inference: Lighter than OWL reasoning, more predictable, useful for the substantial fraction of practical applications that don’t need full DL entailment.

Consistency Checking

A full OWL DL reasoner doesn’t just derive new facts — it checks whether a combination of axioms and data is logically coherent. If ex:VegetarianDish owl:disjointWith ex:MeatDish and an individual is asserted to be both, the reasoner classifies this as unsatisfiable. This is foundational to biomedical ontology development (disjointness axioms encode real biological exclusivity) and regulatory knowledge graphs (definitional constraints must be formally verified).

Neo4j’s constraint system detects data integrity violations but cannot detect logical inconsistency. The difference: “this value is missing” versus “these stated facts cannot simultaneously be true.”

Graph Algorithms vs. Logical Inference

Neo4j compensates for its absence of logical inference with the Graph Data Science (GDS) library:

These are analytically powerful but categorically different from logical entailment. PageRank doesn’t tell you what type something is. Louvain community detection doesn’t enforce rdfs:subClassOf hierarchies. The two families of capability answer different questions.

The Practical Limits of OWL Reasoning

Full OWL DL reasoning does not scale to arbitrarily large graphs. The computational complexity is fundamental:

RDFS entailment: billions of triples (materialization)
OWL EL classification: millions of concepts (SNOMED-scale)
OWL RL materialization: hundreds of millions of triples
Full OWL DL consistency check: hundreds of thousands of individuals

Production knowledge graph deployments that require both scale and reasoning typically stratify the problem: offline DL reasoning over the ontology and curated instance data; continuous RL/RDFS rules via the store’s native engine; SPARQL for complex traversal; SHACL Rules for domain-specific derivation.

Federated Reasoning

SPARQL’s SERVICE keyword enables queries spanning multiple SPARQL endpoints simultaneously. Combined with shared ontologies, this allows entailments to be drawn from facts distributed across organizational boundaries — the architectural promise of the Semantic Web. For domain-specific federation within a single enterprise or standards community, it works and has no property graph equivalent.

9. Performance and Scalability

Storage Architecture

Neo4j’s native graph storage:

Neo4j uses a custom binary store format with fixed-size records and direct pointer encoding. Node records are 15 bytes; relationship records are 34 bytes containing pointers to start node, end node, and the next relationship in each node’s chain (doubly-linked list). The consequence of fixed-size records is that relationship traversal is O(1) per hop — following a relationship requires a single disk seek to a known offset, with no index lookup or join. This is index-free adjacency.

RDF triple store architectures:

Index-based stores (Jena TDB2, Oxigraph) maintain multiple sorted indexes — SPO, POS, OSP, plus quad variants. Every SPARQL query is decomposed into index scans and joins. Traversal is not O(1) per hop — each step requires an index lookup.

Column-oriented stores (Stardog, some GraphDB configurations) enable vectorized execution and better compression — excellent for analytical SPARQL at the cost of higher per-triple access overhead for simple lookups.

Virtuoso is a hybrid — a relational database storing RDF as rows in a specialized triples table with covering indexes, benefiting from decades of relational query optimizer engineering.

The Index-Free Adjacency Question

Neo4j’s index-free adjacency provides O(1) hop cost, but the number of relationships per node still matters. A node with 10,000 relationships has a chain of 10,000 records; traversing to find a specific neighbor requires scanning up to 10,000 relationship records without type indexing. Neo4j’s traversal advantage is strongest for low-to-moderate degree nodes in sparse graphs.

For graphs with power-law degree distributions, index-based RDF stores can outperform pointer-chasing at hub nodes — a lookup on SPO for a specific subject returns exactly its triples regardless of degree, in time proportional to the result set size.

Query Performance by Workload Class

Short traversal (1-3 hops from a known start): Neo4j wins clearly. Index-free adjacency plus Cypher pattern matching is optimized for exactly this case.

Complex pattern matching: RDF stores with good SPARQL optimizers (Virtuoso, Stardog, GraphDB) are competitive or superior. A complex SPARQL query with multiple triple patterns, FILTER conditions, and aggregations is a multi-way join problem — exactly what column-oriented stores are designed for.

Full-text search: Neo4j has native full-text indexes via Lucene integration, available directly in Cypher as first-class functionality. RDF stores handle FTS unevenly — FILTER(regex(?name, "^Alice")) is not index-backed in standard SPARQL. This is a real operational gap.

Aggregation and analytics: SPARQL with a column-oriented store is competitive with Cypher for aggregation-heavy queries. Neo4j’s GDS projected graph handles graph-specific analytics as native algorithms on an optimized in-memory structure — substantially faster for those specific workloads.

Scalability Models

Neo4j: Primary scaling model is vertical scaling with read replicas using Raft consensus. Single primary handles all writes; read replicas receive transaction log streaming. Neo4j Fabric provides manual application-managed sharding for horizontal scale. Single-node Neo4j reaches approximately 50 billion elements in reported production deployments.

RDF stores: More diverse scaling approaches. Virtuoso and GraphDB reach tens to hundreds of billions of triples on a single node. Distributed RDF options include Amazon Neptune, Stardog cluster mode, and Blazegraph (which powers the Wikidata Query Service at approximately 13 billion triples serving ~100M queries/month on a cluster).

The Fundamental Performance Tradeoff

Neo4j optimizes for traversal locality — starting at a node and expanding outward. The graph feels like a network you can walk.

RDF stores optimize for pattern matching at scale — finding all instances of a complex multi-predicate pattern across a large graph, especially with inference. Better served by index-based architectures with mature query optimizers.

For applications whose primary pattern is “start here, find what’s connected” — Neo4j. For applications whose primary pattern is “find all X where these conditions hold, inferred from this ontology” — RDF.

10. Use Case Mapping

The Decision Axes

Five dimensions drive the architecture decision:

Identity scope — Are entities identified within a single system or across organizational boundaries? Local identity favors Neo4j’s ergonomics; global identity requires IRIs.

Schema stability — Is the domain well-understood and stable, or evolving and open? Stable schemas favor Neo4j’s constraint model; open-world domains favor RDF’s schema-optional flexibility.

Query pattern — Is the primary access pattern local traversal from known seed nodes, or global pattern matching across the full graph?

Semantic depth — Does the system need to derive knowledge it wasn’t told, align with external vocabularies, or reason about its own structure?

Operational context — Who builds and operates the system? Data engineers and application developers favor Neo4j’s ergonomics; ontologists and knowledge engineers favor RDF’s expressiveness.

Strongly Neo4j: Operational Graph Applications

Fraud detection and financial crime networks: The canonical property graph use case — fundamentally a local traversal problem. Given a transaction or account, expand outward through relationship chains to find rings, mules, shared identities, or velocity patterns. Stable closed-world schema; GDS graph algorithms directly applicable; OWA actively unhelpful. Verdict: Neo4j clearly.

Recommendation engines: Collaborative filtering and content-based recommendation are graph traversal problems. The schema is stable (User, Product, Category, Purchase, Rating); real-time recommendation requires sub-100ms latency; GDS similarity algorithms and Node2Vec embeddings are purpose-built. Verdict: Neo4j clearly.

Access control and permissions graphs: Role-based and attribute-based access control — traversal-dominant, low-latency, high-frequency queries on a stable schema. The graph schema is well-defined and closed. Verdict: Neo4j clearly.

Network and IT topology: Impact analysis (”if this switch fails, what services are affected?”), path finding, dependency traversal — traversal queries on a closed-world operational schema. Write performance and constraint enforcement matter. Verdict: Neo4j clearly.

Supply chain and logistics: Supplier relationships, logistics routes, inventory positions — path-finding and impact analysis with GDS shortest path and centrality algorithms directly applicable. Verdict: Neo4j clearly.

Strongly RDF: Knowledge-Intensive Applications

Biomedical and clinical knowledge graphs: Deep ontological hierarchies (SNOMED CT ~350,000 concepts); cross-vocabulary alignment (ICD-10, SNOMED, RxNorm, MeSH, LOINC); OWL EL reasoning as core requirement; provenance-critical data requiring triple-level annotation via sh:reifierShape; open-world integration from multiple source systems. The NCBO BioPortal, the European Bioinformatics Institute’s linked data platform, and multiple pharma knowledge graph platforms are all RDF-based. Verdict: RDF/OWL clearly.

Regulatory compliance and legal knowledge graphs: Hierarchical cross-referential open-world frameworks; semantic alignment across jurisdictions via skos:closeMatch; derived obligations from regulatory text and entity characteristics; named graphs tracking regulatory versions. SHACL 1.2’s constraint vocabulary maps naturally to regulatory requirement structures. Verdict: RDF/OWL clearly.

Enterprise data catalogs and metadata management: DCAT, PROV-O, Dublin Core, Schema.org are all RDF vocabularies. Open-world: new datasets added continuously, schema evolves, relationships recorded without migration. SKOS for controlled vocabulary; owl:equivalentProperty for field name alignment. W3C standards were designed for this problem. Verdict: RDF clearly.

Scientific publishing and research knowledge graphs: Open-world by nature; ORCID provides IRI-based author identity; Semantic Scholar, OpenAlex, and COVID-19 Knowledge Graph are all RDF-based; Schema.org for publications; PROV-O for fact provenance; named graphs by data source. Verdict: RDF clearly.

Cultural heritage and library knowledge graphs: Multilingual metadata via @lang literals; FRBR/RDA bibliographic hierarchies; cross-institutional identity via VIAF and ISNI; long-term preservation semantics via PREMIS. The Library of Congress, Europeana, and BBC programmes ontology are all RDF-based. Verdict: RDF clearly.

Environmental and geospatial knowledge graphs: GeoSPARQL for spatial RDF; W3C SSN/SOSA for sensor data; QUDT for units. Multi-source government and scientific data, inherently open-world. The European Environment Agency and US Geological Survey linked data initiatives are RDF-based. Verdict: RDF for integration; Neo4j competitive for pure spatial query performance.

The Ambiguous Middle

Enterprise knowledge management: Simple data model (people, teams, documents, topics) favors Neo4j ergonomics; semantic interoperability and taxonomy management favor RDF. The right answer depends on whether interoperability or development velocity is load-bearing.

Customer 360 / Master Data Management: RDF’s strengths (IRI-based global identity, owl:sameAs reconciliation, named graphs for source tracking, SHACL for data quality governance) map to the integration challenge; Neo4j’s strengths (fast traversal, constraint enforcement, GDS similarity-based duplicate detection) map to the operational serving layer. Production MDM systems increasingly use a hybrid. Verdict: Genuinely hybrid; or choose based on dominant challenge.

Drug discovery and materials science: High semantic requirements (ChEBI, PubChem, Materials Ontology); integration across databases and research groups; OWL EL for classification; SHACL for experimental data quality; AND high-performance traversal for molecular interaction networks and synthesis path finding. The emerging pattern in pharma is RDF as the knowledge layer with a property graph projection for algorithm execution. Verdict: Hybrid architecture.

Digital twins: RDF for the ontological backbone and static structural model (BRICK, DTDL-compatible); property graph for operational state (rapidly changing, transactionally consistent); graph traversal for impact analysis and spatial containment. What Azure Digital Twins effectively implements. Verdict: Hybrid; RDF for structural ontology, property graph for operational state.

The Hybrid Architecture Pattern

Enough use cases land in “hybrid” territory that it deserves explicit treatment. The canonical pattern has three layers:

1. Knowledge layer (RDF): Entities with global IRI identities; ontologies defining the type system; named graphs tracking provenance; SHACL validating data quality; OWL rules materializing inferred relationships. The source of truth for semantic meaning.

2. Materialization pipeline: A transformation layer projecting the RDF knowledge graph into a property graph representation. IRI-identified entities become Neo4j nodes with IRI-as-property; ontological type hierarchies collapse to explicit labels; inferred relationships are materialized as explicit edges. Runs on a schedule (batch) or in response to graph change events (streaming).

3. Operational layer (Neo4j): The traversal-optimized query surface. Sub-100ms latency for applications; GDS algorithms; GraphQL APIs. A read-optimized projection of the knowledge graph.

This architecture accepts operational complexity in exchange for both semantic depth and query performance. It’s appropriate when requirements genuinely span both domains — which is more common than either vendor’s marketing suggests.

11. Synthesis and Conclusion

Two Definitions of a Graph

Neo4j’s graph is an operational structure — a network of entities and relationships navigated by applications to answer questions about the current state of a domain. Its meaning is in its topology.

RDF’s graph is a knowledge structure — a set of propositions from which further propositions can be derived. Its meaning is in its semantics.

What RDF 1.2 Changes

Reification is genuinely improved. The old rdf:Statement model was verbose enough to be practically unusable for routine annotation. The condensed syntax and sh:reifierShape make triple-level provenance tractable, closing the most-cited ergonomic gap with Neo4j relationship properties — substantially if not completely.

The ecosystem is still catching up. Reification syntax support varies across stores and serialization formats. SPARQL 1.2 support for querying triple terms is uneven. The specification is ahead of the tooling — a familiar RDF story.

The fundamental model is unchanged. OWA is still OWA. IRIs are still the identity model. SHACL is still advisory by default. RDF 1.2 refines the execution of the model; it doesn’t change the model’s philosophical commitments.

What GQL Changes

ISO GQL (2024) begins to close the standardization gap in the other direction. As GQL adoption matures, Cypher-compatible queries will be portable across GQL-compliant stores — removing portability as an exclusive RDF advantage. Both ecosystems will have standards-based query languages and serialization formats; remaining differences will be genuinely architectural rather than partly incidental to one ecosystem’s proprietary history.

The Convergence Question

At the edges, yes; at the core, no.

At the edges: RDF’s ergonomics are improving (better tooling, more concise syntax, better client libraries). Neo4j’s analytical capabilities are deepening (GDS, graph embeddings, vector search). JSON-LD is making RDF accessible to web developers who would never write Turtle. GQL is making Cypher portable. Hybrid architectures are normalizing.

At the core: RDF’s OWA, IRI-based global identity, and inference model will not be adopted by Neo4j. Neo4j’s transactional constraint enforcement and index-free adjacency traversal will not be replicated in the SPARQL stack. The convergence is happening at the integration layer — not at the model layer.

Decision Framework

Four questions, asked in sequence:

1. Does your data have meaningful existence outside your application? If entities are identified by global IRIs, published in open vocabularies, or shared across organizational boundaries — start with RDF. If your graph exists to serve one application’s requirements — Neo4j is the simpler choice.

2. Does your system need to know things it wasn’t told? If queries need results derived from ontological axioms — RDF/OWL is required. If queries return only what was explicitly written — inference is not a requirement.

3. What is the dominant query pattern? Local traversal from known seed nodes at sub-100ms latency — Neo4j. Global pattern matching, complex multi-predicate joins, federated queries — SPARQL. Both — hybrid.

4. What is the team’s expertise and the timeline? RDF done badly delivers worse outcomes than Neo4j done well. The theoretical power of the RDF stack is only realized when used competently. Honest self-assessment matters.

The Longer Arc

Neo4j and the property graph model represent the maturation of graph databases as operational infrastructure — fast enough, tooling rich enough, developer experience good enough for mainstream application backends. That maturation happened through the 2010s and is now largely complete.

RDF and the semantic web stack represent a thirty-year attempt to build a global, distributed, machine-readable knowledge graph spanning the entire web. That original vision has not been realized at web scale. What has emerged is a collection of domain-specific knowledge graphs — biomedical, regulatory, scientific, cultural — that use the RDF stack for exactly the reasons this article has described.

RDF 1.2 and SHACL 1.2 represent the maturation of that stack for a more modest but more achievable mission: not a web-scale universal knowledge graph, but a principled foundation for knowledge-intensive applications where semantic depth, long-term stability, and cross-domain integration are genuine requirements.

Both ecosystems are maturing. Both have found their domain. The interesting question for the next decade is not which one wins — they serve different needs and will coexist — but how the integration layer between them develops. If the W3C RDF-to-property-graph mapping specification matures, if GQL adoption broadens, and if tooling for hybrid architectures improves, the friction of using both together will decrease. The choice between them will increasingly be about which layer of a system you’re designing, not which database vendor you’re committing to.

Final Assessment

The graph is not the architecture. The epistemology is the architecture.

RDF 1.2 / SHACL 1.2 is the right choice when the primary value of the graph is in its meaning — what things are, how they relate semantically, what can be derived from what is known. The graph is a knowledge asset whose schema evolves with understanding of the domain, whose queries are global, and whose longevity requirements exceed any single application’s lifecycle.

Neo4j / OpenCypher is the right choice when the primary value of the graph is in its connections — finding paths, expanding neighborhoods, detecting rings and clusters. The graph is operational infrastructure whose schema is known, whose queries are local, and whose performance requirements are stringent.

The sophistication in choosing between them lies not in mastering their technical details — though that matters — but in correctly characterizing the problem you’re solving and matching it to the architecture whose assumptions align with your domain’s reality.

A fraud detection system whose architects chose RDF because “knowledge graphs are the future” will be slower, harder to build, and harder to operate than it needed to be. A biomedical knowledge graph whose architects chose Neo4j because “it’s easier” will lack the inference, provenance, and cross-vocabulary alignment the domain requires and will eventually be rebuilt.

Choose the epistemology that matches your domain’s truth.

In Media Res,

Kurt Cagle
The Ontologist

Subscribe now

If you like these articles, please consider becoming a paid subscriber. It helps me support my work so I can continue writing code, in-depth analyses, educational pieces, and more.

Check out my LinkedIn newsletter, The Cagle Report.

© 2026 Kurt Cagle for The Ontologist Newsletter. This article reflects the state of both ecosystems as of early 2026. Specification and tooling status evolve rapidly; readers should verify the current implementation support for specific features before making an architectural commitment.

Kurt Cagle is a consulting ontologist and the publisher of The Ontologist and The Cagle Report newsletters. He has worked with numerous Fortune 50 companies and US and European Governmental Entities in the realm of ontology and semantics since the 1990s.

The Ontologist

The Format Convergence

What OKF Actually Is

The Convergence With DataBook

Where They Diverge

The Semantic Web Layer That OKF Leaves Open

What a Collaboration Could Look Like

Why the Format Question Matters

The Gap That Runs Both Ways

What the Sentence Actually Contains

What RDF 1.2 Reification Provides

The Named Reifier: Turning a Statement into a Speech Act

The Gap Runs Both Ways

Ontology’s Severed Connection to Language

How This Changes the Job

Conclusion

Structure vs. Concept

Structure Versus Concept

Why Taxonomies Matter to Language Models

The Curation Argument

Separating the Layers in Practice

Are You Confusing Structure and Concept?

W3C Holon Graph Community Group

From Gates to Boundaries

The Problem with Static Assertions

Events, Reifiers, and the Invariant/Variant Split

SHACL Shapes for the Event Model

Drug Interaction Constraints

ODRL: What Happens at the Gate

The Holon Boundary

Summary

Reference Links

Writing Business Rules in SHACL

The Difference Between Hygiene and Logic

SHACL as a Procedural Wrapper Around SPARQL

Three Severity Levels, Three Use Cases

Use Case 1: sh:Violation — An Unsafe Active Administration

Use Case 2: sh:Warning — A Proposed Administration with Interactions

Use Case 3: sh:Info — Recording a Clean Transaction

Business Rules as Exception Handlers

When to Reach for an LLM

What Do You Need to Create a Useful Ontology?

What is the scope of the ontology?

What is the ontology intended to do or support?

What is an upper ontology, and is it necessary?

What annotational metadata should be considered critical?

How do you handle events?

What’s the relationship between a schema and a taxonomy? Whither SKOS?

What’s the difference between a property and a predicate?

SHACL, OWL, SHACL + OWL?

Bottom Up, Top Down, or Both?

Should you mix ontologies? In what circumstances?

Blank Nodes, Reifiers, Named Graphs?

Knowledge Graph, Context Graph, Other Graph, (Holon)?

Where Next?

What a Modern Ontology Stack Actually Looks Like

The First Confusion: SHACL Is Not a Linter

The Second Confusion: Shared Meaning Is Not in the Graph

The Third Confusion: Generation Is One-Off

The Fourth Confusion: Projection, Not Metaphysics

The Stack, Stated

What We Are Actually Building

A Recipe for SHACL Lists

1. The RDF List Problem

2. What Is a “SHACL List”? — The Formal Definition

3. Lists as Parameter Values — The SHACL 1.0 Heritage

3.1 Logical Constraints (sh:and, sh:or, sh:xone)

3.2 Enumeration (sh:in)

3.3 Language Tags (sh:languageIn)

3.4 Ignored Properties (sh:ignoredProperties)

3.5 Sequence Paths

3.6 New in SHACL 1.2 — sh:class and sh:datatype as Lists

4. Lists as Validation Targets — The New §7.5 Constraint Family

4.1 sh:memberShape — Per-Member Validation

4.2 sh:minListLength and sh:maxListLength — Bounding the Sequence

4.3 sh:uniqueMembers — No Duplicates

5. Composing List Constraints — The Complete Cookie Shape

6. Implementation Notes for Node.js and Python Developers

SPARQL Fallback (Pre-1.2 Engines)

Apache Jena 6.x

3.1 Logical Constraints (`sh:and`, `sh:or`, `sh:xone`)

3.2 Enumeration (`sh:in`)

3.3 Language Tags (`sh:languageIn`)

3.4 Ignored Properties (`sh:ignoredProperties`)

3.6 New in SHACL 1.2 — `sh:class` and `sh:datatype` as Lists

4.1 `sh:memberShape` — Per-Member Validation

4.2 `sh:minListLength` and `sh:maxListLength` — Bounding the Sequence

4.3 `sh:uniqueMembers` — No Duplicates

Adding blocks with `insert`

Interlude: Shapes as Queries — `shacl2sparql`