xmlhack: Schematron: An Interview with Rick Jelliffe

Rick Jelliffe is the developer of the Schematron, a schema language that takes a very different approach from every other XML schema language proposed so far. We got Rick to take a few minutes to explain more about Schematron's power and how you can apply it to problems that aren't easily solved with the other schema approaches.

Q. Schematron really comes from a different angle than the rest of the schema proposals out there. What inspired such a different approach?

It became clear when writing my book The XML & SGML Cookbook: Recipes for Structured Documents, especially the central pages on patterns (which are pretty novel), that DTDs merely provided an "assembler language" to represent them. Even if you make parameter entities into first class objects and call them archetypes, you still are stuck with regular grammars at heart.

When I started my book I wanted to produce something much more like what Liam Quin has independently and subsequently done, but I found that that many interesting patterns are not clear to express using parameter entities. I was pretty happy with the book despite that - did you see the review in Leonardo at MIT? The reviewer said he couldn't put it down!

Anyway, I tried lots of different approaches. The "path model" and the "axis model" were two which basically act to allow more powerful right-hand-sides of the BNF production, as it were. They are comparable to Dave Raggett's "assertion grammars" which works by allowing patterns on the left-hand-side of a production.

I wrote a little note about using XSL as an implementation for validation that was well-received. So I guess that Schematron combines path models and assertion grammars, specified using XPaths, implemented through XSL.

Q. What parts of Schematron do you think are innovative?

Schematron takes an innovative approach in a lot of ways.

Schematron rejects the idea that the result of validation is a binary valid/invalid. The purpose of a schema is to make various assertions that should constrain a document; to report on the presence or absence of patterns. So the result of validation may be a complex set of values. Various backends should make use of that set of information, each in their way.

Schematron puts natural language descriptions on an equal footing to machine-useable expressions. Diagnosis is just as important as prescription. In a way, the human description is more important than the machine-useable description: the code gives a model for what structures are checked, the text gives the intent. (THETIS has a similar idea.)

Rejecting SGML's powerful insight that regular grammars can be used to describe many useful document constraints, it uses tree-patterns: XPath paths and expressions. In fact, it takes the pattern as its primary abstraction: one is not validating elements or attributes per se but patterns: the relationships between occurences of information items. (Assertion Grammars has a similar idea.)

Schematron also allows constraints to be checked in parallel. A grammar-approach means that a document must be debugged in branch order: an invalid element in a content model probably prevents us from validating later elements. (François Chahuneau gives interesting rationales for this kind of approach in Beyond the SGML DTD)

A side-effect of validating in parallel is that we can turn on and off groups of patterns. If I want to validate only tables, I can do so; this allows a division of labour working on a document. It also allows incomplete documents. Some early SGML tools did not allow you to save an invalid document: this was a real pain if the minimum document allowed by a DTD was still fairly complex. Dynamic schemas in which certain patterns are active at various stages in a work flow, or where certain patterns belong to a particular historical variant of a document are also possible. This is pretty organic rather than authoritarian.

Schematron tries to cover an area which definitional schemas miss: the usage schema. Take the W3C Web Accessibility Initiative (WAI) guidelines for HTML as an example: they are best-practise guidelines to make sure your HTML pages are not creating disabilities for people with various limited capabilities, simple things like making sure that images have alt attributes for blind readers. Is WAI a schema? Yes: one could even make a DTD for most of it, a stricter form of HTML. Does WAI define a separate language to HTML, in the sense of changing that way any element should be interpreted or allowing instances to be created that are not 100% valid HTML? No, not by the intent of its creators: it is a profile. Schematron provides a way for declaring and testing these kinds of usage schemas.

Schematron tries to be as easily implementable as possible, and implementable on top of XSL in particular. Small is Beautiful. (One could build other Schematron-like on top of another schema or query language, such as Strudel or OmniMark, for instance.) I chose XSL because it allowed the most direct access to XPaths. In a way, I wanted to hide XSL as much as possible and expose as much of XPath as possible. If XPath grows, Schematron inherits that change. And, of course, it makes development documentation and specification very easy: users can look at the XPath spec and implementers can look at the XSL spec! There is a very short learning curve required to start using it.

Because we can use XPath's id() and key() functions, we not only can validate trees but graphs. And, using document(), not only local graphs but documents made from bits all over the web to create strongly-typed graphs and links! (I think ISO HyTime also allowed something like this.)

The XPath syntax is very terse, making it as convenient to use as the regular expressions in DTD content models. I think it is important, for a language pitched at usage rather than definition, for users to be able to email each other with one-line examples or suggestions, which is what people do with content models and with BNF productions. And, of course, it is important for download times. I like the idea of having an XML framework and special purpose little language inside: XPaths and URIs are good examples of this too I suppose.

Finally, it turns out to be very simple to make a backend, a Schematron implementation, that performs automated markup based on patterns. Perhaps to generate RDF or XLink indexes. The creator of the schema only needs to know the source-document structure; they don't need to learn XSL or RDF/XLink in order to use the tool. I have a proof-of-concept demo of this online at the site now.

Q. While Schematron implementation seems largely focused on XSL, you've made a point of providing Omnimark code as well. How tied to XSL do you find Schematron to be? Or is it just coincidence that XSL is the first tool really bonded to XPath?

I prototyped in OmniMark because I love OmniMark and I am comfortable with it; also, getting used to writing an XSL script that generates and XSL script is disconcerting at first. You have to use two different prefixes for the same namespace, like xsl and axsl which does not provide much visual information to help reading. In any case, I think it is good practise to prototype in a different language to the implementation language, especially for a small project like Schematron it focuses you on user-requirements. The OmniMark prototype is merely the front-end -- you still need XSL.

Q. What kind of patterns can Schematron do that a grammar-based schema cannot do?

Newbies to XML always ask "How can I constrain that if the user specifies one attribute (e.g., size) the user also should specify another (e.g. unit)?" We can do that:

    <rule context="*[@size]">
        <assert test="@unit"
        >An element with a size attribute should
have a unit
attribute</assert>
    </rule>

CALS tables are another example: they have an attribute which provides the column count. With a Schematron schema, you can check that the number is correct. Something like:

    <rule context="row">
        <assert test="count(entry) &lt;=
parent::table/attribute::cols"
        >The number of columns should not exceed
the value of the cols
        attribute on the element
table.</assert>
    <rule>

That is an interesting example, because it shows we can give a diagnostic message about "columns" even though the markup just has rows and cells.

Another good example just happened this week -- a teacher wants students to mark up newspaper stories in TEI and with special elements "who", "where", "what", "when" etc. These elements can appear anywhere in a document, but there must be at least one of each in every document. You cannot use a simple DTD for that. In Schematron, though:

    <rule context="/*">
        <assert text="//who"
        >A document should have at least one who
element</assert>
        <assert text="//where"
        >A document should have at least one where
element</assert>
   </rule>

Finally, if I have an attribute "sausageRef" that must point to an element "Sausage" or an element "Wuerst". A Schematron schema lets me traverse an IDREF (or anything declarable as a key) and check the type (or anything surrounding it) at the other end.

<rule context="*[@sausageRef]">
    <assert test="(name(id(@sausageRef)) =
'Sausage') or
            (name(id(@sausageRef)) = 'Wuerst')"
    >A sausageRef attribute should point to some
kind of sausage.</assert>
</rule>

I have a suspicion that all Schematron constraints can be specified by a set of transformation/grammar pairs, applied in parallel. This would be even more powerful than the Schematron; but I think it would not be a schema system that can be used by general people. XML was successful because it was small; I think that is a good lesson for even schema languages in the Schematron case, we get power from being a nice small layer on a nice small path language and a nice small expression language (i.e., XPath).

XPaths are just fantastic -- they made this difficult thing I have been working on so long into an almost trivial exercise. James Clark and the others should feel proud!

Q. So are Schematron schemas replacements for DTDs? Are they competitive with XML Schemas?

You can see that, judged by most of the criteria above, XML Schemas are merely DTD on steroids: I guess that makes Schematron schemas DTDs on acid--documents have all these crazy patterns and the poor user must try to make sense of them.

The XML Schema Working Group at W3C is trying hard to make sure they fit in with the SQL/Java/DTD worlds. The only innovation they are allowing themselves is in the areas of extensibility and refinement. But even with their high-level baby, they know that sometimes there are structures that require a different approach. So Schematron can be useful even with XML Schemas or DTDs.

Perhaps the area of competition is in the area of more informal documents where you want just certain bits to be valid. And, as I said, XML Schema does not currently address the issue of usage schemas. I have a set of slides comparing DTDs and tree-patterns called From Grammars to The Schematron which analyzes the modeling capabilities of both.

Q. What's in Schematron's future?

I have no idea what will happen. We had an ecstatic response by many people who first tried it: amazing. People seem to really like the idea and find it pretty easy. Also, they seem to think it is fairly hacker-sized: you can make a useful schematron implementation in a weekend for some particular purpose. In the first week of release David Carlisle contributed a viewer for it, and Miloslav Nic contributed a set of tutorials.

Q. How can developers integrate Schematron with their applications?

A Schematron validator/debugger can be built easily on top of any system that has XSL available. So it would be great if editing software started to make it available. The user could be given a dialog box in which they can tick which patterns they want to validate for (or against!) at that time; they may tick "tables" and then they will only get the table-related messages. And validation can be done as a thread, so that when a pattern is found incomplete, the GUI flags it in some way, similar to the way Word does background spell-checking now. Perhaps we could say that an XML Schema is more suited for creating wizards, while a Schematron schema may be more useful for creating friendly debuggers: I wonder.

We still need a standard way for documents to locate schemas in general and Schematron schemas in particular. Schemas using tree-patterns seems a really neat way to schematize some important classes of structures.

Q. Why the name?

Well, at least it is better than "The Pink Schematron" which is what it was originally! An antidote to the current diarrhea of acronyms, contractions and slick, grandiose names. Anyone who has seen that movie Barberalla may get a chuckle from it, but of course it is a far from frivolous language.