Rick
Jelliffe is the developer of the Schematron,
a schema language
that takes a very different approach from every
other XML schema language proposed so far. We got
Rick
to take a few minutes to explain more about
Schematron's power and how you can apply it to
problems that
aren't easily solved with the other schema
approaches.
Q. Schematron really comes from a
different angle than the rest of the
schema proposals out there. What inspired such a
different approach?
It became clear when writing my book The
XML & SGML Cookbook: Recipes for
Structured Documents, especially the
central pages on patterns (which are
pretty novel), that DTDs merely provided an
"assembler language" to
represent them. Even if you make parameter
entities into first class
objects and call them archetypes, you still are
stuck with regular grammars
at heart.
When I started my book I wanted to produce
something much more like what
Liam Quin has independently and subsequently done,
but I found that that many
interesting patterns are not clear to express
using parameter entities. I
was pretty happy with the book despite that - did
you see the review
in
Leonardo at MIT? The reviewer said he
couldn't put it down!
Anyway, I tried lots of different approaches. The
"path
model" and the "axis
model" were two which basically act to allow
more powerful right-hand-sides
of the BNF production, as it were. They are
comparable to Dave Raggett's
"assertion
grammars" which works by
allowing patterns on the
left-hand-side of a production.
I wrote a little note about using XSL as an
implementation for validation
that was well-received. So I guess that Schematron
combines path models and
assertion grammars, specified using XPaths,
implemented through XSL.
Q. What parts of Schematron do you
think are innovative?
Schematron takes an innovative approach in a lot
of ways.
Schematron rejects the idea that the result of
validation is a binary valid/invalid. The purpose
of a schema is
to make various assertions that should constrain a
document; to report on the presence or absence of
patterns. So the result of validation may be a
complex set of values. Various backends should
make use of
that set of information, each in their way.
Schematron puts natural language descriptions on
an equal footing to machine-useable expressions.
Diagnosis
is just as important as prescription. In a way,
the human description is more important
than the
machine-useable description: the code gives a
model for what structures are checked, the text
gives the
intent. (THETIS
has a
similar idea.)
Rejecting SGML's powerful insight that regular
grammars can be used to describe many useful
document
constraints, it uses tree-patterns: XPath paths
and expressions. In fact, it takes the pattern as
its primary
abstraction: one is not validating elements or
attributes per se but patterns: the relationships
between
occurences of information items. (Assertion
Grammars has a similar idea.)
Schematron also allows constraints to be checked
in parallel. A grammar-approach means that a
document
must be debugged in branch order: an invalid
element in a content model probably prevents us
from
validating
later elements. (François Chahuneau gives
interesting rationales for this kind of approach
in Beyond
the SGML DTD)
A side-effect of validating in parallel is that we
can turn on and off groups of patterns. If I want
to validate
only tables, I can do so; this allows a division
of labour working on a document. It also allows
incomplete
documents. Some early SGML tools did not allow you
to save an invalid document: this was a real pain
if
the
minimum document allowed by a DTD was still fairly
complex. Dynamic schemas in which certain patterns
are active at various stages in a work flow, or
where certain patterns belong to a particular
historical variant
of a document are also possible. This is pretty
organic rather than authoritarian.
Schematron tries to cover an area which
definitional schemas miss: the
usage
schema. Take the W3C Web
Accessibility
Initiative (WAI) guidelines for HTML as an
example: they are best-practise guidelines to make
sure
your
HTML pages are not creating disabilities for
people with various limited capabilities, simple
things like
making
sure that images have alt attributes for blind
readers. Is WAI a schema? Yes: one could even make
a DTD
for most of it, a stricter form of HTML. Does WAI
define a separate language to HTML, in the sense
of
changing that way any element should be
interpreted or allowing instances to be created
that are not 100%
valid HTML? No, not by the intent of its
creators: it is a profile. Schematron provides a
way for declaring
and testing these kinds of usage schemas.
Schematron tries to be as easily implementable as
possible, and implementable on top of XSL in
particular.
Small is Beautiful. (One could build other
Schematron-like on top of another schema or query
language, such
as Strudel or OmniMark, for
instance.) I chose XSL because it
allowed the most direct access to XPaths. In a
way, I wanted to hide XSL as much as possible and
expose
as much of XPath as possible. If XPath grows,
Schematron inherits that change. And, of course,
it makes
development documentation and specification very
easy: users can look at the XPath spec and
implementers
can look at the XSL spec! There is a very short
learning curve required to start using it.
Because we can use XPath's id() and key()
functions, we not only can validate trees but
graphs. And, using
document(), not only local graphs but documents
made from bits all over the web to create
strongly-typed
graphs and links! (I think ISO HyTime also
allowed something like this.)
The XPath syntax is very terse, making it as
convenient to use as the regular expressions in
DTD content
models. I think it is important, for a language
pitched at usage rather than definition, for users
to be able to
email each other with one-line examples or
suggestions, which is what people do with content
models and
with BNF productions. And, of course, it is
important for download times. I like the idea of
having an XML
framework and special purpose little language
inside: XPaths and URIs are good examples of this
too I
suppose.
Finally, it turns out to be very simple to make a
backend, a Schematron implementation, that
performs
automated markup based on patterns. Perhaps to
generate RDF or XLink indexes. The creator of the
schema only needs to know the source-document
structure; they don't need to learn XSL or
RDF/XLink in
order to use the tool. I have a proof-of-concept
demo of this online at the site now.
Q. While Schematron implementation
seems largely focused on XSL, you've made a point
of
providing Omnimark code as well. How tied to XSL
do you find Schematron to be? Or is it just
coincidence that XSL is the first tool really
bonded to XPath?
I prototyped in OmniMark
because I love OmniMark and I am
comfortable with
it; also, getting used to writing an XSL script
that generates and XSL
script is disconcerting at first. You have to use
two different prefixes for
the same namespace, like xsl and axsl which does
not provide much visual
information to help reading. In any case, I think
it is good practise to
prototype in a different language to the
implementation language, especially
for a small project like Schematron it focuses you
on
user-requirements. The OmniMark prototype is
merely the front-end -- you still need XSL.
Q. What kind of patterns can Schematron do
that a grammar-based schema
cannot do?
Newbies to XML always ask "How can I constrain
that if the user specifies
one attribute (e.g., size) the user also should
specify another (e.g.
unit)?" We can do that:
<rule context="*[@size]">
<assert test="@unit"
>An element with a size attribute should
have a unit
attribute</assert>
</rule>
CALS tables are another example: they have an
attribute which provides the
column count. With a Schematron schema, you can
check that the number is
correct. Something like:
<rule context="row">
<assert test="count(entry) <=
parent::table/attribute::cols"
>The number of columns should not exceed
the value of the cols
attribute on the element
table.</assert>
<rule>
That is an interesting example, because
it shows we can give a diagnostic
message about "columns" even though the markup
just has rows and cells.
Another good example just happened this week -- a
teacher wants students to
mark up newspaper stories in TEI and with special
elements "who", "where",
"what", "when" etc. These elements can appear
anywhere in a document, but
there must be at least one of each in every
document. You cannot use a
simple DTD for that. In Schematron,
though:
<rule context="/*">
<assert text="//who"
>A document should have at least one who
element</assert>
<assert text="//where"
>A document should have at least one where
element</assert>
</rule>
Finally, if I have an attribute "sausageRef" that
must point to an element
"Sausage" or an element "Wuerst". A Schematron
schema lets me traverse an
IDREF (or anything declarable as a key) and check
the type (or anything
surrounding it) at the other end.
<rule context="*[@sausageRef]">
<assert test="(name(id(@sausageRef)) =
'Sausage') or
(name(id(@sausageRef)) = 'Wuerst')"
>A sausageRef attribute should point to some
kind of sausage.</assert>
</rule>
I have a suspicion that all Schematron constraints
can be
specified by a set of transformation/grammar
pairs, applied in parallel.
This would be even more powerful than the
Schematron; but I think it would
not be a schema system that can be used by general
people. XML was
successful because it was small; I think that is a
good lesson for even
schema languages in the Schematron case, we get
power from being a nice
small layer on a nice small path language and a
nice small expression
language (i.e., XPath).
XPaths are just fantastic -- they made this
difficult thing I have been
working on so long into an almost trivial
exercise. James Clark and the
others should feel proud!
Q. So are Schematron schemas replacements
for DTDs? Are they competitive with XML
Schemas?
You can see that, judged by most of the criteria
above, XML Schemas are merely DTD on steroids: I
guess
that makes Schematron schemas DTDs on
acid--documents have all these crazy patterns and
the poor user
must try to make sense of them.
The XML Schema Working Group at W3C is trying hard
to make sure they fit in with the SQL/Java/DTD
worlds. The only innovation they are allowing
themselves is in the areas of extensibility and
refinement. But
even with their high-level baby, they know that
sometimes there are structures that require a
different
approach. So Schematron can be useful even with
XML Schemas or DTDs.
Perhaps the area of competition is in the area of
more informal documents where you want just
certain bits to
be valid. And, as I said, XML Schema does not
currently address the issue of usage schemas. I
have a set
of slides comparing DTDs and tree-patterns called
From Grammars to The
Schematron which analyzes the modeling
capabilities of both.
Q. What's in Schematron's future?
I have no idea what will happen. We had an
ecstatic response by many people who first tried
it: amazing.
People seem to really like the idea and find it
pretty easy. Also, they seem to think it is
fairly hacker-sized:
you can make a useful schematron implementation in
a weekend for some particular purpose. In the
first
week of release David Carlisle contributed a
viewer for it, and Miloslav Nic contributed a set
of tutorials.
Q. How can developers integrate Schematron
with their applications?
A Schematron validator/debugger can be built
easily on top of any system that has XSL
available. So it
would be great if editing software started to make
it available. The user could be given a dialog box
in which
they can tick which patterns they want to validate
for (or against!) at that time; they may tick
"tables" and
then
they will only get the table-related messages.
And validation can be done as a thread, so that
when a pattern
is found incomplete, the GUI flags it in some way,
similar to the way Word does background
spell-checking
now. Perhaps we could say that an XML Schema is
more suited for creating wizards, while a
Schematron
schema may be more useful for creating friendly
debuggers: I wonder.
We still need a standard way for documents to
locate schemas in general and Schematron schemas
in
particular. Schemas using tree-patterns seems a
really neat way to schematize some important
classes of
structures.
Q. Why the name?
Well, at least it is better than "The Pink
Schematron" which is what it was
originally! An antidote to the current diarrhea
of acronyms,
contractions and slick, grandiose names. Anyone
who has seen that movie Barberalla may get a
chuckle
from it, but of course it is a far from frivolous
language.
|