xmlhack: XML-DEV thread watch: the role of class and type in XML

Discussion of the role of data typing and other artefacts from programming languages and classical DBMS never seems to end on XML-DEV. Simon St.Laurent fired up the debate again with a twist by linking Uche Ogbuji's article XML class warfare. The article made whimsical reference to the debate as a "class war" between "bohemians" and "gentry". A long and very interesting sequence of threads ensued.

There were a few posts concurring with the technical points in the article, including this one by Walter Perry:

I would actually go beyond your point:

"Certainly, if you want your data to outlast your code, and to be more
portable to unforeseen, future uses, you would do well to lower your own
level of class consciousness. Strong data typing in XML tends to pigeonhole
data to specific tools, environments and situations. This often raises the
total cost of managing that data."

It is not just over time, but right now, between utterly dissimilar systems
whose only nexus is the internetwork, that communication is possible only by
instantiating a common syntax into locally idiosyncratic semantics at each
end of the conversation.

This is not the first example of the bohemians perfecting as art the very
tools which the science of the gentry lacked, but would require to achieve
ubiquity. In astronomy, engineering, and chemistry of the Renaissance key
pieces are supplied by alchemists, painters, dyers or poisoners. Consider
e.g. Palladio's sketches of ancient Roman construction, instantiated so
differently (and in the instantiation demanding entirely new mechanical
invention) in the local understandings of Rome, London and Washington, DC.
What, if anything, would have been built had he produced engineering field
office quality blueprints?

Joe English pointed out how some of the problems with schema dependence correspond to similar problems in DTD systems.

The way I see it, the problem is not XML annotated with data
types _per se_; rather it's the assumption by schema designers
that those data types will be available to all processes.
If the designer isn't careful, this assumption can easily
lead to document types that can only be processed by WXS-aware
tools.

Jonathan has asked (repeatedly) for concrete examples of how
typed XML causes interop problems.  I don't have any such
examples since I haven't made that _particular_ mistake;
but in the general theme of overreliance on schema information
I've goofed many times.

For instance: I used to be a big fan of keying processing
off of #FIXED attributes in the DTD.  This worked really
well in the SGML world, but with XML it limits you to using
DTD-aware processors (and making sure they can find the DTD,
even when disconnected from the Web, et cetera.)  This
led to so many headaches that I now use different techniques
to do architectural forms.  Lesson learned.

It seems to me that the "International Purchase Order" schema
in section 4 of the W3C XML Schema Primer [1] comes close to the
edge of that slippery slope.  While _most_ of it can be processed
by WXS-oblivious tools, there are some tasks that can't be done

(or can't be done easily) without a type-annotated PSVI and full
schema information.  For instance: write a program that extracts
all of the comments from a purchase order (see the schema fragment
in section 4.6),   Now you could do this with an XSL transform that
extracted all the 'ipo:shipComment' and 'ipo:customerComment'
elements (since those are the only two elements defined to
be of that type), but that's fragile; if the schema is extended
to include other comment types, the transform will silently
break.  Or similarly, find all the Vehicles in a document conforming
to the schema in section 4.7.  Or just about any of the tasks
in section 1.9 of the W3C XML Query Use Cases document [2] --
because of the way the schema is designed, XQuery is probably
the *only* tool that can perform these tasks.

But the key issue is: if you want your data to outlast your code,
don't encode it in a way that's too tightly bound to any particular
process.   Peeking at data types in the PSVI to make processing
easier is not necessarily a problem; schema designs that *require* 
processes to do so are.  This is just one instance of the general theme.

Of course, the provocative idea of class warfare, even raised tongue-in-cheek, ensured rapid controversy. Jonathan Robie requested further illustration of the problems associated with data typing:

I still don't understand this point. Could someone please illustrate with 
an example that uses several kinds of software processing data that uses 
datatypes, and showing how the presence of datatypes in that data prevents 
it from being used except in "specific tools, environments, and situations"?

John Cowan responded:

It isn't the mere presence of datatype meta-information, of course.  It's the
implicit requirement not to use the information in ways that contravene the
datatype.

Within the back and forth of the discussion that followed, Eric van der Vlist pointed out his article Can XML Be The Same After W3C XML Schema?.

Perry also pursued the matter with an example, which is best read in its entirety along with the follow-ups by Robie and others.

Rick Jelliffe offered a somewhat different perspective on the debate. Rich Salz offered and example illustrating the difficulty of providing truly portable data types. This led to an entire subthread discussing such portability issues, and some of the problems with WXS data types.

Norm Walsh expressed that he doesn't see the problems brought about by strong data typing

Taking the specific case of XQuery/XPath2/XSLT2, I'm not sure I see
the problem. Given

<baz>
<foo n="1">Network Drive</foo>
<bar moo="0902">01803</bar>
</baz>

I might write a template that matches those elements in a purely
lexical way.

  <xsl:template match="baz">...</xsl:template>

I might also write a template that matches them based on some data
type (forgive the psuedo-xsl, the standards are still fluid as you
pointed out):

  <xsl:template match="*[type() = my:AddressType]">...</xsl:template>

The latter case seems to be exactly what Walter Perry described in an
earlier message on this thread: a particular view of the data in a local
context (I wrote the query that way because I expect to interpret the
darned thing as an address).

Imposing my view on the data for my query doesn't seem to do any harm.

Or are you concerned that I'm going to slurp up the XML, interpret it
according to my local context, shove it into some database somewhere
with those interpretations and thereafter be unable to view it with a
different local context?

Some people are going to do that, I suppose, to bend XML to the will
of their databases. But I'm not going to do that (that would be
stupid, IMHO, but I'm not trying to build a system that processes a
zillion purchase orders a minute, either). I haven't perceived anyone
threatening to force me to do that. Am I insufficiently paranoid?

Ogbuji then explained that Walsh's example in itself doesn't illustrate the problem.

I think you may have missed the point, because as far as I see it, you're 
using data types in a very modular fashion: i.e. at the precise point in 
processing where it is immediately useful.

I think that no one objects to such use of data types.

The problem I bring up is that in their very tight coupling to text-based XML 
processing specs, that WXSDT end up pretty much imposing implicit data typing 
even when it is not needed, and when it can hamper the processing.  In order 
to use these new data typing "wizards" (as Jonathan call them, seemingly 
deadpan), you have to build these data types into the schema or the instance, 
which means they now affect all XPath, XSLT and XQuery operations on them.  
This, I think is where the brittleness emerges.

As I've said many times I have no problem with data typing qua data types.  I 
do object to

* The bias towards static types

* The lack of modularity in W3C efforts to incorporate data typing into XML 
technologies

Walsh pursued:

| I think that no one objects to such use of data types.

Who or what is preventing you from using them that way? (I'm really
not trying to be argumentative, I think I'm a bohemian myself, and I
sometimes think the gentry go a little bit off the rails, but I don't
lose sleep over it because I don't see how I'm being threatened. As I
said before, maybe I'm insufficiently paranoid.)

| The problem I bring up is that in their very tight coupling to text-based XML
| processing specs, that WXSDT end up pretty much imposing implicit data typing
| even when it is not needed, and when it can hamper the processing.

Where is the tight coupling? Schema import into a stylesheet or XML
Query will bind them together, but I think that's an instance of
modular use. That doesn't bind my documents to any particular schema
(except perhaps when I run a particular query, naturally).

| In order 
| to use these new data typing "wizards" (as Jonathan call them, seemingly 
| deadpan), you have to build these data types into the schema or the instance,

Building data types into the schema doesn't seem harmful. That's the
point of a schema, is it not?

I'm not sure what you mean by building the data types into the
instance. If you mean using xsi:type, then I agree completely that
it's brittle. And wrong. And I'll quickly discard any tool that does
it.

| which means they now affect all XPath, XSLT and XQuery operations on them.  
| This, I think is where the brittleness emerges.

Sometimes I write stylesheets that are entirely data type agnostic,
but not really very often. I don't see how building data typing into a
particular stylesheet or query is harmful.

| * The lack of modularity in W3C efforts to incorporate data typing into XML 
| technologies

Do you mean because they're tied more-or-less exclusively to WXS? Or
do you mean something else?

Ogbuji responded that he didn't think schema import was optional in practice, and that it was an all or nothing proposal in any case. Walsh followed up with some expression of his distaste for xsi:type, and a reinforcement of the dispute that the problem of imposed data types was a problem in the new XPath draft. David Carlisle countered that that schema import can often be triggered without the knowledge of the XPath user. Separately, Robie disputed that schema import is forced upon anyone:

>1) From my last reading of XPath 2.0, schema "import" was not optional if the
>document had a PSVI.  If this has changed, this is a big step forward.

It has always been optional, whether or not the document has been schema 
validated. (Of course, XQuery operates on the Data Model, not the PSVI).

>2) Even if schema import is optional, it is all or nothing.  More likely, I
>want to use type information in, say, one template, and not across the board
>for all values.

This has never been true - XQuery has always allowed you to import just the 
schemas for which you want type information.

> > Building data types into the schema doesn't seem harmful. That's the
> > point of a schema, is it not?
>
>My point is that it ensures tight coupling.

I can do queries on data without importing the schemas into a query, and 
the built-in data types in instances are available whether or not I import 
schemas into a query.

Ogbuji expressed his skepticism of this disavowal:

I don't give a fig about XQuery.  I was talking about XPath.

http://www.w3.org/TR/xpath20/#id-type-conversion

"XPath is a strongly typed language with a type system based on [XML Schema]. 
The built-in types of XPath include the built-in atomic types of [XML Schema] 
(such as xs:integer and xs:string), and the following special derived types: 
fn:dayTimeDuration and fn:yearMonthDuration (described in [XQuery 1.0 and 
XPath 2.0 Functions and Operators]). Additional type definitions may be 
imported from the language environment via the in-scope schema definitions."

This does not sound optional to me.

And I am happy to substitute "XPath 2.0 Data Model" for "PSVI" if you insist 
on that quiddity.  Just don't expect me to always remember that.

Then there is section 2.3:

"  1. The document can be parsed using an XML parser.
   2. The parsed document can be validated against one or more schemas. This 
process, which is described in [XML Schema], results in an abstract 
information structure called the Post-Schema Validation Infoset (PSVI). If a 
document has no associated schema, it can be validated against a permissive 
default schema that accepts any well-formed document.
   3. The PSVI can be transformed into the Data Model by a process described 
in [XQuery 1.0 and XPath 2.0 Data Model]."

I think I see the outs that you folks claim here.  This sequence is listed as 
an "example" (note that the only other example is described as "synthesized 
directly from a relational database".  Then you refer to XSD rules for schema 
processing.  XSD makes following schemaLocation optional.

I'm not impressed by all this wriggling.  In practice, every XSD processor I 
have used follows schemaLocation.  And by the placement and emphasis of this 
supposed example in a normative section of the spec, you pretty much mandate 
its implementation in this way when an XML document is being parsed.  Even if 
I were a strong typing advocate, I would find this stuff intolerable from the 
POV of interoperability.

You could avoid such misinterpretation, and minimize interop problems by the 
simple expedient of moving the PSVIish nonsense (note the "ish") to a 
different spec.  This is what I've been arguing for.

Mike Champion started a subthread claiming that the preponderance of data types in recent working drafts shows the edge of a slippery slope toward far more obvious design problems. Simon St.Laurent amplified this commentary, adding a wish that people who want strong typing had siezed on ASN.1 rather than XML.

Len Bullard put the debate in perspective of how successful tools are implemnented. He expanded on this theme a little later in response to Jeff Lowery's comment that the debate reflects "...a power stuggle here, although ... not between data-heads and doc-heads, nor between pedants and free-thinkers..., [but] fundamentally [an] issue of representation and perceived disenfranchisement of the majority."

And speaking of majorities, Simon St.Laurent observed a clear bias among XML-DEV participants:

Maybe most of the gentry and royalty have left the list - talk of
revolution and guillotines probably doesn't help - but xml-dev seems to
have gone over pretty thoroughly to the position Uche described as
bohemian.  

There are certainly document folks around, but there are also lots of
people who work with data and even think in terms of data, but who value
XML's labeled but untyped textual foundations.

It makes me feel more optimistic than I have in a while, though I also
worry that this is simply a place where bohemians congregate.  

Where do we go from here?

Tim Bray kicked off a particularly long sub-thread with worries about the increasing importance of data model over syntax in XML specifications:

I agree with those who feel that some applications of XML work just fine 
without any requirement from ancillary typing machinery, and that such 
machinery shouldn't be compulsory, and XPath/XQuery would be immensely 
better if the basic and schema-dependent parts were cleanly separated.

There's really not too much paranoia-fuel in that complex of issues 
though.  What really scares me is the recurring theme that we ought to 
re-frame XML as a data model and treat the syntax as just one 
serialization.  That makes me seriously paranoid - if somebody promises 
me XML, I want a stream of unicode characters with angle-brackets, not 
some fragile opaque binary kludge which is advertised as having infoset 
semantics

Follow-ups went on to debate Infoset versus serialized XML, binary XML formats, and the like.