xmlhack: Incremental XML Parsing and Validation in a Text Editor

On 10 December 2003 at XML 2003 in Philadelphia, James Clark presented the ideas and implementation behind his nXML XML editing mode for GNU Emacs.

Clark pointed out that text editors could be classified as text editors and structure editors. Many well-known XML editors are actually the latter, in which the docuemnt is always well-formed (and maybe even schema-valid) by virtue of restrictions on user interaction. In developing nXML, Clark wanted people to truly be able to do all the things a plain text editor, and in particular Emacs, allows. This means that the document will proceed through varying levels of well-formedness and validity as the user works. The goal is to provide the user with as many cues as possible to the user as to well-formedness and validity, without interfering with the basic text editing. This is much like the argument that Rick Jelliffe has been making for a while, and which has informed the development of Rick's commercial venture, the Topologi XML editor. Clark has now provided for effective text-driven editing of XML in an open source tool.

The key technical challenge in developing nXML was in developing a system for incremental parsing and validation of the document. As the document is edited, information is maintained about the changes in a way such that re-parsing and re-valiation need not be repeated over the entire document.

In the nXML implementation, validation is only performed during idle time, i.e. in between user commands (such as keystrokes). As validation proceeds from top to down, validation state is stored periodically maintained such that the processing can continue again from a last-known state, accounting for any changes in the document itself. Clark mentioned that the low-level algorithm involves searching backwards from a validation state marker for certain tokens. He said that because of the relative simplicity of XML's syntax (as opposed to full SGML), the actual algorithm was less complex than expected.

Clark gave a live demonstration editing a modestly-sized OWL documnt using nXML in Emacs/GTK. He showed, for example, the validation error you get if you try to add an rdf:about attribute as well as rdf:ID to an RDF/XML element. Validation errors are marked by red underlining. He also loaded a very OWL large document (about 1 million lines and 35MB), The buffer loaded in 10 seconds or so, with syntax highlighting in place. A status message ticked down the percent completion of the validation, and it seemed that it would take 3-4 minutes to validate the whole thing if left idle. Clark showed that if he started interacting with the editor, the validating would pause. Validtion errors were, however, reported within the partially validated region. Overall, the performance of basic text edting operations was not noticeably degraded.

There was some debate at the end of the talk as to whether such processing was possible for SGML. An audience member claimed that he had been part of a successful effort to write such a processor on Macintosh, but Clark and other audience members insisted that SGML's complexity probably made this impossible for the general case.

Related stories:

Re: Incremental XML Parsing and Validation in a Text Editor (Eric Promislow - 18:30, 19 Dec 2003)

> An audience member claimed that he had been part of a successful effort to write such a processor on Macintosh, but Clark and other audience members insisted that SGML's complexity probably made this impossible for the general case.

That member might have been an employee at Exoterica in the late-80s (which became OmniMark, then Stilo) who worked on the revolutionary CheckMark editor. I didn't work on the product, as I joined Exoterica when the company moved from building editors to a language.

CheckMark was built on Mac Pluses, and never migrated to Windows, nor really made it to the era of machines with more than 640K of RAM and a > 10MB disk. So I don't know how it would have handled large files, but it certainly had no problem handling all those SGML tag omissions and short references. SGML didn't brook ambiguity, although sometimes you had to lookahead to determine the underlying markup structure. And CheckMark also used state checkpointing throughout the document to minimize reparsing time (which naturally was done during idle time).

The CheckMark editor was also revolutionary at the time as it let users add markup as plain text (as opposed to insisting on dialog boxes for managing markup).

To be fair to James' contention, that SGML was too complex to be handled by an incremental-parsing editor, the Exoterica parser didn't support the LINK, RANK, or CONCUR options of the SGML standard. But there's a good reason that most of the people reading this paragraph who got on the angle-bracket bandwagon in the last five years or so would understand tag-omission, but are probably baffled by those three all-caps words.

Another interesting point on James' work is that he wrote the XML parser in emacs-lisp. But Exoterica's first parser was also written in a stripped down Lisp-like language. History's just repeating itself.