Incremental XML Parsing and Validation in a Text Editor
02:47, 15 Dec 2003 UTC | Uche Ogbuji

On 10 December 2003 at XML 2003 in Philadelphia, James Clark presented the ideas and implementation behind his nXML XML editing mode for GNU Emacs.

Clark pointed out that text editors could be classified as text editors and structure editors. Many well-known XML editors are actually the latter, in which the docuemnt is always well-formed (and maybe even schema-valid) by virtue of restrictions on user interaction. In developing nXML, Clark wanted people to truly be able to do all the things a plain text editor, and in particular Emacs, allows. This means that the document will proceed through varying levels of well-formedness and validity as the user works. The goal is to provide the user with as many cues as possible to the user as to well-formedness and validity, without interfering with the basic text editing. This is much like the argument that Rick Jelliffe has been making for a while, and which has informed the development of Rick's commercial venture, the Topologi XML editor. Clark has now provided for effective text-driven editing of XML in an open source tool.

The key technical challenge in developing nXML was in developing a system for incremental parsing and validation of the document. As the document is edited, information is maintained about the changes in a way such that re-parsing and re-valiation need not be repeated over the entire document.

In the nXML implementation, validation is only performed during idle time, i.e. in between user commands (such as keystrokes). As validation proceeds from top to down, validation state is stored periodically maintained such that the processing can continue again from a last-known state, accounting for any changes in the document itself. Clark mentioned that the low-level algorithm involves searching backwards from a validation state marker for certain tokens. He said that because of the relative simplicity of XML's syntax (as opposed to full SGML), the actual algorithm was less complex than expected.

Clark gave a live demonstration editing a modestly-sized OWL documnt using nXML in Emacs/GTK. He showed, for example, the validation error you get if you try to add an rdf:about attribute as well as rdf:ID to an RDF/XML element. Validation errors are marked by red underlining. He also loaded a very OWL large document (about 1 million lines and 35MB), The buffer loaded in 10 seconds or so, with syntax highlighting in place. A status message ticked down the percent completion of the validation, and it seemed that it would take 3-4 minutes to validate the whole thing if left idle. Clark showed that if he started interacting with the editor, the validating would pause. Validtion errors were, however, reported within the partially validated region. Overall, the performance of basic text edting operations was not noticeably degraded.

There was some debate at the end of the talk as to whether such processing was possible for SGML. An audience member claimed that he had been part of a successful effort to write such a processor on Macintosh, but Clark and other audience members insisted that SGML's complexity probably made this impossible for the general case.

Related stories:

| See 1 comment

Newest comments

> An audience member claimed that he had been part of a successful effort to write such a processo ...
xmlhack: developer news from the XML community

Front page | Search | Find XML jobs

Related categories