xmlhack: doclifter: Convert man pages to DocBook XML

Eric Raymond has released version 1.0.0 of doclifter, a Python 2.2 utility that converts man pages and other troff/nroff/groff documents to DocBook XML and SGML.

Both a source tarball and RPM package are available. A NEWS file provides a record of changes.

The man page for the utility has more details about its features:

converts man, mandoc, ms, me, and TkMan page sources
parses command and C function synopses and converts them into DocBook markup (using the cmdsynopsis, arg, replaceable, etc. elements)
recognizes 'stereotyped' patterns of markup and content (such as the use of italics in a FILES section to mark filenames) and 'lifts' them into DocBook markup
recognizes things such as URLs, email addresses, man page references, and C program listings, and lifts them into DocBook markup
maintains a record of semantic 'hints' that it picks up from analyzing source documents (especially from parsing command and function synopses), and provides a means to edit, add to, and save that record

Raymond writes that

doclifter does not do a perfect job, only about 90% of one; the last 10% has to be applied by a human recognizing patterns too subtle for a computer. But doclifter will almost always produce translations that are good enough to be usable before hand-hacking

The NEWS file in the distribution says that doclifter was tested on all 5548 man pages in a full Red Hat 7.3 workstation install, and that only 5 percent of the converted files required any post-conversion manual correction. A TODO file in the distribution provides a list of man pages that it is currently not able to convert perfectly, and, for each man page, lists the reason why it fails.

(It seems that around 65 percent of the conversion failures are due to markup errors in the 'roff source for the pages, 20 percent or so are due to the presence of parenthentical comments in synopses -- which aren't supported in DocBook synopses -- and the remaining 15 percent or so are due to current deficiencies in doclifter.)

Note that there is also a bug in part of the implementation that doclifter uses for dealing with ISO character entities: In some XML instances, it generates internal DTD subsets that include entity declarations which reference the SGML versions of the ISO character-entity sets instead of the XML versions.

A workaround for the bug is simply to delete the ISO character entity declarations from generated XML documents. The declarations are actually redundant at best, because both the DocBook XML and SGML DTDs already reference the appropriate sets.

Re: doclifter: Convert man pages to DocBook XML (Aaron Hawley - 21:59, 3 Feb 2003)

Re: doclifter: Convert man pages to DocBook XML (Michael Smith - 06:37, 19 Sep 2002)

Eric Raymond has released version 1.01 of doclifter. It corrects the ISO character entity bug mentioned in this xmlhack news item.

http://www.tuxedo.org/~esr/doclifter/