ftp://java.sun.com/pub/java/avh/html.tar.Zfor a prototype of a DTD-driven HTML parser written entirely in Java.
This parser is the prototype of the HTML parser used in the HotJava beta release. It is a prototype, it requires the alpha3 release, it has had very little testing, and it has only been tested on Solaris. I'd like to encourage you to parse your favorite HTML pages, see if they contain any HTML errors, and send me feedback (avh@eng.sun.com).
The parser reads a DTD and uses it to parse an HTML file. It reports errors when it finds them. Programmers can subclass the parser and add their own functionality. The parser is an SGML parser with some fine tuning to make it work for HTML. It doesn't yet support all SGML features.