del.icio.us and xhtml parsing in java.

To find the information that del.icio.us knows about any url use http://del.icio.us/url/(url-md5sum). The md5 is a standard 32-character md5 string with downcased hex digits. URLs should be properly slash terminated prior to generating the md5, for example, http://www.slashdot.org/ but not http://www.slashdot.org. In some cases people have tagged the urls without the trailing slash.

The html (xhtml really) served by del.icio.us is really clean, so occured to me that the various entries I'm interested in extracting could be retrieved using the same strategy used to identify microformats. That also started me thinking about using xpath expressions to identify and exract the right xhtml nodes.

I didn't have much luck. xhtml is an interesting beast. When following the java Xpath api usage examples, the parser complains about the contents of script tags containing illegal characters, and del.icio.us doesn't wrap their script tag contents contents in a CDATA, which doesn't seem to be required or common practice anyways.

It turns out that a combination of jTidy and dom4j seems to work reasonably well.


InputStream in = new FileInputStream("target/out.xhtml");

Tidy tidy = new Tidy();
tidy.setShowWarnings(false);
tidy.setQuiet(true);
tidy.setXHTML(true);
tidy.setXmlOut(true);
org.w3c.dom.Document domDocument = tidy.parseDOM(in, null);

DOMReader domReader = new DOMReader();
Document doc = domReader.read(domDocument);

List list = doc.selectNodes("//div");

for (int i=0; i < list.size(); i++) {
System.out.println(list.get(i).toString());
}

0 comments:

top