How to parse html with saxparser (or other solution)

by tlegras » Sun, 03 Jan 2010 00:22:24 GMT

 Happy new year world :)

I want to parse an html page downloaded from a web server and have
pretty much trouble with that.
I am trying with saxparser, is there any better solution?

With sax i am trying to preprocess the page to make it xml compliant
(replace <br> with <br />), but i still have some troubles because of
errors in the page (a couple of mismatched tags and "&" in attributes
value iso &amp;).

Is there any way to make sax parser ignore these errors and keep on
parsing? i tryed to use ErrorHandler interface, but i could not catch

Any help would be welcome.


How to parse html with saxparser (or other solution)

by Kumar Bibek » Sun, 03 Jan 2010 00:37:06 GMT

 I guess you need to use a special HTML parse. Since, HTML pages are
not well-formed and are not XML compliant, using an XML parser will
not serve your purpose.

Search for any third party libraries.

Thanks and Regards,
Kumar Bibek


How to parse html with saxparser (or other solution)

by tlegras » Sun, 03 Jan 2010 02:37:01 GMT

 ok thanks i am trying nekohtml and currently trying to make it run but
with the minimal sample code (so using only provided
xercesMinimal.jar) i got this exception in my parse() function:

E/AndroidRuntime(  765): Uncaught handler: thread Thread-10 exiting
due to uncaught exception
E/AndroidRuntime(  765): java.lang.ExceptionInInitializerError
E/AndroidRuntime(  765):        at org.cyberneko.html.HTMLScanner
E/AndroidRuntime(  765):        at
E/AndroidRuntime(  765):        at
E/AndroidRuntime(  765):        at
E/AndroidRuntime(  765):        at
E/AndroidRuntime(  765): Caused by: java.lang.IllegalStateException:
Failed to create XercesBridge instance
E/AndroidRuntime(  765):        at
E/AndroidRuntime(  765):        at

Still invistigating, I will give feedback.



How to parse html with saxparser (or other solution)

by tlegras » Sun, 03 Jan 2010 19:44:40 GMT

 ok i got it. it seems the problem is that their xercesMinimal.jar does
not work. it tried it in a non android java project and had the same
problem. with the full xerces jar i can parse my html page even it has
several errors in it. Too bad the full xerces jar is 1.2Mo :(
Seems like a bug from nekohtml, i will repport in their mailing list.


How to parse html with saxparser (or other solution)

by jwei512 » Mon, 04 Jan 2010 14:52:29 GMT

 Another one you could try is HTML Cleaner (http://

I've made a few applications already that references this library and
it even supports XPATH to parse the HTML source

If you'd like to see some code snippets then let me know and I can
show you some.

- jwei 


How to parse html with saxparser (or other solution)

by tlegras » Mon, 04 Jan 2010 16:04:03 GMT

 Now nekohtml is working very fine for me so i probably won't change :)
But thank you for the link, it is a goldmine :) I found the
documentation miss such snippets.


