How to parse html with saxparser (or other solution)

by tlegras » Sun, 03 Jan 2010 00:22:24 GMT

Sponsored Links
 Happy new year world :)

I want to parse an html page downloaded from a web server and have
pretty much trouble with that.
I am trying with saxparser, is there any better solution?

With sax i am trying to preprocess the page to make it xml compliant
(replace <br> with <br />), but i still have some troubles because of
errors in the page (a couple of mismatched tags and "&" in attributes
value iso &amp;).

Is there any way to make sax parser ignore these errors and keep on
parsing? i tryed to use ErrorHandler interface, but i could not catch

Any help would be welcome.


How to parse html with saxparser (or other solution)

by Kumar Bibek » Sun, 03 Jan 2010 00:37:06 GMT

 I guess you need to use a special HTML parse. Since, HTML pages are
not well-formed and are not XML compliant, using an XML parser will
not serve your purpose.

Search for any third party libraries.

Thanks and Regards,
Kumar Bibek


Sponsored Links

How to parse html with saxparser (or other solution)

by tlegras » Sun, 03 Jan 2010 02:37:01 GMT

 ok thanks i am trying nekohtml and currently trying to make it run but
with the minimal sample code (so using only provided
xercesMinimal.jar) i got this exception in my parse() function:

E/AndroidRuntime(  765): Uncaught handler: thread Thread-10 exiting
due to uncaught exception
E/AndroidRuntime(  765): java.lang.ExceptionInInitializerError
E/AndroidRuntime(  765):        at org.cyberneko.html.HTMLScanner
E/AndroidRuntime(  765):        at
E/AndroidRuntime(  765):        at
E/AndroidRuntime(  765):        at
E/AndroidRuntime(  765):        at
E/AndroidRuntime(  765): Caused by: java.lang.IllegalStateException:
Failed to create XercesBridge instance
E/AndroidRuntime(  765):        at
E/AndroidRuntime(  765):        at

Still invistigating, I will give feedback.



How to parse html with saxparser (or other solution)

by tlegras » Sun, 03 Jan 2010 19:44:40 GMT

 ok i got it. it seems the problem is that their xercesMinimal.jar does
not work. it tried it in a non android java project and had the same
problem. with the full xerces jar i can parse my html page even it has
several errors in it. Too bad the full xerces jar is 1.2Mo :(
Seems like a bug from nekohtml, i will repport in their mailing list.


How to parse html with saxparser (or other solution)

by jwei512 » Mon, 04 Jan 2010 14:52:29 GMT

 Another one you could try is HTML Cleaner (http://

I've made a few applications already that references this library and
it even supports XPATH to parse the HTML source

If you'd like to see some code snippets then let me know and I can
show you some.

- jwei 


How to parse html with saxparser (or other solution)

by tlegras » Mon, 04 Jan 2010 16:04:03 GMT

 Now nekohtml is working very fine for me so i probably won't change :)
But thank you for the link, it is a goldmine :) I found the
documentation miss such snippets.


Other Threads

1. Playback of dynamic MIDI?

For one of my apps I generate MIDI data on the fly.  Is there any way
to play MIDI data from a or directly from a byte
[]?  So far I've only seen ways to do MIDI playback from files.  Given
the continuous and dynamic nature of my audio there is no way I can
write it to a file first.



2. Logging OpenCore through logcat


I'm trying to capture some log traces from openCore by running/playing
the media player from the emulator.  I'm running OpenCore 2.0 and I
followed the steps as instructed inside external/opencore/android/
android_logger_config.h.  Below is what I did:

1) Create a file named pvlogger.txt which contains the following line:

2) Launch emulator through Eclipse (APIDemos project > Debug As),
making sure that when the emulator is launched that I mounted the
sdcard directory

2a) Open the Debug perspective so I can see the output through LogCat

3) adb push pvlogger.txt /sdcard

4) adb shell and cd to /sdcard just to make sure the pvlogger.txt is

5) Launch the media player (play local file or play streaming video)

NOTE: I've never been successful in playing the streaming video on the
emulator, which is one of my motivations to run a trace and see what
is going on in the framework.  I've googled around and it seems like
everyone is having the same problem and could be an emulator

After Step (5) I don't see any traces that would appear to come from

Do I need to attach to a separate process to see the logs generated
from openCore?  If so, can I do that through Eclipse/DDMS (when I
played the local playback, only 4 threads were running (1 main thread
and 3 binder threads)  I highlighted each one and it doesn't appear
that any openCore logs are showing.  Obviously I'm doing something
wrong here.  I'm also new at Eclipse so can you please advice how and
which process to attach to?



3. How to I use touch screen to draw a straight line

4. Bluetooth not working on G1 device (cupcake version)

5. About the call log ?

6. How does one get the current energy consumption rate of the phone? Are there any Android energy profilers?

7. Problem with giving the SDK path in eclipse for SDK 1.5 pre in windows