SAXParser fails on some RSS feeds

by grennis » Sun, 01 Mar 2009 09:53:48 GMT

Sponsored Links
 I'm using the SAX parser to read some RSS feeds and have found a
problem. Some feeds, for example CNN Money Top Stories, have embedded
some characters in their content, I.e. the copyright symbol. Well,
that's not valid XML and the SAXParser fails with an exception
"invalid token".

The only help I have seen given is to fix the XML at the source and
that's not an option obviously. So, I can think of 2 options and they
both stink: (a) read the content first, scrub it, and then pass it to
the parser. (B) use DOM instead of SAX.

What I *want* to do is make the parser a little more forgiving and
just accept or discard/ignore the bad text. I'm not have any luck with
setErrorHandler. My error handler does not get called.

Can anyone offer some help on this? Thanks


SAXParser fails on some RSS feeds

by Tim Bray » Sun, 01 Mar 2009 12:20:07 GMT


In general you can't use a real XML processor, which the java SAX
stuff is, to read RSS feeds.  Lots and lots of them aren't XML at all.
 Atom 1.0 is better, but lots of feeds aren't Atom.  Once somebody
ports either Jython or JRuby and gets it really running, the problem
is solved because you can use the excellent Feedparser library, which
Just Works on any imaginable feed.  In the interim, you might want to
consider John Cowan's excellent TagSoup, which handles what its name
suggests. Libxml2 also has a "forgiving" parser but I don't know if
there's a Java interface to that. -T

. Some feeds, for example CNN Money Top Stories, have embedded


Sponsored Links

SAXParser fails on some RSS feeds

by 3D » Sun, 01 Mar 2009 16:48:30 GMT

 I'm working on the same problem right now.  I'll take a look at
TagSoup.  Otherwise, I was just thinking of scrubbing out the invalid
tokens before sending it to the xml reader.  Please let me know what
you find/ decide to do.


SAXParser fails on some RSS feeds

by Tim Bray » Sun, 01 Mar 2009 16:53:38 GMT


Scrubbing it will almost certainly not work.  There is some seriously
weird shit in RSS feeds out there.  Not just wonky characters.  The
reason is that most blog authoring systems let you grab arbitrary
claims-to-be-html off the web and drop it into your blog, so it ends
up in your feed, and even with the double-escaping voodoo you see in
RSS, the poison remains.

As an interim step, you could simply take Atom when there's a choice
of feeds, and refuse to process bad RSS.  The proportion of feeds that
have Atom alternatives available is pretty high.  The reason this
works is that one or two of the leading feed-readers decided to use
real persnickety XML parsers for Atom, so the publishing industry has
done the necessary whatevers to make sure they're clean.

The *right* answer is FeedParser, sigh.  -Tim


SAXParser fails on some RSS feeds

by grennis » Sun, 01 Mar 2009 23:03:07 GMT

 OK, thanks all. I didn't realize the problem was as pervasive as it
is. I'm presenting a limited set of feeds so I'm hoping the scrub
approach will work.


SAXParser fails on some RSS feeds

by StefanK » Sun, 01 Mar 2009 23:32:30 GMT

 In my experience, the problem is in many cases in the character
encoding used in the feed. If the feed is encoded using ISO-8859-1
encoding (which is what CNN top stories appears to use), and you are
trying to read it using the default UTF-8 encoding some symbols will
come as invalid and break the parser. The only viable solution is to
manually detect the encoding before trying to parse and then construct
the input stream given to the parser with the correct encoding. This
is what I end up doing for BeyondPod in both Windows Mobile and
Android platforms and this solved large set of parsing issues.
Welcome to the bizarre world of RSS parsing.



SAXParser fails on some RSS feeds

by 3D » Mon, 02 Mar 2009 11:11:55 GMT

 I just wanted to report that I've tried TagSoup and at first glance it
seems to be doing exactly what I want - this is great!  Instead of
using a SAXParserFactory I'm now using the SAXFactoryImpl class in
TagSoup to instantiate a new SAXParser.  I will need to look it over a
bit more but it just parsed through a copyright symbol without any


Other Threads

1. Getting video frames from Android

Hi Guys,

Is there a way I can get each decoded video frame from Android if I
start playing a video file through media player class?

private MediaPlayer vPlayer = new MediaPlayer();


After going through the above code I can hear the audio of the video
that is being played, I understand that it can be done in other ways
but I don't want to open a new window for the video playback. What I
want to do here is to get each decoded frame of the video from Android
somehow :)


2. Gallery.apk tidak baca data

mohon pencerahan galeri.apk pd Optimus one tdk bs baca data/album


3. Difference between SQLite on HTC Hero (1.5) and Nexus One (FR72)?

4. Virus Malware di Android Smartphone

5. Kok Android semakin tidak eksklusif ya?

6. Foreign currencies now showing.

7. need help in java