SAXParser fails on some RSS feeds

by grennis » Sun, 01 Mar 2009 09:53:48 GMT

Sponsored Links
 I'm using the SAX parser to read some RSS feeds and have found a
problem. Some feeds, for example CNN Money Top Stories, have embedded
some characters in their content, I.e. the copyright symbol. Well,
that's not valid XML and the SAXParser fails with an exception
"invalid token".

The only help I have seen given is to fix the XML at the source and
that's not an option obviously. So, I can think of 2 options and they
both stink: (a) read the content first, scrub it, and then pass it to
the parser. (B) use DOM instead of SAX.

What I *want* to do is make the parser a little more forgiving and
just accept or discard/ignore the bad text. I'm not have any luck with
setErrorHandler. My error handler does not get called.

Can anyone offer some help on this? Thanks


SAXParser fails on some RSS feeds

by Tim Bray » Sun, 01 Mar 2009 12:20:07 GMT


In general you can't use a real XML processor, which the java SAX
stuff is, to read RSS feeds.  Lots and lots of them aren't XML at all.
 Atom 1.0 is better, but lots of feeds aren't Atom.  Once somebody
ports either Jython or JRuby and gets it really running, the problem
is solved because you can use the excellent Feedparser library, which
Just Works on any imaginable feed.  In the interim, you might want to
consider John Cowan's excellent TagSoup, which handles what its name
suggests. Libxml2 also has a "forgiving" parser but I don't know if
there's a Java interface to that. -T

. Some feeds, for example CNN Money Top Stories, have embedded


Sponsored Links

SAXParser fails on some RSS feeds

by 3D » Sun, 01 Mar 2009 16:48:30 GMT

 I'm working on the same problem right now.  I'll take a look at
TagSoup.  Otherwise, I was just thinking of scrubbing out the invalid
tokens before sending it to the xml reader.  Please let me know what
you find/ decide to do.


SAXParser fails on some RSS feeds

by Tim Bray » Sun, 01 Mar 2009 16:53:38 GMT


Scrubbing it will almost certainly not work.  There is some seriously
weird shit in RSS feeds out there.  Not just wonky characters.  The
reason is that most blog authoring systems let you grab arbitrary
claims-to-be-html off the web and drop it into your blog, so it ends
up in your feed, and even with the double-escaping voodoo you see in
RSS, the poison remains.

As an interim step, you could simply take Atom when there's a choice
of feeds, and refuse to process bad RSS.  The proportion of feeds that
have Atom alternatives available is pretty high.  The reason this
works is that one or two of the leading feed-readers decided to use
real persnickety XML parsers for Atom, so the publishing industry has
done the necessary whatevers to make sure they're clean.

The *right* answer is FeedParser, sigh.  -Tim


SAXParser fails on some RSS feeds

by grennis » Sun, 01 Mar 2009 23:03:07 GMT

 OK, thanks all. I didn't realize the problem was as pervasive as it
is. I'm presenting a limited set of feeds so I'm hoping the scrub
approach will work.


SAXParser fails on some RSS feeds

by StefanK » Sun, 01 Mar 2009 23:32:30 GMT

 In my experience, the problem is in many cases in the character
encoding used in the feed. If the feed is encoded using ISO-8859-1
encoding (which is what CNN top stories appears to use), and you are
trying to read it using the default UTF-8 encoding some symbols will
come as invalid and break the parser. The only viable solution is to
manually detect the encoding before trying to parse and then construct
the input stream given to the parser with the correct encoding. This
is what I end up doing for BeyondPod in both Windows Mobile and
Android platforms and this solved large set of parsing issues.
Welcome to the bizarre world of RSS parsing.



SAXParser fails on some RSS feeds

by 3D » Mon, 02 Mar 2009 11:11:55 GMT

 I just wanted to report that I've tried TagSoup and at first glance it
seems to be doing exactly what I want - this is great!  Instead of
using a SAXParserFactory I'm now using the SAXFactoryImpl class in
TagSoup to instantiate a new SAXParser.  I will need to look it over a
bit more but it just parsed through a copyright symbol without any


Other Threads

1. Usage of GridView

Hi there,
  I want to implement UI using GridView to fulfill this case:

 1. The items are all in one row and all of them have the same height
and width.
 2. When click one of the item, the selected item can move up/down

 The first is easy to do.
 My issue is about the second. Do you guys know how to implement ?

thank you very much.


2. calling servlet from application

Hi all,

I need to call a servlet from my android app and send/receive some
data (using post method). I don't know how to do it and reading the
API doesn't help me very much. Do u know any solution or have any idea
how to do it?

I've searched this group before writing and there are some solutions
but with the older version of API.

Any ideas, samples or suggestions?


3. SDKs & comparison with the iPhone

4. How-to start service automatically on system startup and on installation + custom Contacts ContextMenu

5. Playback of dynamic MIDI?

6. Logging OpenCore through logcat

7. How to I use touch screen to draw a straight line