feed hassles

After working on chompy.net, it’s clear that dealing with syndicated feeds (especially RSS) feeds is a pain in the ass.

The following are adapted from two emails I sent to Scott about chompy.net. Hope he doesn’t mind.


Scott wrote:

Since I’m a neurotic internet fiddler (as it appears you also are) I thought I should point out that the time stamp on blog entries seems to be out of whack. Maybe it’s applying an unneccesary offset? I can only speak for my own entries, but their stamps appear on Chompy as having been posted 5 hours before they were posted on Junklab. Bizarre.

There is, of course, more to this story. Ready? Really? Okay.

Your B2 RSS 2.0 feed reports dates using the optional Dublin Core (ISO 15836-2003) dc:date field. According to the Dublin Core specification, the dc:date field should use dates formatted according to ISO 8601, or [W3CDTF] (http://www.w3.org/TR/NOTE-datetime). W3CDTF allows a given date’s time zone to be expressed in one of two ways:

  1. With times expressed as UTC/GMT, and with the time zone identifier “Z,” indicating UTC.
  2. With times expressed in local time, and with the time zone designated using an offset (“+hh:mm”) indicating the time’s relation to UTC.

To give an example, the date October 8, 2005, 4:36:14 PM (CDT) could be expressed as either of the following, when formatted according to the W3CDTF specification:

  1. 2005-10-08T16:36:14-06:00
  2. 2005-10-08T22:36:14Z

Now, that date happens to be the same date as the last entry in your blog, which appears in your RSS 2.0 feed like so:


It should actually look like this:


I don’t know how B2 works but fixing this could be as simple as opening up a template file and cramming the string “-06:00” in somewhere.

(And this could actually be even more complicated. RSS 2.0 also allows dates to be specified in a [pubDate] field (RSS 2.0 pubDates), which uses a different date format, RFC 822. An RSS 2.0 feed, then, could conceivably include both a [pubDate] field and a [dc:date] field just to piss off people that write RSS aggregators.)

Even if you were to fix your feed, though, there would still be a few other date-related problems that I need to fix myself, so I took the easy way out and removed the timestamps from the page. The ordering of items on the front page may be slightly off, however, and god knows what might be going on in the chompy RSS and Atom feeds.

Isn’t this fascinating?


(Regarding duplicate entries appearing in the Thunderbird client)

Looks like this is a known issue:


It was opened about a year ago, and the lead developer (Scott MacGregor) closed it after determining the likely cause and checking in a fix, but the problem is still occurring for a lot of users.

(Note: the following paragraph tends to get trashed by the wiki text formatting)

The issue, as Scott understands it, works like this: Thunderbird tracks an entry’s uniqueness by looking at the value of the entry’s [guid] (RSS) or [id] (Atom) tag. (This ID value should always be unique and should never change — this is important.) The problem begins when an entry’s ID contains a character that should be a reserved character in XML, like &. A properly formatted feed will encode such a character; for example, & will become &. When Thunderbird’s XML parser processes that ID, it assumes that such characters aren’t encoded already and double-encodes them, so & becomes &. So when Thunderbird finds an ID like blah&blah&etc, it stores it as blah&blah&etc. That’s wrong, but it would be fine, except that when Thunderbird checks the same feed again, it sees an entry with an ID of blah&blah&etc and finds that it doesn’t match anything it has stored already, because what it has stored isn’t blah&blah&etc, it’s blah&blah&etc. And that’s the bug.

Anyway, that’s supposedly fixed (though maybe not in any of the current stable releases). So why is the bug still open, and why are people still reporting the problem? I don’t actually know, but it’s likely that this bug has multiple causes, so fixing the cause described above isn’t enough. My candidate for the most likely cause is the fact that most software that generates RSS or Atom feeds doesn’t ensure that entry IDs are actually unique and unchanging. An example would be RSS 2.0 feeds that stick the entry’s permalink in the [guid] field, something that the RSS 2.0 spec encourages. This isn’t very well thought out, because permalinks can and do change. And since Thunderbird trusts that the entry IDs in the feeds that it receives will be unique and unchanging, it will create duplicate entries any time that turns out not to be the case. Thunderbird doesn’t really have a better option, because A) that’s what the ID is for, and B) the other information available to it — the title, permalink, date, and content of each element — definitely isn’t guaranteed to be either unique or unchanging, assuming that those elements are even present.

Or take chompy’s feeds. I recently tweaked the ID-generation algorithm so that each entry ID has a much greater chance of uniqueness, and in so doing, changed every single ID. That, of course, caused Thunderbird to think that every entry in the feed was brand new. To make it worse, I’m probably going to end up tweaking that algorithm again, because I made the same mistake that I just mentioned above — I generate IDs based on the entry permalinks.

[ article last updated 2008-08-05 11:48:46 by cobra libre ]