Monday, May 28

XML::Atom::SimpleFeed and utf-8 text

tl;dr: If you’ve got character weirdness with UTF-8 content and the XML::Atom::SimpleFeed module in Perl, make sure you’re doing something like:

my $feed = XML::Atom::SimpleFeed->new(
  -encoding => 'UTF-8',
  # ...
);

my $utf8_content = decode('UTF-8', $content, Encode::FB_CROAK);

$feed->add_entry(
  content   => $utf8_content,
  # ...
);

A fuller explanation follows.

There’s long been a bug with the Atom feed I generate for this site with wrt, where characters outside of the ASCII set (like ✿ ✢ ☆) were showing up mangled in the feed.

I noticed this once in a while, but never investigated too deeply. Since things usually seemed to display correctly in feedreaders (or at least in NewsBlur), I always wrote it off as some sort of vaguely intractable encoding glitch and forgot about it.

Back in April, though, I made some changes to how the feed generation worked, and checking it with feedvalidator.org made it obvious that something was legitimately broken. Sure enough, in the Firefox feed view, decorative glyphs like were rendering instead as multiple characters (see addendum below) while the first line of the generated Atom XML looked like so:

<?xml version="1.0" encoding="us-ascii"?>

I’m using a CPAN library called XML::Atom::SimpleFeed for this. Digging around in the docs, I found this bit:

-encoding (omissible, default us-ascii)

So I changed the constructor in my code to look like this:

my $feed = XML::Atom::SimpleFeed->new(
  -encoding => 'UTF-8',
  title     => $self->{title_prefix} . '::' . $self->{title},
  link      => $self->{url_root},
  link      => { rel => 'self', href => $feed_url, },
  icon      => $self->{favicon_url},
  author    => $self->{author},
  id        => $self->{url_root},
  generator => 'App::WRT.pm / XML::Atom::SimpleFeed',
  updated   => App::WRT::Date::iso_date(App::WRT::Date::get_mtime($first_entry_file)),
);

That took care of the declared encoding, but I was still getting mangled characters. In order to fix that, I had to add something like:

use Encode qw(decode encode);

my $utf8_content = decode('UTF-8', $content, Encode::FB_CROAK);
$feed->add_entry(
  title     => $title,
  link      => $entry_url,
  id        => $entry_url,
  content   => $utf8_content,
  updated   => $iso_date,
);

…where I add individual entries to the feed.

As I understand it, when you read a UTF-8-encoded text file into a Perl string, the string’s contents will correspond to the bytes in the file — all well and good, for many purposes, but unless you explicitly use Encode::decode('UTF-8', $string) to map multi-byte characters in UTF-8 to the correct Unicode code points in Perl’s internal string representation, weirdness will result when somebody calls Encode::encode($character_set, $string) to explicitly turn those bytes back into correctly-encoded output, which XML::Atom::SimpleFeed does.

I may well be mangling that explanation somewhat. I’ve never really had my head around this class of problems in general, which at this late date should probably be kind of embarrassing for a working programmer, but I also suspect that character encoding remains a confusing topic for almost everyone.

See also:

A comment in the review by Darren Kulp (from 2008, no less) pretty much contains the solution I landed on, but of course I skimmed over it on the first reading and spent a bunch more time getting there.

Addendum: I mentioned that was rendering as multiple characters. Specifically, it was turning into what vim displays as â<9c>º. You can use xxd to get a hexdump of this:

$ cat original
✺

$ cat re_encoded
âº

$ xxd original
00000000: e29c ba                                  ...

$ xxd re_encoded
00000000: c3a2 c29c c2ba                           ......

So without running Encode::decode() on this input, each of those initial 3 bytes (e2 9c ba) becomes a character in Perl’s internal string representation, and then Encode::encode() says “ok, let’s represent these three characters as UTF-8”, from which you obviously get nonsense (though it’s nonsense in which some of the original bytes are preserved).