Thursday, May 31

Monday, May 28

XML::Atom::SimpleFeed and utf-8 text

tl;dr: If you’ve got character weirdness with UTF-8 content and the XML::Atom::SimpleFeed module in Perl, make sure you’re doing something like:

my $feed = XML::Atom::SimpleFeed->new(
  -encoding => 'UTF-8',
  # ...
);

my $utf8_content = decode('UTF-8', $content, Encode::FB_CROAK);

$feed->add_entry(
  content   => $utf8_content,
  # ...
);

A fuller explanation follows.

There’s long been a bug with the Atom feed I generate for this site with wrt, where characters outside of the ASCII set (like ✿ ✢ ☆) were showing up mangled in the feed.

I noticed this once in a while, but never investigated too deeply. Since things usually seemed to display correctly in feedreaders (or at least in NewsBlur), I always wrote it off as some sort of vaguely intractable encoding glitch and forgot about it.

Back in April, though, I made some changes to how the feed generation worked, and checking it with feedvalidator.org made it obvious that something was legitimately broken. Sure enough, in the Firefox feed view, decorative glyphs like were rendering instead as multiple characters (see addendum below) while the first line of the generated Atom XML looked like so:

<?xml version="1.0" encoding="us-ascii"?>

I’m using a CPAN library called XML::Atom::SimpleFeed for this. Digging around in the docs, I found this bit:

-encoding (omissible, default us-ascii)

So I changed the constructor in my code to look like this:

my $feed = XML::Atom::SimpleFeed->new(
  -encoding => 'UTF-8',
  title     => $self->{title_prefix} . '::' . $self->{title},
  link      => $self->{url_root},
  link      => { rel => 'self', href => $feed_url, },
  icon      => $self->{favicon_url},
  author    => $self->{author},
  id        => $self->{url_root},
  generator => 'App::WRT.pm / XML::Atom::SimpleFeed',
  updated   => App::WRT::Date::iso_date(App::WRT::Date::get_mtime($first_entry_file)),
);

That took care of the declared encoding, but I was still getting mangled characters. In order to fix that, I had to add something like:

use Encode qw(decode encode);

my $utf8_content = decode('UTF-8', $content, Encode::FB_CROAK);
$feed->add_entry(
  title     => $title,
  link      => $entry_url,
  id        => $entry_url,
  content   => $utf8_content,
  updated   => $iso_date,
);

…where I add individual entries to the feed.

As I understand it, when you read a UTF-8-encoded text file into a Perl string, the string’s contents will correspond to the bytes in the file — all well and good, for many purposes, but unless you explicitly use Encode::decode('UTF-8', $string) to map multi-byte characters in UTF-8 to the correct Unicode code points in Perl’s internal string representation, weirdness will result when somebody calls Encode::encode($character_set, $string) to explicitly turn those bytes back into correctly-encoded output, which XML::Atom::SimpleFeed does.

I may well be mangling that explanation somewhat. I’ve never really had my head around this class of problems in general, which at this late date should probably be kind of embarrassing for a working programmer, but I also suspect that character encoding remains a confusing topic for almost everyone.

See also:

A comment in the review by Darren Kulp (from 2008, no less) pretty much contains the solution I landed on, but of course I skimmed over it on the first reading and spent a bunch more time getting there.

Addendum: I mentioned that was rendering as multiple characters. Specifically, it was turning into what vim displays as â<9c>º. You can use xxd to get a hexdump of this:

$ cat original
✺

$ cat re_encoded
âº

$ xxd original
00000000: e29c ba                                  ...

$ xxd re_encoded
00000000: c3a2 c29c c2ba                           ......

So without running Encode::decode() on this input, each of those initial 3 bytes (e2 9c ba) becomes a character in Perl’s internal string representation, and then Encode::encode() says “ok, let’s represent these three characters as UTF-8”, from which you obviously get nonsense (though it’s nonsense in which some of the original bytes are preserved).

sunday, may 13

a mother's day lawn and garden report

coming over the hill on 36 where
you catch that first view of boulder,
clouds move over and in the gray and green of
the bowl of the valley, against the mountains
sweeping away to the north and west
and in seconds thick drops hit the glass
turning to heavy rain by the edge of town

windshield wipers all the way up 28th and onto
the road north, low rumbling as i park and lug my
bag into the house, half-deranged from the day's
driving and a dozen of the sadnesses that
middle adulthood scores over and over again
into surfaces like these

the cat and i are watching the water come down
out the screendoor when the thunder picks up
i run outside and yank a tarp off the woodpile
in back to throw across the garden
just as the hail really gets going
the flowers from the apple tree falling fast
in the rain and ice, my shirt soaking

the tarp is probably futile, but i have memories
of more than one vegetable crop shredded by a
spring storm like this one
and i'm not sure what else i can do

which is both a metaphor and not.

Tuesday, May 1

wtfm

I’m thinking, far from the first time, about how few open source projects meet a certain standard of practical openness:

Can a user unfamiliar with the project start with the published source code and the included documentation, and wind up with a working installation?

The answer is “no way” a lot more often than it should be.

I’m all too aware that the full context it takes to build a lot of software is a pretty hard thing to explain in a README. All the same, if you’re working on a project of any size, maybe you ought to ask yourself some of the following:

  • Have I documented, in full, an up-to-date list of every environmental condition that a user will have to manually obtain in order to build, install, or run this code?

  • Is it clear what environments this code is developed in, and where it’s known to run without issue? Am I being brutally honest about this part?

  • Is it clear what version of everything I used?

  • When was the last time I tested the installation procedures in my documentation in a clean environment? Has it been since the last time I changed dependencies or configuration requirements, no matter how apparently trivial?

  • To what extent am I relying on implicit details of some OS, distribution, virtual machine, container, configuration management tool, package, init system, etc., without communicating those details to the user?

  • If a user is unfamiliar with details of a language, package manager, build system, or other tooling, are they pretty much just SOL? If so, is my software of interest to anyone outside of my specific technical community, narrowly defined? Is there any potential that it will need to be supported, for example, by admins or ops people who don’t share my context, for use by some less technical audience?

  • If I personally returned to developing the software after leaving it untouched for two years, how many problems would I be required to solve from a combination of my own patchy memory, search engine queries, and painstaking software architecture?

  • Are there any required environment variables, config file values, or command-line parameters that are for some reason undocumented?

  • Am I posturing like a link to some automatically extracted API documentation is a useful substitute for a real user manual?

  • Did I just publish a command-line utility with a crappy builtin help system instead of a real man page?

I could go on like this for a while, but I suppose the point is clear enough. If it sounds like I’m advocating a stricter standard than I usually manage to live up to myself, well, that’s probably fair. Still, I think we could all probably do better.