How to fix illegal character errors in PHP XML parsing

Stop
Photo by Intimaj

I’m still plagued by occasional failures in my XML parsing due to illegal characters. Explicitly setting the character encoding reduced the frequency, but they’re still popping up occasionally. I have a couple of techniques I’ve tried. One is to use iconv() to strip out any illegal characters for the set I’m using, eg

$output = iconv("ISO-8859-1", "ISO-8859-1//IGNORE", $input);

This apparently works with more complex unicode sets, but at the moment I’m sticking with an 8 bit character encoding. The problem is that all values correspond to a defined character in ISO-8859-1. It took some head-scratching to realize that ISO-8859-1 is not the same as ISO 8859-1! The extra hyphen after ISO denotes an extended version that includes values in the range 0x00 to 0x1f, 0x7f and 0x80 to 0x9f. This fills up the range of mapped values, so that any number between 0 and 255 corresponds to a valid character in ISO-8859-1, and the line above does nothing.

So, in theory that will fix Unicode encodings, but I need something that will handle the characters that are valid in ISO-8859-1 but that aren’t allowed by the XML spec. These are the control characters in the range 0x00 to 0x1f, and 0x7f. To replace these you can run a regular expression that looks something like this:

/[\x00-\x19\x7F]//g

I actually had a large file on disk that I wanted to change, so I actually used sed and its control character class shorthand:

sed ‘s/[[:cntrl:]]//g’ messages.xml > messages.xml.fixed

This solved the illegal character error I was hitting. Now I’m hitting "XML error: EntityRef: expecting ‘;’ at line 451837", and inspection of the text hasn’t helped me figure out what’s wrong yet. At least I’ve got a lot further through the file.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: