Illegal characters in PHP XML parsing

Kanji
Photo by Cattoo

If you hit the error "Invalid character" while using PHP’s built-in XML parser, and you don’t see the usual "<" or "&" characters in the input, you might be running into the same control code problems I’ve been hitting. I’d always assumed, and most sites state, that you can put anything within a CDATA block apart from < and &. I’m wrapping the bodies of email messages in XML, within CDATA’s, but I was still seeing parser failures like these. I also tried using various escaping methods instead, like htmlspecialchars(), but still hit the failure.

Digging into it was tricky, since it doesn’t give you the actual character value it’s choking on. In one case I tracked it down to "\x99", which looks like a Microsoft variant for the trademark character. That got me wondering exactly what character set was being used, so I tried specifying ISO 8859 1 explicitly when I created the parser, but still hit the same error.

Then I realized I was cutting some corners by skipping the starting <?xml> tag for all of the strings I was creating. That’s where you can specify the character set for the file, and sure enough prefixing it with
<?xml version="1.0" encoding="ISO-8859-1"?>
got me past that first error. I thought I was home free, but looking at my test logs, it looks like it failed again overnight after going through 1300 more emails. I shall have to dig into that further and see what the issue was there.

It does seem like a design flaw that the parser chokes dies on unrecognized characters, rather than shrugging its shoulders and carrying on. It may well be outside of the spec to have control characters that aren’t legal in the current instruction set, but it seems both possible and helpful to have a mode that either ignores or demotes those characters when they’re found, rather than throwing up its hands and refusing to parse any further. It has the same smell of enforcing elegance at the expense of utility that infuriated me with bondage and discipline languages like Pascal.

3 responses

  1. Hey,
    Did you ever find a solution to your problem?
    im choking on the same error – and are kinda close to getting crazy over that “illigal char” problemo….
    /Tue

  2. I’m still seeing this error occasionally, though adding the explicit character set at the start made almost all the problems go away.
    I did also find this thread discussing the correct way to handle illegal characters:
    http://www.stylusstudio.com/xmldev/200108/post40600.html
    Their reasoning makes sense for ASCII control characters, but I seem to hit this mostly with expanded character set values. I agree, it’s annoying to both have no way to sanitize the input before processing, or have the parser ignore bad values.
    One option I may look at is binary-izing my text data, but that loses the readability advantage of XML.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: