If you hit the error "Invalid character" while using PHP’s built-in XML parser, and you don’t see the usual "<" or "&" characters in the input, you might be running into the same control code problems I’ve been hitting. I’d always assumed, and most sites state, that you can put anything within a CDATA block apart from < and &. I’m wrapping the bodies of email messages in XML, within CDATA’s, but I was still seeing parser failures like these. I also tried using various escaping methods instead, like htmlspecialchars(), but still hit the failure.
Digging into it was tricky, since it doesn’t give you the actual character value it’s choking on. In one case I tracked it down to "\x99", which looks like a Microsoft variant for the trademark character. That got me wondering exactly what character set was being used, so I tried specifying ISO 8859 1 explicitly when I created the parser, but still hit the same error.
Then I realized I was cutting some corners by skipping the starting <?xml> tag for all of the strings I was creating. That’s where you can specify the character set for the file, and sure enough prefixing it with
<?xml version="1.0" encoding="ISO-8859-1"?>
got me past that first error. I thought I was home free, but looking at my test logs, it looks like it failed again overnight after going through 1300 more emails. I shall have to dig into that further and see what the issue was there.
It does seem like a design flaw that the parser chokes dies on unrecognized characters, rather than shrugging its shoulders and carrying on. It may well be outside of the spec to have control characters that aren’t legal in the current instruction set, but it seems both possible and helpful to have a mode that either ignores or demotes those characters when they’re found, rather than throwing up its hands and refusing to parse any further. It has the same smell of enforcing elegance at the expense of utility that infuriated me with bondage and discipline languages like Pascal.