How you can parse XML with PHP

Text
Photo by Dean Terry

I love XML, not because it’s an inherently beautiful format (it’s inelegant in a lot of ways, like why do we have both attributes and character data?) but because for once we have a sensible and widely supported standard in the computing world. The power of this shows when you want to parse an XML file in PHP. Support is built in by default, powered by the ExPat library. For small files you can use the SimpleXML wrapper that creates an object from the XML, but I need to parse large amounts of XML so I didn’t want to keep all of that information in memory. Instead I’m hooking directly into the ExPat event interface, which calls back to the client when tags and other data objects are encountered, and requires the caller to retain and assemble any information it wants to extract.

I’ve included the code below, and here’s a zip file of the example code together with a test XML file. It’s an expanded version of the example from the PHP manual, with the addition of character data handling and the storage of some data during the parsing. It takes the input XML file and outputs an indented version of all tags, showing any character data associated with each tag.

<?php
$file = "example.xml";
$depth = array();
$currenttagname = array();
$currenttagvalue = array();

function onStartElement($parser, $name, $attrs)
{
    global $depth;
    global $currenttagname;
    global $currenttagvalue;

    for ($i = 0; $i < $depth[$parser]; $i++) {
        echo "  ";
    }
    echo "$name\n";
    $depth[$parser]++;

    $currentdepth = $depth[$parser];

    if ($currenttagname[$parser]==null)
        $currenttagname[$parser] = array();

    if ($currenttagvalue[$parser]==null)
        $currenttagvalue[$parser] = array();

    $currenttagname[$parser][$currentdepth] = $name;
    $currenttagvalue[$parser][$currentdepth] = $value;
}

function onEndElement($parser, $name)
{
    global $depth;
    global $currenttagname;
    global $currenttagvalue;

    $currentdepth = $depth[$parser];

    $storedname = $currenttagname[$parser][$currentdepth];
    $storedvalue = $currenttagvalue[$parser][$currentdepth];

    for ($i = 0; $i < $depth[$parser]; $i++) {
        echo "  ";
    }
    echo $storedname;
    if ($storedvalue!="")
        echo " = " . $storedvalue;
    echo "\n";

    $depth[$parser]--;
}

function onCharacterData($parser, $data)
{
    global $depth;
    global $currenttagvalue;

    if ($currenttagvalue[$parser]==null)
        return; // ignore character data outside of tags

    // ignore new lines
    $data = str_replace("\n", "", $data);
    $data = str_replace("\r", "", $data);

    $currentdepth = $depth[$parser];

    $currenttagvalue[$parser][$currentdepth] .= $data;
}

$xml_parser = xml_parser_create();
xml_set_element_handler($xml_parser, "onStartElement", "onEndElement");
xml_set_character_data_handler($xml_parser, "onCharacterData");
if (!($fp = fopen($file, "r"))) {
    die("could not open XML input");
}
?>
<html>
<head><title>PHP XML Parsing Example</title></head>
<body><pre>
<?php

while ($data = fread($fp, 4096)) {
    if (!xml_parse($xml_parser, $data, feof($fp))) {
        die(sprintf("XML error: %s at line %d",
                    xml_error_string(xml_get_error_code($xml_parser)),
                    xml_get_current_line_number($xml_parser)));
    }
}
xml_parser_free($xml_parser);
?>
</pre></body>
</html>

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: