The real problem is that what XML::Simple
primarily tries to do is take XML, and represent it as a perl data structure.
As you'll no doubt be aware from perldata
the two key data structures you have available is the hash
and the array
.
- Arrays are ordered scalars.
- hashes are unordered key-value pairs.
And XML doesn't do either really. It has elements which are:
- non uniquely named (which means hashes don't "fit").
- .... but are 'ordered' within the file.
- may have attributes (Which you could insert into a hash)
- may have content (But might not, but could be a unary tag)
- may have children (Of any depth)
And these things don't map directly to the available perl data structures - at a simplistic level, a nested hash of hashes might fit - but it can't cope with elements with duplicated names. Nor can you differentiate easily between attributes and child nodes.
So XML::Simple
tries to guess based on the XML content, and takes 'hints' from the various option settings, and then when you try and output the content, it (tries to) apply the same process in reverse.
As a result, for anything other than the most simple XML, it becomes unwieldy at best, or loses data at worst.
Consider:
<xml>
<parent>
<child att="some_att">content</child>
</parent>
<another_node>
<another_child some_att="a value" />
<another_child different_att="different_value">more content</another_child>
</another_node>
</xml>
This - when parsed through XML::Simple
gives you:
$VAR1 = {
'parent' => {
'child' => {
'att' => 'some_att',
'content' => 'content'
}
},
'another_node' => {
'another_child' => [
{
'some_att' => 'a value'
},
{
'different_att' => 'different_value',
'content' => 'more content'
}
]
}
};
Note - now you have under parent
- just anonymous hashes, but under another_node
you have an array of anonymous hashes.
So in order to access the content of child
:
my $child = $xml -> {parent} -> {child} -> {content};
Note how you've got a 'child' node, with a 'content' node beneath it, which isn't because it's ... content.
But to access the content beneath the first another_child
element:
my $another_child = $xml -> {another_node} -> {another_child} -> [0] -> {content};
Note how - because of having multiple <another_node>
elements, the XML has been parsed into an array, where it wasn't with a single one. (If you did have an element called content
beneath it, then you end up with something else yet). You can change this by using ForceArray
but then you end up with a hash of arrays of hashes of arrays of hashes of arrays - although it is at least consistent in it's handling of child elements. Edit: Note, following discussion - this is a bad default, rather than a flaw with XML::Simple.
You should set:
ForceArray => 1, KeyAttr => [], ForceContent => 1
If you apply this to the XML as above, you get instead:
$VAR1 = {
'another_node' => [
{
'another_child' => [
{
'some_att' => 'a value'
},
{
'different_att' => 'different_value',
'content' => 'more content'
}
]
}
],
'parent' => [
{
'child' => [
{
'att' => 'some_att',
'content' => 'content'
}
]
}
]
};
This will give you consistency, because you will no longer have single node elements handle differently to multi-node.
But you still:
- Have a 5 reference deep tree to get at a value.
E.g.:
print $xml -> {parent} -> [0] -> {child} -> [0] -> {content};
You still have content
and child
hash elements treated as if they were attributes, and because hashes are unordered, you simply cannot reconstruct the input. So basically, you have to parse it, then run it through Dumper
to figure out where you need to look.
But with an xpath
query, you get at that node with:
findnodes("/xml/parent/child");
What you don't get in XML::Simple
that you do in XML::Twig
(and I presume XML::LibXML
but I know it less well):
xpath
support. xpath
is an XML way of expressing a path to a node. So you can 'find' a node in the above with get_xpath('//child')
. You can even use attributes in the xpath
- like get_xpath('//another_child[@different_att]')
which will select exactly which one you wanted. (You can iterate on matches too).
cut
and paste
to move elements around
parsefile_inplace
to allow you to modify XML
with an in place edit.
pretty_print
options, to format XML
.
twig_handlers
and purge
- which allows you to process really big XML without having to load it all in memory.
simplify
if you really must make it backwards compatible with XML::Simple
.
- the code is generally way simpler than trying to follow daisy chains of references to hashes and arrays, that can never be done consistently because of the fundamental differences in structure.
It's also widely available - easy to download from CPAN
, and distributed as an installable package on many operating systems. (Sadly it's not a default install. Yet)
See: XML::Twig quick reference
For the sake of comparison:
my $xml = XMLin( *DATA, ForceArray => 1, KeyAttr => [], ForceContent => 1 );
print Dumper $xml;
print $xml ->{parent}->[0]->{child}->[0]->{content};
Vs.
my $twig = XML::Twig->parse( *DATA );
print $twig ->get_xpath( '/xml/parent/child', 0 )->text;
print $twig ->root->first_child('parent')->first_child_text('child');