Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
320 views
in Technique[技术] by (71.8m points)

Powershell: Faster version of xml.parentNode.RemoveChild

I'm trying to drop nodes from a very large XLF-File (which is basically an XML file) via powershell. The structure is (simplified) always the following:

<?xml version="1.0" encoding="utf-8"?>
<xliff>
  <file>
    <body>
      <group>
        <trans-unit>
          <source>asd</source>
          <target>asd</target>
        </trans-unit>
        <trans-unit>
          <source> </source>
          <target> </target>
        </trans-unit>
        <trans-unit>
          <source>asd</source>
          <target>asdf</target>
        </trans-unit>
        </group>
    </body>
  </file>
</xliff>

Now i want to remove all nodes in this file where source and target are equal.

Here is what i have so far:

Match function:

function Match{
    param(
        $sourceNode,$targetNode
    )
    #do this because empty string as xml value is of type xmlElement and fails to compare
    if ($sourceNode.innerText -eq " ") {
        $source = $sourceNode.innerText
    }
    else {
        $source = $sourceNode
    }
    if ($targetNode.innerText -eq " ") {
        $target = $targetNode.innerText
    }
    else {
        $target = $targetNode
    }
    return $source -eq $target
}

Code to remove nodes:

$xml = [xml]((Get-Content $xmlPath -Encoding UTF8).Replace("trans-unit", "transunit"))
$xml.xliff.file.body.group.transunit | ForEach-Object {
    if (Match $_.source $_.target) {
        $_.parentNode.RemoveChild($_) | Out-Null
    }
}
$xml = [xml]($xml.OuterXml.Replace("transunit", "trans-unit"))
$xml.Save($outPath)

This works, but unfortunately it is very slow as the file has roughly 300 000 nodes. It is important that the nodes keep their attributes while saving to further process the file later.

A faster approach which I was not able to finish is the following:

$xml = [xml]([System.IO.File]::ReadAllText($xmlPath).Replace("trans-unit", "transunit"))
$filteredNodes = $xml.xliff.file.body.group.transunit | Where-Object {
    !(Match $_.source $_.target)
}

???

$xml = [xml]($xml.OuterXml.Replace("transunit", "trans-unit"))
$xml.Save($outPath)

to get a List containing all XmlNodes where target and source are different, but unfortunately I was not able to pass this list back into the xml document

Is there a faster way to remove those matching nodes from the file?

question from:https://stackoverflow.com/questions/65904307/powershell-faster-version-of-xml-parentnode-removechild

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

There are a couple of things wrong with your approach.

  1. NEVER use [System.IO.File]::ReadAllText($xmlPath) or Get-Content $xmlPath to read an XML file. This is wrong because it kills the file encoding auto-detection that is built into XML, and it's wasteful because the XML parser is perfectly able to read the file on its own - reading it into a PowerShell variable first serves no purpose.

    Always load XML files with the parser directly:

    $doc = New-Object xml
    $doc.Load($xmlPath)
    
  2. You should use XPath to select candidates for deletion:

    $same = $doc.selectNodes('//trans-unit[source = target]')
    

    These are easy to iterate over and remove:

    foreach ($n in $same) {
        $n.parentNode.removeChild($n)
    }
    

    This is about as fast as it gets when processing the file with .NET's XmlDocument.

  3. Don't call string.Replace() on XML source code. Just don't.


You can use XSLT to strip nodes from a document. There's a good chance that this performs better:

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
  <xsl:strip-space elements="*" />
  <xsl:output method="xml" indent="yes" encoding="utf-8" />

  <xsl:template match="node() | @*">
      <xsl:copy>
          <xsl:apply-templates select="node() | @*" />
      </xsl:copy>
  </xsl:template>
  
  <xsl:template match="trans-unit[source = target]" />
</xsl:stylesheet>

Usage in PowerShell goes something like this:

using namespace System.Xml

$xsl = [Xsl.XslCompiledTransform]::new();
$xsl.Load('C:pathoabove.xsl')

$xmlIn = [XmlTextReader]::new('C:pathoinput.xml')
$xmlOut = [XmlTextWriter]::Create('C:pathooutput.xml')
$xsl.Transform($xmlIn, $xmlOut)

$xmlIn.Close()
$xmlIn.Dispose()

$xmlOut.Close()
$xmlOut.Dispose()

Caveat with using the .NET objects directly is that you need to supply full paths. Relative paths won't work. You can use (Join-Path (Get-Location) 'filename.xml') to create a full path where needed.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...