I have a data structure such as following. The input file is pretty large and thus I am trying to find an efficient method.
<?xml version='1.0' encoding='UTF-8'?>
<corpus name="corpus">
<recording audio="audio.wav" name="first audio">
<segment name="1" start="0" end="2">
<orth>some text 1</orth>
</segment>
<segment name="2" start="2" end="4">
<orth>some text 2</orth>
</segment>
<segment name="3" start="4" end="6">
<orth>some text 3</orth>
</segment>
</recording>
</corpus>
given an input file containing number of files such as
1
3
it would remove the segments that has those name
. For example, 1 and 3 was given so segments with names 1 and 3 has been removed.
<?xml version='1.0' encoding='UTF-8'?>
<corpus name="corpus">
<recording audio="audio.wav" name="first audio">
<segment name="2" start="2" end="4">
<orth>some text 2</orth>
</segment>
</recording>
</corpus>
the code I have so far
from lxml import etree
with open("g.xml", "r") as xml_file:
xml_data = xml_file.read()
with open('del_names.txt', 'r') as file:
list_of_names = file.read().split("
")
new_xml = xml_data
for each_name in list_of_names:
print(each_name)
tree = etree.XML(new_xml.encode())
find_segments = tree.xpath("*//segment[@name='{}']".format(each_name))
for each_segment in find_segments:
each_segment.getparent().remove(each_segment)
new_xml = str(etree.tostring(tree, pretty_print=True, xml_declaration=True), encoding="utf-8")
print(new_xml)
The problem with the code is that, I ran the code for 2 hours now and it didn't even output a single line. I am not sure what efficient way I could do this.
How do I accomplish this? I also think having 2 might be unnecessary is that correct?
question from:
https://stackoverflow.com/questions/65876287/how-to-delete-parts-of-xml-data-and-write-it-to-a-new-file-with-python 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…