Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
1.3k views
in Technique[技术] by (71.8m points)

python - Parsing xml to pandas data frame throws memory error

I am trying to put 1. a parent attribute 2. a child attribute and 3. a grandchild text into a data frame. I am able to get the child attribute and the grandchild text to print out on the screen, but I cannot get them to go into a data frame. I get a memory error from pandas.

Here is intro stuff

import requests
from lxml import etree, objectify
r = requests.get('https://api.stuff.us/place/getData?   security_key=key&period=minutes&startTime=2013-05-01T00:00&endTime=2013-05-01T23:59&sort=channel') #edited for privacy
root = etree.fromstring(r.text)
xml_new = etree.tostring(root, pretty_print=True)
print xml_new[300:900] #gives xml output to show structure
<startTime>2013-05-01 00:00:00</startTime>
<endTime>2013-05-01 23:59:00</endTime>
<summaryPeriod>minutes</summaryPeriod>
<data>
  <channel channel="97925" name="blah"> 
    <Time Time="2013-05-01 00:00:00">
      <value>258</value>
    </Time>
    <Time Time="2013-05-01 00:01:00">
      <value>259</value>
    </Time>
    <Time Time="2013-05-01 00:02:00">
      <value>258</value>
    </Time>
    <Time Time="2013-05-01 00:03:00">
      <value>257</value>
    </Time>

This shows how I am parsing to get the child attribute and grandchild to print.

for df in root.xpath('//channel/Time'):
    ## Iterate over attributes of channel/Time
    for attrib in df.attrib:
            print '@' + attrib + '=' + df.attrib[attrib]
    ## value is a child of time, and iterate
    subfields = df.getchildren()
    for subfield in subfields:
            print 'subfield=' + subfield.text

It yields a very long print out with the information as requested:

...
@Time=2013-05-01 23:01:00
value=100
@Time=2013-05-01 23:02:00
value=101
@Time=2013-05-01 23:03:00
value=99
@Time=2013-05-01 23:04:00
value=101
...

However, when I try to put it into a data frame, I get a memory error. I tried with both of them an also with just trying to get the child attribute into a data frame.

data = []
for df in root.xpath('//channel/Time'):
    ## Iterate over attributes of channel/Time
    for attrib in df.attrib:
        el_data = {}
        el_data[attrib] = df.attrib[attrib]
    data.append(el_data)
from pandas import *
perf = DataFrame(data)
perf

---------------------------------------------------------------------------
MemoryError                               Traceback (most recent call last)
<ipython-input-6-08c8c74f7192> in <module>()
      1 from pandas import *
----> 2 perf = DataFrame(data)
      3 perf

/Users/user/Library/Enthought/Canopy_32bit/User/lib/python2.7/site-    packages/pandas/core/frame.pyc in __init__(self, data, index, columns, dtype, copy)
    417 
    418                 if isinstance(data[0], (list, tuple, collections.Mapping, Series)):
--> 419                     arrays, columns = _to_arrays(data, columns, dtype=dtype)
    420                     columns = _ensure_index(columns)
    421 

/Users/user/Library/Enthought/Canopy_32bit/User/lib/python2.7/site-packages/pandas/core/frame.pyc in _to_arrays(data, columns, coerce_float, dtype)
   5457         return _list_of_dict_to_arrays(data, columns,
   5458                                        coerce_float=coerce_float,
-> 5459                                        dtype=dtype)
   5460     elif isinstance(data[0], Series):
   5461         return _list_of_series_to_arrays(data, columns,

/Users/user/Library/Enthought/Canopy_32bit/User/lib/python2.7/site-    packages/pandas/core/frame.pyc in _list_of_dict_to_arrays(data, columns, coerce_float, dtype)
   5521             for d in data]
   5522 
-> 5523     content = list(lib.dicts_to_array(data, list(columns)).T)
   5524     return _convert_object_array(content, columns, dtype=dtype,
   5525                                  coerce_float=coerce_float)

/Users/user/Library/Enthought/Canopy_32bit/User/lib/python2.7/site-packages/pandas/lib.so in pandas.lib.dicts_to_array (pandas/lib.c:7657)()

MemoryError: 

I have 12960 values of "value" in my xml file. I assume that these memory errors are telling me something about the values in the file not meeting what is expected, but that doesn't match with a memory error, and I could not figure it out from other SO questions regarding memory errors or from the pandas documentation.

An attempt to get the data types yields no information. Maybe there are no types? Perhaps because they are elements in an element tree. (I tried to print .pyval, but it only told me there was no attribute.) el_data is of type "dict"

print(objectify.dump(root))[700:1000] #print a subset of types
name = 'zone'
            Time = None [_Element]
              * Time = '2013-05-01 00:00:00'
                value = '258' [_Element]
            Time = None [_Element]
              * Time = '2013-05-01 00:01:00'
                value = '259' [_Element]
type(el_data)
dict

I built this code based on the book Python for Data Analysis and other examples found on SO for parsing XML. I am still new to python.

Running Python 2.7.2 on Mac OS 10.7.5

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

Answer based on help from Jeff and JoeKington. The data needed to be put into lists separately before being pushed into the dataframe. The memory error was being caused by the multiple "elements" which were not able to be put into a data frame. Instead, each element dict needs to be put into a list which can go into a data frame.

This works:

dTime=[]
dvalue=[]
for df in root.xpath('//channel/Time'):
    ## Iterate over attributes of channel
    for attrib in df.attrib:
    dTime.append(df.attrib[attrib])
    ## value is a child of time, and iterate
    subfields = df.getchildren()
    for subfield in subfields:
    dvalue.append(subfield.text)
pef=DataFrame({'Time':dTime,'values':dvalue})

pef

&ltclass 'pandas.core.frame.DataFrame'&gt
Int64Index: 12960 entries, 0 to 12959
Data columns (total 2 columns):
Time     12960  non-null values
value    12960  non-null values
dtypes: object(2) 

pef[:5]

    Time                    value
0    2013-05-01 00:00:00    258
1    2013-05-01 00:01:00    259
2    2013-05-01 00:02:00    258
3    2013-05-01 00:03:00    257
4    2013-05-01 00:04:00    257

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...