I am new to parsing XML in R. I am trying to parse XML into a workable data frame. I have tried some XPath functions from the XML package but cannot seem to arrive at the correct answer.
Here is my XML:
<ResidentialProperty>
<Listing>
<StreetAddress>
<StreetNumber>11111</StreetNumber>
<StreetName>111th</StreetName>
<StreetSuffix>Avenue Ct</StreetSuffix>
<StateOrProvince>WA</StateOrProvince>
</StreetAddress>
<MLSInformation>
<ListingStatus Status="Active"/>
<StatusChangeDate>2015-07-05T23:48:53.410</StatusChangeDate>
</MLSInformation>
<GeographicData>
<Latitude>11.111111</Latitude>
<Longitude>-111.111111</Longitude>
<County>Pierce</County>
</GeographicData>
<SchoolData>
<SchoolDistrict>Puyallup</SchoolDistrict>
</SchoolData>
<View>Territorial</View>
</Listing>
<YearBuilt>1997</YearBuilt>
<InteriorFeatures>Bath Off Master,Dbl Pane/Storm Windw</InteriorFeatures>
<Occupant>
<Name>Vacant</Name>
</Occupant>
<WaterFront/>
<Roof>Composition</Roof>
<Exterior>Brick,Cement Planked,Wood,Wood Products</
</ResidentialProperty>
When I run:
ResidentialProperty <- xmlToDataFrame(nodes=getNodeSet(doc,"//ResidentialProperty"))
The values of the child nodes within the parent node is compressed to:
11111111thAvenue CtWA2015-07-05T23:48:53.41011.111111-111.111111PiercePuyallupTerritorial
If I move down one node to , the same thing happens:
11111111thAvenue CtWA
The values of the child nodes are all pasted together.
I also tried a brute force method which worked somewhat:
StreetAddress <- xmlToDataFrame(nodes=getNodeSet(doc,"//StreetAddress"))
MLSInformation <- xmlToDataFrame(nodes=getNodeSet(doc,"//MLSInformation"))
GeographicData <- xmlToDataFrame(nodes=getNodeSet(doc,"//GeographicData"))
SchoolData <- xmlToDataFrame(nodes=getNodeSet(doc,"//SchoolData"))
YearBuilt <- xmlToDataFrame(nodes=getNodeSet(doc,"//YearBuilt"))
InteriorFeatures <- xmlToDataFrame(nodes=getNodeSet(doc,"//InteriorFeatures"))
Occupant <- xmlToDataFrame(nodes=getNodeSet(doc,"//Occupant"))
Roof <- xmlToDataFrame(nodes=getNodeSet(doc,"//Roof"))
Exterior <- xmlToDataFrame(nodes=getNodeSet(doc,"//Exterior"))
df <- cbind(StreetAddress, MLSInformation, GeographicData, SchoolData, YearBuilt, InteriorFeatures, Occupant, Roof, Exterior)
but some of the column names were not as expected:
> colnames(df)
[1] "StreetNumber" "StreetName" "StreetSuffix" "StateOrProvince" "ListingStatus"
[6] "StatusChangeDate" "Latitude" "Longitude" "County" "SchoolDistrict"
[11] "text" "text" "Name" "text" "text"
colnames[11,12,14,15]
should be "YearBuilt", "InteriorFeatures", "Roof", and "Exterior"
respectively. (Side note - why does this happen?)
I am trying to find a way to sort each atomic value into an appropriate column of a data frame with the column names being the names of the nodes, even within nested children nodes. Also, my data may change over time, so I'm looking for a dynamic function to conform to the data, producing expected results if possible.
I imagine this is a somewhat common XML schema (with layers of nested children) so I am surprised to not find much info on the topic, though I may simply using the wrong jargon in my searches. It my guess that there is a simple answer. Do you have any suggestions?
See Question&Answers more detail:
os