Update: 21 February 2016
col_fixer
updated to include a vec2col
argument that lets you flatten a list column into either a single string or a set of columns.
In the data.frame
you've downloaded, I see several different column types. There are normal columns comprising vectors of the same type. There are list columns where the items may be NULL
or may themselves be a flat vector. There are list columns where there are data.frame
s as the list elements. There are list columns that contain a data.frame
of the same number of rows as the main data.frame
.
Here's a sample dataset that recreates those conditions:
mydf <- data.frame(id = 1:3, type = c("A", "A", "B"),
facility = I(list(c("x", "y"), NULL, "x")),
address = I(list(data.frame(v1 = 1, v2 = 2, v4 = 3),
data.frame(v1 = 1:2, v2 = 3:4, v3 = 5),
data.frame(v1 = 1, v2 = NA, v3 = 3))))
mydf$person <- data.frame(name = c("AA", "BB", "CC"), age = c(20, 32, 23),
preference = c(TRUE, FALSE, TRUE))
The str
of this sample data.frame
looks like:
str(mydf)
## 'data.frame': 3 obs. of 5 variables:
## $ id : int 1 2 3
## $ type : Factor w/ 2 levels "A","B": 1 1 2
## $ facility:List of 3
## ..$ : chr "x" "y"
## ..$ : NULL
## ..$ : chr "x"
## ..- attr(*, "class")= chr "AsIs"
## $ address :List of 3
## ..$ :'data.frame': 1 obs. of 3 variables:
## .. ..$ v1: num 1
## .. ..$ v2: num 2
## .. ..$ v4: num 3
## ..$ :'data.frame': 2 obs. of 3 variables:
## .. ..$ v1: int 1 2
## .. ..$ v2: int 3 4
## .. ..$ v3: num 5 5
## ..$ :'data.frame': 1 obs. of 3 variables:
## .. ..$ v1: num 1
## .. ..$ v2: logi NA
## .. ..$ v3: num 3
## ..- attr(*, "class")= chr "AsIs"
## $ person :'data.frame': 3 obs. of 3 variables:
## ..$ name : Factor w/ 3 levels "AA","BB","CC": 1 2 3
## ..$ age : num 20 32 23
## ..$ preference: logi TRUE FALSE TRUE
## NULL
One way you can "flatten" this is to "fix" the list columns. There are three fixes.
flatten
(from "jsonlite") will take care of columns like the "person" column.
- Columns like the "facility" column can be fixed using
toString
, which would convert each element to a comma separated item or which can be converted into multiple columns.
- Columns where there are
data.frame
s, some with multiple rows, first need to be flattened into a single row (by transforming to a "wide" format) and then need to be bound together as a single data.table
. (I'm using "data.table" for reshaping and for binding the rows together).
We can take care of the second and third points with a function like the following:
col_fixer <- function(x, vec2col = FALSE) {
if (!is.list(x[[1]])) {
if (isTRUE(vec2col)) {
as.data.table(data.table::transpose(x))
} else {
vapply(x, toString, character(1L))
}
} else {
temp <- rbindlist(x, use.names = TRUE, fill = TRUE, idcol = TRUE)
temp[, .time := sequence(.N), by = .id]
value_vars <- setdiff(names(temp), c(".id", ".time"))
dcast(temp, .id ~ .time, value.var = value_vars)[, .id := NULL]
}
}
We'll integrate that and the flatten
function in another function that would do most of the processing.
Flattener <- function(indf, vec2col = FALSE) {
require(data.table)
require(jsonlite)
indf <- flatten(indf)
listcolumns <- sapply(indf, is.list)
newcols <- do.call(cbind, lapply(indf[listcolumns], col_fixer, vec2col))
indf[listcolumns] <- list(NULL)
cbind(indf, newcols)
}
Running the function gives us:
Flattener(mydf)
## id type person.name person.age person.preference facility address.v1_1
## 1 1 A AA 20 TRUE x, y 1
## 2 2 A BB 32 FALSE 1
## 3 3 B CC 23 TRUE x 1
## address.v1_2 address.v2_1 address.v2_2 address.v4_1 address.v4_2 address.v3_1
## 1 NA 2 NA 3 NA NA
## 2 2 3 4 NA NA 5
## 3 NA NA NA NA NA 3
## address.v3_2
## 1 NA
## 2 5
## 3 NA
Or, with the vectors going into separate columns:
Flattener(mydf, TRUE)
## id type person.name person.age person.preference facility.V1 facility.V2
## 1 1 A AA 20 TRUE x y
## 2 2 A BB 32 FALSE <NA> <NA>
## 3 3 B CC 23 TRUE x <NA>
## address.v1_1 address.v1_2 address.v2_1 address.v2_2 address.v4_1 address.v4_2
## 1 1 NA 2 NA 3 NA
## 2 1 2 3 4 NA NA
## 3 1 NA NA NA NA NA
## address.v3_1 address.v3_2
## 1 NA NA
## 2 5 5
## 3 3 NA
Here's the str
:
str(Flattener(mydf))
## 'data.frame': 3 obs. of 14 variables:
## $ id : int 1 2 3
## $ type : Factor w/ 2 levels "A","B": 1 1 2
## $ person.name : Factor w/ 3 levels "AA","BB","CC": 1 2 3
## $ person.age : num 20 32 23
## $ person.preference: logi TRUE FALSE TRUE
## $ facility : chr "x, y" "" "x"
## $ address.v1_1 : num 1 1 1
## $ address.v1_2 : num NA 2 NA
## $ address.v2_1 : num 2 3 NA
## $ address.v2_2 : num NA 4 NA
## $ address.v4_1 : num 3 NA NA
## $ address.v4_2 : num NA NA NA
## $ address.v3_1 : num NA 5 3
## $ address.v3_2 : num NA 5 NA
## NULL
On your "providers" object, this runs very quickly and consistently:
library(microbenchmark)
out <- microbenchmark(Flattener(providers), Flattener(providers, TRUE), flattenList(jsonRList))
out
# Unit: milliseconds
# expr min lq mean median uq max neval
# Flattener(providers) 104.18939 126.59295 157.3744 138.4185 174.5222 308.5218 100
# Flattener(providers, TRUE) 67.56471 86.37789 109.8921 96.3534 121.4443 301.4856 100
# flattenList(jsonRList) 1780.44981 2065.50533 2485.1924 2269.4496 2694.1487 4397.4793 100
library(ggplot2)
qplot(y = time, data = out, colour = expr) ## Via @TylerRinker