dataframe - Porting set operations from R's data frames to data tables: How to identify duplicated rows?

Question

Welcome To Ask or Share your Answers For Others

dataframe - Porting set operations from R's data frames to data tables: How to identify duplicated rows?

asked Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

dataframe - Porting set operations from R's data frames to data tables: How to identify duplicated rows?

[Update 1: As Matthew Dowle noted, I'm using data.table version 1.6.7 on R-Forge, not CRAN. You won't see the same behavior with an earlier version of data.table.]

As background: I am porting some little utility functions to do set operations on rows of a data frame or pairs of data frames (i.e. each row is an element in a set), e.g. unique - to create a set from a list, union, intersection, set difference, etc. These mimic Matlab's intersect(...,'rows'), setdiff(...,'rows'), etc., which don't appear to have counterparts in R (R's set operations are limited to vectors and lists, but not rows of matrices or data frames). Examples of these little functions are below. If this functionality for data frames already exists in some package or base R, I'm open to suggestions.

I have been migrating these to data tables and one necessary step in the current approach is to find duplicated rows. When duplicated() is executed an error is returned stating that data tables must have keys. This is an unfortunate roadblock - other than setting keys, which isn't a universal solution and adds to computational costs, is there some other way to find duplicated objects?

Here is a reproducible example:

library(data.table)
set.seed(0)
x   <- as.data.table(matrix(sample(2, 100, replace = TRUE), ncol = 4))
y   <- as.data.table(matrix(sample(2, 100, replace = TRUE), ncol = 4))

res3    <- dt_intersect(x,y)

Yielding this error message:

Error in duplicated.data.table(z_rbind) : data table must have keys

The code works as-is for data frames, though I've named each function with the pattern dt_operation.

Is there some way to get around this issue? Setting keys only works for integers, which is a constraint I can't assume for the input data. So, perhaps I'm missing a clever way to use data tables?

Example set operation functions, where the elements of the sets are rows of data:

dt_unique       <- function(x){
    return(unique(x))
}

dt_union        <- function(x,y){
    z_rbind     <- rbind(x,y)
    z_unique    <- dt_unique(z_rbind)
    return(z_unique)
}

dt_intersect    <- function(x,y){
    zx          <- dt_unique(x)
    zy          <- dt_unique(y)

    z_rbind     <- rbind(zy,zx)
    ixDupe      <- which(duplicated(z_rbind))
    z           <- z_rbind[ixDupe,]
    return(z)
}

dt_setdiff      <- function(x,y){
    zx          <- dt_unique(x)
    zy          <- dt_unique(y)

    z_rbind     <- rbind(zy,zx)
    ixRangeX    <- (nrow(zy) + 1):nrow(z_rbind)
    ixNotDupe   <- which(!duplicated(z_rbind))
    ixDiff      <- intersect(ixNotDupe, ixRangeX)
    diffX       <- z_rbind[ixDiff,]
    return(diffX)
}

Note 1: One intended use for these helper functions is to find rows where key values in x are not among the key values in y. This way, I can find where NAs may appear when calculating x[y] or y[x]. Although this usage allows for setting of keys for the z_rbind object, I'd prefer not to constrain myself to just this use case.

Note 2: For related posts, here is a post on running unique on data frames, with excellent results for running it with the updated data.table package. And this is an earlier post on running unique on data tables.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Answer

深蓝 · Answer 1 · 2021-10-23T19:18:19+0000

duplicated.data.table needs the same fix unique.data.table got [EDIT: Now done in v1.7.2]. Please raise another bug report: bug.report(package="data.table"). For the benefit of others watching, you're already using v1.6.7 from R-Forge, not 1.6.6 on CRAN.

But, on Note 1, there's a 'not join' idiom :

x[-x[y,which=TRUE]]

See also FR#1384 (New 'not' and 'whichna' arguments?) to make that easier for users, and that links to the keys that don't match thread which goes into more detail.

Update. Now in v1.8.3, not-join has been implemented.

DT[-DT["a",which=TRUE,nomatch=0],...]   # old idiom
DT[!"a",...]                            # same result, now preferred.

Categories

dataframe - Porting set operations from R's data frames to data tables: How to identify duplicated rows?

dataframe - Porting set operations from R's data frames to data tables: How to identify duplicated rows?

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags