Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
459 views
in Technique[技术] by (71.8m points)

data.table - Find rows with duplicate values across a small set of columns in R

Let's say that have a data.table with an id and integer values in four other columns. How can I efficiently find the rows where at least two of the four values in the four other columns are the same?

fooTbl = data.table(id = c('a', 'b'), ind1=c(1,2), ind2=c(3,4), ind3=c(2,3), ind4=c(2,1))
fooTbl
#    id ind1 ind2 ind3 ind4
# 1:  a    1    3    2    2
# 2:  b    2    4    3    1

I have two solutions already. The first is much faster than the second, but the first requires hard-coding all of the combinations and checking equality for each of them. This seems undesirable and difficult to maintain as the number of columns increases:

fooTbl[, uniq := (ind1 != ind2 & ind1 != ind3 & ind1 != ind4 & ind2 != ind3 & ind2 != ind4 & ind3 != ind4)]
fooTbl
#    id ind1 ind2 ind3 ind4  uniq
# 1:  a    1    3    2    2 FALSE
# 2:  b    2    4    3    1  TRUE

The second is to use data.table and operate on a long form of the table. This one is more maintainable (no hard coding of all of the combinations) but is much slower:

fooTbl[, uniq := NULL]
fooTbl
#    id ind1 ind2 ind3 ind4
# 1:  a    1    3    2    2
# 2:  b    2    4    3    1
fooTbl = melt(fooTbl, measure=c('ind1', 'ind2', 'ind3', 'ind4'))
fooTbl
#    id variable value
# 1:  a     ind1     1
# 2:  b     ind1     2
# 3:  a     ind2     3
# 4:  b     ind2     4
# 5:  a     ind3     2
# 6:  b     ind3     3
# 7:  a     ind4     2
# 8:  b     ind4     1
fooTbl[, N := length(unique(value)), by=id]
fooTbl[, uniq := N == 4][, N := NULL]
fooTbl
   id variable value  uniq
1:  a     ind1     1 FALSE
2:  b     ind1     2  TRUE
3:  a     ind2     3 FALSE
4:  b     ind2     4  TRUE
5:  a     ind3     2 FALSE
6:  b     ind3     3  TRUE
7:  a     ind4     2 FALSE
8:  b     ind4     1  TRUE
fooTbl = dcast(fooTbl, id + uniq ~ variable, value.var='value')
fooTbl
  id  uniq ind1 ind2 ind3 ind4
1  a FALSE    1    3    2    2
2  b  TRUE    2    4    3    1

Is there a way that I can get the speed of the first (wide) solution without hard coding all of the combinations of checks?

N for my actual table is manageable (~ 3M) but large enough to feel the weight of the by operation in the second solution.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

This assumes that id is the unique key for each row:

> (ind <- paste0("ind",1:4))
[1] "ind1" "ind2" "ind3" "ind4"
> fooTbl[,u := length(ind) == length(unique(unlist(.SD))),by="id", .SDcols = ind]

or

> fooTbl[,u := !any(duplicated(unlist(.SD))),by="id", .SDcols = ind]

or without by:

> fooTbl[, u := apply(.SD,1,function(x) !any(duplicated(x))), .SDcols = ind]

now:

> fooTbl
   id ind1 ind2 ind3 ind4     u
1:  a    1    3    2    2 FALSE
2:  b    2    4    3    1  TRUE

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...