Let's say that have a data.table with an id and integer values in four other columns. How can I efficiently find the rows where at least two of the four values in the four other columns are the same?
fooTbl = data.table(id = c('a', 'b'), ind1=c(1,2), ind2=c(3,4), ind3=c(2,3), ind4=c(2,1))
fooTbl
# id ind1 ind2 ind3 ind4
# 1: a 1 3 2 2
# 2: b 2 4 3 1
I have two solutions already. The first is much faster than the second, but the first requires hard-coding all of the combinations and checking equality for each of them. This seems undesirable and difficult to maintain as the number of columns increases:
fooTbl[, uniq := (ind1 != ind2 & ind1 != ind3 & ind1 != ind4 & ind2 != ind3 & ind2 != ind4 & ind3 != ind4)]
fooTbl
# id ind1 ind2 ind3 ind4 uniq
# 1: a 1 3 2 2 FALSE
# 2: b 2 4 3 1 TRUE
The second is to use data.table and operate on a long form of the table. This one is more maintainable (no hard coding of all of the combinations) but is much slower:
fooTbl[, uniq := NULL]
fooTbl
# id ind1 ind2 ind3 ind4
# 1: a 1 3 2 2
# 2: b 2 4 3 1
fooTbl = melt(fooTbl, measure=c('ind1', 'ind2', 'ind3', 'ind4'))
fooTbl
# id variable value
# 1: a ind1 1
# 2: b ind1 2
# 3: a ind2 3
# 4: b ind2 4
# 5: a ind3 2
# 6: b ind3 3
# 7: a ind4 2
# 8: b ind4 1
fooTbl[, N := length(unique(value)), by=id]
fooTbl[, uniq := N == 4][, N := NULL]
fooTbl
id variable value uniq
1: a ind1 1 FALSE
2: b ind1 2 TRUE
3: a ind2 3 FALSE
4: b ind2 4 TRUE
5: a ind3 2 FALSE
6: b ind3 3 TRUE
7: a ind4 2 FALSE
8: b ind4 1 TRUE
fooTbl = dcast(fooTbl, id + uniq ~ variable, value.var='value')
fooTbl
id uniq ind1 ind2 ind3 ind4
1 a FALSE 1 3 2 2
2 b TRUE 2 4 3 1
Is there a way that I can get the speed of the first (wide) solution without hard coding all of the combinations of checks?
N for my actual table is manageable (~ 3M) but large enough to feel the weight of the by operation in the second solution.
See Question&Answers more detail:
os 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…