if I understand correctly, duplicated()
function for data.table
returns a logical vector which doesn't contain first occurrence of duplicated record. What is the best way to mark this first occurrence as well? In case of base::duplicated()
, I solved this simply by disjunction with reversed order function: myDups <- (duplicated(x) | duplicated(x, fromLast=TRUE))
- but in data.table::duplicated()
, fromLast=TRUE
is not included (I don't know why)...
P.S. ok, here's a primitive example
myDT <- fread(
"id,fB,fC
1, b1,c1
2, b2,c2
3, b1,c1
4, b3,c3
5, b1,c1
")
setkeyv(myDT, c('fB', 'fC'))
myDT[, fD:=duplicated(myDT)]
rows 1, 3 and 5 are all duplicates but only 3 and 5 will be included in duplicated
while I need to mark all of them.
UPD. important notice: the answer I've accepted below works only for keyed table. If you want to find duplicate records considering all columns, you have to setkey
all these columns explicitly. So far I use the following workaround specifically for this case:
dups1 <- duplicated(myDT);
dups2 <- duplicated(myDT, fromLast=T);
dups <- dups1 | dups2;
See Question&Answers more detail:
os 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…