I have a relatively large data set (1,750,000 lines, 5 columns) which contains records with unique ID values (first column), described by four criteria (4 other columns). A small example would be:
# example
library(data.table)
dt <- data.table(id=c("a1","b3","c7","d5","e3","f4","g2","h1","i9","j6"),
s1=c("a","b","c","l","l","v","v","v",NA,NA),
s2=c("d","d","e","k","k","o","o","o",NA,NA),
s3=c("f","g","f","n","n","s","r","u","w","z"),
s4=c("h","i","j","m","m","t","t","t",NA,NA))
which looks like this:
id s1 s2 s3 s4
1: a1 a d f h
2: b3 b d g i
3: c7 c e f j
4: d5 l k n m
5: e3 l k n m
6: f4 v o s t
7: g2 v o r t
8: h1 v o u t
9: i9 <NA> <NA> w <NA>
10: j6 <NA> <NA> z <NA>
My ultimate goal is to find all records with the same character on any description columns (disregarding NAs), and group them under a new ID, so that I can easily identify duplicated records. These IDs are constructed by concatenating the IDs of each row.
Things got messier because I can find those records with duplicated descriptions directly and indirectly. Therefore, I am currently doing this operation in two steps.
STEP 1 - Constructing duplicated IDs based on direct duplicates
# grouping ids with duplicated info in any of the columns
#sorry, I could not find search for duplicates using multiple columns simultaneously...
dt[!is.na(dt$s1),ids1:= paste(id,collapse="|"), by = list(s1)]
dt[!is.na(dt$s1),ids2:= paste(id,collapse="|"), by = list(s2)]
dt[!is.na(dt$s1),ids3:= paste(id,collapse="|"), by = list(s3)]
dt[!is.na(dt$s1),ids4:= paste(id,collapse="|"), by = list(s4)]
# getting a unique duplicated ID for each row
dt$new.id <- apply(dt[,.(ids1,ids2,ids3,ids4)], 1, paste, collapse="|")
dt$new.id <- apply(dt[,"new.id",drop=FALSE], 1, function(x) paste(unique(strsplit(x,"\|")[[1]]),collapse="|"))
This operation results in the following, with the unique duplicated ID define as "new.id":
id s1 s2 s3 s4 ids1 ids2 ids3 ids4 new.id
1: a1 a d f h a1 a1|b3 a1|c7 a1 a1|b3|c7
2: b3 b d g i b3 a1|b3 b3 b3 b3|a1
3: c7 c e f j c7 c7 a1|c7 c7 c7|a1
4: d5 l k n m d5|e3 d5|e3 d5|e3 d5|e3 d5|e3
5: e3 l k n m d5|e3 d5|e3 d5|e3 d5|e3 d5|e3
6: f4 v o s t f4|g2|h1 f4|g2|h1 f4 f4|g2|h1 f4|g2|h1
7: g2 v o r t f4|g2|h1 f4|g2|h1 g2 f4|g2|h1 f4|g2|h1
8: h1 v o u t f4|g2|h1 f4|g2|h1 h1 f4|g2|h1 f4|g2|h1
9: i9 <NA> <NA> w <NA> <NA> <NA> <NA> <NA> NA
10: j6 <NA> <NA> z <NA> <NA> <NA> <NA> <NA> NA
Note that records "b3" and "c7" are duplicated indirectly through "a1" (all other examples are direct duplicates that should remain the same). That is why we need the next step.
STEP 2 - Updating the duplicated IDs based on indirect duplicates
#filtering the relevant columns for the indirect search
dt = dt[,.(id,new.id)]
#creating the patterns to be used by grepl() for the look-up for each row
dt[,patt:= .(paste(paste("^",id,"\||",sep=""),paste("\|",id,"\||",sep=""),paste("\|",id,"$",sep=""),collapse = "" ,sep="")), by = list(id)]
#Transforming the ID vector into factor and setting it as a 'key' to the data.table (speed up the processing)
dt$new.id = as.factor(dt$new.id)
setkeyv(dt, c("new.id"))
#Performing the loop using sapply
library(stringr)
for(i in 1:nrow(dt)) {
pat = dt$patt[i] # retrieving the research pattern
tmp = dt[new.id %like% pat] # searching the pattern using grepl()
if(dim(tmp)[1]>1) {
x = which.max(str_count(tmp$new.id, "\|"))
dt$new.id[i] = as.character(tmp$new.id[x])
}
}
#filtering the final columns
dt = dt[,.(id,new.id)]
The final table looks like:
id new.id
1: a1 a1|b3|c7
2: b3 a1|b3|c7
3: c7 a1|b3|c7
4: d5 d5|e3
5: e3 d5|e3
6: f4 f4|g2|h1
7: g2 f4|g2|h1
8: h1 f4|g2|h1
9: i9 NA
10: j6 NA
Note that now the first three records ("a1","b3","c7") are grouped under a broader duplicated ID, which contains both direct and indirect records.
Everything is working out fine, but my code is horrendously slow. It took 2 entire days to run half of the data set (~800,0000). I could parallelize the loop into different cores, but it would still take hours. And I am almost sure that I could use data.table functionalities in a better way, maybe using using 'set' inside the loop. I spent hours today trying to implement the same codes using data.table, but I am new to its syntax and I am really having a hard time here. Any suggestions on how I could optimize this code?
Note: The slowest part of the code is the loop and inside the loop the most inefficient step is the grepl() of the patterns inside the data.table. It seems that setting a 'key' to the data.table can speed up the process, but I did not changed the time it took to do the grepl() in my case.
See Question&Answers more detail:
os