r - Fast way to group variables based on direct and indirect similarities in multiple columns

Question

Welcome To Ask or Share your Answers For Others

r - Fast way to group variables based on direct and indirect similarities in multiple columns

asked Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

r - Fast way to group variables based on direct and indirect similarities in multiple columns

I have a relatively large data set (1,750,000 lines, 5 columns) which contains records with unique ID values (first column), described by four criteria (4 other columns). A small example would be:

# example
library(data.table)
dt <- data.table(id=c("a1","b3","c7","d5","e3","f4","g2","h1","i9","j6"), 
                 s1=c("a","b","c","l","l","v","v","v",NA,NA), 
                 s2=c("d","d","e","k","k","o","o","o",NA,NA),
                 s3=c("f","g","f","n","n","s","r","u","w","z"),
                 s4=c("h","i","j","m","m","t","t","t",NA,NA))

which looks like this:

   id   s1   s2 s3   s4
 1: a1    a    d  f    h
 2: b3    b    d  g    i
 3: c7    c    e  f    j
 4: d5    l    k  n    m
 5: e3    l    k  n    m
 6: f4    v    o  s    t
 7: g2    v    o  r    t
 8: h1    v    o  u    t
 9: i9 <NA> <NA>  w <NA>
10: j6 <NA> <NA>  z <NA>

My ultimate goal is to find all records with the same character on any description columns (disregarding NAs), and group them under a new ID, so that I can easily identify duplicated records. These IDs are constructed by concatenating the IDs of each row.

Things got messier because I can find those records with duplicated descriptions directly and indirectly. Therefore, I am currently doing this operation in two steps.

STEP 1 - Constructing duplicated IDs based on direct duplicates

# grouping ids with duplicated info in any of the columns
#sorry, I could not find search for duplicates using multiple columns simultaneously...
dt[!is.na(dt$s1),ids1:= paste(id,collapse="|"), by = list(s1)]
dt[!is.na(dt$s1),ids2:= paste(id,collapse="|"), by = list(s2)]
dt[!is.na(dt$s1),ids3:= paste(id,collapse="|"), by = list(s3)]
dt[!is.na(dt$s1),ids4:= paste(id,collapse="|"), by = list(s4)]

# getting a unique duplicated ID for each row
dt$new.id <- apply(dt[,.(ids1,ids2,ids3,ids4)], 1, paste, collapse="|")
dt$new.id <- apply(dt[,"new.id",drop=FALSE], 1, function(x) paste(unique(strsplit(x,"\|")[[1]]),collapse="|"))

This operation results in the following, with the unique duplicated ID define as "new.id":

   id   s1   s2 s3   s4     ids1     ids2  ids3     ids4   new.id
 1: a1    a    d  f    h       a1    a1|b3 a1|c7       a1 a1|b3|c7
 2: b3    b    d  g    i       b3    a1|b3    b3       b3    b3|a1
 3: c7    c    e  f    j       c7       c7 a1|c7       c7    c7|a1
 4: d5    l    k  n    m    d5|e3    d5|e3 d5|e3    d5|e3    d5|e3
 5: e3    l    k  n    m    d5|e3    d5|e3 d5|e3    d5|e3    d5|e3
 6: f4    v    o  s    t f4|g2|h1 f4|g2|h1    f4 f4|g2|h1 f4|g2|h1
 7: g2    v    o  r    t f4|g2|h1 f4|g2|h1    g2 f4|g2|h1 f4|g2|h1
 8: h1    v    o  u    t f4|g2|h1 f4|g2|h1    h1 f4|g2|h1 f4|g2|h1
 9: i9 <NA> <NA>  w <NA>     <NA>     <NA>  <NA>     <NA>       NA
10: j6 <NA> <NA>  z <NA>     <NA>     <NA>  <NA>     <NA>       NA

Note that records "b3" and "c7" are duplicated indirectly through "a1" (all other examples are direct duplicates that should remain the same). That is why we need the next step.

STEP 2 - Updating the duplicated IDs based on indirect duplicates

#filtering the relevant columns for the indirect search
dt = dt[,.(id,new.id)]

#creating the patterns to be used by grepl() for the look-up for each row
dt[,patt:= .(paste(paste("^",id,"\||",sep=""),paste("\|",id,"\||",sep=""),paste("\|",id,"$",sep=""),collapse = "" ,sep="")), by = list(id)]

#Transforming the ID vector into factor and setting it as a 'key' to the data.table (speed up the processing)
dt$new.id = as.factor(dt$new.id)
setkeyv(dt, c("new.id"))

#Performing the loop using sapply
library(stringr)
for(i in 1:nrow(dt)) {
  pat = dt$patt[i] # retrieving the research pattern
  tmp = dt[new.id %like% pat] # searching the pattern using grepl()
  if(dim(tmp)[1]>1) {
    x = which.max(str_count(tmp$new.id, "\|"))
    dt$new.id[i] = as.character(tmp$new.id[x])
  }
}

#filtering the final columns 
dt = dt[,.(id,new.id)]

The final table looks like:

   id   new.id
 1: a1 a1|b3|c7
 2: b3 a1|b3|c7
 3: c7 a1|b3|c7
 4: d5    d5|e3
 5: e3    d5|e3
 6: f4 f4|g2|h1
 7: g2 f4|g2|h1
 8: h1 f4|g2|h1
 9: i9       NA
10: j6       NA

Note that now the first three records ("a1","b3","c7") are grouped under a broader duplicated ID, which contains both direct and indirect records.

Everything is working out fine, but my code is horrendously slow. It took 2 entire days to run half of the data set (~800,0000). I could parallelize the loop into different cores, but it would still take hours. And I am almost sure that I could use data.table functionalities in a better way, maybe using using 'set' inside the loop. I spent hours today trying to implement the same codes using data.table, but I am new to its syntax and I am really having a hard time here. Any suggestions on how I could optimize this code?

Note: The slowest part of the code is the loop and inside the loop the most inefficient step is the grepl() of the patterns inside the data.table. It seems that setting a 'key' to the data.table can speed up the process, but I did not changed the time it took to do the grepl() in my case.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Answer

深蓝 · Answer 1 · 2021-10-23T17:52:27+0000

You may approach this as a network problem. Here I use functions from the igraph package. The basic steps:

meltthe data to long format.
Use graph_from_data_frame to create a graph, where 'id' and 'value' columns are treated as an edge list.
Use components to get connected components of the graph, i.e. which 'id' are connected via their criteria, directly or indirectly.
Select the membership element to get "the cluster id to which each vertex belongs".
Join membership to original data.
Concatenate 'id' grouped by cluster membership.

library(igraph)

# melt data to long format, remove NA values
d <- melt(dt, id.vars = "id", na.rm = TRUE)

# convert to graph
g <- graph_from_data_frame(d[ , .(id, value)])

# get components and their named membership id 
mem <- components(g)$membership

# add membership id to original data
dt[.(names(mem)), on = .(id), mem := mem] 

# for groups of length one, set 'mem' to NA
dt[dt[, .I[.N == 1], by = mem]$V1, mem := NA]

If desired, concatenate 'id' by 'mem' column (for non-NA 'mem') (IMHO this just makes further data manipulation more difficult ;) ). Anyway, here we go:

dt[!is.na(mem), id2 := paste(id, collapse = "|"), by = mem]

#     id   s1   s2 s3   s4  mem      id2
#  1: a1    a    d  f    h    1 a1|b3|c7
#  2: b3    b    d  g    i    1 a1|b3|c7
#  3: c7    c    e  f    j    1 a1|b3|c7
#  4: d5    l    k  l    m    2    d5|e3
#  5: e3    l    k  l    m    2    d5|e3
#  6: f4    o    o  s    o    3 f4|g2|h1
#  7: g2    o    o  r    o    3 f4|g2|h1
#  8: h1    o    o  u    o    3 f4|g2|h1
#  9: i9 <NA> <NA>  w <NA>   NA     <NA>
# 10: j6 <NA> <NA>  z <NA>   NA     <NA>

A basic plot of the graph in this small example, just to illustrate the connected components:

plot(g, edge.arrow.size = 0.5, edge.arrow.width = 0.8, vertex.label.cex = 2, edge.curved = FALSE)

Categories

r - Fast way to group variables based on direct and indirect similarities in multiple columns

r - Fast way to group variables based on direct and indirect similarities in multiple columns

STEP 1 - Constructing duplicated IDs based on direct duplicates

STEP 2 - Updating the duplicated IDs based on indirect duplicates

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags