I've tried to code an answer to this based on similar questions, however I'm struggling to get any answers working exactly for my use case.
I have 2 genetic datasets (looking at positions of mutations in the genome).
These look like:
#df1:
Chromosome Min Max Group
1 500 1000 1
1 400 1900 2
2 300 400 3
3 2000 2100 4
#df2
Gene Chromosome Position
Gene1 1 600
Gene1 1 650
Gene2 1 1700
Gene3 2 350
Gene4 2 355
Gene5 2 450
Gene6 3 2050
I am looking to find which rows in df2
belong to which Group
in df1
- so I am coding to ask if a row in df2
has a matching Chromosome
number AND the Position
column value is inside the range between the Max
and Min
columns of df1
, then if it is assign that row the same Group
number as is in df1
. I want this to also duplicate rows in df2
if they appear in multiple Groups.
Expected output of the example is:
Gene Chromosome Position Group
Gene1 1 600 1
Gene1 1 650 1
Gene1 1 600 2
Gene1 1 650 2
Gene2 1 1700 2
Gene3 2 350 3
Gene4 2 355 3
Gene5 2 450 NA
Gene6 3 2050 4
#Gene1 enters both groups 1 and 2 as their Chromosome and Position fits in both those groups of df1
They key thing I'm trying to get that I am missing is that some groups in df1
have overlap in their min-max range, but I still want each individual group to be kept and just Genes/rows of df2
to be duplicated as their position might match mulitple groups with overlap.
Currently I am trying to code this with:
df1$ID <- seq.int(nrow(df1))
df2$ID<- seq.int(nrow(df2))
df2[df1, Group := i.ID, on = .(Chromosome, Position > Min, Position < Max ) ]
For my example this outputs:
Gene Chromosome Position ID Group
Gene1 1 600 1 2
Gene1 1 650 2 2
Gene2 1 1700 3 2
Gene3 2 350 4 3
Gene4 2 355 5 3
Gene5 2 450 6 NA
Gene6 3 2050 7 4
So in this case because Groups 1 and 2 have an overlap in their min-max range, Group 1 has been lost from the output result. Is there another way I can code to avoid this and match rows but maintain all Groups
despite any range overlaps?
I have tried other ways to code this with a similar question (How to perform join over date ranges using data.table?) using foverlaps()
but this also does not work as expected.
Input data:
df1 <-
structure(list(Chromosome = c(1L, 1L, 2L, 3L), Min = c(500L,
400L, 300L, 2000L), Max = c(1000L, 1900L, 400L, 2100L), Group = 1:4,
ID = 1:4), row.names = c(NA, -4L), class = c("data.table",
"data.frame"))
df2 <-
structure(list(Gene = c("Gene1", "Gene1", "Gene2", "Gene3", "Gene4",
"Gene5", "Gene6"), Chromosome = c(1L, 1L, 1L, 2L, 2L, 2L, 3L),
Position = c(600L, 650L, 1700L, 350L, 355L, 450L, 2050L)), row.names = c(NA,
-7L), class = c("data.table", "data.frame"))
question from:
https://stackoverflow.com/questions/65921081/how-to-group-rows-based-on-conditions-set-by-another-dataset