Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
233 views
in Technique[技术] by (71.8m points)

string - Deleting reversed duplicates with R

I have a data frame in R that contains the gene ids of paralogous genes in Arabidopsis, looking something like this:

gene_x    gene_y
AT1       AT2
AT3       AT4
AT1       AT2
AT1       AT3
AT2       AT1

with the 'ATx' corresponding to the gene names.

Now, for downstream analysis, I would want to continue only with the unique pairs. Some pairs are just simple duplicates and can be removed easily upon using the duplicated() function. However, the fifth row in the artificial data frame above is also a duplicate, but in reversed order, and which will not be picked up by the duplicated(), nor by the unique() function.

Any ideas in how to remove these rows?

Question&Answers:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)
mydf <- read.table(text="gene_x    gene_y
AT1       AT2
AT3       AT4
AT1       AT2
AT1       AT3
AT2       AT1", header=TRUE, stringsAsFactors=FALSE)

Here's one strategy using apply, sort, paste, and duplicated:

mydf[!duplicated(apply(mydf,1,function(x) paste(sort(x),collapse=''))),]
  gene_x gene_y
1    AT1    AT2
2    AT3    AT4
4    AT1    AT3

And here's a slightly different solution:

mydf[!duplicated(lapply(as.data.frame(t(mydf), stringsAsFactors=FALSE), sort)),]
  gene_x gene_y
1    AT1    AT2
2    AT3    AT4
4    AT1    AT3

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...