r - How to use data.table to efficiently calculate allele frequencies (proportions) by group across multiple columns (loci)

Question

Welcome To Ask or Share your Answers For Others

r - How to use data.table to efficiently calculate allele frequencies (proportions) by group across multiple columns (loci)

asked Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

r - How to use data.table to efficiently calculate allele frequencies (proportions) by group across multiple columns (loci)

I have a data.table of allele identities (rows are individuals, columns are loci), grouped by a separate column. I want to calculate allele frequencies (proportions) for each locus efficiently, by group. An example data table:

    DT = data.table(Loc1=rep(c("G","T"),each=5), 
      Loc2=c("C","A"), Loc3=c("C","G","G","G",
      "C","G","G","G","G","G"), 
    Group=c(rep("G1",3),rep("G2",4),rep("G3",3)))
    for(i in 1:3)
        set(DT, sample(10,2), i, NA)
    > DT
        Loc1 Loc2 Loc3 Group
     1:    G   NA    C    G1
     2:    G    A    G    G1
     3:    G    C    G    G1
     4:   NA   NA   NA    G2
     5:    G    C   NA    G2
     6:    T    A    G    G2
     7:    T    C    G    G2
     8:    T    A    G    G3
     9:    T    C    G    G3
    10:   NA    A    G    G3

The problem I have is that when I try to do calculations by group, only the allele i.d.s present in the group are recognized, so I'm struggling to find code that can tell me e.g. the proportion of G's for locus 1 in all 3 groups. Simple example, calculating a sum (not proportion) for the first allele at each locus:

    > fun1<- function(x){sum(na.omit(x==unique(na.omit(x))[1]))}
    > DT[,lapply(.SD,fun1),by=Group,.SDcols=1:3]
       Group Loc1 Loc2 Loc3
    1:    G1    3    1    1
    2:    G2    1    2    2
    3:    G3    2    2    3

For G1 the result is that Loc1 has 3 G's, but for G3 it shows Loc1 has 2 T's, not the number of G's. I want the number of G's for both in this case. So the key problem is that the allele identities are determined by group, not over the whole column. I tried making a separate table with the allele identities I want to use in calculations, but can't figure out how to include it in fun1 so that the correct cells are referenced in lapply above. Allele identities table:

    > fun2<- function(x){sort(na.omit(unique(x)))}
    > allele.id<-data.table(DT[,lapply(.SD,fun2),.SDcols=1:3])
    > allele.id
       Loc1 Loc2 Loc3
    1:    G    A    C
    2:    T    C    G

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Answer

2.1m questions

2.1m answers

60 comments

57.0k users

Most popular tags

javascript python c# java How android c++ php ios html sql r c node.js .net iphone asp.net css reactjs jquery ruby What Android objective mysql linux Is git Python windows Why regex angular swift amazon excel algorithm macos Java visual how bash Can multithreading PHP Using scala angularjs typescript apache spring performance postgresql database flutter json rust arrays C# dart vba django wpf xml vue.js In go Get google jQuery xcode jsf http Google mongodb string shell oop powershell SQL C++ security assembly docker Javascript Android: Does haskell Convert azure debugging delphi vb.net Spring datetime pandas oracle math Django

联盟问答网站-Union QA website

Xstack问答社区

生活宝问答社区

OverStack问答社区

Ostack问答社区

在这了问答社区

在哪了问答社区

Xstack问答社区

无极谷问答社区

TouSu问答社区

SQlite问答社区

Qi-U问答社区

MLink问答社区

Jonic问答社区

Jike问答社区

16892问答社区

Vigges问答社区

55276问答社区

OGeek问答社区

深圳家问答社区

深圳家问答社区

深圳家问答社区

Vigges问答社区

Vigges问答社区

在这了问答社区

DevDocs API Documentations

Xstack问答社区

生活宝问答社区

OverStack问答社区

Ostack问答社区

在这了问答社区

在哪了问答社区

Xstack问答社区

无极谷问答社区

TouSu问答社区

SQlite问答社区

Qi-U问答社区

MLink问答社区

Jonic问答社区

Jike问答社区

16892问答社区

Vigges问答社区

55276问答社区

OGeek问答社区

深圳家问答社区

深圳家问答社区

深圳家问答社区

Vigges问答社区

Vigges问答社区

在这了问答社区

在这了问答社区

DevDocs API Documentations

Xstack问答社区

生活宝问答社区

OverStack问答社区

Ostack问答社区

在这了问答社区

在哪了问答社区

Xstack问答社区

无极谷问答社区

TouSu问答社区

SQlite问答社区

Qi-U问答社区

MLink问答社区

Jonic问答社区

Jike问答社区

16892问答社区

Vigges问答社区

55276问答社区

OGeek问答社区

深圳家问答社区

深圳家问答社区

深圳家问答社区

Vigges问答社区

Vigges问答社区

在这了问答社区

DevDocs API Documentations

广告位招租

深蓝 · Answer 1 · 2021-10-23T20:04:51+0000

It's probably wise to transform your data.table into long format first. This will make it easier to use for further calculations (or making visualisations with ggplot2 for example). With the melt function of data.table (which works the same as the melt function of the reshape2 package) you can transform from wide to long format:

DT2 <- melt(DT, id = "Group", variable.name = "loci")

When you want to remove the NA-values during the melt-operation, you can add na.rm = TRUE in the above call (na.rm = FALSE is the default behaviour).

Then you can make count and proportion variables as follows:

DT2 <- DT2[, .N, by = .(Group, loci, value)][, prop := N/sum(N), by = .(Group, loci)]

which gives the following result:

> DT2
    Group loci value N      prop
 1:    G1 Loc1     G 3 1.0000000
 2:    G2 Loc1    NA 1 0.2500000
 3:    G2 Loc1     G 1 0.2500000
 4:    G2 Loc1     T 2 0.5000000
 5:    G3 Loc1     T 2 0.6666667
 6:    G3 Loc1    NA 1 0.3333333
 7:    G1 Loc2    NA 1 0.3333333
 8:    G1 Loc2     A 1 0.3333333
 9:    G1 Loc2     C 1 0.3333333
10:    G2 Loc2    NA 1 0.2500000
11:    G2 Loc2     C 2 0.5000000
12:    G2 Loc2     A 1 0.2500000
13:    G3 Loc2     A 2 0.6666667
14:    G3 Loc2     C 1 0.3333333
15:    G1 Loc3     C 1 0.3333333
16:    G1 Loc3     G 2 0.6666667
17:    G2 Loc3    NA 2 0.5000000
18:    G2 Loc3     G 2 0.5000000
19:    G3 Loc3     G 3 1.0000000

I you want it back in wide format, you can use dcast on multiple variables:

DT3 <- dcast(DT2, Group + loci ~ value, value.var = c("N", "prop"), fill = 0)

which results in:

> DT3
   Group loci N_A N_C N_G N_T N_NA    prop_A    prop_C    prop_G    prop_T   prop_NA
1:    G1 Loc1   0   0   3   0    0 0.0000000 0.0000000 1.0000000 0.0000000 0.0000000
2:    G1 Loc2   1   1   0   0    1 0.3333333 0.3333333 0.0000000 0.0000000 0.3333333
3:    G1 Loc3   0   1   2   0    0 0.0000000 0.3333333 0.6666667 0.0000000 0.0000000
4:    G2 Loc1   0   0   1   2    1 0.0000000 0.0000000 0.2500000 0.5000000 0.2500000
5:    G2 Loc2   1   2   0   0    1 0.2500000 0.5000000 0.0000000 0.0000000 0.2500000
6:    G2 Loc3   0   0   2   0    2 0.0000000 0.0000000 0.5000000 0.0000000 0.5000000
7:    G3 Loc1   0   0   0   2    1 0.0000000 0.0000000 0.0000000 0.6666667 0.3333333
8:    G3 Loc2   2   1   0   0    0 0.6666667 0.3333333 0.0000000 0.0000000 0.0000000
9:    G3 Loc3   0   0   3   0    0 0.0000000 0.0000000 1.0000000 0.0000000 0.0000000

Another and straightforward approach is using melt and dcast in one call (which is a simplified version of the first part of @Frank's answer):

DT2 <- dcast(melt(DT, id="Group"), Group + variable ~ value)

which gives:

> DT2
   Group variable A C G T NA
1:    G1     Loc1 0 0 3 0  0
2:    G1     Loc2 1 1 0 0  1
3:    G1     Loc3 0 1 2 0  0
4:    G2     Loc1 0 0 1 2  1
5:    G2     Loc2 1 2 0 0  1
6:    G2     Loc3 0 0 2 0  2
7:    G3     Loc1 0 0 0 2  1
8:    G3     Loc2 2 1 0 0  0
9:    G3     Loc3 0 0 3 0  0

Because the default aggregation function in dcast is length, you will automatically get the counts for each of the values.

Used data:

DT <- structure(list(Loc1 = c("G", "G", "G", NA, "G", "T", "T", "T", "T", NA), 
                     Loc2 = c(NA, "A", "C", NA, "C", "A", "C", "A", "C", "A"), 
                     Loc3 = c("C", "G", "G", NA, NA, "G", "G", "G", "G", "G"), 
                     Group = c("G1", "G1", "G1", "G2", "G2", "G2", "G2", "G3", "G3", "G3")), 
                .Names = c("Loc1", "Loc2", "Loc3", "Group"), row.names = c(NA, -10L), class = c("data.table", "data.frame"))

Categories

r - How to use data.table to efficiently calculate allele frequencies (proportions) by group across multiple columns (loci)

r - How to use data.table to efficiently calculate allele frequencies (proportions) by group across multiple columns (loci)

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags