I am handling a data frame 'df' that have millions of rows and four columns (i.e., Chromosome, Position, Allele1, Allele2). Now I am wanting to concatenate characters in these columns into one separate vector 'cc'. This is my first try:
myfunc = function(CHR) {
chr = subset(df, df$Chromosome == CHR)
cc = data.frame(No=seq.int(nrow(chr)), pos_al1_al2=NA)
for (i in 1: nrow(chr)) {
cc$pos_al1_al2[i] = paste(CHR, chr$Position[i], ".", chr$Allele1[i], chr$Allele2[i])
cc = cc[, -1] # remove the column 'No'
}
}
# Run my code
myfunc(7)
where CHR
is the number of chromosome of my interest I will input to the function (e.g., 1,2,3,..., or 22). Of course, CHR
must be in a range of from 1 to 22 as in the column Chromosome
of the 'df'.
My idea is that: I first created an empty vector called cc
whose the number of rows are the same as the data.frame 'df'.
Now I created a new column in the cc
called pos_al1_al2
whose each row includes characters as you can see in the function.
The computation time is very slow. I guess It comes from the for loop but I do have no idea to optimize my function.
Any help is appreciated! Thanks in advance.
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…