Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
511 views
in Technique[技术] by (71.8m points)

r - 有没有一种方法可以“合并”两列,其中新列的值是具有特定值的原始列的名称(按组)?(Is there a way to 'merge' two columns, where the values of new column are the name of the original column that had a specific value, group wise?)

I have a dataframe (will call it 'df') with a decent amount of variables (numeric, logical and characters) representing an experiment where different cell types were moved from a specific medium, to another one, and the activity of the cell was quantified at specific times.

(我有一个数据帧(将其称为“ df”),其中包含相当数量的变量(数字,逻辑和字符),代表一项实验,其中不同类型的细胞从一种特定的培养基移至另一种,并且该细胞的活性为在特定时间进行量化。)

The first and second columns hold the name of the 'source' medium, and the name of the medium the cells were moved to, respectively;

(第一列和第二列分别保存“源”媒体的名称和单元格要移动到的媒体的名称;)

the third column describes the time at which the activity was quantified, the fourth is the cell type, the fifth is the activity measured, and this is where it gets funny.

(第三列描述了活动的量化时间,第四列是细胞类型,第五列是测量的活动,这很有趣。)

I have two main questions, the first one is to know if there is an 'R-esque' way to did what I did to obtain the sixth column, which contains the increase/decrease (in percentage) of the value in 'Activity' relative from that present in the previous row, but in a group manner (each group consist of a combination of Cell.Type, Pre.Medium and Time), so that's why its value is NA everytime the value of Time is zero.

(我有两个主要问题,第一个是要知道是否有一种'R-esque'的方式来完成我所获得的第六列,该列包含'Activity'中值的增加/减少(百分比)相对于上一行中存在的相对值,但以分组方式(每个组由Cell.Type,Pre.Medium和Time的组合组成),因此这就是每次Time的值为零时其值为NA的原因。)

Assuming this is my dataframe (I've simplified it in order to make my question more clear):

(假设这是我的数据框(为了使我的问题更清楚,我对其进行了简化):)

df <- structure(list(Pre.Medium = c("Medium1", "Medium1", "Medium1", 
"Medium2", "Medium2", "Medium2", "Medium1", "Medium1", "Medium1", 
"Medium2", "Medium2", "Medium2"), Pos.Medium = c("Medium2", "Medium2", 
"Medium2", "Medium1", "Medium1", "Medium1", "Medium2", "Medium2", 
"Medium2", "Medium1", "Medium1", "Medium1"), Time = c(0, 2, 4, 
0, 2, 4, 0, 2, 4, 0, 2, 4), Cell.Type = c("Cell_A", "Cell_A", 
"Cell_A", "Cell_A", "Cell_A", "Cell_A", "Cell_B", "Cell_B", "Cell_B", 
"Cell_B", "Cell_B", "Cell_B"), Activity = c(0.5, 1, 2, 2, 1, 
0.5, 0.2, 0.8, 0.2, 0.2, 0.2, 0.4), Percent.Increase = c(NA, 
100, 100, NA, -50, -50, NA, 300, -75, NA, 0, 100), Primary.Increase = c(NA, 
TRUE, FALSE, NA, TRUE, FALSE, NA, TRUE, FALSE, NA, FALSE, FALSE
), Secondary.Increase = c(NA, FALSE, FALSE, NA, FALSE, FALSE, 
NA, FALSE, FALSE, NA, FALSE, TRUE)), class = c("spec_tbl_df", 
"tbl_df", "tbl", "data.frame"), row.names = c(NA, -12L), problems = structure(list(
    row = 1L, col = NA_character_, expected = "8 columns", actual = "9 columns", 
    file = "'new 2'"), row.names = c(NA, -1L), class = c("tbl_df", 
"tbl", "data.frame")), spec = structure(list(cols = list(Pre.Medium = structure(list(), class = c("collector_character", 
"collector")), Pos.Medium = structure(list(), class = c("collector_character", 
"collector")), Time = structure(list(), class = c("collector_double", 
"collector")), Cell.Type = structure(list(), class = c("collector_character", 
"collector")), Activity = structure(list(), class = c("collector_double", 
"collector")), Percent.Increase = structure(list(), class = c("collector_double", 
"collector")), Primary.Increase = structure(list(), class = c("collector_logical", 
"collector")), Secondary.Increase = structure(list(), class = c("collector_logical", 
"collector"))), default = structure(list(), class = c("collector_guess", 
"collector")), skip = 1), class = "col_spec"))
### Pre.Med Pos.Med Time  Cell.Type Activity  Percent.Increase  Primary.Increase Secondary.Increase
### Medium1 Medium2   0    Cell_A    0.5           NA           NA                NA 
### Medium1 Medium2   2    Cell_A    1             100          TRUE              FALSE
### Medium1 Medium2   4    Cell_A    2             100          FALSE             FALSE
### Medium2 Medium1   0    Cell_A    2             NA           NA                NA
### Medium2 Medium1   2    Cell_A    1            -50           TRUE              FALSE
### Medium2 Medium1   4    Cell_A    0.5          -50           FALSE             FALSE
### Medium1 Medium2   0    Cell_B    0.2           NA           NA                NA
### Medium1 Medium2   2    Cell_B    0.8           300          TRUE              FALSE
### Medium1 Medium2   4    Cell_B    0.2          -75           FALSE             FALSE
### Medium2 Medium1   0    Cell_B    0.2           NA           NA                NA
### Medium2 Medium1   2    Cell_B    0.2           0            FALSE             FALSE
### Medium2 Medium1   4    Cell_B    0.4           100          FALSE             TRUE

I did by using the group_by and mutate functions, and then the lag function to calculate the increase/decrease from the previous and the previous previous row, was there a better way to do so?

(我使用了group_by和mutate函数,然后使用lag函数来计算上一行和上一行的增加/减少,是否有更好的方法呢?)

For my specific case, lag was enough, but what if I had more than three time measurements in each 'group' and needed to go way behind to calculate it?

(对于我的特定情况,滞后就足够了,但是如果我在每个“组”中进行了三次以上的时间测量并且需要落后于时间来进行计算,该怎么办?)

With my approach, at some point I would've had to use something like lag(lag(lag(lag(lag((Activity / lag(Activity)) - 1) * 100)))) etc.

(用我的方法,在某些时候我将不得不使用lag(lag(lag(lag(lag(lag((Activity / lag(Activity))-1)* 100))))等东西。)

The other thing is something I have not been able to figure out in any way, and it is to turn my 'wide' dataset into a long one, by turning my columns 'Primary.Increase' and 'Secondary.Increase' into a column named 'Increase.Type' where its value will consist, for each group (combination of Cell.Type, Pre.Med and Time), in the name of the column (either Primary.Response or Secondary.Response) where the value of one of its member was TRUE.

(另一件事是我无法以任何方式弄清楚,它是通过将我的列“ Primary.Increase”和“ Secondary.Increase”变成一列来将“宽”数据集变成一个长数据集名为“ Increase.Type”,其中对于每个组(Cell.Type,Pre.Med和Time的组合),其值将包含在列名(Primary.Response或Secondary.Response)中,其中一者的值它的成员为TRUE。)

It should look something like this:

(它看起来应该像这样:)

df <- structure(list(Pre.Med = c("Medium1", "Medium1", "Medium1", "Medium2", 
"Medium2", "Medium2", "Medium1", "Medium1", "Medium1", "Medium2", 
"Medium2", "Medium2"), Pos.Med = c("Medium2", "Medium2", "Medium2", 
"Medium1", "Medium1", "Medium1", "Medium2", "Medium2", "Medium2", 
"Medium1", "Medium1", "Medium1"), Time = c(0, 2, 4, 0, 2, 4, 
0, 2, 4, 0, 2, 4), Cell.Type = c("Cell_A", "Cell_A", "Cell_A", 
"Cell_A", "Cell_A", "Cell_A", "Cell_B", "Cell_B", "Cell_B", "Cell_B", 
"Cell_B", "Cell_B"), Activity = c(0.5, 1, 2, 2, 1, 0.5, 0.2, 
0.8, 0.2, 0.2, 0.2, 0.4), Percent.Inc = c(NA, 100, 100, NA, -50, 
-50, NA, 300, -75, NA, 0, 100), Increase.Type = c("Primary.Increase", 
"Primary.Increase", "Primary.Increase", "Primary.Increase", "Primary.Increase", 
"Primary.Increase", "Primary.Increase", "Primary.Increase", "Primary.Increase", 
"Secondary.Increase", "Secondary.Increase", "Secondary.Increase"
)), class = c("spec_tbl_df", "tbl_df", "tbl", "data.frame"), row.names = c(NA, 
-12L), spec = structure(list(cols = list(Pre.Med = structure(list(), class = c("collector_character", 
"collector")), Pos.Med = structure(list(), class = c("collector_character", 
"collector")), Time = structure(list(), class = c("collector_double", 
"collector")), Cell.Type = structure(list(), class = c("collector_character", 
"collector")), Activity = structure(list(), class = c("collector_double", 
"collector")), Percent.Inc = structure(list(), class = c("collector_double", 
"collector")), Increase.Type = structure(list(), class = c("collector_character", 
"collector"))), default = structure(list(), class = c("collector_guess", 
"collector")), skip = 1), class = "col_spec"))
### Pre.Med Pos.Med Time  Cell.Type Activity    Percent.Inc Increase.Type 
### Medium1 Medium2   0    Cell_A    0.5           NA         Primary.Increase
### Medium1 Medium2   2    Cell_A    1             100        Primary.Increase
### Medium1 Medium2   4    Cell_A    2             100        Primary.Increase
### Medium2 Medium1   0    Cell_A    2             NA         Primary.Increase
### Medium2 Medium1   2    Cell_A    1            -50         Primary.Increase
### Medium2 Medium1   4    Cell_A    0.5          -50         Primary.Increase
### Medium1 Medium2   0    Cell_B    0.2           NA         Primary.Increase
### Medium1 Medium2   2    Cell_B    0.8           300        Primary.Increase
### Medium1 Medium2   4    Cell_B    0.2          -75         Primary.Increase
### Medium2 Medium1   0    Cell_B    0.2           NA         Secondary.Increase
### Medium2 Medium1   2    Cell_B    0.2           0          Secondary.Increase     
### Medium2 Medium1   4    Cell_B    0.4           100        Secondary.Increase             

Is there a way to do this in the first place?

(首先有没有办法做到这一点?)

I'd assume so, but so far I've not been able to do it :/ I'm an undergraduate in biology relatively new to R, I'm loving what you can do with it but I'm still a long way from being good at it.

(我以为是这样,但是到目前为止我还没有做到:/我是R的新兴生物学专业的本科生,我很喜欢你能用它做什么,但是我还有很长的路要走从擅长)

Any help is heavily appreciated.

(非常感谢您的帮助。)

  ask by John Sandman translate from so

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

I'm not sure I understand the first question.

(我不确定我是否理解第一个问题。)

If you do something like:

(如果您执行以下操作:)

library(dplyr)

df %>%
  group_by(Cell.Type, Pre.Medium, Pos.Medium) %>%
  arrange(Time, .by_group = TRUE) %>% # remove if Time is always ascending
  mutate(Percent.Increase = ((Activity / lag(Activity)) - 1) * 100)

the calculation of Percent.Increase is vectorized, so it won't matter how long Activity is (see also my last explanation below).

(Percent.Increase的计算是矢量化的,因此Activity有多长时间都没关系(另请参见下面的最后解释)。)

For the second question, if I understand correctly, you can do it like this:

(对于第二个问题,如果我理解正确,则可以这样做:)

df %>%
  group_by(Cell.Type, Pre.Medium, Pos.Medium) %>%
  mutate(Increase.Type = if (any(Secondary.Increase, na.rm = TRUE)) "Secondary.Increase" else "Primary.Increase") %>%
  select(-(Primary.Increase:Secondary.Increase))
# A tibble: 12 x 7
# Groups:   Cell.Type, Pre.Medium, Pos.Medium [4]
   Pre.Medium Pos.Medium  Time Cell.Type Activity Percent.Increase Increase.Type     
   <chr>      <chr>      <dbl> <chr>        <dbl>            <dbl> <chr>             
 1 Medium1    Medium2        0 Cell_A         0.5               NA Primary.Increase  
 2 Medium1    Medium2        2 Cell_A         1                100 Primary.Increase  
 3 Medium1    Medium2        4 Cell_A         2                100 Primary.Increase  
 4 Medium2    Medium1        0 Cell_A         2                 NA Primary.Increase  
 5 Medium2    Medium1        2 Cell_A         1                -50 Primary.Increase  
 6 Medium2    Medium1        4 Cell_A         0.5              -50 Primary.Increase  
 7 Medium1    Medium2        0 Cell_B         0.2               NA Primary.Increase  
 8 Medium1    Medium2        2 Cell_B         0.8              300 Primary.Increase  
 9 Medium1    Medium2        4 Cell_B         0.2              -75 Primary.Increase  
10 Medium2    Medium1        0 Cell_B         0.2               NA Secondary.Increase
11 Medium2    Medium1        2 Cell_B         0.2                0 Secondary.Increase
12 Medium2    Medium1        4 Cell_B         0.4              100 Secondary.Increase

The transformation inside mutate sees all values from the group, so any(Secondary.Increase, na.rm = TRUE) receives all elements at once, and if we only return 1 value, it'll be copied to fit the group size.

(mutate的转换会看到组中的所有值,因此any(Secondary.Increase, na.rm = TRUE)会立即接收所有元素,如果我们仅返回1值,则会将其复制以适合组大小。)


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...