Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
106 views
in Technique[技术] by (71.8m points)

r - Add new rows from categorical values

This is my dataframe

year   state    sex   age   population  deaths
-----------------------------------------------
1970   Cal      Male    12     134         45
1970   Cal      Female  12     100         20
1980   Pen      Male    13     200         10
1980   Pen      Female  13     150         50

What I want to do is add the number of deaths and population for each year, and state based on the column sex. And create a new row called Total, with the sum of these values.

I would like to have this:

For each year, example (1970,1980,1990...2050) I have a column called age from 0 to 100. With the number of deaths and population for Male and Female. I want to add the number of population and death for each year,state and age. (keeping the value of age)

I mean I want a third row with the add of Femlaes and Males (Total) and keep the age and year.

    year   state    sex   age   population     deaths
-----------------------------------------------
1970       Cal      Male       12     134         45
1970       Cal      Female     12     100         20
1970       Cal      Total      12     234         65
1980       Pen      Male       13     200         10
1980       Pen      Female     13     150         50
1980       Pen      Total      13     350         60

I tried with this line

df_1 <- setDT(df)[,rbind(.SD,c(.(sex = "Total"),colSums(.SD[,-1]))),state,.SDcols = c("sex",    "population", "deaths", "year", "age")]

But is not giving me the right result. How can I do this?

question from:https://stackoverflow.com/questions/66051850/add-new-rows-from-categorical-values

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

1. adorn_totals

library(purrr)
library(janitor)
library(tidyr)
library(dplyr)


df %>% 
  dplyr::select(sex, everything()) %>%
  dplyr::group_split(year, state, age) %>% 
  purrr::map_dfr(janitor::adorn_totals) %>%
  dplyr::mutate(across(c(year, state, age), ~ ifelse(sex == "Total", NA, .))) %>% 
  tidyr::fill(everything(), .direction = "down") %>% 
  dplyr::select(year, state, everything())

FYI, across is from the dplyr library.

Output

year state    sex age population deaths
 1970   Cal   Male  12        134     45
 1970   Cal Female  12        100     20
 1970   Cal  Total  12        234     65
 1980   Pen   Male  13        200     10
 1980   Pen Female  13        150     50
 1980   Pen  Total  13        350     60

How it works

  1. The first select statement is to accommodate the behavior of janitor::adorn_totals, which puts the "Total" label in the first column. So this puts sex as the first column.
  2. group_split splits the dataframe into a list of dataframes by same year, state, and age.
  3. purrr::map_dfr will map janitor::adorn_totals over each dataframe in the list, which essentially is creating subtotals, and then row binds back into one dataframe, hence the r in dfr.
  4. The remainder of the pipes are to format the data how you requested. So that the year, state, and age columns will have the same value in the "Total" row. The last select reorders everything back to the original order.

2. Pivot

Alternatively, since janitor::adorn_totals is a bit buggy you can do this using some pivots to create the exact same output:

library(tidyr)
library(dplyr)

df %>% 
  tidyr::pivot_wider(id_cols = c(year, state, age),
                     names_from = sex, 
                     values_from = population:deaths) %>%
  dplyr::mutate(population_Total = rowSums(across(starts_with("population"))),
                deaths_Total = rowSums(across(starts_with("deaths")))) %>% 
  tidyr::pivot_longer(cols = contains("_"), 
                      names_to = c(".value", "sex"),
                      names_pattern = c("(.*)_(.*)$"))

FYI, across, contains, and starts_with are exported or from the dplyr library.


How it works

  1. pivot_wider re-orients the data so there is one row per year, state, and age. New columns are created appending sex to the end of the pivoted columns, which are population and deaths in this case. This creates columns population_Male, population_Female, deaths_Male, and so on.
  2. We then sum across rows using tidyselect syntax to sum all the rows starting with "population" and likewise for "deaths", storing them in appropriately named columns appended with "_Total".
  3. The final pivot maintains the year, state, and age as the ID columns. It pivots all columns with an underscore: population_Male, population_Female, population_Total, etc. Using regular expressions the column names are parsed by the underscore: population_Total is parsed into the strings population and Total. The former becomes a column name storing values, the latter is a value stored in a column called "sex".

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...