Consider the following Pandas data frame.
import pandas as pd
df = pd.DataFrame({"val":[1, 2, 3, 10, 20, 30, 40],
"group_id":["ones", "ones", "ones", "tens", "tens", "tens", "tens"],
"condition":["sum", "sum", "sum", "mean", "mean", "mean", "mean"]})
I'd like to aggregate the data in df["val"]
by grouping on group_id
then applying a different aggregation function to each group. To determine which aggregation function to use, I'd like to set up a condition that references another column in df
, namely condition
.
Specifically, I'd like to take the sum of all elements in val
for the "ones"
group and the mean of all elements in the "tens"
group. (But I don't need to pull the name of the aggregation function from condition
. The condition
column could be anything, as long as every group has the same condition, e.g. all "ones"
correspond to "sum"
. So maybe the condition
column is redundant?)
I'd like to get the following result.
df_aggregated = pd.DataFrame({"group_id":["ones", "tens"],
"val_aggregated":["6", "25"]})
There's a clean way to do this with R and dplyr:
df <- tibble(val = c(1, 2, 3, 10, 20, 30, 40),
group_id = c("ones", "ones", "ones", "tens", "tens", "tens", "tens"),
condition = c("sum", "sum", "sum", "mean", "mean", "mean", "mean"))
df_aggregated <- df %>%
group_by(group_id) %>%
summarise(val_aggregated = case_when(condition == "sum" ~ sum(val),
condition == "mean" ~ mean(val),
TRUE ~ NA_real_)) %>%
distinct()
But I can't seem to find a good way to do this kind of aggregation in Pandas. Maybe the solution involves NumPy's select()
function? Or would the Pandas-idiomatic way be to just loop over the grouped data structure?
Many thanks for any help you can provide!
question from:
https://stackoverflow.com/questions/65864398/conditionally-aggregate-grouped-data-frame-with-different-functions-depending-on