python - Conditionally aggregate grouped data frame with different functions depending on values in a column

Question

Welcome To Ask or Share your Answers For Others

python - Conditionally aggregate grouped data frame with different functions depending on values in a column

asked Oct 7, 2021 in Technique[技术] by 深蓝 (71.8m points)

python - Conditionally aggregate grouped data frame with different functions depending on values in a column

Consider the following Pandas data frame.

import pandas as pd

df = pd.DataFrame({"val":[1, 2, 3, 10, 20, 30, 40],
                   "group_id":["ones", "ones", "ones", "tens", "tens", "tens", "tens"],
                   "condition":["sum", "sum", "sum", "mean", "mean", "mean", "mean"]})

I'd like to aggregate the data in df["val"] by grouping on group_id then applying a different aggregation function to each group. To determine which aggregation function to use, I'd like to set up a condition that references another column in df, namely condition.

Specifically, I'd like to take the sum of all elements in val for the "ones" group and the mean of all elements in the "tens" group. (But I don't need to pull the name of the aggregation function from condition. The condition column could be anything, as long as every group has the same condition, e.g. all "ones" correspond to "sum". So maybe the condition column is redundant?)

I'd like to get the following result.

df_aggregated = pd.DataFrame({"group_id":["ones", "tens"],
                              "val_aggregated":["6", "25"]})

There's a clean way to do this with R and dplyr:

df <- tibble(val = c(1, 2, 3, 10, 20, 30, 40),
             group_id = c("ones", "ones", "ones", "tens", "tens", "tens", "tens"),
             condition = c("sum", "sum", "sum", "mean", "mean", "mean", "mean"))

df_aggregated <- df %>%
  group_by(group_id) %>% 
  summarise(val_aggregated = case_when(condition == "sum" ~ sum(val),
                                       condition == "mean" ~ mean(val),
                                       TRUE ~ NA_real_)) %>% 
  distinct()

But I can't seem to find a good way to do this kind of aggregation in Pandas. Maybe the solution involves NumPy's select() function? Or would the Pandas-idiomatic way be to just loop over the grouped data structure?

Many thanks for any help you can provide!

question from:https://stackoverflow.com/questions/65864398/conditionally-aggregate-grouped-data-frame-with-different-functions-depending-on

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Answer

深蓝 · Answer 1 · 2021-10-06T19:25:56+0000

One way to achieve this is to groupby on both group_id and condition and aggregate:

(
    df.groupby(["group_id", "condition"])
    .agg(["sum", "mean"])
    .stack()
    .reset_index() 
     # keeps only rows where condition equals aggregates
    .query("condition==level_2")
    .drop(columns=["condition", "level_2"])
    .rename(columns={"val": "val_aggregated"})
)

    group_id    val_aggregated
0      ones         6
3      tens         25

Another way is to pivot the data, then aggregate:

result = df.pivot(columns=["group_id", "condition"], values="val")
result

group_id    ones    tens
condition   sum     mean
0           1.0     NaN
1           2.0     NaN
2           3.0     NaN
3           NaN     10.0
4           NaN     20.0
5           NaN     30.0
6           NaN     40.0

Get a pairing of ones and teens to sum and mean:

mapping = zip(*result.columns)
mapping = dict(zip(*mapping))
mapping
{'ones': 'sum', 'tens': 'mean'}

Drop the condition level on the columns and aggregate:

(
    result.droplevel(level="condition", axis="columns")
    .agg(mapping)
    .rename_axis(index="group_id")
    .reset_index(name="val_aggregated")
)


    group_id    val_aggregated
0       ones    6.0
1       tens    25.0

Another option, which is a bit similar to dplyr's solution is to use np.where, as you alluded to in your question :

group = df.groupby("group_id")

(
    df.assign(
        val_aggregate=np.where(
            df.condition.eq("sum"),
            group.val.transform("sum"),
            group.val.transform("mean"),
        )
    )
    .loc[:, ["group_id", "val_aggregate"]]
    .drop_duplicates()
)

    group_id    val_aggregate
0       ones        6
3       tens        25

Categories

python - Conditionally aggregate grouped data frame with different functions depending on values in a column

python - Conditionally aggregate grouped data frame with different functions depending on values in a column

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags