Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
688 views
in Technique[技术] by (71.8m points)

python - Pandas groupby category, rating, get top value from each category?

First question on SO, very new to pandas and still a little shaky on the terminology: I'm trying to figure out the proper syntax/sequence of operations on a dataframe to be able to group by column B, find the max (or min) corresponding value for each group in column C, and retrieve the corresponding value for that in column A.

Suppose this is my dataframe:

name     type      votes     
bob       dog        10
pete      cat         8
fluffy    dog         5
max       cat         9

Using df.groupby('type').votes.agg('max') returns:

dog     10
cat      9

So far, so good. However, I'd like to figure out how to return this:

dog    10    bob
cat     9    max 

I've gotten as far as df.groupby(['type', 'votes']).name.agg('max'), though that returns

dog   5    fluffy
      10   bob
cat   8    pete
      9    max

... which is fine for this pretend dataframe, but doesn't quite help when working with a much larger one.

Thanks very much!

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

If df has an index with no duplicate values, then you can use idxmax to return the index of the maximum row for each group. Then use df.loc to select the entire row:

In [322]: df.loc[df.groupby('type').votes.agg('idxmax')]
Out[322]: 
  name type  votes
3  max  cat      9
0  bob  dog     10

If df.index has duplicate values, i.e. is not a unique index, then make the index unique first:

df = df.reset_index()

then use idxmax:

result = df.loc[df.groupby('type').votes.agg('idxmax')]

If you really need to, you can return df to its original state:

df = df.set_index(['index'], drop=True)

but in general life is much better with a unique index.


Here is an example showing what goes wrong when df does not have a unique index. Suppose the index is AABB:

import pandas as pd
df = pd.DataFrame({'name': ['bob', 'pete', 'fluffy', 'max'],
                   'type': ['dog', 'cat', 'dog', 'cat'],
                   'votes': [10, 8, 5, 9]}, 
                  index=list('AABB'))
print(df)
#      name type  votes
# A     bob  dog     10
# A    pete  cat      8
# B  fluffy  dog      5
# B     max  cat      9

idxmax returns the index values A and B:

print(df.groupby('type').votes.agg('idxmax'))
type
cat    B
dog    A
Name: votes, dtype: object

But A and B do not uniquely specify the desired rows. df.loc[...] returns all rows whose index value is A or B:

print(df.loc[df.groupby('type').votes.agg('idxmax')])
#      name type  votes
# B  fluffy  dog      5
# B     max  cat      9
# A     bob  dog     10
# A    pete  cat      8

In contrast, if we reset the index:

df = df.reset_index()
#   index    name type  votes
# 0     A     bob  dog     10
# 1     A    pete  cat      8
# 2     B  fluffy  dog      5
# 3     B     max  cat      9

then df.loc can be used to select the desired rows:

print(df.groupby('type').votes.agg('idxmax'))
# type
# cat    3
# dog    0
# Name: votes, dtype: int64

print(df.loc[df.groupby('type').votes.agg('idxmax')])
#   index name type  votes
# 3     B  max  cat      9
# 0     A  bob  dog     10

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...