Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
332 views
in Technique[技术] by (71.8m points)

python - Split a dataframe into chunks where each chunk has no common non-zero element with the other chunks

I have a pretty large (about 2000x2000 but not square necessarily) dataframe that is very sparse looking something like this:

      col1  col2  col3  col4
row1     0     0     1     0
row2     1     1     0     0
row3     0     1     0     1
row4     0     0     0     1

You can recreate this with this line:

df = pd.DataFrame([[0, 0, 1, 0], [1, 1, 0, 0], [0, 1, 0, 1], [0, 0, 0, 1]], columns=["col1", "col2", "col3", "col4"], index=["row1", "row2", "row3", "row4"])

So in this case we can see that row2 and row3 have a common element of col2, and row4 has a common non-zero element with row3 so they would all be a group (row2, row3, row4) while row1 has no common non-zero elements so that would be its own group.

What I would like is to have a decently efficient way to get all these groupings of rows that are independent of each other.

The only strategy I've come up with is to loop over all rows, find its common ones, and then keep looping over all rows til I have tied together all combinations but that seems really inefficient.

Does anyone have a better way of generating these distinct groups?

question from:https://stackoverflow.com/questions/65847279/split-a-dataframe-into-chunks-where-each-chunk-has-no-common-non-zero-element-wi

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

If you can install and use networkx, you can do it like

# change the shape of the data only where 1
s = df.where(df.eq(1)).stack()
print(s.head())
# row1  col3    1.0
# row2  col1    1.0
#       col2    1.0
# row3  col2    1.0
#       col4    1.0
# dtype: float64

import networkx as nx
# create the graph and add the edges from couple rows-cols
G = nx.Graph()
G.add_edges_from(set(s.index))

# get the values together in a group with connected_components
print(list(nx.connected_components(G)))
#[{'row4', 'row2', 'row3', 'col1', 'col4', 'col2'}, <-- rows and cols together
# {'col3', 'row1'}]

# use this and create a series with group number reindex by original df index
gr = pd.Series({val: gr for gr, vals in enumerate(nx.connected_components(G)) 
                        for val in vals})[df.index]
# could be a column in df with df['gr'] = gr

print(gr)
row1    1
row2    0
row3    0
row4    0
dtype: int64

In the case the rows and the cols have similar values, you can add a merge in the process to use rows-rows couple instead of rows-cols couple:

# change the shape of the data only where 1 and merge
df_ = df.where(df.eq(1)).stack().reset_index()
df_ = df_.merge(df_, on=['level_1']) # <-- merge on cols
print(df_.head())
#   level_0_x level_1  0_x level_0_y  0_y
# 0      row1    col3  1.0      row1  1.0
# 1      row2    col1  1.0      row2  1.0
# 2      row2    col2  1.0      row2  1.0
# 3      row2    col2  1.0      row3  1.0
# 4      row3    col2  1.0      row2  1.0

import networkx as nx
# create the graph and add the edges from couple rows-cols
G = nx.Graph()
G.add_edges_from(df_[['level_0_x','level_0_y']].to_numpy()) #<-- couple of rows sharing same cols 
print(list(nx.connected_components(G)))
#[{'row1'}, {'row3', 'row2', 'row4'}] <-- only rows name here

# use this and create a series with group number reindex by original df index
gr = pd.Series({val: gr for gr, vals in enumerate(nx.connected_components(G)) 
                        for val in vals})

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...