Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
353 views
in Technique[技术] by (71.8m points)

python - measuring the distance between rows of a dataframe

i have a dataframe which consists of 472 rows and 32 columns and it looks like this:

2   3   0   4   2   0   0   5   2   3   3   3   2   0   5   5   3   3   3   2   2   0   2   5   3   3   3   2   2   2   0   5
2   3   0   4   2   0   0   5   2   3   3   3   2   0   5   5   3   3   3   2   2   0   2   5   3   3   3   2   2   2   0   5
2   3   0   4   2   0   0   5   2   3   3   3   2   0   5   5   3   3   3   2   2   0   2   5   3   3   3   2   2   2   0   5

here, every row represent 32 teeth of a person and each number between 0-5 represent different teeth categories. now i want to measure the distance between any 2 rows by using different distance metrics (eg MANHATTAN, EUCLID, MINKOWSKI). so, the less the difference the more likely they are the same people etc.

*if i apply ONE-HOT-ENCODING before computing these metrics, there will be more than 32 columns for every row, which will be useless for me.

*i also found cdist and pdist, but these functions give me element-wise distance results. but what i want is to obtain a "single result" between any two rows.

am i trying something non-sense or what should i do to be able to compute these distances ?

question from:https://stackoverflow.com/questions/65939890/measuring-the-distance-between-rows-of-a-dataframe

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

The distance calculation function you seem to be looking for is the following:

https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise_distances.html

You can set the metric to be any of the ones used for scipy.spatial.distance.pdist.

Example of how it would work:

a = [[1,2,3,4,5,6,7,8,10]]
b = [[2,4,1,3,4,5,6,7,8]]
c = [[4,2,1,54,7,85,89,1,2]]

from sklearn.metrics import pairwise_distances

pairwise_distances(a,b)

The output would be:

array([[4.24264069]])

Similary, the output for

pairwise_distances(a,c)

would be:

array([[124.87994234]])

Hence, c is further away from a.

You can use this logic in your problem. In your case, the following code snippet would do the trick:

import pandas as pd
import numpy as np

df = pd.read_csv('your_file.csv')
for i, row in df.iterrows():
    row = np.array(row)
    for j, other_row in df.iterrows():
       other_row = np.array(other_row)
       distance = pairwise_distances(np.reshape(row,(1,len(row))),np.reshape(other_row,(1,len(other_row))))
       print("Distance between row {} and {} : {}".format(i,j,distance))

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...