2019独角兽企业重金招聘Python工程师标准>>>
估算不同样本之间的相似性度量(Similarity Measurement),通常采用的方法就是计算样本间的“距离”(Distance)
1.欧氏距离(Euclidean Distance)
欧氏距离是最易于理解的一种距离计算方法,源自欧氏空间中两点间的距离公式
import numpy as np
vector1 = np.mat([1,2,3])
vector2 = np.mat([4,5,6])
distance = np.sqrt((vector1-vector2)*((vector1-vector2).T))
print(distance)
2. 曼哈顿距离(ManhattanDistance)
城市街区距离(CityBlock distance)
import numpy as np
vector1 = np.mat([1,2,3])
vector2 = np.mat([4,5,6])
distance = np.sum(np.abs(vector1-vector2))
print(distance)
3.切比雪夫距离 ( Chebyshev Distance )
二个点之间的距离定义是其各坐标数值差绝对值的最大值
import numpy as np
vector1 = np.mat([1,2,3])
vector2 = np.mat([4,7,5])
distance = np.abs(vector1-vector2).max()
print(distance)
4.闵可夫斯基距离(Minkowski Distance)
闵氏距离不是一种距离,而是一组距离的定义
曼哈顿距离/欧氏距离/切比雪夫距离
5.标准化欧氏距离(Standardized Euclidean Distance )
(标准化变量的数学期望为0,方差为1)
标准化后的值 = ( 标准化前的值 - 分量的均值 ) /分量的标准差
6.马氏距离(Mahalanobis Distance)
两个服从同一分布并且其协方差矩阵为Σ的随机变量之间的差异程度
7.夹角余弦(Cosine)
衡量两个向量方向的差异
夹角余弦取值范围为[-1,1]。夹角余弦越大表示两个向量的夹角越小,夹角余弦越小表示两向量的夹角越大。当两个向量的方
向重合时夹角余弦取最大值1,当两个向量的方向完全相反夹角余弦取最小值-1
from numpy import *
cosV12 = dot(vector1,vector2)/(linalg.norm(vector1)*linalg.norm(vector2))
print(cosV12)
8.汉明距离(Hamming Distance)
两个等长字符串s1与s2之间的汉明距离定义为将其中一个变为另外一个所需要作的最小替换次数
import numpy as np
matV = np.mat([[1,1,0,1,0,1,0,0,1],[0,1,1,0,0,0,1,1,1]])
smstr = np.nonzero(matV[0]-matV[1]);
distance = np.shape(smstr[0])[1]
print(distance)
9.杰卡德相似系数(Jaccard Similarity Coefficient) & 杰卡德距离(Jaccard Distance)
杰卡德相似系数:两个集合A和B的交集元素在A,B的并集中所占的比例
杰卡德相似系数是衡量两个集合的相似度一种指标。
杰卡德距离:用两个集合中不同元素占所有元素的比例来衡量两个集合的区分度
from numpy import *
import scipy.spatial.distance as dist
matV = bp.mat([[1,1,0,1,0,1,0,0,1],[0,1,1,0,0,0,1,1,1]])
distance = dist.pdist(matV,'jaccard')
print(distance)
10.相关系数(Correlation Coefficient)与相关距离(Correlation Distance)
相关系数是衡量随机变量X与Y相关程度的一种方法,相关系数的取值范围是[-1,1]。
相关系数的绝对值越大,则表明X与Y相关度越高。
当X与Y线性相关时,相关系数取值为1(正线性相关)或-1(负线性相关)
from numpy import *
import scipy.spatial.distance as dist
matV = bp.mat([[1,1,0,1,0,1,0,0,1],[0,1,1,0,0,0,1,1,1]])
distance = dist.pdist(matV,'correlation')
print(distance)
11.信息熵(Information Entropy)
信息熵是衡量分布的混乱程度或分散程度的一种度量。
分布越分散(或者说分布越平均),信息熵就越大。分布越有序(或者说分布越集中),信息熵就越小
12.其他 scipy.spatial.distance.xxxx
scipy.spatial.distance.pdist(x, 'xxxx')
braycurtis(u, v) | Computes the Bray-Curtis distance between two 1-D arrays. |
canberra(u, v) | Computes the Canberra distance between two 1-D arrays. |
chebyshev(u, v) | Computes the Chebyshev distance. |
cityblock(u, v) | Computes the City Block (Manhattan) distance. |
correlation(u, v) | Computes the correlation distance between two 1-D arrays. |
cosine(u, v) | Computes the Cosine distance between 1-D arrays. |
dice(u, v) | Computes the Dice dissimilarity between two boolean 1-D arrays. |
euclidean(u, v) | Computes the Euclidean distance between two 1-D arrays. |
hamming(u, v) | Computes the Hamming distance between two 1-D arrays. |
jaccard(u, v) | Computes the Jaccard-Needham dissimilarity between two boolean 1-D arrays. |
kulsinski(u, v) | Computes the Kulsinski dissimilarity between two boolean 1-D arrays. |
mahalanobis(u, v, VI) | Computes the Mahalanobis distance between two 1-D arrays. |
matching(u, v) | Computes the Matching dissimilarity between two boolean 1-D arrays. |
minkowski(u, v, p) | Computes the Minkowski distance between two 1-D arrays. |
rogerstanimoto(u, v) | Computes the Rogers-Tanimoto dissimilarity between two boolean 1-D arrays. |
russellrao(u, v) | Computes the Russell-Rao dissimilarity between two boolean 1-D arrays. |
seuclidean(u, v, V) | Returns the standardized Euclidean distance between two 1-D arrays. |
sokalmichener(u, v) | Computes the Sokal-Michener dissimilarity between two boolean 1-D arrays. |
sokalsneath(u, v) | Computes the Sokal-Sneath dissimilarity between two boolean 1-D arrays. |
sqeuclidean(u, v) | Computes the squared Euclidean distance between two 1-D arrays. |
wminkowski(u, v, p, w) | Computes the weighted Minkowski distance between two 1-D arrays. |
yule(u, v) | Computes the Yule dissimilarity between two boolean 1-D arrays. |