数据挖掘Numpy的学习

作者：yaunye | 来源：互联网 | 2023-09-13 10:54

什么是NumpyNumPy系统是Python的一种开源的数值计算扩展。这种工具可用来存储和处理大型矩阵(任意维度的数据处理)，比Python自身的嵌套列表࿰

什么是Numpy

NumPy系统是Python的一种开源的数值计算扩展。这种工具可用来存储和处理大型矩阵(任意维度的数据处理)&＃xff0c;比Python自身的嵌套列表&＃xff08;nested list structure)结构要高效的多&＃xff08;该结构也可以用来表示矩阵&＃xff08;matrix&＃xff09;&＃xff09;。

数据类型ndarray

NumPy provides an N-dimension array type, the ndarray, which describes a collection of ‘items’of the same type.

NumPy提供了一个N维数组类型ndarray&＃xff0c;它描述了相同类型的“items”的集合。

import numpy as npscore &＃61; np.array([[80, 89, 86, 67, 79],[78, 97, 89, 67, 81],[90, 94, 78, 67, 74],[91, 91, 90, 67, 69],[76, 87, 75, 67, 86],[70, 79, 84, 67, 84],[94, 92, 93, 67, 64],[86, 85, 83, 67, 80]])print(score, type(score)) #

ndarray与Python原生list运算效率对比

import numpy as np import random import time # 生成一个大数组 python_list &＃61; []for i in range(100000000):python_list.append(random.random())ndarray_list &＃61; np.array(python_list) len(ndarray_list)# 原生pythonlist求和 t1 &＃61; time.time() a &＃61; sum(python_list) t2 &＃61; time.time() d1 &＃61; t2 - t1 print(d1) # 0.7309620380401611# ndarray求和 t3 &＃61; time.time() b &＃61; np.sum(ndarray_list) t4 &＃61; time.time() d2 &＃61; t4 - t3 print(d2) # 0.12980318069458008

Numpy优势:

         1&＃xff09;存储风格
            ndarray - 相同类型 - 通用性不强 - 数据是连续性的存储
            list - 不同类型 - 通用性很强 - 引用的方式且不连续的堆空间存储
        2&＃xff09;并行化运算
            ndarray支持向量化运算
        3&＃xff09;底层语言
            C语言&＃xff0c;解除了GIL

1、内存块风格

2、ndarry支持并行化运算

3、Numpy底层是C编程&＃xff0c;内部解除了GIL(全局解释器锁--实际上只有一个线程)的限制

认识N维数组的属性-ndarry的属性(shape&＃43;dtype)

ndarry形状

import numpy as np # 利用元组表示维度(2,3)2个数字代表2维&＃xff0c;具体代表2行3列 a &＃61; np.array([[1, 2, 3], [4, 5, 6]])

# (4,)1维用1个数字表示&＃xff0c;表示元素个数&＃xff0c;为了表示为一个元组&＃xff0c;我们会添加一个&＃xff0c; b &＃61; np.array([1, 2, 3, 4])

# (2,2,3),最外层2个二维数组&＃xff0c;2维数组内又嵌套了2个一维数组&＃xff0c;一个一维数组又有3个元素 c &＃61; np.array([[[1, 2, 3], [4, 5, 6]], [[1, 2, 3], [4, 5, 6]]])

如何理解数组的形状&＃xff1f;

二维数组实际上是在一维数组内嵌套多个一维数组

三维数组实际上是在一维数组内嵌套多个二维数组

ndarry的类型

在创建ndarray的时候&＃xff0c;如果没有指定类型
默认整数 int64
默认浮点数 float64

创建数组的时候指定类型

import numpy as np# 创建数组的时候指定类型(1) t &＃61; np.array([1.1, 2.2, 3.3], dtype&＃61;np.float32) # 创建数组的时候指定类型(2) tt &＃61; np.array([1.1, 2.2, 3.3], dtype&＃61;"float32")

基本操作

生成数组的方法

生成数组的方法(4种类型)

1&＃xff09;生成0和1
    np.zeros(shape)
    np.ones(shape)
2&＃xff09;从现有数组中生成
    np.array() np.copy() 深拷贝
    np.asarray() 浅拷贝
3&＃xff09;生成固定范围的数组
    np.linspace(0, 10, 100)
        [0, 10] 等距离

    np.arange(a, b, c)
        range(a, b, c)
            [a, b) c是步长
4&＃xff09;生成随机数组
    分布状况 - 直方图
    1&＃xff09;均匀分布
        每组的可能性相等
    2&＃xff09;正态分布
        σ 幅度、波动程度、集中程度、稳定性、离散程度

1、生成0和1的数组

import numpy as np# 1 生成0和1的数组 t &＃61; np.zeros(shape&＃61;(3, 4), dtype&＃61;"float32") tt &＃61; np.ones(shape&＃61;[2, 3], dtype&＃61;np.int32)

2 从现有数组生成

import numpy as np # 方法一&＃xff1a;np.array() score &＃61; np.array([[80, 89, 86, 67, 79], [94, 92, 93, 67, 64], [86, 85, 83, 67, 80]])# 方法二&＃xff1a;np.copy() ttt &＃61; np.copy(score)# 方法三&＃xff1a;np.asarray() tttt &＃61; np.asarray(ttt)

区别&＃xff1a;

np.array() np.copy() 深拷贝
np.asarray() 浅拷贝

3 生成固定范围的数组

            np.linspace(0, 10, 100)
                [0, 10] 左闭右闭的等距离输出100个数字

            np.arange(a, b, c)
                 [a, b) 左闭右开的步长为c的数组

4 生成随机数组&＃xff08; 分布状况 - 直方图&＃xff09;

            1&＃xff09;均匀分布
                每组的可能性相等
            2&＃xff09;正态分布
                σ 幅度、波动程度、集中程度、稳定性、离散程度

1、均匀分布&＃xff1a;出现的概率一样

import numpy as np import matplotlib.pyplot as plt# 均匀分布&＃xff1a; data1 &＃61; np.random.uniform(low&＃61;-1, high&＃61;1, size&＃61;1000000)# 1、创建画布 plt.figure(figsize&＃61;(8, 6), dpi&＃61;100) # 2、绘制直方图 plt.hist(data1, 1000) # 3、显示图像 plt.show()

2、正太分布

方差是在概率论和统计方差衡量随机变量或一组数据时离散程度的度量。概率论中方差用来度量随机变量和其数学期望&＃xff08;即均值&＃xff09;之间的偏离程度。统计中的方差&＃xff08;样本方差&＃xff09;是每个样本值与全体样本值的平均数之差的平方值的平均数。标准差越小&＃xff0c;数据越集中。

demo:

import numpy as np import matplotlib.pyplot as plt# 正太分布 data2 &＃61; np.random.normal(loc&＃61;1.75, scale&＃61;0.1, size&＃61;1000000)# 1、创建画布 plt.figure(figsize&＃61;(20, 8), dpi&＃61;80)# 2、绘制直方图 plt.hist(data2, 1000)# 3、显示图像 plt.show()

数组的索引与切片

demo:

import numpy as npdef slice_index():&＃39;&＃39;&＃39;一维修改&＃xff1a;&＃39;&＃39;&＃39;arr &＃61; np.array([12, 32, 31])arr[0]&＃61;2print(arr)&＃39;&＃39;&＃39;二维修改&＃xff1a;&＃39;&＃39;&＃39;arr2 &＃61; np.array([[12, 2], [43, 3]])arr2[0, 0] &＃61; 22 # 修改[12, 2]为[22, 2]print(arr2)&＃39;&＃39;&＃39;三维修改&＃xff1a;&＃39;&＃39;&＃39;arr3 &＃61; np.array([[[1, 2, 3],[4, 5, 6]],[[12, 3, 34],[5, 6, 7]]]) # 3个[&＃xff0c;表示3维数组&＃xff0c;内又2个2维数组&＃xff0c;1个二维数组有2个1维数组&＃xff0c;1个一维数组又3个数字&＃xff0c;古(2,2,3)arr3[1, 0, 2] &＃61; 22 # 修改[12, 3, 34]为[12, 3, 22]print(arr3)print(arr3[1, 1, :2]) # 5,6 # 取出前2个if __name__ &＃61;&＃61; &＃39;__main__&＃39;:# 切片与索引slice_index()

形状改变

ndarray.reshape(shape) 返回新的ndarray&＃xff0c;原始数据没有改变&＃xff0c;且仅仅是改变了形状&＃xff0c;未改变行列. ndarry.reshape(-1,2) 自动变形

ndarray.resize(shape) 没有返回值&＃xff0c;对原始的ndarray进行了修改&＃xff0c;未改变行列
ndarray.T 转置行变成列&＃xff0c;列变成行

demo:

import numpy as npdef np_change():arr3 &＃61; np.array([[1, 2, 3], [4, 5, 6]]) # &＃xff08;2, 3&＃xff09;&＃39;&＃39;&＃39;方式一&＃xff1a;reshape: 返回一个新的ndarry, 且不改变原ndarry,且仅仅是改变了形状&＃xff0c;未改变行列[[1 2][3 4][5 6]]&＃39;&＃39;&＃39;arr4 &＃61; arr3.reshape((3, 2))print(arr3.shape) # (2, 3)print(arr4.shape) # (3, 2)&＃39;&＃39;&＃39;方式二&＃xff1a;resize: 没有返回值&＃xff0c;对原始的ndarray进行了修改&＃xff0c;未改变行列[[1 2 3 1 2 3]] &＃39;&＃39;&＃39;arr3.resize((1, 6))print(arr3) # (1, 6)&＃39;&＃39;&＃39;方式三&＃xff1a;T: 进行行列的转置&＃xff0c;把行数据转换为列&＃xff0c;列数据转换为行[[1 3 5][2 4 6]]&＃39;&＃39;&＃39;print(arr4.T) if __name__ &＃61;&＃61; &＃39;__main__&＃39;:# 改变形状np_change()

类型的修改

ndarray.astype(type)
ndarray 序列化到本地 --》ndarray.tostring()&＃xff1a;实现序列化

import numpy as npdef type_change():&＃39;&＃39;&＃39;ndarry的类型修改一&＃xff1a; astype(&＃39;float32&＃39;)&＃39;&＃39;&＃39;arr3 &＃61; np.array([[1, 2, 3], [4, 5, 6]]) # &＃xff08;2, 3&＃xff09;arr4 &＃61; arr3.astype("float32") # int转换为floatprint(arr3.dtype) # int32print(arr4.dtype) # float32&＃39;&＃39;&＃39;ndarry的类型修改二&＃xff1a; 利用tostrint()序列化&＃39;&＃39;&＃39;arr5 &＃61;arr3.tostring() # 序列化 \x01\x00\x00\x00print(arr5)if __name__ &＃61;&＃61; &＃39;__main__&＃39;:# 类型形状type_change()

数组去重

set

import numpy as npdef type_change():&＃39;&＃39;&＃39;ndarry的去重&＃39;&＃39;&＃39;temp &＃61; np.array([[1, 2, 3, 4], [3, 4, 5, 6]])# 方法一&＃xff1a; unique()np.unique(temp)print(&＃39;利用unique去重&＃xff1a;&＃39;, temp) # [3 4 5 6]]temp2 &＃61; np.array([[1, 2, 3, 4], [3, 4, 5, 6]])# 方法二&＃xff1a; set的要求是数组必须是一维的&＃xff0c;利用flatten&＃xff08;&＃xff09;进行降维set(temp2.flatten())print(&＃39;利用set进行降维后&＃xff1a;&＃39;, temp2) # [3 4 5 6]]if __name__ &＃61;&＃61; &＃39;__main__&＃39;:# ndarry的去重type_change()

小结&＃xff1a;

ndarray的运算(逻辑运算&＃43;统计运算&＃43;数组运算)

1、逻辑运算

        布尔索引
        通用判断函数
            np.all(布尔值)
                只要有一个False就返回False&＃xff0c;只有全是True才返回True
            np.any()
                只要有一个True就返回True&＃xff0c;只有全是False才返回False

np.where&＃xff08;三元运算符&＃xff09;
np.where(布尔值, True的位置的值, False的位置的值)

布尔索引

import numpy as npdef demo():&＃39;&＃39;&＃39;逻辑运算&＃39;&＃39;&＃39;temp &＃61; np.array([[1, 2, 3, 4], [3, 4, 5, 6]])# 判断temp里面的元素是否大于5(temp > 5)就标记为True 否则为False:print(temp > 5)# 找到数值大于等于5的数字print(temp[temp >&＃61; 5]) # [5 6]# 找到数值大于等于5的数字,并统一赋值为100temp[temp >&＃61; 5] &＃61; 100print(temp)if __name__ &＃61;&＃61; &＃39;__main__&＃39;:# 逻辑运算 -- 布尔索引demo()

通用判断函数

    np.all(布尔值)
        只要有一个False就返回False&＃xff0c;只有全是True才返回True
    np.any()
        只要有一个True就返回True&＃xff0c;只有全是False才返回False

import numpy as npdef demo():&＃39;&＃39;&＃39;通用判断函数&＃39;&＃39;&＃39;temp &＃61; np.array([[1, 2, 3, 4], [3, 4, 5, 6]])# np.all(): 只要有一个False就返回False&＃xff0c;只有全是True才返回Trueprint(np.all(temp > 5)) # Falseprint(np.all(temp <15)) # True# np.any(): 只要有一个True就返回True&＃xff0c;只有全是False才返回Falseprint(np.any(temp > 5)) # Trueif __name__ &＃61;&＃61; &＃39;__main__&＃39;:# 逻辑运算 -- 通用判断函数demo()

三元运算符

np.where(布尔值, True的位置的值, False的位置的值)

import numpy as npdef demo():&＃39;&＃39;&＃39;三元运算符&＃39;&＃39;&＃39;temp &＃61; np.array([[1, 2, 3, 4], [3, 4, 5, 6]])# np.where(): np.where(布尔值, True的位置的值, False的位置的值)print(np.where(temp > 4, 100, -100)) # 如果元素大于4&＃xff0c;则置为100&＃xff0c;否则置为-100&＃39;&＃39;&＃39;[[-100 -100 -100 -100][-100 -100 100 100]]&＃39;&＃39;&＃39;if __name__ &＃61;&＃61; &＃39;__main__&＃39;:# 逻辑运算 -- 三元运算符demo()

配合了逻辑与或非的运算&＃xff1a;

import numpy as npdef demo():&＃39;&＃39;&＃39;三元运算符&＃xff1a; 配合逻辑与或非运算&＃39;&＃39;&＃39;temp &＃61; np.array([[1, 2, 3, 4], [3, 4, 5, 6]])# np.logical_and(), np.logical_or(), logical_not()进行与或非运算print(np.logical_and(temp > 2, temp <4)) # 进行与运算print(np.logical_or(temp > 2, temp <3)) # 进行或运算print(np.where(np.logical_or(temp > 2, temp <3), 1, 0)) # 配合了or的where三木运算print(np.where(np.logical_and(temp > 2, temp <4), 1, 0)) # 配合了and的where三木运算&＃39;&＃39;&＃39;[[-100 -100 -100 -100][-100 -100 100 100]]&＃39;&＃39;&＃39;if __name__ &＃61;&＃61; &＃39;__main__&＃39;:# 逻辑运算 -- 三元运算符demo()

2、统计运算

        统计指标函数
            min, max, mean, median, var, std
            np.函数名&＃xff0c;例如&＃xff0c;arr.max()
            ndarray.方法名, 例如&＃xff0c;ndarray.max(arr, ) # 需要先指定好元组
        返回最大值、最小值所在位置
            np.argmax(temp, axis&＃61;)
            np.argmin(temp, axis&＃61;)

统计指标函数&＃xff1a;需指定好指标

import numpy as npdef demo():&＃39;&＃39;&＃39;统计运算&＃39;&＃39;&＃39;temp &＃61; np.array([[1, 2, 3, 4], [3, 4, 5, 6], [5, 6, 7, 8]])print(temp.max(axis&＃61;0)) # [5 6 7 8]&＃xff0c; 按照列比较print(temp.max(axis&＃61;1)) # [4 6 8]&＃xff0c; 按照行比较print(np.argmax(temp, axis&＃61;1)) # [3 3 3]&＃xff0c; 返回最大值所在的位置print(np.argmin(temp, axis&＃61;1)) # [0 0 0 ]&＃xff0c; 返回最小值所在的位置if __name__ &＃61;&＃61; &＃39;__main__&＃39;:# 统计运算demo()

3、数组间运算

1. 数组与数的运算
2. 数组与数组的运算
3. 广播机制
4. 矩阵运算
    1 什么是矩阵
        矩阵matrix 二维数组
        矩阵 & 二维数组
        两种方法存储矩阵
            1&＃xff09;ndarray 二维数组
                矩阵乘法&＃xff1a;
                    np.matmul
                    np.dot
            2&＃xff09;matrix数据结构
    2 矩阵乘法运算
        形状
            (m, n) * (n, l) &＃61; (m, l)
        运算规则
            A (2, 3) B(3, 2)
            A * B &＃61; (2, 2)

1、数组与数的运算

import numpy as npdef demo():&＃39;&＃39;&＃39;数组与数的运算&＃39;&＃39;&＃39;temp &＃61; np.array([[1, 2, 3, 4], [3, 4, 5, 6], [5, 6, 7, 8]])print(temp &＃43; 10)print(temp * 10)if __name__ &＃61;&＃61; &＃39;__main__&＃39;:# 数组与数的运算demo()

2、数组与数组的运算(需满足广播机制)

import numpy as npdef demo():&＃39;&＃39;&＃39;数组与数组的运算&＃39;&＃39;&＃39;arr1 &＃61; np.array([[1, 2, 3, 2, 1, 4], [5, 6, 1, 2, 3, 1]]) # 2行6列arr2 &＃61; np.array([[1, 2, 3, 4], [3, 4, 5, 6]]) # 2行4列arr3 &＃61; np.array([[1, 2, 3, 2, 1, 4], [5, 6, 1, 2, 3, 1]]) # 2行6列arr4 &＃61; [2]# print(arr1 &＃43; arr2) could not be broadcast together with shapes (2,6) (2,4)print(arr1 &＃43; arr3)print(arr1 &＃43; arr4)if __name__ &＃61;&＃61; &＃39;__main__&＃39;:# 数组与数组的运算demo()

矩阵运算

1 什么是矩阵
    矩阵matrix 二维数组
    矩阵 & 二维数组   --》矩阵肯定是二维数组形式存储计算机&＃xff0c;但是不是所有的二维数组都是矩阵。
    两种方法存储矩阵
        1&＃xff09;ndarray 二维数组
            矩阵乘法&＃xff1a;
                np.matmul
                np.dot
        2&＃xff09;matrix数据结构
2 矩阵乘法运算
    形状
        (m, n) * (n, l) &＃61; (m, l)
    运算规则
        A (2, 3) B(3, 2)
        A * B &＃61; (2, 2)

1、什么是矩阵

import numpy as npdef demo():&＃39;&＃39;&＃39;矩阵存储方法&＃39;&＃39;&＃39;# 方案一&＃xff1a;ndarray存储矩阵data &＃61; np.array([[80, 86],[82, 80],[85, 78],[90, 90],[86, 82],[82, 90],[78, 80],[92, 94]])print(type(data)) # # 方案二&＃xff1a; matrix存储矩阵data_mat &＃61; np.mat([[80, 86],[82, 80],[85, 78],[90, 90],[86, 82],[82, 90],[78, 80],[92, 94]])print(type(data_mat)) # if __name__ &＃61;&＃61; &＃39;__main__&＃39;:# ndarray存储矩阵demo()

2、矩阵乘法

       形状
             (m, n) * (n, l) &＃61; (m, l)
       运算规则
             A (2, 3) B(3, 2)
             A * B &＃61; (2, 2)

import numpy as npdef demo():&＃39;&＃39;&＃39;矩阵乘法API&＃39;&＃39;&＃39;# 方案一&＃xff1a;np.matmul()data &＃61; np.array([[80, 86],[82, 80],[78, 80],[92, 94]]) # (4,2)weight &＃61; np.array([[0.5],[0.5]]) # (2,1)print(np.matmul(data, weight)) # (4,1)# 方案二&＃xff1a; np.dot()data_mat &＃61; np.mat([[80, 86],[82, 80],[78, 80],[92, 94]])print(np.dot(data_mat, weight)) # (4,1) # 扩展方案&＃xff1a; print(data &＃64; weight) # ndarry的直接矩阵计算 if __name__ &＃61;&＃61; &＃39;__main__&＃39;:# 矩阵乘法APIdemo()

合并与分割

合并

分割

IO操作和数据处理

数据准备&＃xff1a;test.csv

id,value1,value2,value3 1,123,1.4,23 2,110,,18 3,,2.1,19

demo:

import numpy as npdef demo():&＃39;&＃39;&＃39;# 合并&＃39;&＃39;&＃39;data &＃61; np.genfromtxt("F:\linear\\test.csv", delimiter&＃61;",")print(data) # 把字符串和缺失值用nan记录(not a number)&＃39;&＃39;&＃39;[[ nan nan nan nan][ 1. 123. 1.4 23. ][ 2. 110. nan 18. ][ 3. nan 2.1 19. ]]&＃39;&＃39;&＃39;if __name__ &＃61;&＃61; &＃39;__main__&＃39;:# 合并demo()

缺失值的处理

    1. 直接删除含有缺失值的样本
    2. 替换/插补
            按列求平均&＃xff0c;用平均值进行填补

import numpy as npdef fill_nan_by_column_mean():&＃39;&＃39;&＃39;处理缺失值 -- 均值填补&＃39;&＃39;&＃39;t &＃61; np.genfromtxt("F:\linear\\test.csv", delimiter&＃61;",")for i in range(t.shape[1]): # 按照列求平均&＃xff0c;先计算数据的shape&＃xff0c;看列的数量# 计算nan的个数nan_num &＃61; np.count_nonzero(t[:, i][t[:, i] !&＃61; t[:, i]])if nan_num > 0:now_col &＃61; t[:, i]# 求和now_col_not_nan &＃61; now_col[np.isnan(now_col) &＃61;&＃61; False].sum()# 和/个数now_col_mean &＃61; now_col_not_nan / (t.shape[0] - nan_num)# 赋值给now_colnow_col[np.isnan(now_col)] &＃61; now_col_mean# 赋值给t&＃xff0c;即更新t的当前列t[:, i] &＃61; now_colprint(t)return tif __name__ &＃61;&＃61; &＃39;__main__&＃39;:# 处理缺失值 -- 均值填补fill_nan_by_column_mean()