pandas数据处理常用函数demo之创建/行列操作/查看/文件操作

作者：浅笑二度 | 来源：互联网 | 2023-10-13 14:01

pandas是Python下强大的数据分析工具，这篇文章代码主要来自于10Minutestopandas，我将示例代码进行了重跑和修改，基本可以满足所有操作，但是使用更高级的功

pandas是Python下强大的数据分析工具，这篇文章代码主要来自于
10 Minutes to pandas，我将示例代码进行了重跑和修改，基本可以满足所有操作，但是使用更高级的功能可以达到事半功倍的效果：原文如下：
http://pandas.pydata.org/pandas-docs/stable/10min.html
初次使用pandas，很多人最头痛的就是Merge, join等表的操作了，下面这个官方手册用图形的形式形象的展示出来了表操作的方式：
http://pandas.pydata.org/pandas-docs/stable/merging.html

创建dataframe

DataFrame和Series作为padans两个主要的数据结构，是数据处理的载体和基础。

def create():

    #create Series
    s = pd.Series([1,3,5,np.nan,6,8])
    print s

    #create dataframe
    dates = pd.date_range('20130101', periods=6)
    df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD'))
    print df

#Creating a DataFrame by passing a dict of objects that can be converted to series-like.
    df2 = pd.DataFrame({ 'A' : 1.,
                        'B' : pd.Timestamp('20130102'),
                        'C' : pd.Series(1,index=list(range(4)),dtype='float32'),
                        'D' : np.array([3] * 4,dtype='int32'),
                        'E' : pd.Categorical(["test","train","test","train"]),
                        'F' : 'foo' })
    print df2
    #Having specific dtypes
    print df2.dtypes

查看dataframe属性

我们生成数据或者从文件加在数据后，首先要看数据是否符合我们的需求，比如行和列数目，每列的基本统计信息等，这些信息可以让我们认识数据的特点或者检查数据的正确性：

def see():

    dates = pd.date_range('20130101', periods=6)
    df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD'))
    print df

    #See the top & bottom rows of the frame'''
    print df.head(2)
    print df.tail(1)

    #Display the index, columns, and the underlying numpy data,num of line and col
    print df.index
    print df.columns
    print df.values
    print df.shape[0]
    print df.shape[1]

    #Describe shows a quick statistic summary of your data
    print df.describe()

    #Transposing your data
    print df.T

    #Sorting by an axis,0 is y,1 is x,ascending True is zhengxv,false is daoxv
    print df.sort_index(axis=0, ascending=False)

    #Sorting by values
    print df.sort(column='B')

    #see valuenums
    print df[0].value_counts()
    print df[u'hah'].value_counts()

    #see type and change
    df.dtypes
    df[['two', 'three']] = df[['two', 'three']].astype(float)

选取数据

了解了数据基本信息后，我们可能要对数据进行一些裁剪。很多情况下，我们并不需要数据的全部信息，因此我们要学会选取出我们感兴趣的数据和行列，接下来的例子就是对数据的裁剪：

def selection():

    dates = pd.date_range('20130101', periods=6)
    df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD'))
    print df

    #Selecting a single column, which yields a Series, equivalent to df.A
    print df['A']
    print df.A

    #Selecting via [], which slices the rows.
    print df[0:3]
    print df['20130102':'20130104']

    #Selection by Label

    #For getting a cross section using a label
    print df.loc[dates[0]]

    #Selecting on a multi-axis by label
    print df.loc[:,['A','B']]

    #Showing label slicing, both endpoints are included
    print df.loc['20130102':'20130104',['A','B']]

    #For getting a scalar value
    print df.loc[dates[0],'A']
    print df.at[dates[0],'A']


    #Selection by Position

    #Select via the position of the passed integers
    print df.iloc[3]

    #By integer slices, acting similar to numpy/python
    print df.iloc[3:5,0:2]

    #By lists of integer position locations, similar to the numpy/python style
    print df.iloc[[1,2,4],[0,2]]

    #For slicing rows explicitly
    print df.iloc[1:3,:]

    #For getting a value explicitly
    print df.iloc[1,1]
    print df.iat[1,1]


    #Boolean Indexing

    #Using a single column's values to select data.
    print df[df.A > 0]

    #Using the isin() method for filtering:
    df2 = df.copy()
    df2['E'] = ['one', 'one','two','three','four','three']
    print df2[df2['E'].isin(['two','four'])]

    #A where operation for getting.
    print df[df > 0]
    df2[df2 > 0] = -df2

    #Setting
    #Setting a new column automatically aligns the data by the indexes
    s1 = pd.Series([1,2,3,4,5,6], index=pd.date_range('20130102', periods=6))
    df['F'] = s1
    print df

    #Setting values by label/index
    df.at[dates[0],'A'] = 0
    df.iat[0,1] = 0
    print df

    #Setting by assigning with a numpy array
    df.loc[:,'D'] = np.array([5] * len(df))
    print df

文件操作

很多时候，我们的数据并不是自己生成的，而是从文件中读取的，数据文件则具有各种各样的来源，下面就展示如何加载和保存数据。pandas提供了多种API，可以加载txt/csv/libsvm等各个格式的数据，完全可以满足数据分析的需求

def file():
    ts = pd.Series(np.random.randn(1000), index=pd.date_range('1/1/2000', periods=1000))
    df = pd.DataFrame(np.random.randn(1000, 4), index=ts.index,
                      columns=['A', 'B', 'C', 'D'])
    pd.read_csv('foo.csv')
    df.to_csv('foo.csv')

推荐阅读

import
技术分享：从动态网站提取站点密钥的解决方案

本文探讨了如何从动态网站中提取站点密钥，特别是针对验证码（reCAPTCHA）的处理方法。通过结合Selenium和requests库，提供了详细的代码示例和优化建议。 ... [详细]

蜡笔小新 2024-12-28 04:11:47
import
PyCharm下载与安装指南

本文详细介绍如何从官方渠道下载并安装PyCharm集成开发环境（IDE），涵盖Windows、macOS和Linux系统，同时提供详细的安装步骤及配置建议。 ... [详细]

蜡笔小新 2024-12-28 09:42:41
import
Python 的 10 个开发技巧！太实用了

1.如何在运行状态查看源代码？查看函数的源代码，我们通常会使用IDE来完成。比如在PyCharm中，你可以Ctrl+鼠标点击进入函数的源代码。那如果没有IDE呢？当我们想使用一个函 ... [详细]

蜡笔小新 2024-12-27 18:36:54
plugins
JQuery基础：省市联动与表单验证

本文介绍了如何使用JQuery实现省市二级联动和表单验证。首先，通过change事件监听用户选择的省份，并动态加载对应的城市列表。其次，详细讲解了使用Validation插件进行表单验证的方法，包括内置规则、自定义规则及实时验证功能。 ... [详细]

蜡笔小新 2024-12-27 17:10:48
import
Akka BackoffSupervisor的深入解析与实践

本文详细介绍了Akka中的BackoffSupervisor机制，探讨其在处理持久化失败和Actor重启时的应用。通过具体示例，展示了如何配置和使用BackoffSupervisor以实现更细粒度的异常处理。 ... [详细]

蜡笔小新 2024-12-27 15:04:09
select
在Ubuntu 16.04 LTS上配置Qt Creator开发环境

本文详细介绍了如何在Ubuntu 16.04 LTS系统中安装和配置Qt Creator，涵盖了从下载到安装的全过程，并提供了常见问题的解决方案。 ... [详细]

蜡笔小新 2024-12-27 13:19:53
range
扫描线三巨头 hdu1928hdu 1255 hdu 1542 [POJ 1151]

学习链接：http:blog.csdn.netlwt36articledetails48908031学习扫描线主要学习的是一种扫描的思想，后期可以求解很 ... [详细]

蜡笔小新 2024-12-26 20:04:36
range
SQL 触发器实现视图插入操作

本文介绍如何通过创建替代插入触发器，使对视图的插入操作能够正确更新相关的基本表。涉及的表包括：飞机（Aircraft）、员工（Employee）和认证（Certification）。 ... [详细]

蜡笔小新 2024-12-26 15:53:40
filter
解决Element UI中Select组件创建条目为空时报错的问题

本文介绍如何在Element UI的Select组件中使用allow-create属性创建新条目，并处理创建条目为空时出现的错误。我们将详细说明filterable属性的必要性，以及default-first-option属性的作用。 ... [详细]

蜡笔小新 2024-12-26 12:39:46
php
MySQL索引详解与优化

本文深入探讨了MySQL中的索引机制，包括索引的基本概念、优势与劣势、分类及其实现原理，并详细介绍了索引的使用场景和优化技巧。通过具体示例，帮助读者更好地理解和应用索引以提升数据库性能。 ... [详细]

蜡笔小新 2024-12-25 19:52:47
range
基于KVM的SRIOV直通配置及性能测试

SRIOV介绍、VF直通配置，以及包转发率性能测试小慢哥的原创文章，欢迎转载目录?1.SRIOV介绍?2.环境说明?3.开启SRIOV?4.生成VF?5.VF ... [详细]

蜡笔小新 2024-12-25 19:26:39
import
Java并发编程：LinkedBlockingQueue的实际应用

本文介绍了Java并发库中的阻塞队列（BlockingQueue）及其典型应用场景。通过具体实例，展示了如何利用LinkedBlockingQueue实现线程间高效、安全的数据传递，并结合线程池和原子类优化性能。 ... [详细]

蜡笔小新 2024-12-27 18:51:49
const
使用 Azure Service Principal 和 Microsoft Graph API 获取 AAD 用户列表

本文介绍了一段通用代码示例，该代码不仅能够操作 Azure Active Directory (AAD)，还可以通过 Azure Service Principal 的授权访问和管理 Azure 订阅资源。Azure 的架构可以分为两个层级：AAD 和 Subscription。 ... [详细]

蜡笔小新 2024-12-27 16:07:12
instance
深入解析Spring Cloud Ribbon负载均衡机制

本文详细介绍了Spring Cloud中的Ribbon组件如何实现服务调用的负载均衡。通过分析其工作原理、源码结构及配置方式，帮助读者理解Ribbon在分布式系统中的重要作用。 ... [详细]

蜡笔小新 2024-12-27 16:01:25
instance
解析Java中Text.splitText()方法及其应用场景

本文详细介绍了Java中org.w3c.dom.Text类的splitText()方法，通过多个代码示例展示了其实际应用。该方法用于将文本节点在指定位置拆分为两个节点，并保持在文档树中。 ... [详细]

蜡笔小新 2024-12-26 18:31:42

浅笑二度

这个家伙很懒，什么也没留下！

Tags | 热门标签

RankList | 热门文章