★ 本次主要学习如何使用pandas对数据进行分析统计,特别是groupby
函数的使用,进行分组统计。
” 本次项目主要总结如下知识点:
将文件保存在特定的文件夹下 在数据分析的过程中,会生成新的数据和图表,为了将产生的图表放置在相同的目录下,需要生成一个新的文件夹,具体的代码操作如下:
import os data_path = '.' output_path = './output' 如果没有的话,需要新建一个文件夹if not os.path.exits(output_path): os.makedirs(output_path)
改段代码表示:如果不存在当前目录output_path
就会生成一个当前目录output_path = './output'
灵活运用groupby groupby
是pandas
中的一种数据统计方法,该方法的基本思想如下:
分割步骤将DataFrame按照指定的键分割成若干组 应用步骤对每个组应用函数,通常是累积、转换或过滤函数 基本步骤如下:
举一个简单的应用例子:我们要查看某家咖啡店销量的产品,如Classic Espresso Drinks , Frappuccino® Blended Coffee , Frappuccino® Light Blended Coffee 等产品的总的品类,然后求每一款茶品的平均热量。例如Coffee 有四款茶品,每一种茶品有一个卡路里热量值,我们需要求Coffee 这款饮品的热量值。这时groupby
方法能够发挥作用。具体的代码如下:
import pandas as pd data_path = './coffee_menu.csv' df = pd.read_csv(data_path,sep = ',' ,encoding = 'utf-8' )#print(df.head()) Beverage_calories_mean = df.groupby("Beverage_category" )["Calories" ].mean() Beverage_calories_total = df.groupby("Beverage_category" )["Calories" ].count() print(Beverage_calories_total) print(Beverage_calories_mean)
输出的结果为:
Beverage_category Classic Espresso Drinks 58 Coffee 4 Frappuccino® Blended Coffee 36 Frappuccino® Blended Crème 13 Frappuccino® Light Blended Coffee 12 Shaken Iced Beverages 18 Signature Espresso Drinks 40 Smoothies 9 Tazo® Tea Drinks 52 Name: Calories, dtype: int64 Beverage_category Classic Espresso Drinks 140.172414 Coffee 4.250000 Frappuccino® Blended Coffee 276.944444 Frappuccino® Blended Crème 233.076923 Frappuccino® Light Blended Coffee 162.500000 Shaken Iced Beverages 114.444444 Signature Espresso Drinks 250.000000 Smoothies 282.222222 Tazo® Tea Drinks 177.307692 Name: Calories, dtype: float64
我们理清了整体关键步骤之后,我们需要保存相关数据和图片:完整的代码如下:
import pandas as pdimport osimport matplotlib.pyply as plt""" 分析:总的饮品数据和饮品的平均热量 """ data_path = './coffee_menu.csv' output_path = './output_002' if not os.path.exists(output_path): os.makedirs(output_path) def collect_data () : df = pd.read_csv(data_path,sep = ',' ,encoding = 'utf-8' ) return dfdef analyse_data (df) : bervage_categories__total = df.groupby("Beverage_category" )["Calories" ].count() beverage_categories_calories_mean = df.groupby("Beverage_category" )["Calories" ].mean() return bervage_categories__total,beverage_categories_calories_meandef save_and_show_data (bervage_categories_total,beverage_categories_calories_mean) : #保存数据 bervage_categories_total.to_csv(os.path.join(output_path,'total_categories.csv' )) beverage_categories_calories_mean.to_csv(os.path.join(output_path,'mean_calories.csv' )) #保存图表 bervage_categories_total.plot(kind = 'bar' ) plt.title("total_beverage_categories" ) plt.tight_layout() plt.savefig(os.path.join(output_path,'total_berage_categories.png' )) plt.show() beverage_categories_calories_mean.plot(kind='bar' ) plt.title("mean_beverage_calories" ) plt.tight_layout() plt.savefig(os.path.join(output_path, 'mean_bervage_calories.png' )) plt.show()def main () : df = collect_data() bervage_categories_total,beverage_categories_calories_mean = analyse_data(df) save_and_show_data(bervage_categories_total,beverage_categories_calories_mean)if __name__ == '__main__' : main()
输出表格为:
项目总结:
(完)
★往期回顾 美国PM2.5污染变化图 用python绘制中国地区 项目学习(3)