大数据Spark中决策树模型Pipeline的建立和两种验证方法(完整版)

作者：starry-night--_848 | 来源：互联网 | 2023-09-18 17:27

文章目录[toc]一、数据预处理1、加载数据2、SparkSession读取CSV格式文件3、清洗数据4、特征处理4.1、StringIndexer4.2、OneHotEncode

文章目录
&＃64;[toc]
一、数据预处理
1、加载数据
2、SparkSession读取CSV格式文件
3、清洗数据
4、特征处理
4.1、StringIndexer
4.2、OneHotEncoder
4.3、VectorAssembler
二、建模
分类决策树DecisionTreeClassifier
三、评估&＃xff08;ROC曲线&＃xff09;
四、打包&＃xff08;ML Pipeline&＃xff09;
Step 1. 创建流程中转换器和模型学习器
Step 2. 创建Pipeline实例对象
step3. Pipeline 数据处理与训练模型
Step 4. PipelineModel模型预测
step5、PipelineModel模型保存于加载
step6、调用
五、验证选择最优模型
5.1、创建 TrainValidationSplit 实例对象
5.1、Cross-Validation交叉验证
六、提升&＃xff1a;随即森林&＃xff08;RF算法&＃xff09;

数据链接

一、数据预处理

1、加载数据

# 导入包 import os import time from pyspark.sql import SparkSession# 实例化SparkSession对象&＃xff0c;以本地模式是运行Spark程序 spark &＃61; SparkSession \.builder \.appName("PySpark_ML_Pipeline") \.master("local[4]")\.getOrCreate()print spark print spark.sparkContext &＃39;&＃39;&＃39; &＃39;&＃39;&＃39;

2、SparkSession读取CSV格式文件

help(spark.read.csv) # 读取数据集&＃xff0c; raw_df &＃61; spark.read.csv(&＃39;./datas/train.tsv&＃39;, header&＃61;&＃39;true&＃39;, sep&＃61;&＃39;\t&＃39;,\inferSchema&＃61;&＃39;true&＃39;) # 显示条目数 print raw_df.count() &＃61;&＃61;>7395 raw_df.printSchema()# 由于字段太多&＃xff0c;选择某些字段值 raw_df.select(&＃39;url&＃39;, &＃39;alchemy_category&＃39;, &＃39;alchemy_category_score&＃39;, \&＃39;label&＃39;).show(10)

3、清洗数据

# 定义函数转换 &＃xff1f;转换为 0 def replace_question_func(x):return &＃39;0&＃39; if x &＃61;&＃61; &＃39;?&＃39; else x# 注册函数 from pyspark.sql.functions import udf replace_question &＃61; udf(replace_question_func)# col函数将一个字符串转换为DataFrame中列, 获取对应DataFrame中此列的值 from pyspark.sql.functions import col# 使用自定义的函数&＃xff0c;转换数据 df &＃61; raw_df.select([&＃39;url&＃39;, &＃39;alchemy_category&＃39;] &＃43;\[ replace_question(col(column)).cast(&＃39;double&＃39;)\.alias(column) for column in raw_df.columns[4:]])df.printSchema()df.select(&＃39;url&＃39;, &＃39;alchemy_category&＃39;, &＃39;alchemy_category_score&＃39;, \&＃39;label&＃39;).show(10)

这里写图片描述

# 将数据集分为训练集和测试集 train_df, test_df &＃61; df.randomSplit([0.7, 0.3])print train_df.cache().count() print test_df.cache().count() """ 5216 2179 """

4、特征处理

1、alchemy_category类别特征数据转换第一特征转换器、StringIndexer将文字的类别特征转换数字第二特征转换器、OneHotEncoder将数值的类别特征字段转换为多个字段的Vector 2、特征的组合第二特征转换器、VectorAssembler将多个特征整合到一起

4.1、StringIndexer

网址&＃xff1a;http://spark.apache.org/docs/2.2.0/ml-features.html#stringindexer

# 导入模块 from pyspark.ml.feature import StringIndexer help(StringIndexer)# 创建StringIndexer实例对象 """参数说明&＃xff1a;inputCol -> 要转换的字段名称outputCol -> 转换后的字段名称 """ categoryIndexer &＃61; StringIndexer(inputCol&＃61;&＃39;alchemy_category&＃39;,\outputCol&＃61;&＃39;alchemy_category_index&＃39;)print type(categoryIndexer) """ &＃61;&＃61;> """

调用StringIndexer类中的 fit 方法&＃xff0c;获取到转换器Transformer

categoryTransformer &＃61; categoryIndexer.fit(df) print type(categoryTransformer)# 使用 categoryTransformer 转换器将所有的 train_df 进行转换 df1 &＃61; categoryTransformer.transform(train_df)df1.select(&＃39;alchemy_category&＃39;, &＃39;alchemy_category_index&＃39;).show(10) """ &＃43;------------------&＃43;----------------------&＃43; | alchemy_category|alchemy_category_index| &＃43;------------------&＃43;----------------------&＃43; | ?| 0.0| |arts_entertainment| 2.0| | ?| 0.0| | business| 3.0| |arts_entertainment| 2.0| | ?| 0.0| | ?| 0.0| | recreation| 1.0| | business| 3.0| |arts_entertainment| 2.0| &＃43;------------------&＃43;----------------------&＃43; only showing top 10 rows """df1.printSchema() #查看结构数据

4.2、OneHotEncoder

OneHotEncoder可以将一个数值的类别特征字段转换为多个字段的Vector向量

from pyspark.ml.feature import OneHotEncoder # 创建 OneHotEncoder 实例对象 encoder &＃61; OneHotEncoder(inputCol&＃61;&＃39;alchemy_category_index&＃39;, outputCol&＃61;&＃39;alchemy_category_index_vector&＃39;)print type(encoder) """ """df2 &＃61; encoder.transform(df1)df2.printSchema()df2.select(&＃39;alchemy_category&＃39;, &＃39;alchemy_category_index&＃39;,\&＃39;alchemy_category_index_vector&＃39;).show(10)

这里写图片描述

4.3、VectorAssembler

特征的组合
第二特征转换器、VectorAssembler&＃xff0c;将多个特征整合到一起

from pyspark.ml.feature import VectorAssembler assembler_inputs &＃61; [&＃39;alchemy_category_index_vector&＃39;] \&＃43; raw_df.columns[4:-1] print assembler_inputs""" [&＃39;alchemy_category_index_vector&＃39;, &＃39;alchemy_category_score&＃39;, &＃39;avglinksize&＃39;, &＃39;commonlinkratio_1&＃39;, &＃39;commonlinkratio_2&＃39;, &＃39;commonlinkratio_3&＃39;, &＃39;commonlinkratio_4&＃39;, &＃39;compression_ratio&＃39;,&＃39;embed_ratio&＃39;, &＃39;framebased&＃39;, &＃39;frameTagRatio&＃39;, &＃39;hasDomainLink&＃39;, &＃39;linkwordscore&＃39;, &＃39;news_front_page&＃39;, &＃39;non_markup_alphanum_characters&＃39;, &＃39;numberOfLinks&＃39;, &＃39;numwords_in_url&＃39;, &＃39;parametrizedLinkRatio&＃39;, &＃39;spelling_errors_ratio&＃39;] """

######创建 VectorAssembler 实例对象&＃xff0c;传递参数&＃xff0c;指定合并哪些字段&＃xff0c;输出的字段名称 assembler &＃61; VectorAssembler(inputCols&＃61;assembler_inputs, outputCol&＃61;&＃39;features&＃39;) df3 &＃61; assembler.transform(df2)df3.printSchema()""" &＃43;--------------------&＃43;-----&＃43; | features|label| &＃43;--------------------&＃43;-----&＃43; |(35,[0,14,15,16,1...| 1.0| |(35,[2,13,14,15,1...| 1.0| |(35,[0,14,15,19,2...| 0.0| |(35,[3,13,14,15,1...| 1.0| |(35,[2,13,14,15,1...| 0.0| &＃43;--------------------&＃43;-----&＃43; only showing top 5 rows """df3.select(&＃39;features&＃39;).take(1) """ [Row(features&＃61;SparseVector(35, {0: 1.0, 14: 2.1446, 15: 0.7969, 16: 0.3945, 17: 0.332, 18: 0.3203, 19: 0.5022, 22: 0.028, 24: 0.1898, 25: 0.2354,26: 1.0, 27: 1.0, 28: 17.0, 30: 10588.0, 31: 256.0, 32: 5.0, 33: 0.3828, 34: 0.1368}))] """
二、建模

分类决策树DecisionTreeClassifier

from pyspark.ml.classification import DecisionTreeClassifier# 使用决策树分类算法 dtc &＃61; DecisionTreeClassifier(featuresCol&＃61;&＃39;features&＃39;, labelCol&＃61;&＃39;label&＃39;,impurity&＃61;&＃39;gini&＃39;, maxDepth&＃61;5, maxBins&＃61;32)# 将训练数据应用到算法 dtc_model &＃61; dtc.fit(df3)# 使用模型预测 df4 &＃61; dtc_model.transform(df3) df4.select(&＃39;label&＃39;, &＃39;prediction&＃39;, &＃39;rawPrediction&＃39;, &＃39;probability&＃39;).show(20, truncate&＃61;False)

label	prediction	rawPrediction	probability
1.0	1.0	[909.0,1104.0]	[0.45156482861400893,0.5484351713859911]
1.0	1.0	[909.0,1104.0]	[0.45156482861400893,0.5484351713859911]
0.0	0.0	[38.0,1.0]	[0.9743589743589743,0.02564102564102564]
1.0	1.0	[27.0,177.0]	[0.1323529411764706,0.8676470588235294]
0.0	0.0	[95.0,28.0]	[0.7723577235772358,0.22764227642276422]
1.0	1.0	[909.0,1104.0]	[0.45156482861400893,0.5484351713859911]
1.0	1.0	[909.0,1104.0]	[0.45156482861400893,0.5484351713859911]
1.0	0.0	[144.0,95.0]	[0.602510460251046,0.39748953974895396]
0.0	0.0	[363.0,146.0]	[0.7131630648330058,0.2868369351669941]
0.0	0.0	[86.0,23.0]	[0.7889908256880734,0.21100917431192662]
0.0	0.0	[144.0,95.0]	[0.602510460251046,0.39748953974895396]
0.0	0.0	[144.0,95.0]	[0.602510460251046,0.39748953974895396]
0.0	0.0	[43.0,1.0]	[0.9772727272727273,0.022727272727272728]
1.0	1.0	[909.0,1104.0]	[0.45156482861400893,0.5484351713859911]
1.0	1.0	[909.0,1104.0]	[0.45156482861400893,0.5484351713859911]
1.0	1.0	[27.0,177.0]	[0.1323529411764706,0.8676470588235294]
1.0	1.0	[129.0,417.0]	[0.23626373626373626,0.7637362637362637]
1.0	1.0	[909.0,1104.0]	[0.45156482861400893,0.5484351713859911]
0.0	1.0	[909.0,1104.0]	[0.45156482861400893,0.5484351713859911]
1.0	1.0	[909.0,1104.0]	[0.45156482861400893,0.5484351713859911]

only showing top 20 rows

三、评估&＃xff08;ROC曲线&＃xff09;

from pyspark.ml.evaluation import BinaryClassificationEvaluator # 创建实例对象&＃xff0c; 传递参数值 evaluator &＃61; BinaryClassificationEvaluator(labelCol&＃61;&＃39;label&＃39;,rawPredictionCol&＃61;&＃39;rawPrediction&＃39;) # 计算指标 metricName&＃61;"areaUnderROC" auc &＃61; evaluator.evaluate(df4) print auc """ 0.6087142511 """

总结上述开发流程&＃xff1a;1、从原始数据提取特征数据2、特征数据应用到算法&＃xff0c;得到模型3、使用模型预测数据4、评估模型Pipeline:相当于一个“算法” -> 模型学习器包含两部分内容&＃xff1b;-a. Estimator 模型学习器fit()-b. transformers 转换器transformer() pipeline &＃61; Pipeline(Stages(.....))pipeline.fit()..... model.transfor().....
四、打包&＃xff08;ML Pipeline&＃xff09;

Step 1. 创建流程中转换器和模型学习器

# 1. 导入全部需要模块 from pyspark.ml import Pipeline from pyspark.ml.feature import StringIndexer, OneHotEncoder, VectorAssembler from pyspark.ml.classification import DecisionTreeClassifier # a. StringIndexer string_indexer &＃61; StringIndexer(inputCol&＃61;&＃39;alchemy_category&＃39;,\outputCol&＃61;&＃39;alchemy_category_index&＃39;)# b. OneHotEncoding one_hot_encoder &＃61; OneHotEncoder(inputCol&＃61;&＃39;alchemy_category_index&＃39;,\outputCol&＃61;&＃39;alchemy_category_index_vector&＃39;)# c. VectorAessmbler assembler_inputs &＃61; [&＃39;alchemy_category_index_vector&＃39;] \&＃43; raw_df.columns[4:-1] vector_assembler &＃61; VectorAssembler(inputCols&＃61;assembler_inputs,\outputCol&＃61;&＃39;features&＃39;)# d. DecisionTreeClassifier 模型学习器 dt &＃61; DecisionTreeClassifier(featuresCol&＃61;&＃39;features&＃39;, labelCol&＃61;&＃39;label&＃39;,\impurity&＃61;&＃39;gini&＃39;, maxDepth&＃61;5, maxBins&＃61;32)

Step 2. 创建Pipeline实例对象

# 按照数据处理顺序 pipeline &＃61; Pipeline(stages&＃61;[string_indexer,one_hot_encoder, vector_assembler, dt]) pipeline.getStages()""" [StringIndexer_43e8b50676a58dad4d05,OneHotEncoder_4bf2a31a6b4b12aebd78,VectorAssembler_4429bf16ed1cc6c14207,DecisionTreeClassifier_451682088ef8fcaa79ae]"""

step3. Pipeline 数据处理与训练模型

# 调用fit方法学&＃xff0c; pipleline_model &＃61; pipeline.fit(train_df)type(pipleline_model) #pyspark.ml.pipeline.PipelineModel pipleline_model.stages[3]

Step 4. PipelineModel模型预测

predict_df &＃61; pipleline_model.transform(test_df)

step5、PipelineModel模型保存于加载

# 保存模型 pipleline_model.save(&＃39;./datas/dtc-model&＃39;)

step6、调用

# 加载模型 from pyspark.ml.pipeline import PipelineModelload_pipeline_model &＃61; PipelineModel.load(&＃39;./datas/dtc-model&＃39;) load_pipeline_model.stages[3]# 预测 load_pipeline_model.transform(test_df) \.select(&＃39;label&＃39;, &＃39;prediction&＃39;, &＃39;rawPrediction&＃39;,\&＃39;probability&＃39;).show(20, truncate&＃61;False)

label	prediction	rawPrediction	probability
0.0	0.0	[361.0,300.0]	[0.546142208774584,0.45385779122541603]
1.0	0.0	[144.0,95.0]	[0.602510460251046,0.39748953974895396]
0.0	1.0	[0.0,8.0]	[0.0,1.0]
1.0	1.0	[129.0,417.0]	[0.23626373626373626,0.7637362637362637]
0.0	0.0	[363.0,146.0]	[0.7131630648330058,0.2868369351669941]
0.0	0.0	[363.0,146.0]	[0.7131630648330058,0.2868369351669941]
1.0	1.0	[909.0,1104.0]	[0.45156482861400893,0.5484351713859911]
1.0	1.0	[129.0,417.0]	[0.23626373626373626,0.7637362637362637]
1.0	1.0	[27.0,177.0]	[0.1323529411764706,0.8676470588235294]
1.0	1.0	[27.0,177.0]	[0.1323529411764706,0.8676470588235294]
1.0	1.0	[27.0,177.0]	[0.1323529411764706,0.8676470588235294]
1.0	1.0	[27.0,177.0]	[0.1323529411764706,0.8676470588235294]
1.0	1.0	[27.0,177.0]	[0.1323529411764706,0.8676470588235294]
1.0	1.0	[909.0,1104.0]	[0.45156482861400893,0.5484351713859911]
0.0	0.0	[363.0,146.0]	[0.7131630648330058,0.2868369351669941]
1.0	1.0	[909.0,1104.0]	[0.45156482861400893,0.5484351713859911]
1.0	1.0	[909.0,1104.0]	[0.45156482861400893,0.5484351713859911]
1.0	1.0	[909.0,1104.0]	[0.45156482861400893,0.5484351713859911]
1.0	0.0	[361.0,300.0]	[0.546142208774584,0.45385779122541603]
0.0	0.0	[86.0,23.0]	[0.7889908256880734,0.21100917431192662]

only showing top 20 rows

五、验证选择最优模型

5.1、创建 TrainValidationSplit 实例对象

&＃xff08;训练检验分离选择最优&＃xff09;
导入模块

from pyspark.ml.tuning import TrainValidationSplit, ParamGridBuilder

构建一个决策树分类算法网格参数

"""调整三个参数&＃xff1a;-1. 不纯度度量-2. 最多深度-3. 最大分支数 """ param_grid &＃61; ParamGridBuilder() \.addGrid(dt.impurity, [&＃39;gini&＃39;, &＃39;entropy&＃39;]) \.addGrid(dt.maxDepth, [5, 10, 20]) \.addGrid(dt.maxBins, [8, 16, 32]) \.build()print type(param_grid) for param in param_grid:print param

针对二分类创建模型评估器

binary_class_evaluator &＃61; BinaryClassificationEvaluator(labelCol&＃61;&＃39;label&＃39;,\rawPredictionCol&＃61;&＃39;rawPrediction&＃39;)

创建 TrainValidationSplit 实例对象

"""__init__(self, estimator&＃61;None, estimatorParamMaps&＃61;None, evaluator&＃61;None, trainRatio&＃61;0.75, seed&＃61;None)参数解释&＃xff1a;estimator&＃xff1a;模型学习器&＃xff0c;针对哪个算法进行调整超参数&＃xff0c;这里是DTestimatorParamMaps:算法调整的参数组合evaluator&＃xff1a;评估模型的评估器&＃xff0c;比如二分类的话&＃xff0c;使用auc面积trainRatio:训练集与验证集所占的比例&＃xff0c;此处的值表示的是训练集比例 """train_validataion_split &＃61; TrainValidationSplit(estimator&＃61;dt,evaluator&＃61;binary_class_evaluator, estimatorParamMaps&＃61;param_grid, trainRatio&＃61;0.8)type(train_validataion_split) #pyspark.ml.tuning.TrainValidationSplit

建立新的Pipeline实例对象

#使用 train_validataion_split 代替原先 dt tvs_pipeline &＃61; Pipeline(stages&＃61;[string_indexer, \one_hot_encoder, vector_assembler, \train_validataion_split]) # tvs_pipeline 进行数据处理、模型训练&＃xff08;找到最佳模型&＃xff09; tvs_pipeline_model &＃61; tvs_pipeline.fit(train_df)best_model &＃61; tvs_pipeline_model.stages[3].bestModel """ DecisionTreeClassificationModel (uid&＃61;DecisionTreeClassifier_\ 451682088ef8fcaa79ae) of depth 20 with 1851 nodes """

评估最佳模型

predictions_df &＃61; tvs_pipeline_model.transform(test_df)model_auc &＃61; binary_class_evaluator.evaluate(predictions_df) print model_auc0.649609702764

5.1、Cross-Validation交叉验证

"""__init__(self, estimator&＃61;None, estimatorParamMaps&＃61;None, \evaluator&＃61;None, numFolds&＃61;3, seed&＃61;None)假设 K-Fold的CrossValidation交叉验证 K &＃61; 3,将数据分为3个部分&＃xff1a;1、A &＃43; B作为训练&＃xff0c;C作为验证2、B &＃43; C作为训练&＃xff0c;A作为验证3、A &＃43; C最为训练&＃xff0c;B作为验证"""# 导入模块 from pyspark.ml.tuning import CrossValidator # 构建 CrossValidator实例对象&＃xff0c;设置相关参数 cross_validator &＃61; CrossValidator(estimator&＃61;dt, \evaluator&＃61;binary_class_evaluator,\estimatorParamMaps&＃61;param_grid, numFolds&＃61;3)# 创建Pipeline cv_pipeline &＃61; Pipeline(stages&＃61;[string_indexer, one_hot_encoder, \vector_assembler, cross_validator])

使用 cv_pipeline 进行训练与验证&＃xff08;交叉&＃xff09;

cv_pipeline_model &＃61; cv_pipeline.fit(train_df)

查看最佳模型

best_model &＃61; cv_pipeline_model.stages[3].bestModel """ DecisionTreeClassificationModel (uid&＃61;DecisionTreeClassifier_ \ 451682088ef8fcaa79ae) of depth 10 with 527 nodes """

使用测试集评估最佳模型

cv_predictions &＃61; cv_pipeline_model.transform(test_df) cv_model_auc &＃61; binary_class_evaluator.evaluate(cv_predictions) print cv_model_auc
六、提升&＃xff1a;随即森林&＃xff08;RF算法&＃xff09;

# 导入随机森林分类算法模块 from pyspark.ml.classification import RandomForestClassifier# 创建RFC实例对象 rfc &＃61; RandomForestClassifier(labelCol&＃61;&＃39;label&＃39;, \featuresCol&＃61;&＃39;features&＃39;,\numTrees&＃61;10, \featureSubsetStrategy&＃61;"auto",\maxDepth&＃61;5, \maxBins&＃61;32, \impurity&＃61;"gini")# 创建Pipeline实例对象 rfc_pipeline &＃61; Pipeline(stages&＃61;[string_indexer, one_hot_encoder, \vector_assembler, rfc])# 使用训练数据训练模型 rfc_pipeline_model &＃61; rfc_pipeline.fit(train_df)# 预测 rfc_predictions &＃61; rfc_pipeline_model.transform(test_df)rfc_model_auc &＃61; binary_class_evaluator.evaluate(rfc_predictions) print rfc_model_auc """ 0.716242043615 """

大数据Spark中决策树模型Pipeline的建立和两种验证方法(完整版)

文章目录

1、加载数据

2、SparkSession读取CSV格式文件

3、清洗数据

4、特征处理

4.1、StringIndexer

4.2、OneHotEncoder

4.3、VectorAssembler

分类决策树DecisionTreeClassifier

Step 1. 创建流程中转换器和模型学习器

Step 2. 创建Pipeline实例对象

step3. Pipeline 数据处理与训练模型

Step 4. PipelineModel模型预测

step5、PipelineModel模型保存于加载

step6、调用

5.1、创建 TrainValidationSplit 实例对象

5.1、Cross-Validation交叉验证

Mysql MySqlBulkLoader在.NET平台下的批量插入

JavaMail的常用类(Session,Transport,MimeMessage,Address,Store,Folder,Multipart)

objc runtime 挂载

在JAVA代码的不同部分多次使用数组列表

SENDMESSAGE函数巧应用

IDEA实用插件Lombok

C#学习教程：使用RSACryptoServiceProvider进行公钥加密分享

Java发布webservice应用并发送SOAP请求调用

C#的Type对象的简单应用

字符串匹配: BF与KMP算法

Flex中使用filter过滤数据

《Effective Java》阅读笔记9 覆盖equals时总要覆盖hashCode

IOSUITableView解析（一）

JavaHashMap原理解析

第38天：Python decimal 模块

大数据Spark中决策树模型Pipeline的建立和两种验证方法(完整版)

文章目录

1、加载数据

2、SparkSession读取CSV格式文件

3、清洗数据

4、特征处理

4.1、StringIndexer

4.2、OneHotEncoder

4.3、VectorAssembler

分类决策树DecisionTreeClassifier

Step 1. 创建流程中 转换器和 模型学习器

Step 2. 创建Pipeline实例对象

step3. Pipeline 数据处理与训练模型

Step 4. PipelineModel模型预测

step5、PipelineModel模型保存于加载

step6、调用

5.1、创建 TrainValidationSplit 实例对象

5.1、Cross-Validation交叉验证

Step 1. 创建流程中转换器和模型学习器