探索MLlib机器学习

作者：手艺人生大姑娘 | 来源：互联网 | 2023-10-14 16:50

公众号后台回复关键词：pyspark，获取本项目github地址。MLlib是Spark的机器学习库，包括以下主要功能。实用工具ÿ

公众号后台回复关键词&＃xff1a;pyspark&＃xff0c;获取本项目github地址。

MLlib是Spark的机器学习库&＃xff0c;包括以下主要功能。

实用工具&＃xff1a;线性代数&＃xff0c;统计&＃xff0c;数据处理等工具特征工程&＃xff1a;特征提取&＃xff0c;特征转换&＃xff0c;特征选择常用算法&＃xff1a;分类&＃xff0c;回归&＃xff0c;聚类&＃xff0c;协同过滤&＃xff0c;降维模型优化&＃xff1a;模型评估&＃xff0c;参数优化。

MLlib库包括两个不同的部分&＃xff1a;

pyspark.mllib 包含基于rdd的机器学习算法API&＃xff0c;目前不再更新&＃xff0c;以后将被丢弃&＃xff0c;不建议使用。

pyspark.ml 包含基于DataFrame的机器学习算法API&＃xff0c;可以用来构建机器学习工作流Pipeline&＃xff0c;推荐使用。

import findspark#指定spark_home为刚才的解压路径,指定python路径 spark_home &＃61; "/Users/liangyun/ProgramFiles/spark-3.0.1-bin-hadoop3.2" python_path &＃61; "/Users/liangyun/anaconda3/bin/python" findspark.init(spark_home,python_path)import pyspark from pyspark.sql import SparkSession from pyspark.storagelevel import StorageLevel #SparkSQL的许多功能封装在SparkSession的方法接口中spark &＃61; SparkSession.builder \.appName("dbscan") \.config("master","local[4]") \.enableHiveSupport() \.getOrCreate()sc &＃61; spark.sparkContext

一&＃xff0c;MLlib基本概念

DataFrame: MLlib中数据的存储形式&＃xff0c;其列可以存储特征向量&＃xff0c;标签&＃xff0c;以及原始的文本&＃xff0c;图像。

Transformer&＃xff1a;转换器。具有transform方法。通过附加一个或多个列将一个DataFrame转换成另外一个DataFrame。

Estimator&＃xff1a;估计器。具有fit方法。它接受一个DataFrame数据作为输入后经过训练&＃xff0c;产生一个转换器Transformer。

Pipeline&＃xff1a;流水线。具有setStages方法。顺序将多个Transformer和1个Estimator串联起来&＃xff0c;得到一个流水线模型。

二&＃xff0c; Pipeline流水线范例

任务描述&＃xff1a;用逻辑回归模型预测句子中是否包括”spark“这个单词。

from pyspark.ml.feature import Tokenizer,HashingTF from pyspark.ml.classification import LogisticRegression from pyspark.ml.evaluation import MulticlassClassificationEvaluator,BinaryClassificationEvaluator from pyspark.ml import Pipeline,PipelineModel from pyspark.ml.linalg import Vector from pyspark.sql import Row

1&＃xff0c;准备数据

dftrain &＃61; spark.createDataFrame([(0,"a b c d e spark",1.0),(1,"a c f",0.0),(2,"spark hello world",1.0),(3,"hadoop mapreduce",0.0),(4,"I love spark", 1.0),(5,"big data",0.0)],["id","text","label"]) dftrain.show()

&＃43;---&＃43;-----------------&＃43;-----&＃43; | id| text|label| &＃43;---&＃43;-----------------&＃43;-----&＃43; | 0| a b c d e spark| 1.0| | 1| a c f| 0.0| | 2|spark hello world| 1.0| | 3| hadoop mapreduce| 0.0| | 4| I love spark| 1.0| | 5| big data| 0.0| &＃43;---&＃43;-----------------&＃43;-----&＃43;

2&＃xff0c;定义模型

tokenizer &＃61; Tokenizer().setInputCol("text").setOutputCol("words") print(type(tokenizer))hashingTF &＃61; HashingTF().setNumFeatures(100) \.setInputCol(tokenizer.getOutputCol()) \.setOutputCol("features") print(type(hashingTF))lr &＃61; LogisticRegression().setLabelCol("label") #print(lr.explainParams) lr.setFeaturesCol("features").setMaxIter(10).setRegParam(0.2) print(type(lr))pipe &＃61; Pipeline().setStages([tokenizer,hashingTF,lr]) print(type(pipe))

3&＃xff0c;训练模型

model &＃61; pipe.fit(dftrain) print(type(model))

4&＃xff0c;使用模型

dftest &＃61; spark.createDataFrame([(7,"spark job",1.0),(9,"hello world",0.0),(10,"a b c d e",0.0),(11,"you can you up",0.0),(12,"spark is easy to use.",1.0)]).toDF("id","text","label") dftest.show()dfresult &＃61; model.transform(dftest)dfresult.selectExpr("text","features","probability","prediction").show()

&＃43;---&＃43;--------------------&＃43;-----&＃43; | id| text|label| &＃43;---&＃43;--------------------&＃43;-----&＃43; | 7| spark job| 1.0| | 9| hello world| 0.0| | 10| a b c d e| 0.0| | 11| you can you up| 0.0| | 12|spark is easy to ...| 1.0| &＃43;---&＃43;--------------------&＃43;-----&＃43;&＃43;--------------------&＃43;--------------------&＃43;--------------------&＃43;----------&＃43; | text| features| probability|prediction| &＃43;--------------------&＃43;--------------------&＃43;--------------------&＃43;----------&＃43; | spark job|(100,[57,86],[1.0...|[0.30134853865356...| 1.0| | hello world|(100,[60,89],[1.0...|[0.20714372651040...| 1.0| | a b c d e|(100,[50,65,67,68...|[0.24502686265469...| 1.0| | you can you up|(100,[33,38,51],[...|[0.87589306761045...| 0.0| |spark is easy to ...|(100,[9,21,60,86,...|[0.07662944406376...| 1.0| &＃43;--------------------&＃43;--------------------&＃43;--------------------&＃43;----------&＃43;

5&＃xff0c;评估模型

dfresult.printSchema()

evaluator &＃61; MulticlassClassificationEvaluator().setMetricName("f1") \.setPredictionCol("prediction").setLabelCol("label")#print(evaluator.explainParams()) accuracy &＃61; evaluator.evaluate(dfresult) print("\n accuracy &＃61; {}".format(accuracy))

accuracy &＃61; 0.5666666666666667

6&＃xff0c;保存模型

#可以将训练好的模型保存到磁盘中 model.write().overwrite().save("./data/mymodel.model")#也可以将没有训练的模型保存到磁盘中 #pipeline.write.overwrite().save("./data/unfit-lr-model")

#重新载入模型 model_loaded &＃61; PipelineModel.load("./data/mymodel.model") model_loaded.transform(dftest).select("text","label","prediction").show()

&＃43;--------------------&＃43;-----&＃43;----------&＃43; | text|label|prediction| &＃43;--------------------&＃43;-----&＃43;----------&＃43; | spark job| 1.0| 1.0| | hello world| 0.0| 1.0| | a b c d e| 0.0| 1.0| | you can you up| 0.0| 0.0| |spark is easy to ...| 1.0| 1.0| &＃43;--------------------&＃43;-----&＃43;----------&＃43;

三&＃xff0c;特征工程

spark的特征处理功能主要在 pyspark.ml.feature 模块中&＃xff0c;包括以下一些功能。

特征提取&＃xff1a;Tf-idf, Word2Vec, CountVectorizer, FeatureHasher
特征转换&＃xff1a;OneHotEncoderEstimator, Normalizer, Imputer(缺失值填充), StandardScaler, MinMaxScaler, Tokenizer(构建词典), StopWordsRemover, SQLTransformer, Bucketizer, Interaction(交叉项), Binarizer(二值化), n-gram,……
特征选择&＃xff1a;VectorSlicer(向量切片), RFormula, ChiSqSelector(卡方检验)
LSH转换&＃xff1a;局部敏感哈希广泛用于海量数据中求最邻近&＃xff0c;聚类等算法。

1&＃xff0c;CountVectorizer

CountVectorizer可以提取文本中的词频特征。

from pyspark.ml.feature import CountVectorizer, CountVectorizerModeldf &＃61; spark.createDataFrame([(0, ["a", "b", "c"]),(1, ["a", "b", "b", "c", "a"])],["id","words"])cvModel &＃61; CountVectorizer() \.setInputCol("words") \.setOutputCol("features") \.setVocabSize(3) \.setMinDF(2) \.fit(df)cvModel.transform(df).show()

2&＃xff0c;Word2Vec

Word2Vec可以使用浅层神经网络提取文本中词的相似语义信息。

from pyspark.ml.feature import Word2Vecdf_document &＃61; spark.createDataFrame([("Hi I heard about Spark".split(" "), ),("I wish Java could use case classes".split(" "), ),("Logistic regression models are neat".split(" "), ) ], ["text"])word2Vec &＃61; Word2Vec(vectorSize&＃61;3, minCount&＃61;0, inputCol&＃61;"text", outputCol&＃61;"result") model &＃61; word2Vec.fit(df_document)df_vector &＃61; model.transform(df_document) for row in df_vector.collect():text, vector &＃61; rowprint("text: [%s] &＃61;> \nvector: %s\n" % (", ".join(text), str(vector)))

text: [Hi, I, heard, about, Spark] &＃61;> vector: [-0.03952452838420868,-0.019742850959300996,-0.04259629175066948]text: [I, wish, Java, could, use, case, classes] &＃61;> vector: [-0.017589610069990158,0.03303118874984128,-0.03793099456067596]text: [Logistic, regression, models, are, neat] &＃61;> vector: [-0.03930013366043568,0.08479443639516832,-0.025407366454601288]

3&＃xff0c; OnHotEncoder

OneHotEncoder可以将类别特征转换成OneHot编码。

from pyspark.ml.feature import OneHotEncoderdf &＃61; spark.createDataFrame([(0.0, 1.0),(1.0, 0.0),(2.0, 1.0),(0.0, 2.0),(0.0, 1.0),(2.0, 0.0) ], ["categoryIndex1", "categoryIndex2"])encoder &＃61; OneHotEncoder(inputCols&＃61;["categoryIndex1", "categoryIndex2"],outputCols&＃61;["categoryVec1", "categoryVec2"]) model &＃61; encoder.fit(df) encoded &＃61; model.transform(df) encoded.show()

&＃43;--------------&＃43;--------------&＃43;-------------&＃43;-------------&＃43; |categoryIndex1|categoryIndex2| categoryVec1| categoryVec2| &＃43;--------------&＃43;--------------&＃43;-------------&＃43;-------------&＃43; | 0.0| 1.0|(2,[0],[1.0])|(2,[1],[1.0])| | 1.0| 0.0|(2,[1],[1.0])|(2,[0],[1.0])| | 2.0| 1.0| (2,[],[])|(2,[1],[1.0])| | 0.0| 2.0|(2,[0],[1.0])| (2,[],[])| | 0.0| 1.0|(2,[0],[1.0])|(2,[1],[1.0])| | 2.0| 0.0| (2,[],[])|(2,[0],[1.0])| &＃43;--------------&＃43;--------------&＃43;-------------&＃43;-------------&＃43;

4, MinMax标准化

from pyspark.ml.feature import MinMaxScaler from pyspark.ml.linalg import Vectorsdf &＃61; spark.createDataFrame([(0, Vectors.dense([1.0, 0.1, -1.0]),),(1, Vectors.dense([2.0, 1.1, 1.0]),),(2, Vectors.dense([3.0, 10.1, 3.0]),) ], ["id", "features"])scaler &＃61; MinMaxScaler(inputCol&＃61;"features", outputCol&＃61;"scaledFeatures")scalerModel &＃61; scaler.fit(df)df_scaled &＃61; scalerModel.transform(df) print("Features scaled to range: [%f, %f]" % (scaler.getMin(), scaler.getMax())) df_scaled.select("features", "scaledFeatures").show()

Features scaled to range: [0.000000, 1.000000] &＃43;--------------&＃43;--------------&＃43; | features|scaledFeatures| &＃43;--------------&＃43;--------------&＃43; |[1.0,0.1,-1.0]| (3,[],[])| | [2.0,1.1,1.0]| [0.5,0.1,0.5]| |[3.0,10.1,3.0]| [1.0,1.0,1.0]| &＃43;--------------&＃43;--------------&＃43;

5&＃xff0c;MaxAbsScaler标准化

from pyspark.ml.feature import MaxAbsScaler from pyspark.ml.linalg import Vectorsdf &＃61; spark.createDataFrame([(0, Vectors.dense([1.0, 0.1, -8.0]),),(1, Vectors.dense([2.0, 1.0, -4.0]),),(2, Vectors.dense([4.0, 10.0, 8.0]),) ], ["id", "features"])scaler &＃61; MaxAbsScaler(inputCol&＃61;"features", outputCol&＃61;"scaledFeatures")scalerModel &＃61; scaler.fit(df)df_rescaled &＃61; scalerModel.transform(df)df_rescaled.select("features", "scaledFeatures").show()

&＃43;--------------&＃43;--------------------&＃43; | features| scaledFeatures| &＃43;--------------&＃43;--------------------&＃43; |[1.0,0.1,-8.0]|[0.25,0.010000000...| |[2.0,1.0,-4.0]| [0.5,0.1,-0.5]| |[4.0,10.0,8.0]| [1.0,1.0,1.0]| &＃43;--------------&＃43;--------------------&＃43;

6&＃xff0c;SQLTransformer

可以使用SQL语法将DataFrame进行转换&＃xff0c;等效于注册表的作用。

但它可以用于Pipeline中作为Transformer.

from pyspark.ml.feature import SQLTransformerdf &＃61; spark.createDataFrame([(0, 1.0, 3.0),(2, 2.0, 5.0) ], ["id", "v1", "v2"]) sqlTrans &＃61; SQLTransformer(statement&＃61;"SELECT *, (v1 &＃43; v2) AS v3, (v1 * v2) AS v4 FROM __THIS__")sqlTrans.transform(df).show()

&＃43;---&＃43;---&＃43;---&＃43;---&＃43;----&＃43; | id| v1| v2| v3| v4| &＃43;---&＃43;---&＃43;---&＃43;---&＃43;----&＃43; | 0|1.0|3.0|4.0| 3.0| | 2|2.0|5.0|7.0|10.0| &＃43;---&＃43;---&＃43;---&＃43;---&＃43;----&＃43;

7, Imputer

Imputer转换器可以填充缺失值&＃xff0c;缺失值可以用 float("nan")来表示。

from pyspark.ml.feature import Imputerdf &＃61; spark.createDataFrame([(1.0, float("nan")),(2.0, float("nan")),(float("nan"), 3.0),(4.0, 4.0),(5.0, 5.0) ], ["a", "b"])imputer &＃61; Imputer(inputCols&＃61;["a", "b"], outputCols&＃61;["out_a", "out_b"]) model &＃61; imputer.fit(df)model.transform(df).show()

&＃43;---&＃43;---&＃43;-----&＃43;-----&＃43; | a| b|out_a|out_b| &＃43;---&＃43;---&＃43;-----&＃43;-----&＃43; |1.0|NaN| 1.0| 4.0| |2.0|NaN| 2.0| 4.0| |NaN|3.0| 3.0| 3.0| |4.0|4.0| 4.0| 4.0| |5.0|5.0| 5.0| 5.0| &＃43;---&＃43;---&＃43;-----&＃43;-----&＃43;

四&＃xff0c;分类模型

Mllib支持常见的机器学习分类模型&＃xff1a;逻辑回归&＃xff0c;SoftMax回归&＃xff0c;决策树&＃xff0c;随机森林&＃xff0c;梯度提升树&＃xff0c;线性支持向量机&＃xff0c;朴素贝叶斯&＃xff0c;One-Vs-Rest&＃xff0c;以及多层感知机模型。这些模型的接口使用方法基本大同小异&＃xff0c;下面仅仅列举常用的决策树&＃xff0c;随机森林和梯度提升树的使用作为示范。更多范例参见官方文档。

1&＃xff0c;决策树

from pyspark.ml import Pipeline from pyspark.ml.classification import DecisionTreeClassifier from pyspark.ml.feature import StringIndexer, VectorIndexer from pyspark.ml.evaluation import MulticlassClassificationEvaluator# 载入数据 dfdata &＃61; spark.read.format("libsvm").load("data/sample_libsvm_data.txt") (dftrain, dftest) &＃61; dfdata.randomSplit([0.7, 0.3])# 对label进行序号标注&＃xff0c;将字符串换成整数序号 labelIndexer &＃61; StringIndexer(inputCol&＃61;"label", outputCol&＃61;"indexedLabel").fit(dfdata)# 处理分类特征&＃xff0c;类别如果超过4将视为连续值 featureIndexer &＃61;\VectorIndexer(inputCol&＃61;"features", outputCol&＃61;"indexedFeatures", maxCategories&＃61;4).fit(dfdata)# 构建一个决策树模型 dt &＃61; DecisionTreeClassifier(labelCol&＃61;"indexedLabel", featuresCol&＃61;"indexedFeatures")# 构建流水线 pipeline &＃61; Pipeline(stages&＃61;[labelIndexer, featureIndexer, dt])# 训练流水线 model &＃61; pipeline.fit(dftrain)dfpredictions &＃61; model.transform(dftest)dfpredictions.select("prediction", "indexedLabel", "features").show(5)# 评估模型误差 evaluator &＃61; MulticlassClassificationEvaluator(labelCol&＃61;"indexedLabel", predictionCol&＃61;"prediction", metricName&＃61;"accuracy") accuracy &＃61; evaluator.evaluate(dfpredictions) print("Test Error &＃61; %g " % (1.0 - accuracy)) treeModel &＃61; model.stages[2] print(treeModel)

&＃43;----------&＃43;------------&＃43;--------------------&＃43; |prediction|indexedLabel| features| &＃43;----------&＃43;------------&＃43;--------------------&＃43; | 1.0| 1.0|(692,[98,99,100,1...| | 1.0| 1.0|(692,[124,125,126...| | 1.0| 1.0|(692,[124,125,126...| | 1.0| 1.0|(692,[125,126,127...| | 1.0| 1.0|(692,[126,127,128...| &＃43;----------&＃43;------------&＃43;--------------------&＃43; only showing top 5 rowsTest Error &＃61; 0.037037 DecisionTreeClassificationModel: uid&＃61;DecisionTreeClassifier_5711dbfcd91e, depth&＃61;2, numNodes&＃61;5, numClasses&＃61;2, numFeatures&＃61;692

2&＃xff0c;随机森林

from pyspark.ml import Pipeline from pyspark.ml.classification import RandomForestClassifier from pyspark.ml.feature import IndexToString, StringIndexer, VectorIndexer from pyspark.ml.evaluation import MulticlassClassificationEvaluator# 载入数据 dfdata &＃61; spark.read.format("libsvm").load("data/sample_libsvm_data.txt") (dftrain, dftest) &＃61; dfdata.randomSplit([0.7, 0.3])# 对label进行序号标注&＃xff0c;将字符串换成整数序号 labelIndexer &＃61; StringIndexer(inputCol&＃61;"label", outputCol&＃61;"indexedLabel").fit(dfdata)# 处理类别特征 featureIndexer &＃61;\VectorIndexer(inputCol&＃61;"features", outputCol&＃61;"indexedFeatures", maxCategories&＃61;4).fit(dfdata)# 使用随机森林模型 rf &＃61; RandomForestClassifier(labelCol&＃61;"indexedLabel", featuresCol&＃61;"indexedFeatures", numTrees&＃61;10)# 将label重新转换成字符串 labelConverter &＃61; IndexToString(inputCol&＃61;"prediction", outputCol&＃61;"predictedLabel",labels&＃61;labelIndexer.labels)# 构建流水线 pipeline &＃61; Pipeline(stages&＃61;[labelIndexer, featureIndexer, rf, labelConverter])# 训练流水线 model &＃61; pipeline.fit(dftrain)# 进行预测 dfpredictions &＃61; model.transform(dftest)dfpredictions.select("predictedLabel", "label", "features").show(5)# 评估模型 evaluator &＃61; MulticlassClassificationEvaluator(labelCol&＃61;"indexedLabel", predictionCol&＃61;"prediction", metricName&＃61;"accuracy") accuracy &＃61; evaluator.evaluate(dfpredictions) print("Test Error &＃61; %g" % (1.0 - accuracy))rfModel &＃61; model.stages[2] print(rfModel)

&＃43;--------------&＃43;-----&＃43;--------------------&＃43; |predictedLabel|label| features| &＃43;--------------&＃43;-----&＃43;--------------------&＃43; | 0.0| 0.0|(692,[122,123,124...| | 0.0| 0.0|(692,[124,125,126...| | 0.0| 0.0|(692,[124,125,126...| | 0.0| 0.0|(692,[124,125,126...| | 0.0| 0.0|(692,[124,125,126...| &＃43;--------------&＃43;-----&＃43;--------------------&＃43; only showing top 5 rowsTest Error &＃61; 0 RandomForestClassificationModel: uid&＃61;RandomForestClassifier_9d8f7dfec86b, numTrees&＃61;10, numClasses&＃61;2, numFeatures&＃61;692

3&＃xff0c;梯度提升树

from pyspark.ml import Pipeline from pyspark.ml.classification import GBTClassifier from pyspark.ml.feature import StringIndexer, VectorIndexer from pyspark.ml.evaluation import MulticlassClassificationEvaluator# 载入数据 dfdata &＃61; spark.read.format("libsvm").load("data/sample_libsvm_data.txt") (dftrain, dftest) &＃61; dfdata.randomSplit([0.7, 0.3])# 对label进行序号标注&＃xff0c;将字符串换成整数序号 labelIndexer &＃61; StringIndexer(inputCol&＃61;"label", outputCol&＃61;"indexedLabel").fit(dfdata)# 处理类别特征 featureIndexer &＃61;\VectorIndexer(inputCol&＃61;"features", outputCol&＃61;"indexedFeatures", maxCategories&＃61;4).fit(dfdata)# 使用梯度提升树模型 gbt &＃61; GBTClassifier(labelCol&＃61;"indexedLabel", featuresCol&＃61;"indexedFeatures", maxIter&＃61;20)# 构建流水线 pipeline &＃61; Pipeline(stages&＃61;[labelIndexer, featureIndexer, gbt])# 训练流水线 model &＃61; pipeline.fit(dftrain)# 进行预测 dfpredictions &＃61; model.transform(dftest) dfpredictions.select("prediction", "indexedLabel", "features").show(5)# 评估模型 evaluator &＃61; MulticlassClassificationEvaluator(labelCol&＃61;"indexedLabel", predictionCol&＃61;"prediction", metricName&＃61;"accuracy") accuracy &＃61; evaluator.evaluate(dfpredictions) print("Test Error &＃61; %g" % (1.0 - accuracy))gbtModel &＃61; model.stages[2] print(gbtModel)

&＃43;----------&＃43;------------&＃43;--------------------&＃43; |prediction|indexedLabel| features| &＃43;----------&＃43;------------&＃43;--------------------&＃43; | 1.0| 1.0|(692,[95,96,97,12...| | 1.0| 1.0|(692,[98,99,100,1...| | 1.0| 1.0|(692,[122,123,148...| | 1.0| 1.0|(692,[124,125,126...| | 1.0| 1.0|(692,[124,125,126...| &＃43;----------&＃43;------------&＃43;--------------------&＃43; only showing top 5 rowsTest Error &＃61; 0.0689655 GBTClassificationModel: uid &＃61; GBTClassifier_e3d7713552b3, numTrees&＃61;20, numClasses&＃61;2, numFeatures&＃61;692

五&＃xff0c;回归模型

Mllib支持常见的回归模型&＃xff0c;如线性回归&＃xff0c;广义线性回归&＃xff0c;决策树回归&＃xff0c;随机森林回归&＃xff0c;梯度提升树回归&＃xff0c;生存回归&＃xff0c;保序回归。

下面仅以线性回归和决策树回归为例。

1&＃xff0c;线性回归

from pyspark.ml.regression import LinearRegression# 载入数据 dfdata &＃61; spark.read.format("libsvm")\.load("data/sample_linear_regression_data.txt")# 定义模型 lr &＃61; LinearRegression(maxIter&＃61;10, regParam&＃61;0.3, elasticNetParam&＃61;0.8)# 训练模型 lrModel &＃61; lr.fit(dfdata)# 模型参数 print("Coefficients: %s" % str(lrModel.coefficients)) print("Intercept: %s" % str(lrModel.intercept))# 评估模型 trainingSummary &＃61; lrModel.summary print("numIterations: %d" % trainingSummary.totalIterations) print("objectiveHistory: %s" % str(trainingSummary.objectiveHistory)) trainingSummary.residuals.show() print("RMSE: %f" % trainingSummary.rootMeanSquaredError) print("r2: %f" % trainingSummary.r2)

Coefficients: [0.0,0.32292516677405936,-0.3438548034562218,1.9156017023458414,0.05288058680386263,0.765962720459771,0.0,-0.15105392669186682,-0.21587930360904642,0.22025369188813426] Intercept: 0.1598936844239736 numIterations: 7 objectiveHistory: [0.49999999999999994, 0.4967620357443381, 0.4936361664340463, 0.4936351537897608, 0.4936351214177871, 0.49363512062528014, 0.4936351206216114] &＃43;--------------------&＃43; | residuals| &＃43;--------------------&＃43; | -9.889232683103197| | 0.5533794340053554| | -5.204019455758823| | -20.566686715507508| | -9.4497405180564| | -6.909112502719486| | -10.00431602969873| | 2.062397807050484| | 3.1117508432954772| | -15.893608229419382| | -5.036284254673026| | 6.483215876994333| | 12.429497299109002| | -20.32003219007654| | -2.0049838218725005| | -17.867901734183793| | 7.646455887420495| | -2.2653482182417406| |-0.10308920436195645| | -1.380034070385301| &＃43;--------------------&＃43; only showing top 20 rowsRMSE: 10.189077 r2: 0.022861

2&＃xff0c;决策树回归

from pyspark.ml import Pipeline from pyspark.ml.regression import DecisionTreeRegressor from pyspark.ml.feature import VectorIndexer from pyspark.ml.evaluation import RegressionEvaluator# 载入数据 dfdata &＃61; spark.read.format("libsvm").load("data/sample_libsvm_data.txt") (dftrain, dftest) &＃61; dfdata.randomSplit([0.7, 0.3])# 处理类别特征 featureIndexer &＃61;\VectorIndexer(inputCol&＃61;"features", outputCol&＃61;"indexedFeatures", maxCategories&＃61;4).fit(dfdata)# 使用决策树模型 dt &＃61; DecisionTreeRegressor(featuresCol&＃61;"indexedFeatures")# 构建流水线 pipeline &＃61; Pipeline(stages&＃61;[featureIndexer, dt])# 训练流水线 model &＃61; pipeline.fit(dftrain)# 进行预测 dfpredictions &＃61; model.transform(dftest) dfpredictions.select("prediction", "label", "features").show(5)# 评估模型 evaluator &＃61; RegressionEvaluator(labelCol&＃61;"label", predictionCol&＃61;"prediction", metricName&＃61;"rmse") rmse &＃61; evaluator.evaluate(dfpredictions) print("Root Mean Squared Error (RMSE) on test data &＃61; %g" % rmse)treeModel &＃61; model.stages[1] print(treeModel)

&＃43;----------&＃43;-----&＃43;--------------------&＃43; |prediction|label| features| &＃43;----------&＃43;-----&＃43;--------------------&＃43; | 0.0| 0.0|(692,[123,124,125...| | 0.0| 0.0|(692,[124,125,126...| | 0.0| 0.0|(692,[126,127,128...| | 0.0| 0.0|(692,[126,127,128...| | 0.0| 0.0|(692,[126,127,128...| &＃43;----------&＃43;-----&＃43;--------------------&＃43; only showing top 5 rowsRoot Mean Squared Error (RMSE) on test data &＃61; 0 DecisionTreeRegressionModel: uid&＃61;DecisionTreeRegressor_06213a3aaeb0, depth&＃61;2, numNodes&＃61;5, numFeatures&＃61;692

六&＃xff0c;聚类模型

Mllib支持的聚类模型较少&＃xff0c;主要有K均值聚类&＃xff0c;高斯混合模型GMM&＃xff0c;以及二分的K均值&＃xff0c;隐含狄利克雷分布LDA模型等。

1&＃xff0c;K均值聚类

from pyspark.ml.clustering import KMeans from pyspark.ml.evaluation import ClusteringEvaluator# 载入数据 dfdata &＃61; spark.read.format("libsvm").load("data/sample_kmeans_data.txt")# 训练Kmeans模型 kmeans &＃61; KMeans().setK(2).setSeed(1) model &＃61; kmeans.fit(dfdata)# 进行预测 dfpredictions &＃61; model.transform(dfdata)# 评估模型 evaluator &＃61; ClusteringEvaluator() silhouette &＃61; evaluator.evaluate(dfpredictions) print("Silhouette with squared euclidean distance &＃61; " &＃43; str(silhouette))# 打印中心点 centers &＃61; model.clusterCenters() print("Cluster Centers: ") for center in centers:print(center)

Silhouette with squared euclidean distance &＃61; 0.9997530305375207 Cluster Centers: [9.1 9.1 9.1] [0.1 0.1 0.1]

2&＃xff0c;高斯混合模型

from pyspark.ml.clustering import GaussianMixturedfdata &＃61; spark.read.format("libsvm").load("data/sample_kmeans_data.txt")gmm &＃61; GaussianMixture().setK(2).setSeed(538009335) model &＃61; gmm.fit(dfdata)print("Gaussians shown as a DataFrame: ") model.gaussiansDF.show(truncate&＃61;True)

aussians shown as a DataFrame: &＃43;--------------------&＃43;--------------------&＃43; | mean| cov| &＃43;--------------------&＃43;--------------------&＃43; |[0.10000000000001...|0.006666666666806...| |[9.09999999999998...|0.006666666666812...| &＃43;--------------------&＃43;--------------------&＃43;

3, 二分K均值 Bisecting k-means

Bisecting k-means是一种自上而下的层次聚类算法。所有的样本点开始时属于一个cluster,然后不断通过K均值二分裂得到多个cluster。

from pyspark.ml.clustering import BisectingKMeansdfdata &＃61; spark.read.format("libsvm").load("data/sample_kmeans_data.txt")bkm &＃61; BisectingKMeans().setK(2).setSeed(1) model &＃61; bkm.fit(dfdata)cost &＃61; model.computeCost(dfdata) print("Within Set Sum of Squared Errors &＃61; " &＃43; str(cost))print("Cluster Centers: ") centers &＃61; model.clusterCenters() for center in centers:print(center)

Within Set Sum of Squared Errors &＃61; 0.11999999999994547 Cluster Centers: [0.1 0.1 0.1] [9.1 9.1 9.1]

七&＃xff0c;降维模型

Mllib中支持的降维模型只有主成分分析PCA算法。这个模型在spark.ml.feature中&＃xff0c;通常作为特征预处理的一种技巧使用。

from pyspark.ml.feature import PCA from pyspark.ml.linalg import Vectorsdata &＃61; [(Vectors.sparse(5, [(1, 1.0), (3, 7.0)]),),(Vectors.dense([2.0, 0.0, 3.0, 4.0, 5.0]),),(Vectors.dense([4.0, 0.0, 0.0, 6.0, 7.0]),)] dfdata &＃61; spark.createDataFrame(data, ["features"])pca &＃61; PCA(k&＃61;3, inputCol&＃61;"features", outputCol&＃61;"pcaFeatures") model &＃61; pca.fit(dfdata)dfresult &＃61; model.transform(dfdata).select("pcaFeatures") dfresult.show(truncate&＃61;False)

&＃43;-----------------------------------------------------------&＃43; |pcaFeatures | &＃43;-----------------------------------------------------------&＃43; |[1.6485728230883807,-4.013282700516296,-5.524543751369388] | |[-4.645104331781534,-1.1167972663619026,-5.524543751369387]| |[-6.428880535676489,-5.337951427775355,-5.524543751369389] | &＃43;-----------------------------------------------------------&＃43;

八&＃xff0c;模型优化

模型优化一般也称作模型选择(Model selection)或者超参调优(hyperparameter tuning)。

Mllib支持网格搜索方法进行超参调优&＃xff0c;相关函数在spark.ml.tunning模块中。

有两种使用网格搜索方法的模式&＃xff0c;一种是通过交叉验证(cross-validation)方式进行使用&＃xff0c;另外一种是通过留出法(hold-out)方法进行使用。

交叉验证模式使用的是K-fold交叉验证&＃xff0c;将数据随机等分划分成K份&＃xff0c;每次将一份作为验证集&＃xff0c;其余作为训练集&＃xff0c;根据K次验证集的平均结果来决定超参选取&＃xff0c;计算成本较高&＃xff0c;但是结果更加可靠。

而留出法只用将数据随机划分成训练集和验证集&＃xff0c;仅根据验证集的单次结果决定超参选取&＃xff0c;结果没有交叉验证可靠&＃xff0c;但计算成本较低。

如果数据规模较大&＃xff0c;一般选择留出法&＃xff0c;如果数据规模较小&＃xff0c;则应该选择交叉验证模式。

1&＃xff0c;交叉验证模式

from pyspark.ml import Pipeline from pyspark.ml.classification import LogisticRegression from pyspark.ml.evaluation import BinaryClassificationEvaluator from pyspark.ml.feature import HashingTF, Tokenizer from pyspark.ml.tuning import CrossValidator, ParamGridBuilder# 准备数据 dfdata &＃61; spark.createDataFrame([(0, "a b c d e spark", 1.0),(1, "b d", 0.0),(2, "spark f g h", 1.0),(3, "hadoop mapreduce", 0.0),(4, "b spark who", 1.0),(5, "g d a y", 0.0),(6, "spark fly", 1.0),(7, "was mapreduce", 0.0),(8, "e spark program", 1.0),(9, "a e c l", 0.0),(10, "spark compile", 1.0),(11, "hadoop software", 0.0) ], ["id", "text", "label"])# 构建流水线&＃xff0c;包含&＃xff1a;tokenizer, hashingTF, lr. tokenizer &＃61; Tokenizer(inputCol&＃61;"text", outputCol&＃61;"words") hashingTF &＃61; HashingTF(inputCol&＃61;tokenizer.getOutputCol(), outputCol&＃61;"features") lr &＃61; LogisticRegression(maxIter&＃61;10) pipeline &＃61; Pipeline(stages&＃61;[tokenizer, hashingTF, lr])# 现在我们将整个流水线视作一个Estimator进行统一的超参数调优 # 构建网格&＃xff1a;hashingTF.numFeatures 有 3 个可选值 and lr.regParam 有2个可选值 # 我们的网格空间总共有2*3&＃61;6个点需要搜索 paramGrid &＃61; ParamGridBuilder() \.addGrid(hashingTF.numFeatures, [10, 100, 1000]) \.addGrid(lr.regParam, [0.1, 0.01]) \.build()# 创建5折交叉验证超参调优器 crossval &＃61; CrossValidator(estimator&＃61;pipeline,estimatorParamMaps&＃61;paramGrid,evaluator&＃61;BinaryClassificationEvaluator(),numFolds&＃61;5) # fit后会输出最优的模型 cvModel &＃61; crossval.fit(dfdata)# 准备预测数据 test &＃61; spark.createDataFrame([(4, "spark i j k"),(5, "l m n"),(6, "mapreduce spark"),(7, "apache hadoop") ], ["id", "text"])# 使用最优模型进行预测 prediction &＃61; cvModel.transform(test) selected &＃61; prediction.select("id", "text", "probability", "prediction") for row in selected.collect():print(row)

Row(id&＃61;4, text&＃61;&＃39;spark i j k&＃39;, probability&＃61;DenseVector([0.2661, 0.7339]), prediction&＃61;1.0) Row(id&＃61;5, text&＃61;&＃39;l m n&＃39;, probability&＃61;DenseVector([0.9209, 0.0791]), prediction&＃61;0.0) Row(id&＃61;6, text&＃61;&＃39;mapreduce spark&＃39;, probability&＃61;DenseVector([0.4429, 0.5571]), prediction&＃61;1.0) Row(id&＃61;7, text&＃61;&＃39;apache hadoop&＃39;, probability&＃61;DenseVector([0.8584, 0.1416]), prediction&＃61;0.0)

2&＃xff0c;留出法模式

from pyspark.ml.evaluation import RegressionEvaluator from pyspark.ml.regression import LinearRegression from pyspark.ml.tuning import ParamGridBuilder, TrainValidationSplit# 准备数据 dfdata &＃61; spark.read.format("libsvm")\.load("data/sample_linear_regression_data.txt") dftrain, dftest &＃61; dfdata.randomSplit([0.9, 0.1], seed&＃61;12345)lr &＃61; LinearRegression(maxIter&＃61;10)# 构建网格作为超参数搜索空间 paramGrid &＃61; ParamGridBuilder()\.addGrid(lr.regParam, [0.1, 0.01]) \.addGrid(lr.fitIntercept, [False, True])\.addGrid(lr.elasticNetParam, [0.0, 0.5, 1.0])\.build()# 创建留出法超参调优器 tvs &＃61; TrainValidationSplit(estimator&＃61;lr,estimatorParamMaps&＃61;paramGrid,evaluator&＃61;RegressionEvaluator(),# 80% 的数据作为训练集&＃xff0c;20的数据作为验证集trainRatio&＃61;0.8)# 训练后会输出最优超参的模型 model &＃61; tvs.fit(dftrain)# 使用模型进行预测 model.transform(dftest)\.select("features", "label", "prediction")\.show()

&＃43;--------------------&＃43;--------------------&＃43;--------------------&＃43; | features| label| prediction| &＃43;--------------------&＃43;--------------------&＃43;--------------------&＃43; |(10,[0,1,2,3,4,5,...| -17.026492264209548| -1.6265106840933026| |(10,[0,1,2,3,4,5,...| -16.71909683360509|-0.01129960392982...| |(10,[0,1,2,3,4,5,...| -15.375857723312297| 0.9008270143746643| |(10,[0,1,2,3,4,5,...| -13.772441561702871| 3.435609049373433| |(10,[0,1,2,3,4,5,...| -13.039928064104615| 0.3670260850771136| |(10,[0,1,2,3,4,5,...| -9.42898793151394| -3.26399994121536| |(10,[0,1,2,3,4,5,...| -9.2679651250406| -0.1762581278405398| |(10,[0,1,2,3,4,5,...| -9.173693798406978| -0.2824541263038875| |(10,[0,1,2,3,4,5,...| -7.1500991588127265| 3.087239142258043| |(10,[0,1,2,3,4,5,...| -6.930603551528371| 0.12389571117374062| |(10,[0,1,2,3,4,5,...| -6.456944198081549| -0.7275144195427645| |(10,[0,1,2,3,4,5,...| -3.2843694575334834| -0.9048235164747517| |(10,[0,1,2,3,4,5,...| -1.99891354174786| 0.9588887587748192| |(10,[0,1,2,3,4,5,...| -0.4683784136986876| 0.6261083785799368| |(10,[0,1,2,3,4,5,...|-0.44652227528840105| 0.19068393875752507| |(10,[0,1,2,3,4,5,...| 0.10157453780074743| -0.9062122256799047| |(10,[0,1,2,3,4,5,...| 0.2105613019270259| 1.225604620956131| |(10,[0,1,2,3,4,5,...| 2.1214592666251364| 0.2854396644518767| |(10,[0,1,2,3,4,5,...| 2.8497179990245116| 1.3569268250561075| |(10,[0,1,2,3,4,5,...| 3.980473021620311| 2.5359695420417965| &＃43;--------------------&＃43;--------------------&＃43;--------------------&＃43; only showing top 20 rows

九&＃xff0c;实用工具

pyspark.ml.linalg模块提供了线性代数向量和矩阵对象。

pyspark.ml.stat模块提供了数理统计诸如卡方检验&＃xff0c;相关性分析等功能。

1&＃xff0c;向量和矩阵

pyspark.ml.linalg 支持 DenseVector&＃xff0c;SparseVector&＃xff0c;DenseMatrix&＃xff0c;SparseMatrix类。

并可以使用Matrices和Vectors提供的工厂方法创建向量和矩阵。

from pyspark.ml.linalg import DenseVector, SparseVector#稠密向量 dense_vec &＃61; DenseVector([1, 0, 0, 2.0, 0])print("dense_vec: ", dense_vec) print("dense_vec.numNonzeros: ", dense_vec.numNonzeros())#稀疏向量 #参数分别是维度&＃xff0c;非零索引&＃xff0c;非零元素值 sparse_vec &＃61; SparseVector(5, [0,3],[1.0,2.0]) print("sparse_vec: ", sparse_vec)

dense_vec: [1.0,0.0,0.0,2.0,0.0] dense_vec.numNonzeros: 2 sparse_vec: (5,[0,3],[1.0,2.0])

dense_vec.toArray()

array([1., 0., 0., 2., 0.])

from pyspark.ml.linalg import DenseMatrix, SparseMatrix#稠密矩阵 #参数分别是行数&＃xff0c;列数&＃xff0c;元素值&＃xff0c;是否转置(默认False) dense_matrix &＃61; DenseMatrix(3, 2, [1, 3, 5, 2, 4, 6])#稀疏矩阵 #参数分别是行数&＃xff0c;列数&＃xff0c;在第几个元素列索引加1&＃xff0c;行索引&＃xff0c;非零元素值 sparse_matrix &＃61; SparseMatrix(3, 3, [0, 2, 3, 6],[0, 2, 1, 0, 1, 2], [1.0, 2.0, 3.0, 4.0, 5.0, 6.0])print("sparse_matrix.toArray(): \n", sparse_matrix.toArray())

sparse_matrix.toArray(): [[1. 0. 4.][0. 3. 5.][2. 0. 6.]]

from pyspark.ml.linalg import Vectors,Matrices#工厂方法 vec &＃61; Vectors.zeros(3) matrix &＃61; Matrices.dense(2,2,[1,2,3,5])print(matrix)

DenseMatrix([[1., 3.],[2., 5.]])

2,数理统计

#相关性分析 from pyspark.ml.linalg import Vectors from pyspark.ml.stat import Correlationdata &＃61; [(Vectors.sparse(4, [(0, 1.0), (3, -2.0)]),),(Vectors.dense([4.0, 5.0, 0.0, 3.0]),),(Vectors.dense([6.0, 7.0, 0.0, 8.0]),),(Vectors.sparse(4, [(0, 9.0), (3, 1.0)]),)] df &＃61; spark.createDataFrame(data, ["features"])r1 &＃61; Correlation.corr(df, "features").head() print("Pearson correlation matrix:\n" &＃43; str(r1[0]))r2 &＃61; Correlation.corr(df, "features", "spearman").head() print("Spearman correlation matrix:\n" &＃43; str(r2[0]))

Pearson correlation matrix: DenseMatrix([[1. , 0.05564149, nan, 0.40047142],[0.05564149, 1. , nan, 0.91359586],[ nan, nan, 1. , nan],[0.40047142, 0.91359586, nan, 1. ]]) Spearman correlation matrix: DenseMatrix([[1. , 0.10540926, nan, 0.4 ],[0.10540926, 1. , nan, 0.9486833 ],[ nan, nan, 1. , nan],[0.4 , 0.9486833 , nan, 1. ]])

#卡方检验 from pyspark.ml.linalg import Vectors from pyspark.ml.stat import ChiSquareTestdata &＃61; [(0.0, Vectors.dense(0.5, 10.0)),(0.0, Vectors.dense(1.5, 20.0)),(1.0, Vectors.dense(1.5, 30.0)),(0.0, Vectors.dense(3.5, 30.0)),(0.0, Vectors.dense(3.5, 40.0)),(1.0, Vectors.dense(3.5, 40.0))] df &＃61; spark.createDataFrame(data, ["label", "features"])r &＃61; ChiSquareTest.test(df, "features", "label").head() print("pValues: " &＃43; str(r.pValues)) print("degreesOfFreedom: " &＃43; str(r.degreesOfFreedom)) print("statistics: " &＃43; str(r.statistics))

pValues: [0.6872892787909721,0.6822703303362126] degreesOfFreedom: [2, 3] statistics: [0.75,1.5]

如果本书对你有所帮助&＃xff0c;想鼓励一下作者&＃xff0c;记得给本项目加一颗星星star⭐️&＃xff0c;并分享给你的朋友们喔????!