1、官方文档
官方文档中可查询到各语言版本的安装方法、官方用例等
XGBoost Documentation — xgboost 1.6.0-dev documentationhttps://xgboost.readthedocs.io/en/latest/index.html2、github
github源码可查看代码实现、下载数据样例等GitHub - dmlc/xgboost: Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Dask, Flink and DataFlowhttps://github.com/dmlc/xgboost3、maven仓库
使用IDEA配置xgboost4j的时候,根据scala版本配置依赖
https://mvnrepository.com/artifact/ml.dmlc/xgboost4jhttps://mvnrepository.com/artifact/ml.dmlc/xgboost4j
POM文件配置:
参考官方github给出的spark分布式训练的代码例子
https://github.com/dmlc/xgboost/blob/master/jvm-packages/xgboost4j-example/src/main/scala/ml/dmlc/xgboost4j/scala/example/spark/SparkMLlibPipeline.scalahttps://github.com/dmlc/xgboost/blob/master/jvm-packages/xgboost4j-example/src/main/scala/ml/dmlc/xgboost4j/scala/example/spark/SparkMLlibPipeline.scala
或者官网教程讲解的例子。这与上面的是同一个例子,只不过这里是教程,上面的是代码。
XGBoost4J-Spark Tutorial (version 0.9+) — xgboost 1.6.0-dev documentation
其他博客中的代码例子 :
[机器学习] XGBoost on Spark 分布式使用完全手册_小墨鱼的专栏-CSDN博客_spark xgboost
经验1:二分类中不要设置参数,"num_class" -> 2, 会训练失败。多元分类时设置。经验2:为了防止block,将训练数据分区数和申请的工作节点数numWorkers保持一致。即对训练数据做repartition处理。numWorkers默认值是32.经验3:numWorker参数应该与executor数量设置一致,executor-cores设置小一些经验4:在train的过程中,每个partition占用的内存最好限制在executor-memory的1/3以内,因为除了本来训练样本需要驻留的内存外,xgboost为了速度的提升,为每个线程申请了额外的内存,并且这些内存是JVM所管理不到的经验5:检查训练数据,如果有缺失值,XGB4j会training failed经验6:XGB4j 1.3.0以上版本,解决了 spark.speculation机制 和xgboost4j的worker的逻辑冲突问题,但1.2.0版本比1.3.1 、1.5.0运行更稳定经验7:kill_spark_context_on_worker_failure 设置为false,默认值是true,如果worker训练失败会杀掉spark进程
注意区分:xgboost本地单机模式 和xgboost spark分布式模式 !!!
上手机器学习系列-第5篇(中)-XGBoost+Scala/Spark_a_step_further的博客-CSDN博客
import org.apache.spark.ml.{Pipeline, PipelineModel}
import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator
import org.apache.spark.ml.feature._
import org.apache.spark.ml.tuning._
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.types._
import ml.dmlc.xgboost4j.scala.spark.{XGBoostClassifier, XGBoostClassificationModel}object xgboostExample {def main(args: Array[String]): Unit = {val spark = SparkSession.builder().appName("xgboost").getOrCreate()val cpuGpu = args(0)//0-用GPU还是CPUval (treeMethod, numWorkers) = if (cpuGpu == "gpu") {("gpu_hist", 4)} else ("auto", 8)// 1-加载数据集val masterPath = "hdfs://gaoToby/ml_data/xgboost_data/"val irisPath = masterPath + "iris.data"val schema = new StructType(Array(StructField("sepal length", DoubleType, true),StructField("sepal width", DoubleType, true),StructField("petal length", DoubleType, true),StructField("petal width", DoubleType, true),StructField("class", StringType, true)))val rawInput = spark.read.schema(schema).csv(irisPath)// Split training and test datasetval Array(training, test) = rawInput.randomSplit(Array(0.8, 0.2), 123)training.cache()training.show(5,false)//2- 创建Pipeline// Build ML pipeline, it includes 4 stages:// 1, Assemble all features into a single vector column.// 2, From string label to indexed double label.// 3, Use XGBoostClassifier to train classification model.// 4, Convert indexed double label back to original string label.val assembler = new VectorAssembler().setInputCols(Array("sepal length", "sepal width", "petal length", "petal width")).setOutputCol("features")val labelIndexer = new StringIndexer().setInputCol("class").setOutputCol("classIndex").fit(training)val booster = new XGBoostClassifier(Map("eta" -> 0.1f,"max_depth" -> 2,"objective" -> "multi:softprob","num_class" -> 3,"num_round" -> 100,"num_workers" -> numWorkers,"tree_method" -> treeMethod))booster.setFeaturesCol("features")booster.setLabelCol("classIndex")val labelConverter = new IndexToString().setInputCol("prediction").setOutputCol("realLabel").setLabels(labelIndexer.labels)val pipeline = new Pipeline().setStages(Array(assembler, labelIndexer, booster, labelConverter))//3-模型训练val model = pipeline.fit(training)//4-模型预测val prediction = model.transform(test)prediction.show(false)//5-模型评价val evaluator = new MulticlassClassificationEvaluator()evaluator.setLabelCol("classIndex")evaluator.setPredictionCol("prediction")val accuracy = evaluator.evaluate(prediction)println("The model accuracy is : " + accuracy)//6-模型调参 Tune model using cross validationval paramGrid = new ParamGridBuilder().addGrid(booster.maxDepth, Array(3, 8)).addGrid(booster.eta, Array(0.2, 0.6)).build()val cv = new CrossValidator().setEstimator(pipeline).setEvaluator(evaluator).setEstimatorParamMaps(paramGrid).setNumFolds(3)val cvModel = cv.fit(training)val bestModel = cvModel.bestModel.asInstanceOf[PipelineModel].stages(2).asInstanceOf[XGBoostClassificationModel]println("The params of best XGBoostClassification model : " +bestModel.extractParamMap())println("The training summary of best XGBoostClassificationModel : " +bestModel.summary)//7- 模型保存val pipelineModelPath = "hdfs://gaoToby/model/xgboost/"model.write.overwrite().save(pipelineModelPath)//8-模型加载val model2 = PipelineModel.load(pipelineModelPath)model2.transform(test).show(false)//9-保存为单机版本的模型,可以本地加载模型// Export the XGBoostClassificationModel as local XGBoost model,// then you can load it back in local Python environment.val nativeModelPath = "/home/gaoToby/xgboost/nativeModel"bestModel.nativeBooster.saveModel(nativeModelPath)}
}
结果展示:
(1)原始训练数据
(2)模型预测结果
import ml.dmlc.xgboost4j.scala.spark.{XGBoostClassificationModel, XGBoostClassifier}
import org.apache.spark.sql.SparkSession/*** XGBoost例子: 二元分类、CPU计算、不用Pipeline*/object xgboost {def main(args: Array[String]): Unit = {val spark = SparkSession.builder().appName("xgboost").enableHiveSupport().getOrCreate()val masterPath = "hdfs://gaoToby/ml_data/xgboost_data/"val trainPath = masterPath + "agaricus.txt.train"val testPath = masterPath + "agaricus.txt.test"//1-加载训练数据val trainData = spark.read.format("libsvm").load(trainPath).dropDuplicates()trainData.cache()trainData.show(5,false)val testData = spark.read.format("libsvm").load(testPath)testData.cache()testData.show(5,false)//2- 创建模型实例val xgbClassifier = new XGBoostClassifier(Map("eta" -> 0.1f,"missing" -> -999.0,"max_depth" -> 4,"objective" -> "binary:logistic", //二元分类
// "num_class" -> 2, //二分类中不要设置这个参数,会训练失败。多元分类时设置。"num_round" -> 100,"num_workers" -> 4,"tree_method" -> "hist", //GPU不能用,用CPU的方法"train_test_ratio" -> 0.7)).setFeaturesCol("features").setLabelCol("label").setAllowNonZeroForMissing(true) //允许非零的缺失值存在//3-模型训练val modelFitted =xgbClassifier.fit(trainData)//4-模型预测val predicts = modelFitted.transform(testData) //对DataFrame用transform进行转换predicts.show(5,false)// import org.apache.spark.ml.linalg.Vector
// val features = trainData.head().getAs[Vector]("features")
// val result: Double = modelFitted.predict(features) //对一条向量的特征数据进行预测用predict
// println(result)//5-模型评价 --二元分类模型评价器import org.apache.spark.ml.evaluation.BinaryClassificationEvaluatorval evaluator = new BinaryClassificationEvaluator().setMetricName("areaUnderROC").setRawPredictionCol("rawPrediction").setLabelCol("label")val auc = evaluator.evaluate(predicts)println(s"xgboost Model evaluate trainData areaUnderROC i.e AUC: $auc")//6-模型保存val modelPath = "hdfs://gaoToby/model/xgboost/"modelFitted.write.overwrite().save(modelPath)//7-模型加载val xgboostModel1 = XGBoostClassificationModel.load(modelPath)val testPrediction = xgboostModel1.transform(testData)testPrediction.cache()testPrediction.show(5,false)}
}
结果展示:
1、训练数据
2、模型预测数据
xgboost参数解读:XGBoost Parameters — xgboost 1.6.0-dev documentation
xgboost参数解释_lc574260570的博客-CSDN博客
论文:XGBoost: A Scalable Tree Boosting System
上手机器学习系列-第5篇(下)-XGBoost原理_a_step_further的博客-CSDN博客
原理介绍文档:中文版:https://xgboost.apachecn.org/#/docs/3
英文版: Introduction to Boosted Trees — xgboost 1.6.0-dev documentation