Xgboost安装、使用和算法原理理解

作者：幸福的FRN | 来源：互联网 | 2023-06-13 10:12

一、Xgboost相关重要文档1、官方文档官方文档中可查询到各语言版本的安装方法、官方用例等XGBoostDocumentation—xgboost1.6.0-devdocu

一、Xgboost相关重要文档

1、官方文档

官方文档中可查询到各语言版本的安装方法、官方用例等

XGBoost Documentation — xgboost 1.6.0-dev documentationhttps://xgboost.readthedocs.io/en/latest/index.html2、github

github源码可查看代码实现、下载数据样例等GitHub - dmlc/xgboost: Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C&＃43;&＃43; and more. Runs on single machine, Hadoop, Spark, Dask, Flink and DataFlowhttps://github.com/dmlc/xgboost3、maven仓库

使用IDEA配置xgboost4j的时候&＃xff0c;根据scala版本配置依赖

https://mvnrepository.com/artifact/ml.dmlc/xgboost4jhttps://mvnrepository.com/artifact/ml.dmlc/xgboost4j

POM文件配置&＃xff1a;

2.122.12.63.0.21.2.0 ml.dmlcxgboost4j-spark_${spark.version.scala}${xgboost.version}

二、xgboost使用方法

参考官方github给出的spark分布式训练的代码例子

https://github.com/dmlc/xgboost/blob/master/jvm-packages/xgboost4j-example/src/main/scala/ml/dmlc/xgboost4j/scala/example/spark/SparkMLlibPipeline.scalahttps://github.com/dmlc/xgboost/blob/master/jvm-packages/xgboost4j-example/src/main/scala/ml/dmlc/xgboost4j/scala/example/spark/SparkMLlibPipeline.scala

或者官网教程讲解的例子。这与上面的是同一个例子&＃xff0c;只不过这里是教程&＃xff0c;上面的是代码。

XGBoost4J-Spark Tutorial (version 0.9&＃43;) — xgboost 1.6.0-dev documentation

其他博客中的代码例子 &＃xff1a;

[机器学习] XGBoost on Spark 分布式使用完全手册_小墨鱼的专栏-CSDN博客_spark xgboost

XGBoost实践经验&＃xff1a;

经验1&＃xff1a;二分类中不要设置参数&＃xff0c;"num_class" -> 2, 会训练失败。多元分类时设置。经验2&＃xff1a;为了防止block&＃xff0c;将训练数据分区数和申请的工作节点数numWorkers保持一致。即对训练数据做repartition处理。numWorkers默认值是32.经验3:numWorker参数应该与executor数量设置一致&＃xff0c;executor-cores设置小一些经验4:在train的过程中&＃xff0c;每个partition占用的内存最好限制在executor-memory的1/3以内&＃xff0c;因为除了本来训练样本需要驻留的内存外&＃xff0c;xgboost为了速度的提升&＃xff0c;为每个线程申请了额外的内存&＃xff0c;并且这些内存是JVM所管理不到的经验5&＃xff1a;检查训练数据&＃xff0c;如果有缺失值&＃xff0c;XGB4j会training failed经验6&＃xff1a;XGB4j 1.3.0以上版本&＃xff0c;解决了 spark.speculation机制和xgboost4j的worker的逻辑冲突问题&＃xff0c;但1.2.0版本比1.3.1 、1.5.0运行更稳定经验7&＃xff1a;kill_spark_context_on_worker_failure 设置为false&＃xff0c;默认值是true&＃xff0c;如果worker训练失败会杀掉spark进程

注意区分&＃xff1a;xgboost本地单机模式和xgboost spark分布式模式 &＃xff01;&＃xff01;&＃xff01;
上手机器学习系列-第5篇&＃xff08;中&＃xff09;-XGBoost&＃43;Scala/Spark_a_step_further的博客-CSDN博客

代码示例1: 多元分类 &＃43; pipeline

import org.apache.spark.ml.{Pipeline, PipelineModel} import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator import org.apache.spark.ml.feature._ import org.apache.spark.ml.tuning._ import org.apache.spark.sql.SparkSession import org.apache.spark.sql.types._ import ml.dmlc.xgboost4j.scala.spark.{XGBoostClassifier, XGBoostClassificationModel}object xgboostExample {def main(args: Array[String]): Unit &＃61; {val spark &＃61; SparkSession.builder().appName("xgboost").getOrCreate()val cpuGpu &＃61; args(0)//0-用GPU还是CPUval (treeMethod, numWorkers) &＃61; if (cpuGpu &＃61;&＃61; "gpu") {("gpu_hist", 4)} else ("auto", 8)// 1-加载数据集val masterPath &＃61; "hdfs://gaoToby/ml_data/xgboost_data/"val irisPath &＃61; masterPath &＃43; "iris.data"val schema &＃61; new StructType(Array(StructField("sepal length", DoubleType, true),StructField("sepal width", DoubleType, true),StructField("petal length", DoubleType, true),StructField("petal width", DoubleType, true),StructField("class", StringType, true)))val rawInput &＃61; spark.read.schema(schema).csv(irisPath)// Split training and test datasetval Array(training, test) &＃61; rawInput.randomSplit(Array(0.8, 0.2), 123)training.cache()training.show(5,false)//2- 创建Pipeline// Build ML pipeline, it includes 4 stages:// 1, Assemble all features into a single vector column.// 2, From string label to indexed double label.// 3, Use XGBoostClassifier to train classification model.// 4, Convert indexed double label back to original string label.val assembler &＃61; new VectorAssembler().setInputCols(Array("sepal length", "sepal width", "petal length", "petal width")).setOutputCol("features")val labelIndexer &＃61; new StringIndexer().setInputCol("class").setOutputCol("classIndex").fit(training)val booster &＃61; new XGBoostClassifier(Map("eta" -> 0.1f,"max_depth" -> 2,"objective" -> "multi:softprob","num_class" -> 3,"num_round" -> 100,"num_workers" -> numWorkers,"tree_method" -> treeMethod))booster.setFeaturesCol("features")booster.setLabelCol("classIndex")val labelConverter &＃61; new IndexToString().setInputCol("prediction").setOutputCol("realLabel").setLabels(labelIndexer.labels)val pipeline &＃61; new Pipeline().setStages(Array(assembler, labelIndexer, booster, labelConverter))//3-模型训练val model &＃61; pipeline.fit(training)//4-模型预测val prediction &＃61; model.transform(test)prediction.show(false)//5-模型评价val evaluator &＃61; new MulticlassClassificationEvaluator()evaluator.setLabelCol("classIndex")evaluator.setPredictionCol("prediction")val accuracy &＃61; evaluator.evaluate(prediction)println("The model accuracy is : " &＃43; accuracy)//6-模型调参 Tune model using cross validationval paramGrid &＃61; new ParamGridBuilder().addGrid(booster.maxDepth, Array(3, 8)).addGrid(booster.eta, Array(0.2, 0.6)).build()val cv &＃61; new CrossValidator().setEstimator(pipeline).setEvaluator(evaluator).setEstimatorParamMaps(paramGrid).setNumFolds(3)val cvModel &＃61; cv.fit(training)val bestModel &＃61; cvModel.bestModel.asInstanceOf[PipelineModel].stages(2).asInstanceOf[XGBoostClassificationModel]println("The params of best XGBoostClassification model : " &＃43;bestModel.extractParamMap())println("The training summary of best XGBoostClassificationModel : " &＃43;bestModel.summary)//7- 模型保存val pipelineModelPath &＃61; "hdfs://gaoToby/model/xgboost/"model.write.overwrite().save(pipelineModelPath)//8-模型加载val model2 &＃61; PipelineModel.load(pipelineModelPath)model2.transform(test).show(false)//9-保存为单机版本的模型&＃xff0c;可以本地加载模型// Export the XGBoostClassificationModel as local XGBoost model,// then you can load it back in local Python environment.val nativeModelPath &＃61; "/home/gaoToby/xgboost/nativeModel"bestModel.nativeBooster.saveModel(nativeModelPath)} }

结果展示&＃xff1a;

(1)原始训练数据

&＃xff08;2&＃xff09;模型预测结果

代码示例2: 二元分类 &＃43; 无pipeline直接训练

import ml.dmlc.xgboost4j.scala.spark.{XGBoostClassificationModel, XGBoostClassifier} import org.apache.spark.sql.SparkSession/*** XGBoost例子&＃xff1a; 二元分类、CPU计算、不用Pipeline*/object xgboost {def main(args: Array[String]): Unit &＃61; {val spark &＃61; SparkSession.builder().appName("xgboost").enableHiveSupport().getOrCreate()val masterPath &＃61; "hdfs://gaoToby/ml_data/xgboost_data/"val trainPath &＃61; masterPath &＃43; "agaricus.txt.train"val testPath &＃61; masterPath &＃43; "agaricus.txt.test"//1-加载训练数据val trainData &＃61; spark.read.format("libsvm").load(trainPath).dropDuplicates()trainData.cache()trainData.show(5,false)val testData &＃61; spark.read.format("libsvm").load(testPath)testData.cache()testData.show(5,false)//2- 创建模型实例val xgbClassifier &＃61; new XGBoostClassifier(Map("eta" -> 0.1f,"missing" -> -999.0,"max_depth" -> 4,"objective" -> "binary:logistic", //二元分类 // "num_class" -> 2, //二分类中不要设置这个参数&＃xff0c;会训练失败。多元分类时设置。"num_round" -> 100,"num_workers" -> 4,"tree_method" -> "hist", //GPU不能用&＃xff0c;用CPU的方法"train_test_ratio" -> 0.7)).setFeaturesCol("features").setLabelCol("label").setAllowNonZeroForMissing(true) //允许非零的缺失值存在//3-模型训练val modelFitted &＃61;xgbClassifier.fit(trainData)//4-模型预测val predicts &＃61; modelFitted.transform(testData) //对DataFrame用transform进行转换predicts.show(5,false)// import org.apache.spark.ml.linalg.Vector // val features &＃61; trainData.head().getAs[Vector]("features") // val result: Double &＃61; modelFitted.predict(features) //对一条向量的特征数据进行预测用predict // println(result)//5-模型评价 --二元分类模型评价器import org.apache.spark.ml.evaluation.BinaryClassificationEvaluatorval evaluator &＃61; new BinaryClassificationEvaluator().setMetricName("areaUnderROC").setRawPredictionCol("rawPrediction").setLabelCol("label")val auc &＃61; evaluator.evaluate(predicts)println(s"xgboost Model evaluate trainData areaUnderROC i.e AUC: $auc")//6-模型保存val modelPath &＃61; "hdfs://gaoToby/model/xgboost/"modelFitted.write.overwrite().save(modelPath)//7-模型加载val xgboostModel1 &＃61; XGBoostClassificationModel.load(modelPath)val testPrediction &＃61; xgboostModel1.transform(testData)testPrediction.cache()testPrediction.show(5,false)} }

结果展示&＃xff1a;

1、训练数据

2、模型预测数据

三、xgboost原理理解

xgboost参数解读&＃xff1a;XGBoost Parameters — xgboost 1.6.0-dev documentation

xgboost参数解释_lc574260570的博客-CSDN博客

论文&＃xff1a;XGBoost: A Scalable Tree Boosting System

上手机器学习系列-第5篇&＃xff08;下&＃xff09;-XGBoost原理_a_step_further的博客-CSDN博客

原理介绍文档&＃xff1a;中文版&＃xff1a;https://xgboost.apachecn.org/#/docs/3

英文版&＃xff1a; Introduction to Boosted Trees — xgboost 1.6.0-dev documentation

推荐阅读

hadoop
python发送文件传输助手_python 通过 socket 发送文件的实例代码

{moduleinfo:{card_count:[{count_phone:1,count:1}],search_count:[{count_phone:4 ... [详细]

蜡笔小新 2023-10-17 20:20:31
hadoop
像跟踪分布式服务调用那样跟踪Go函数调用链 | Gopher Daily (2020.12.07) ʕ◔ϖ◔ʔ

每日一谚：“Acacheisjustamemoryleakyouhaven’tmetyet.”—Mr.RogersGo技术专栏“改善Go语⾔编程质量的50个有效实践” ... [详细]

蜡笔小新 2023-10-17 19:23:45
java
dataguard日志传输模式解析_SOFAJRaft 日志复制pipeline 实现剖析 | SOFAJRaft 实现原理

SOFAStack（ScalableOpenFinancialArchitectureStack）是蚂蚁金服自主研发的金融级分布式架构，包 ... [详细]

蜡笔小新 2023-10-15 08:16:39
java
2018年人工智能大数据的爆发，学Java还是Python？

本文介绍了2018年人工智能大数据的爆发以及学习Java和Python的相关知识。在人工智能和大数据时代，Java和Python这两门编程语言都很优秀且火爆。选择学习哪门语言要根据个人兴趣爱好来决定。Python是一门拥有简洁语法的高级编程语言，容易上手。其特色之一是强制使用空白符作为语句缩进，使得新手可以快速上手。目前，Python在人工智能领域有着广泛的应用。如果对Java、Python或大数据感兴趣，欢迎加入qq群458345782。 ... [详细]

蜡笔小新 2023-12-14 20:08:28
default
数据库的存储结构及其重要性

本文介绍了数据库的存储结构及其重要性，强调了关系数据库范例中将逻辑存储与物理存储分开的必要性。通过逻辑结构和物理结构的分离，可以实现对物理存储的重新组织和数据库的迁移，而应用程序不会察觉到任何更改。文章还展示了Oracle数据库的逻辑结构和物理结构，并介绍了表空间的概念和作用。 ... [详细]

蜡笔小新 2023-12-14 16:00:02
text
CSS3选择器的使用方法详解，提高Web开发效率和精准度

本文详细介绍了CSS3新增的选择器方法，包括属性选择器的使用。通过CSS3选择器，可以提高Web开发的效率和精准度，使得查找元素更加方便和快捷。同时，本文还对属性选择器的各种用法进行了详细解释，并给出了相应的代码示例。通过学习本文，读者可以更好地掌握CSS3选择器的使用方法，提升自己的Web开发能力。 ... [详细]

蜡笔小新 2023-12-14 14:37:52
uri
知识图谱——机器大脑中的知识库

本文介绍了知识图谱在机器大脑中的应用，以及搜索引擎在知识图谱方面的发展。以谷歌知识图谱为例，说明了知识图谱的智能化特点。通过搜索引擎用户可以获取更加智能化的答案，如搜索关键词"Marie Curie"，会得到居里夫人的详细信息以及与之相关的历史人物。知识图谱的出现引起了搜索引擎行业的变革，不仅美国的微软必应，中国的百度、搜狗等搜索引擎公司也纷纷推出了自己的知识图谱。 ... [详细]

蜡笔小新 2023-12-14 10:06:19
uri
大量研发销售产品设计市场岗位！

关于我们EMQ是一家全球领先的开源物联网基础设施软件供应商，服务新产业周期的IoT&5G、边缘计算与云计算市场，交付全球领先的开源物联网消息服务器和流处理数据 ... [详细]

蜡笔小新 2023-12-13 21:02:32
tree
推荐系统遇上深度学习(十七）详解推荐系统中的常用评测指标

原创：石晓文小小挖掘机2018-06-18笔者是一个痴迷于挖掘数据中的价值的学习人，希望在平日的工作学习中，挖掘数据的价值， ... [详细]

蜡笔小新 2023-12-13 19:35:25
tree
一句话解决高并发的核心原则

本文介绍了解决高并发的核心原则，即将用户访问请求尽量往前推，避免访问CDN、静态服务器、动态服务器、数据库和存储，从而实现高性能、高并发、高可扩展的网站架构。同时提到了Google的成功案例，以及适用于千万级别PV站和亿级PV网站的架构层次。 ... [详细]

蜡笔小新 2023-12-12 10:56:24
tree
OpenStack及其构成简介

本文介绍了OpenStack的逻辑概念以及其构成简介，包括了软件开源项目、基础设施资源管理平台、三大核心组件等内容。同时还介绍了Horizon(UI模块)等相关信息。 ... [详细]

蜡笔小新 2023-12-12 06:47:38
erlang
Python开源库和第三方包的常用框架及库

本文介绍了Python开源库和第三方包中常用的框架和库，包括Django、CubicWeb等。同时还整理了GitHub中最受欢迎的15个Python开源框架，涵盖了事件I/O、OLAP、Web开发、高性能网络通信、测试和爬虫等领域。 ... [详细]

蜡笔小新 2023-12-11 18:24:06
uri
SpringBoot整合SpringSecurity+JWT实现单点登录

SpringBoot整合SpringSecurity+JWT实现单点登录,Go语言社区,Golang程序员人脉社 ... [详细]

蜡笔小新 2023-12-11 08:21:41
input
程度|也就是_论文精读：Neural Architecture Search without Training

篇首语：本文由编程笔记#小编为大家整理，主要介绍了论文精读：NeuralArchitectureSearchwithoutTraining相关的知识，希望对你有一定的参考价值。 ... [详细]

蜡笔小新 2023-10-16 16:33:20
input
马蜂窝数据总监分享：从数仓到数据中台，大数据演进技术选型最优解

大家好，今天分享的议题主要包括几大内容：带大家回顾一下大数据在国内的发展，从传统数仓到当前数据中台的演进过程；我个人认为数 ... [详细]

蜡笔小新 2023-10-14 14:20:07

幸福的FRN

这个家伙很懒，什么也没留下！

Tags | 热门标签

RankList | 热门文章