Mahout随机森林算法源码分析（2-2）

作者：WYZ的小舟于SZ | 来源：互联网 | 2023-09-13 14:34

Mahout版本：0.7，hadoop版本：1.0.4，jdk：1.7.0_2564bit。今天到BuildForest的主要Mapper操作，前面也说到BuildForest主要的操作都在Mapp

Mahout版本：0.7，hadoop版本：1.0.4，jdk：1.7.0_25 64bit。

今天到BuildForest的主要Mapper操作，前面也说到BuildForest主要的操作都在Mapper里面，而reducer是没有的。本篇介绍其Mapper，Step1Mapper。首先贴上其仿制代码，如下：

package mahout.fansy.partial;

import java.io.IOException;
import java.util.List;
import java.util.Random;

import mahout.fansy.utils.read.ReadText;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.filecache.DistributedCache;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.mahout.classifier.df.Bagging;
import org.apache.mahout.classifier.df.builder.DecisionTreeBuilder;
import org.apache.mahout.classifier.df.data.Data;
import org.apache.mahout.classifier.df.data.DataConverter;
import org.apache.mahout.classifier.df.data.Dataset;
import org.apache.mahout.classifier.df.data.Instance;
import org.apache.mahout.classifier.df.mapreduce.Builder;
import org.apache.mahout.classifier.df.mapreduce.MapredOutput;
import org.apache.mahout.classifier.df.mapreduce.partial.TreeID;
import org.apache.mahout.classifier.df.node.Node;
import org.apache.mahout.common.RandomUtils;

import com.google.common.base.Preconditions;
import com.google.common.collect.Lists;

/**
 * Step1Mapper的仿造代码
 * @author fansy
 */
public class Step1MapperFollow {
private DataConverter converter;
  private Random rng;
  private int nbTrees;
  private int firstTreeId;
  private int partition;
  private final List instances = Lists.newArrayList();
  private Configuration conf;
  private Path datasetPath ;
  private Path input;
//  private Path output;
  private List values;
  private Dataset dataset;
  private DecisionTreeBuilder treeBuilder;
  private int m; // selection
  
  public static void main(String[] args) throws IOException, InterruptedException{
  Step1MapperFollow s1m=new Step1MapperFollow();
  s1m.init();
  s1m.setup();
  s1m.map();
  s1m.cleanup();
  }
  /*
   * 运行该类时首先要先运行该函数
   */
  private void init() throws IOException{
  datasetPath=new Path("hdfs://ubuntu:9000/user/breiman/glass.info");
  input=new Path("hdfs://ubuntu:9000/user/breiman/input/glass.data");
//  output=new Path("hdfs://ubuntu:9000/user/breiman/output-forest");
  
 
  treeBuilder = new DecisionTreeBuilder();
  treeBuilder.setM(m);
  treeBuilder.setComplemented(true);
  cOnf=new Configuration();
  conf.set("mapred.job.tracker", "ubuntu:9001");
  // 把dataset加入内存中
  DistributedCache.addCacheFile(datasetPath.toUri(), conf);
  dataset=Dataset.load(conf,datasetPath);
  values=getData();
  }
private List getData() throws IOException {

return ReadText.readText(input, conf);
}
/*
 * 仿造setup函数
 */
public void setup() throws IOException{

//configure(Builder.getRandomSeed(conf), conf.getInt("mapred.task.partition", -1),
//      Builder.getNumMaps(conf), Builder.getNbTrees(conf));
// conf.getInt("mapred.task.partition", -1)的值直接设置为0即可
// 参数设置参考上面    
configure(Builder.getRandomSeed(conf), 0,
      1, 10);
}


/*
 * 仿造map函数 
 */
protected void map() throws IOException {
//List values =ReadText.readText(input, conf);
for(Text value:values){
String[] v=value.toString().split(",");
if(v[10].equals("2")){
//System.out.println(v[10]);
}
instances.add(converter.convert(value.toString()));
}
  }

/*
 * 仿造cleanup函数
 */
 protected void cleanup() throws IOException, InterruptedException {
    // prepare the data
    
    Data data = new Data(dataset, instances);
    Bagging bagging = new Bagging(treeBuilder, data);
    
    TreeID key = new TreeID();
    
    for (int treeId = 0; treeId       
      Node tree = bagging.build(rng);
      
      key.set(partition, firstTreeId + treeId);
      
  //    if (!isNoOutput()) {
        MapredOutput emOut = new MapredOutput(tree);
        System.out.println("key:"+key+"***value:"+emOut);
   //     context.write(key, emOut);
  //    }
    }
 }

protected void configure(Long seed, int partition, int numMapTasks, int numTrees) throws IOException {
    cOnverter= new DataConverter(dataset);
    // prepare random-numders generator
    if (seed == null) {
      rng = RandomUtils.getRandom();
    } else {
      rng = RandomUtils.getRandom(seed);
    }
    
    // mapper's partition
    Preconditions.checkArgument(partition >= 0, "Wrong partition ID");
    this.partition = partition;
    
    // compute number of trees to build
    nbTrees = nbTrees(numMapTasks, numTrees, partition);
    
    // compute first tree id
    firstTreeId = 0;
    for (int p = 0; p       firstTreeId += nbTrees(numMapTasks, numTrees, p);
    }
  }

 public static int nbTrees(int numMaps, int numTrees, int partition) {
    int nbTrees = numTrees / numMaps;
    if (partition == 0) {
      nbTrees += numTrees - nbTrees * numMaps;
    }
    return nbTrees;
 }
}

（1）setup函数

这个函数其实应该包括init里面的所有东东，这里设定的主要包括；Random随机种子、nbTrees决策树的个数、dataset的路径、data的路径。把data读入到values集合里面、把dataset读到dataset变量，新建treeBuilder变量设定其相关属性值，新建converter变量。

（2）map函数

map函数就是遍历每行的输入，然后使用converter把读入的数据进行转换，然后添加到instances里面，首先看下instances变量吧。这个变量定义如下：List，这个是一个list，然后看到Instance类，Instance类里面就一个属性Vector和若干方法，可以看到其实Instance里面就是存储的Vector而已，不清楚搞多个Instance干嘛，直接Vector不好么？接下来看DataConverter，它有两个属性，一个是Pattern的用于分解string字符串的，另外一个是dataset，用于convert方法中相关值的设定。还有一个比较重要的方法convert方法，这个是用于把字符串转换为Vector（准确来说是Instance）的函数。在讲这个函数前，先来看下dataset吧：

假如我传入的字符串是：[1,1.52101,13.64,4.49,1.10,71.78,0.06,8.75,0.00,0.00,1]，那么convert函数首先使用逗号把字符串解析到数组中，然后根据ignored的值把数组中对应的下标的值忽略，再次根据attributes的值进行匹配，如果是Numerical的话直接把值加入vector中，如果是categorical的话就按照values里面的数组进行匹配，比如如果是字符串“3”的话，那么就把其下标值加入vector中，比如上面的数据是1，那么加入字符串中的值就是2。可以通过debug方式查看添加这行输入后vector的值：

这里可以看到字符串1（这里一定要看做是字符串，而不是数字）的确是被转换为了2了，而且可以看到由于第7、8的值为0，所以这里就没有显示了。

（3）cleanup函数

看cleanup函数，刚开始新建了几个变量、Data、Bagging、TreeID，然后循环调用build函数建立树并输出每棵树，每棵树是由Node类带出的。所以这里的重点是build函数。

Bagging.build函数传入一个随机种子，然后返回一个Node，这个Node就是一个树了，这个Node可以往左、右继续添加Node。继续看这个函数的代码：

Arrays.fill(sampled, false);    Data bag = data.bagging(rng, sampled);        return treeBuilder.build(rng, bag);

看到这里首先对Data进行了一个.bagging（rng）的处理，然后把处理后的data传入了treeBuilder的build函数。一个个来看data.bagging是做什么处理的呢？

 public Data bagging(Random rng, boolean[] sampled) {    int datasize = size();    List bag = Lists.newArrayListWithCapacity(datasize);        for (int i = 0; i instaces是原始数据的list，可以看到bag每次添加了一个从instances中随机取出的一个vector值，然后进行返回，同时修改了sampled的值（这个值是说instances的哪个下标已经被选中了），所以返回的bag值里面肯定是有重复的，如下：



下面到了treeBuilder.build方法，这个方法被两个类覆写，分别是DecisionTreeBuilder、DefaultTreeBuilder，这里调用的是DecisionTreeBuilder的build方法。
刚开始是如下的代码：

if (selected == null) {      selected = new boolean[data.getDataset().nbAttributes()];      selected[data.getDataset().getLabelId()] = true; // never select the label    }    if (m == 0) {      // set default m      double e = data.getDataset().nbAttributes() - 1;      if (data.getDataset().isNumerical(data.getDataset().getLabelId())) {        // regression        m = (int) Math.ceil(e / 3.0);      } else {        // classification        m = (int) Math.ceil(Math.sqrt(e));      }    }设定label的selected的值为true，其他属性值的selected被设置为false。然后设定m的值，由于m的值，前面没有设定，而这里是做分类问题的，所以设定m的值为所有属性值个数的平方根。这个m值是为了下面随机选择的属性值的个数。

下面的代码通过判断data.getDataset().isNumerical(data.getDataset().getLabelId())这个boolean值来进行判断是用回归还是分类思路来处理。这里的label肯定不是数值型的，所以进入分类处理的代码：
首先是两个判断：

 if (isIdentical(data)) {        return new Leaf(data.majorityLabel(rng));      }      if (data.identicalLabel()) {        return new Leaf(data.getDataset().getLabel(data.get(0)));      }第一个判断是判断data是否全部都是一样的，第二个判断是判断data是否是空的；这里传入的data虽然有重复，但是不全是一样的，而且肯定不是为空，所以继续往下走。


int[] attributes = randomAttributes(rng, selected, m);这行代码的主要意思是随机选择m个属性返回到attributes，比如这次debug得到的结果是：[8,2,6]；然后到了下面的if (attributes == null || attributes.length == 0)这里跳过，下面if (igSplit == null) 对分类问题，这个赋值为：igSplit = new OptIgSplit();

代码继续走：

Split best = null;    for (int attr : attributes) {      Split split = igSplit.computeSplit(data, attr);      if (best == null || best.getIg() 首先看下Split这个类，有三个属性：int attr，double ig，double split；来看下computeSplit函数（OptIgSplitl里面的函数）：


public Split computeSplit(Data data, int attr) {    if (data.getDataset().isNumerical(attr)) {      return numericalSplit(data, attr);    } else {      return categoricalSplit(data, attr);    }  }又要进入函数，看numericalSplit函数：


Split numericalSplit(Data data, int attr) {    double[] values = sortedValues(data, attr);    initCounts(data, values);    computeFrequencies(data, attr, values);    int size = data.size();    double hy = entropy(countAll, size);    double invDataSize = 1.0 / size;    int best = -1;    double bestIg = -1.0;    // try each possible split value    for (int index = 0; index = values[index]      size = DataUtils.sum(countAll);      ig -= size * invDataSize * entropy(countAll, size);      if (ig > bestIg) {        bestIg = ig;        best = index;      }      DataUtils.add(countLess, counts[index]);      DataUtils.dec(countAll, counts[index]);    }    if (best == -1) {      throw new IllegalStateException("no best split found !");    }    return new Split(attr, bestIg, values[best]);  }

尼玛，好长呀。晚上回去再看。。。




分享，成长，快乐
转载请注明blog地址：http://blog.csdn.net/fansy1990

推荐阅读

char
Spring Boot集成与使用JPA详解

本文详细介绍如何在Spring Boot项目中集成和使用JPA，涵盖JPA的基本概念、Spring Data JPA的功能以及具体的操作步骤，帮助开发者快速掌握这一强大的持久化技术。 ... [详细]

蜡笔小新 2024-11-27 17:44:54
c语言
20145209刘一阳《JAVA程序设计》第三周课堂测试

第三周课堂测试1、使用汇编语言编写指令时，用一些简单的容易记忆的符号来代替二进制指令，比机器语言更为方便，属于高级语言。（B ... [详细]

蜡笔小新 2024-11-28 13:02:41
future
优化 moment.js 的时间显示配置

探讨如何在 moment.js 中使用更精确的时间显示方式，特别是对于较近的时间点，如昨天和今天的显示。 ... [详细]

蜡笔小新 2024-11-28 12:30:34
get
PHP网站部署指南：从零开始搭建PHP网站

本文提供了详细的步骤指导，帮助开发者在不同环境下成功部署PHP网站，包括在IIS和Apache服务器上的具体操作。 ... [详细]

蜡笔小新 2024-11-28 11:23:57
client
Flink与Kafka集成时事务频繁失败及解决方案

本文探讨了在使用Apache Flink向Kafka发送数据过程中遇到的事务频繁失败问题，并提供了详细的解决方案，包括必要的配置调整和最佳实践。 ... [详细]

蜡笔小新 2024-11-27 20:17:44
text
如何实现类似CSDN博客的页面返回顶部功能

本文将详细介绍如何实现类似于CSDN博客的页面返回顶部功能，通过调整返回速度和图标显示条件，使用户体验更加流畅。适合前端开发者参考学习。 ... [详细]

蜡笔小新 2024-11-27 18:22:17
get
深入探讨Web服务器与动态语言的交互机制：CGI、FastCGI与PHP-FPM

本文详细解析了Web服务器（如Apache、Nginx等）与动态语言（如PHP）之间通过CGI、FastCGI及PHP-FPM进行交互的具体过程，旨在帮助开发者更好地理解这些技术背后的原理。 ... [详细]

蜡笔小新 2024-11-26 20:03:27
get
Struts2框架构建指南

本文详细介绍了如何使用Struts2（版本2.3.16.3）构建Web应用，包括必要的依赖库添加、配置文件设置以及简单的示例代码。Struts2是Apache软件基金会下的一个开源框架，用于简化Java Web应用程序的开发。 ... [详细]

蜡笔小新 2024-11-26 16:08:50
get
Java 中静态和非静态嵌套类的区别

Java 中静态和非静态嵌套类的区别 ... [详细]

蜡笔小新 2024-11-28 11:32:56
get
深入解析达内Java基础练习题

本文精选了几道典型的Java基础题目，旨在帮助学习者巩固基础知识，提升编程技能。通过这些题目，你可以检验自己的Java基础掌握程度。 ... [详细]

蜡笔小新 2024-11-28 09:36:06
runtime
Java中XMLSerializer.childAsRoot()方法详解与应用实例

本文详细介绍了Java中com.sun.xml.bind.v2.runtime.XMLSerializer类下的childAsRoot()方法，并提供了多个实际应用的代码示例，帮助开发者更好地理解和使用该方法。 ... [详细]

蜡笔小新 2024-11-28 09:21:49
shell
Windows Terminal 自定义配置：提升 PowerShell 7 使用体验

本文将指导你如何通过自定义配置，使 Windows Terminal 中的 PowerShell 7 更加高效且美观。我们将移除默认的广告和提示符，设置快捷键，并添加实用的别名和功能。 ... [详细]

蜡笔小新 2024-11-28 07:25:46
char
ECharts图表绘制函数集

本文档提供了使用ECharts库创建柱状图、饼图和双折线图的JavaScript函数。每个函数都详细列出了参数说明，并通过示例展示了如何调用这些函数以生成不同类型的图表。 ... [详细]

蜡笔小新 2024-11-27 20:24:40
python
寻找子树中值小于自身节点的最大数量

本文介绍了一种算法，用于在一个给定的二叉树中找到一个节点，该节点的子树包含最大数量的值小于该节点的节点。如果存在多个符合条件的节点，可以选择任意一个。 ... [详细]

蜡笔小新 2024-11-27 18:08:54
get
Php微信小程序支付

微信小程序支付官方参数小程序中代码后端发起支付代码支付回调官方参数文档地址：https:developers.weixin.qq.comminiprogramdeva ... [详细]

蜡笔小新 2024-11-27 11:48:26

WYZ的小舟于SZ

这个家伙很懒，什么也没留下！

Tags | 热门标签

RankList | 热门文章