作者:手机用户2502852635_269 | 来源:互联网 | 2023-08-25 14:06
Hadoop的简单单词统计案例
在Hadoop学习过程中,单词统计作为一个最基本的案例,非常简单实用,是每一个入门菜鸟必须要掌握的一个例子,可以通过这个简单的小案例了解Hadoop的基本运行原理和MapReduce程序的开发流程
引入相关Hadoop目录相关Jar文件:
(hdfs(必须),common(必须),mapreduce(必须))
引入配置文件:
core-site.xml;hdfs-site.xml;……
编写Map程序:
package cn.guyouda.hadoop.mapreduce.wordcount;import java.io.IOException;import org.apache.commons.lang.StringUtils;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;/*** * @author Youda* Map需要四个泛型参数* KEYIN:输入参数:默认是要处理的文本中的某一行的偏移量* VALUEIN:输入参数:要处理的某一行文本内容* VALUEOUT:输出给Reduce的数据类型* KEYOUT:输出给Reduce的偏移量* * 由于需要网络传输,故参数需要序列化* 但是Java自带的序列化会携带一些冗余信息,不利于大量的网络传输* 所以Hadoop对Long,String进行了封装,变为LongWritable,Text* */
public class WordCountMapper extends Mapper{@Overrideprotected void map(LongWritable key, Text value, Mapper.Context context)throws IOException, InterruptedException {// 处理具体的业务逻辑String text = value.toString();String[] words = StringUtils.split(text," |,|\\.");for(String word:words){context.write(new Text(word), new LongWritable(1));}}}
编写Reduce程序:
package cn.guyouda.hadoop.mapreduce.wordcount;import java.io.IOException;import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;/*** * @author Youda* */
public class WordCountReducer extends Reducer{@Overrideprotected void reduce(Text arg0, Iterable arg1,Reducer.Context arg2) throws IOException, InterruptedException {Long value = 0L;for(LongWritable num:arg1){value += num.get();}arg2.write(arg0, new LongWritable(value));}}
编写控制程序:
package cn.guyouda.hadoop.mapreduce.wordcount;import java.io.IOException;import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;/*** * @author Youda*指定Map和Reduce类*指定作业需要处理的数据位置*还可以指定数据输出的结果路径*/
public class WordCountRunner {public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {Configuration cOnf= new Configuration();Job job = Job.getInstance(conf);job.setJarByClass(WordCountRunner.class);//指定Map和Reduce类job.setMapperClass(WordCountMapper.class);job.setReducerClass(WordCountReducer.class);//指定Reduce的输出类型job.setOutputKeyClass(Text.class);job.setOutputValueClass(LongWritable.class);//指定Map的输出类型job.setMapOutputKeyClass(Text.class);job.setMapOutputValueClass(LongWritable.class);//指定源文件夹和输出文件夹FileInputFormat.setInputPaths(job, new Path("/wordcount/srcdata/"));FileOutputFormat.setOutputPath(job, new Path("/wordcount/output/"));//提交:参数:是否显示处理进度System.exit(job.waitForCompletion(true)?0:1);}}
在Hadoop中创建文件夹需要统计的单词所在文件夹并上传
hadoop fs -mkdir /wordcont
hadoop fs -mkdir /wordcount/srcdata
hadoop fs -put XXXX.txt /wordcount/srcdata
注:在创建文件夹和上传文件之前需要启动HDFS(start-dfs.sh)输出文件夹不能自己创建,否则程序运行时会报错
运行程序:
hadoop jar Count.jar cn.guyouda.hadoop.mapreduce.WordCountRunner
运行结束以后会在输出文件夹创建结果文件
注:运行程序前必须启动YARN(start-yarn.sh)
显示统计结果:
hadoop fs -cat /wordcount/output/part-r-00000