Hadoop之NLineInputFormat

作者：寒落落 | 来源：互联网 | 2023-10-10 14:24

上图是InputFormat的派生子类关系图，这篇主要讲解FileInputDormat的实现类——>NLineInputFormat由于InputFormat是一个抽象类，

上图是InputFormat的派生子类关系图，这篇主要讲解FileInputDormat的实现类——>NLineInputFormat

由于InputFormat是一个抽象类，不同的实现类，分片机制不同，如下图：

NLineInputFormat源码：

public class NLineInputFormat extends FileInputFormat { public static final String LINES_PER_MAP = "mapreduce.input.lineinputformat.linespermap"; public RecordReader createRecordReader( InputSplit genericSplit, TaskAttemptContext context) throws IOException { context.setStatus(genericSplit.toString()); return new LineRecordReader(); } /** * Logically splits the set of input files for the job, splits N lines * of the input as one split. * * @see FileInputFormat#getSplits(JobContext) */ public List getSplits(JobContext job) throws IOException { List splits = new ArrayList(); int numLinesPerSplit = getNumLinesPerSplit(job); for (FileStatus status : listStatus(job)) { splits.addAll(getSplitsForFile(status, job.getConfiguration(), numLinesPerSplit)); } return splits; } public static List getSplitsForFile(FileStatus status, Configuration conf, int numLinesPerSplit) throws IOException { List splits = new ArrayList (); Path fileName = status.getPath(); if (status.isDirectory()) { throw new IOException("Not a file: " + fileName); } FileSystem fs = fileName.getFileSystem(conf); LineReader lr = null; try { FSDataInputStream in = fs.open(fileName); lr = new LineReader(in, conf); Text line = new Text(); int numLines = 0; long begin = 0; long length = 0; int num = -1; while ((num = lr.readLine(line)) > 0) { numLines++; length += num; if (numLines == numLinesPerSplit) { splits.add(createFileSplit(fileName, begin, length)); begin += length; length = 0; numLines = 0; } } if (numLines != 0) { splits.add(createFileSplit(fileName, begin, length)); } } finally { if (lr != null) { lr.close(); } } return splits; } /** * NLineInputFormat uses LineRecordReader, which always reads * (and consumes) at least one character out of its upper split * boundary. So to make sure that each mapper gets N lines, we * move back the upper split limits of each split * by one character here. * @param fileName Path of file * @param begin the position of the first byte in the file to process * @param length number of bytes in InputSplit * @return FileSplit */ protected static FileSplit createFileSplit(Path fileName, long begin, long length) { return (begin == 0) ? new FileSplit(fileName, begin, length - 1, new String[] {}) : new FileSplit(fileName, begin - 1, length, new String[] {}); } /** * Set the number of lines per split * @param job the job to modify * @param numLines the number of lines per split */ public static void setNumLinesPerSplit(Job job, int numLines) { job.getConfiguration().setInt(LINES_PER_MAP, numLines); } /** * Get the number of lines per split * @param job the job * @return the number of lines per split */ public static int getNumLinesPerSplit(JobContext job) { return job.getConfiguration().getInt(LINES_PER_MAP, 1); } }

发下他重写getSplit()方法，它的切片机制不同与FileInputFormat，不再按Block 块去划分，而是按NlineInputFormat指定的行数N来划分。即输入文件的总行数/N=切片数，如果不整除，切片数=商+1

例子：

输入四行数据。

使用案例：

1．需求

对每个单词进行个数统计，要求根据每个输入文件的行数来规定输出多少个切片。此案例要求每三行放入一个切片中。

代码实现：

package com.c21.mapreduce.nline; import java.io.IOException; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Mapper; public class NLineMapper extends Mapper{ private Text k = new Text(); private LongWritable v = new LongWritable(1); @Override protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { // 1 获取一行 String line = value.toString(); // 2 切割 String[] splited = line.split(" "); // 3 循环写出 for (int i = 0; i k.set(splited[i]); context.write(k, v); } } }

（2）编写Reducer类

package com.c21.mapreduce.nline; import java.io.IOException; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Reducer; public class NLineReducer extends Reducer{ LongWritable v = new LongWritable(); @Override protected void reduce(Text key, Iterable values, Context context) throws IOException, InterruptedException { long sum = 0l; // 1 汇总 for (LongWritable value : values) { sum += value.get(); } v.set(sum); // 2 输出 context.write(key, v); } }

（3）编写Driver类

package com.c21.mapreduce.nline; import java.io.IOException; import java.net.URISyntaxException; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.input.NLineInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; public class NLineDriver { public static void main(String[] args) throws IOException, URISyntaxException, ClassNotFoundException, InterruptedException { // 输入输出路径需要根据自己电脑上实际的输入输出路径设置 args = new String[] { "e:/input/inputword", "e:/output1" }; // 1 获取job对象 Configuration cOnfiguration= new Configuration(); Job job = Job.getInstance(configuration); // 7设置每个切片InputSplit中划分三条记录 NLineInputFormat.setNumLinesPerSplit(job, 3); // 8使用NLineInputFormat处理记录数 job.setInputFormatClass(NLineInputFormat.class); // 2设置jar包位置，关联mapper和reducer job.setJarByClass(NLineDriver.class); job.setMapperClass(NLineMapper.class); job.setReducerClass(NLineReducer.class); // 3设置map输出kv类型 job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(LongWritable.class); // 4设置最终输出kv类型 job.setOutputKeyClass(Text.class); job.setOutputValueClass(LongWritable.class); // 5设置输入输出数据路径 FileInputFormat.setInputPaths(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); // 6提交job job.waitForCompletion(true); } }

结果如图：

推荐阅读

request
Web动态服务器Python基本实现

Web动态服务器Python基本实现 ... [详细]

蜡笔小新 2024-11-21 08:01:30
python
OBS Studio自动化实践：利用脚本批量生成录制场景

本文探讨了如何利用OBS Studio进行高效录屏，并通过脚本实现场景的自动生成。适合对自动化办公感兴趣的读者。 ... [详细]

蜡笔小新 2024-11-21 10:44:53
regex
web: _show -> _info 造轮子编程

问题场景用Java进行web开发过程当中，当遇到很多很多个字段的实体时，最苦恼的莫过于编辑字段的查看和修改界面，发现2个页面存在很多重复信息，能不能写一遍？有没有轮子用都不如自己造。解决方式笔者根据自 ... [详细]

蜡笔小新 2024-11-21 10:21:24
request
使用Service Locator模式实现高效的服务命名访问

本文探讨了如何通过Service Locator模式来简化和优化在B/S架构中的服务命名访问，特别是对于需要频繁访问的服务，如JNDI和XMLNS。该模式通过缓存机制减少了重复查找的成本，并提供了对多种服务的统一访问接口。 ... [详细]

蜡笔小新 2024-11-20 19:26:30
ip
linux网络子系统分析（二）—— 协议栈分层框架的建立

目录一、综述二、INET的初始化2.1INET接口注册2.2抽象实体的建立2.3代码细节分析2.3.1socket参数三、其他协议3.1PF_PACKET3.2P ... [详细]

蜡笔小新 2024-11-20 15:21:14
request
软通动力Java开发工程师笔试题解析

本文档详细介绍了软通动力Java开发工程师职位的笔试题目，涵盖了Java基础、集合框架、JDBC、JSP等内容，并提供了详细的答案解析。 ... [详细]

蜡笔小新 2024-11-20 13:34:48
version
fleaframedb使用之JPA封装介绍

flea,frame,db,使用,之 ... [详细]

蜡笔小新 2024-11-20 12:00:16
bit
开发技巧: Effective Java第三版——优先选用Collection而非Stream作为方法返回类型

在Effective Java第三版中，建议在方法返回类型中优先考虑使用Collection而非Stream，以提高代码的灵活性和兼容性。 ... [详细]

蜡笔小新 2024-11-19 15:31:16
string
IC卡操作功能实现

本文介绍了如何通过C#语言调用动态链接库（DLL）中的函数来实现IC卡的基本操作，包括初始化设备、设置密码模式、获取设备状态等，并详细展示了将TextBox中的数据写入IC卡的具体实现方法。 ... [详细]

蜡笔小新 2024-11-21 11:02:19
version
spring boot使用jetty无法启动

spring boot使用jetty无法启动 ... [详细]

蜡笔小新 2024-11-21 10:15:52
request
设置Shadowsocks公共代理的关键步骤

本文详细介绍了如何正确设置Shadowsocks公共代理，包括调整超时设置、检查系统限制、防止滥用及遵守DMCA法规等关键步骤。 ... [详细]

蜡笔小新 2024-11-20 20:41:33
bit
使用QT构建基础串口辅助工具

本文详细介绍了如何利用QT框架创建一个简易的串口助手应用程序，包括项目的建立、界面设计与编程实现、运行测试以及最终的应用程序打包。 ... [详细]

蜡笔小新 2024-11-20 00:47:00
ip
43.Word Break（看字符串是否由词典中的单词组成）

Level： Medium题目描述：Givenanon-emptystringsandadictionarywordDictcontainingalistofnon-emptyw ... [详细]

蜡笔小新 2024-11-19 20:43:23
ip
C# 中创建和执行存储过程的方法

本文详细介绍了如何使用 C# 创建和调用 SQL Server 存储过程，包括连接数据库、定义命令类型、设置参数等步骤。 ... [详细]

蜡笔小新 2024-11-19 19:55:59
less
Java代码保护与混淆：ProGuard详解

在Java开发中，保护代码安全是一个重要的课题。由于Java字节码容易被反编译，因此使用代码混淆工具如ProGuard变得尤为重要。本文将详细介绍如何使用ProGuard进行代码混淆，以及其基本原理和常见问题。 ... [详细]

蜡笔小新 2024-11-18 16:46:17

寒落落

这个家伙很懒，什么也没留下！

Tags | 热门标签

RankList | 热门文章