HBaseMapReduceExamples

作者：逍遥微博2011_213 | 来源：互联网 | 2023-08-15 17:24

importjava.io.IOException;importjava.util.List;importorg.apache.hadoop.conf.Configured;imp

import java.io.IOException;
import java.util.List;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.hbase.KeyValue;
import org.apache.hadoop.hbase.client.Put;
import org.apache.hadoop.hbase.client.Result;
import org.apache.hadoop.hbase.client.Scan;
import org.apache.hadoop.hbase.io.ImmutableBytesWritable;
import org.apache.hadoop.hbase.mapreduce.TableMapReduceUtil;
import org.apache.hadoop.hbase.mapreduce.TableMapper;
import org.apache.hadoop.hbase.util.Bytes;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.hbase.HBaseConfiguration;
public class TableCopy extends Configured implements Tool{

static class CopyMapper extends TableMapper{
@Override
protected void map(ImmutableBytesWritable key, Result value,
Context context) throws IOException, InterruptedException {
// TODO Auto-generated method stub
//将查询结果保存到list
Put p = new Put(key.get());//注意千万不要少了key.get()
// 将结果装载到Put
for (KeyValue kv : value.raw())
p.add(kv);
// 将结果写入到Reduce
context.write(key, p);
}
}

public static Job createSubmittableJob(Configuration conf, String[] args)throws IOException{
String jobName = args[0];
String srcTable = args[1];
String dstTable = args[2];
Scan sc = new Scan();
sc.setCaching(2000);
sc.setCacheBlocks(false);
Job job = new Job(conf,jobName);
job.setJarByClass(TableCopy.class);
job.setNumReduceTasks(0);
TableMapReduceUtil.initTableMapperJob(srcTable, sc, CopyMapper.class, ImmutableBytesWritable.class, Result.class, job);
TableMapReduceUtil.initTableReducerJob(dstTable, null, job);
return job;

}

@Override
public int run(String[] args)throws Exception{
Configuration cOnf= HBaseConfiguration.create();
Job job = createSubmittableJob(conf, args);
return job.waitForCompletion(true)? 0 : 1;
}

public static void main(String[] args){
System.out.println("job:"+args[0]+",copy src table "+args[1]+" to dest table "+args[1]);
try {
TableCopy tc = new TableCopy();
System.exit(tc.run(args));
} catch (Exception e) {
e.printStackTrace();
}
}
}

7.2. HBase MapReduce Examples

7.2.1. HBase MapReduce Read Example

The following is an example of using HBase as a MapReduce source in read-only manner. Specifically, there is a Mapper instance but no Reducer, and nothing is being emitted from the Mapper. There job would be defined as follows&＃8230;

Configuration cOnfig= HBaseConfiguration.create();
Job job = new Job(config, "ExampleRead");
job.setJarByClass(MyReadJob.class); // class that contains mapper
Scan scan = new Scan();
scan.setCaching(500); // 1 is the default in Scan, which will be bad for MapReduce jobs
scan.setCacheBlocks(false); // don't set to true for MR jobs
// set other scan attrs
...
TableMapReduceUtil.initTableMapperJob(
tableName, // input HBase table name
scan, // Scan instance to control CF and attribute selection
MyMapper.class, // mapper
null, // mapper output key
null, // mapper output value
job);
job.setOutputFormatClass(NullOutputFormat.class); // because we aren't emitting anything from mapper
boolean b = job.waitForCompletion(true);
if (!b) {
throw new IOException("error with job!");
}

&＃8230;and the mapper instance would extend TableMapper&＃8230;

public static class MyMapper extends TableMapper {
public void map(ImmutableBytesWritable row, Result value, Context context) throws InterruptedException, IOException {
// process data for the row from the Result instance.
}
}

7.2.2. HBase MapReduce Read/Write Example

The following is an example of using HBase both as a source and as a sink with MapReduce. This example will simply copy data from one table to another.

Configuration cOnfig= HBaseConfiguration.create();
Job job = new Job(config,"ExampleReadWrite");
job.setJarByClass(MyReadWriteJob.class); // class that contains mapper
Scan scan = new Scan();
scan.setCaching(500); // 1 is the default in Scan, which will be bad for MapReduce jobs
scan.setCacheBlocks(false); // don't set to true for MR jobs
// set other scan attrs
TableMapReduceUtil.initTableMapperJob(
sourceTable, // input table
scan, // Scan instance to control CF and attribute selection
MyMapper.class, // mapper class
null, // mapper output key
null, // mapper output value
job);
TableMapReduceUtil.initTableReducerJob(
targetTable, // output table
null, // reducer class
job);
job.setNumReduceTasks(0);
boolean b = job.waitForCompletion(true);
if (!b) {
throw new IOException("error with job!");
}

An explanation is required of what TableMapReduceUtil is doing, especially with the reducer. TableOutputFormat is being used as the outputFormat class, and several parameters are being set on the config (e.g., TableOutputFormat.OUTPUT_TABLE), as well as setting the reducer output key to ImmutableBytesWritable and reducer value to Writable. These could be set by the programmer on the job and conf, but TableMapReduceUtil tries to make things easier.

The following is the example mapper, which will create a Put and matching the input Result and emit it. Note: this is what the CopyTable utility does.

public static class MyMapper extends TableMapper {
public void map(ImmutableBytesWritable row, Result value, Context context) throws IOException, InterruptedException {
// this example is just copying the data from the source table...
context.write(row, resultToPut(row,value));
}
private static Put resultToPut(ImmutableBytesWritable key, Result result) throws IOException {
Put put = new Put(key.get());
for (KeyValue kv : result.raw()) {
put.add(kv);
}
return put;
}
}

There isn&＃8217;t actually a reducer step, so TableOutputFormat takes care of sending the Put to the target table.

This is just an example, developers could choose not to use TableOutputFormat and connect to the target table themselves.

7.2.3. HBase MapReduce Read/Write Example With Multi-Table Output

TODO: example for MultiTableOutputFormat.

7.2.4. HBase MapReduce Summary to HBase Example

The following example uses HBase as a MapReduce source and sink with a summarization step. This example will count the number of distinct instances of a value in a table and write those summarized counts in another table.

Configuration cOnfig= HBaseConfiguration.create();
Job job = new Job(config,"ExampleSummary");
job.setJarByClass(MySummaryJob.class); // class that contains mapper and reducer
Scan scan = new Scan();
scan.setCaching(500); // 1 is the default in Scan, which will be bad for MapReduce jobs
scan.setCacheBlocks(false); // don't set to true for MR jobs
// set other scan attrs
TableMapReduceUtil.initTableMapperJob(
sourceTable, // input table
scan, // Scan instance to control CF and attribute selection
MyMapper.class, // mapper class
Text.class, // mapper output key
IntWritable.class, // mapper output value
job);
TableMapReduceUtil.initTableReducerJob(
targetTable, // output table
MyTableReducer.class, // reducer class
job);
job.setNumReduceTasks(1); // at least one, adjust as required
boolean b = job.waitForCompletion(true);
if (!b) {
throw new IOException("error with job!");
}

In this example mapper a column with a String-value is chosen as the value to summarize upon. This value is used as the key to emit from the mapper, and an IntWritable represents an instance counter.

public static class MyMapper extends TableMapper {
private final IntWritable OnE= new IntWritable(1);
private Text text = new Text();
public void map(ImmutableBytesWritable row, Result value, Context context) throws IOException, InterruptedException {
String val = new String(value.getValue(Bytes.toBytes("cf"), Bytes.toBytes("attr1")));
text.set(val); // we can only emit Writables...
context.write(text, ONE);
}
}

In the reducer, the &＃8220;ones&＃8221; are counted (just like any other MR example that does this), and then emits a Put.

public static class MyTableReducer extends TableReducer {
public void reduce(Text key, Iterable values, Context context) throws IOException, InterruptedException {
int i = 0;
for (IntWritable val : values) {
i += val.get();
}
Put put = new Put(Bytes.toBytes(key.toString()));
put.add(Bytes.toBytes("cf"), Bytes.toBytes("count"), Bytes.toBytes(i));
context.write(null, put);
}
}

7.2.5. HBase MapReduce Summary to File Example

This very similar to the summary example above, with exception that this is using HBase as a MapReduce source but HDFS as the sink. The differences are in the job setup and in the reducer. The mapper remains the same.

Configuration cOnfig= HBaseConfiguration.create();
Job job = new Job(config,"ExampleSummaryToFile");
job.setJarByClass(MySummaryFileJob.class); // class that contains mapper and reducer
Scan scan = new Scan();
scan.setCaching(500); // 1 is the default in Scan, which will be bad for MapReduce jobs
scan.setCacheBlocks(false); // don't set to true for MR jobs
// set other scan attrs
TableMapReduceUtil.initTableMapperJob(
sourceTable, // input table
scan, // Scan instance to control CF and attribute selection
MyMapper.class, // mapper class
Text.class, // mapper output key
IntWritable.class, // mapper output value
job);
job.setReducerClass(MyReducer.class); // reducer class
job.setNumReduceTasks(1); // at least one, adjust as required
FileOutputFormat.setOutputPath(job, new Path("/tmp/mr/mySummaryFile")); // adjust directories as required
boolean b = job.waitForCompletion(true);
if (!b) {
throw new IOException("error with job!");
}

As stated above, the previous Mapper can run unchanged with this example. As for the Reducer, it is a &＃8220;generic&＃8221; Reducer instead of extending TableMapper and emitting Puts.

public static class MyReducer extends Reducer {
public void reduce(Text key, Iterable values, Context context) throws IOException, InterruptedException {
int i = 0;
for (IntWritable val : values) {
i += val.get();
}
context.write(key, new IntWritable(i));
}
}

7.2.6. HBase MapReduce Summary to HBase Without Reducer

It is also possible to perform summaries without a reducer &＃8211; if you use HBase as the reducer.

An HBase target table would need to exist for the job summary. The HTable method incrementColumnValue would be used to atomically increment values. From a performance perspective, it might make sense to keep a Map of values with their values to be incremeneted for each map-task, and make one update per key at during the cleanup method of the mapper. However, your milage may vary depending on the number of rows to be processed and unique keys.

In the end, the summary results are in HBase.

7.2.7. HBase MapReduce Summary to RDBMS

Sometimes it is more appropriate to generate summaries to an RDBMS. For these cases, it is possible to generate summaries directly to an RDBMS via a custom reducer. The setup method can connect to an RDBMS (the connection information can be passed via custom parameters in the context) and the cleanup method can close the connection.

It is critical to understand that number of reducers for the job affects the summarization implementation, and you&＃8217;ll have to design this into your reducer. Specifically, whether it is designed to run as a singleton (one reducer) or multiple reducers. Neither is right or wrong, it depends on your use-case. Recognize that the more reducers that are assigned to the job, the more simultaneous connections to the RDBMS will be created &＃8211; this will scale, but only to a point.

public static class MyRdbmsReducer extends Reducer {
private Connection c = null;
public void setup(Context context) {
// create DB connection...
}
public void reduce(Text key, Iterable values, Context context) throws IOException, InterruptedException {
// do summarization
// in this example the keys are Text, but this is just an example
}
public void cleanup(Context context) {
// close db connection
}
}

In the end, the summary results are written to your RDBMS table/s.

推荐阅读

import
Java 编程错误：对象无法转换为 long 类型

本文介绍了在 Java 编程中遇到的一个常见错误：对象无法转换为 long 类型，并提供了详细的解决方案。 ... [详细]

蜡笔小新 2024-11-13 10:57:24
main
Spring – Bean Life Cycle

Spring – Bean Life Cycle ... [详细]

蜡笔小新 2024-11-13 13:24:40
java
oracle c3p0 dword 60,web_day10 dbcp c3p0 dbutils

createdatabasemydbcharactersetutf8;alertdatabasemydbcharactersetutf8;1.自定义连接池为了不去经常创建连接和释放 ... [详细]

蜡笔小新 2024-11-12 19:26:15
config
com.hazelcast.config.MapConfig.isStatisticsEnabled()方法的使用及代码示例

com.hazelcast.config.MapConfig.isStatisticsEnabled()方法的使用及代码示例 ... [详细]

蜡笔小新 2024-11-12 14:33:17
main
字节流(InputStream和OutputStream)，字节流读写文件，字节流的缓冲区，字节缓冲流

字节流抽象类InputStream和OutputStream是字节流的顶级父类所有的字节输入流都继承自InputStream，所有的输出流都继承子OutputStreamInput ... [详细]

蜡笔小新 2024-11-12 14:07:25
select
如何在Java中使用DButils类

这期内容当中小编将会给大家带来有关如何在Java中使用DButils类，文章内容丰富且以专业的角度为大家分析和叙述，阅读完这篇文章希望大家可以有所收获。D ... [详细]

蜡笔小新 2024-11-12 13:46:11
ip
PHP 对象生命周期与内存管理

本文详细介绍了 PHP 中对象的生命周期、内存管理和魔术方法的使用，包括对象的自动销毁、析构函数的作用以及各种魔术方法的具体应用场景。 ... [详细]

蜡笔小新 2024-11-12 13:35:26
ip
javax.mail.search.BodyTerm.matchPart()方法的使用及代码示例

javax.mail.search.BodyTerm.matchPart()方法的使用及代码示例 ... [详细]

蜡笔小新 2024-11-13 15:24:50
ip
CentOS 7 中配置开机自动挂载 NFS 的解决方案

本文详细介绍了在 CentOS 7 系统中配置 fstab 文件以实现开机自动挂载 NFS 共享目录的方法，并解决了常见的配置失败问题。 ... [详细]

蜡笔小新 2024-11-13 12:05:24
ip
com.sun.javadoc.PackageDoc.exceptions()方法的使用及代码示例

com.sun.javadoc.PackageDoc.exceptions()方法的使用及代码示例 ... [详细]

蜡笔小新 2024-11-13 10:47:33
ip
面试中如何回答“零拷贝”技术问题？

零拷贝技术是提高I/O性能的重要手段，常用于Java NIO、Netty、Kafka等框架中。本文将详细解析零拷贝技术的原理及其应用。 ... [详细]

蜡笔小新 2024-11-13 02:03:52
java
Java 中如何将多参数方法传递给使用 List 的 Function

本文探讨了如何在 Java 中将多参数方法通过 Lambda 表达式传递给一个接受 List 的 Function。具体分析了 `OrderUtil` 类中的 `runInBatches` 方法及其使用场景。 ... [详细]

蜡笔小新 2024-11-12 22:25:23
java
MySQL 5.7 学习指南：SQLyog 中的主键、列属性和数据类型

本文介绍了 MySQL 5.7 中主键（Primary Key）和自增（Auto-Increment）的概念，以及如何在 SQLyog 中设置这些属性。同时，还探讨了数据类型的分类和选择，以及列属性的设置方法。 ... [详细]

蜡笔小新 2024-11-12 15:57:04
ip
javascript分页类支持页码格式

前端时间因为项目需要，要对一个产品下所有的附属图片进行分页显示，没考虑ajax一张张请求，所以干脆一次性全部把图片out，然 ... [详细]

蜡笔小新 2024-11-12 14:58:57
instance
深入解析 Lifecycle 的实现原理

本文将详细介绍 Android Jetpack 中 Lifecycle 组件的实现原理，帮助开发者更好地理解和使用 Lifecycle，避免常见的内存泄漏问题。 ... [详细]

蜡笔小新 2024-11-12 14:05:19

逍遥微博2011_213

这个家伙很懒，什么也没留下！

Tags | 热门标签

RankList | 热门文章