mapreduce去重的问题怎么解决？

作者：Alistar1991_281 | 来源：互联网 | 2023-07-25 20:25

john89tom100mary100mary200tom20———–我刚学mapreduce，正在练习，上面这个我计算了很久也不对，就是对第一列去重，去重

john 89
tom 100
mary 100
mary 200
tom 20
———–
我刚学mapreduce，正在练习，上面这个我计算了很久也不对，就是对第一列去重，去重后应该是3
如果用mapreduce计算成功后，part-00000 的文件内容是：
3
请问下，这个mapreduce怎么写啊？

15 个解决方案

#1

map按第一列为key，value无所谓
reduce class中初始化一个计数器
每个reduce方法中计数器每次加一
reduce 的cleanup方法中commit计数器就可以了

#2

map 知道怎么写了，那reduce的具体怎么写啊？

#3

直接一个Map,在Map里面定义一个全局的HashSet,map方法里面把key加入进去，cleanup方法里面把结果写入就行了。

#4

学习

#5

引用 3 楼 wulinshishen 的回复:

直接一个Map,在Map里面定义一个全局的HashSet,map方法里面把key加入进去，cleanup方法里面把结果写入就行了。

只用map不可能解决这个问题
如果在不同的map中都用同一个key，怎么解决？

必须用reduce去group后的key才能得到去重效果

#6

引用 5 楼 tntzbzc 的回复:

Quote: 引用 3 楼 wulinshishen 的回复:

直接一个Map,在Map里面定义一个全局的HashSet,map方法里面把key加入进去，cleanup方法里面把结果写入就行了。

只用map不可能解决这个问题
如果在不同的map中都用同一个key，怎么解决？

必须用reduce去group后的key才能得到去重效果

嗯，对，没想那么仔细，谢谢指正。

#7

引用 4 楼 tjytad1982 的回复:

学习

   public static class Map extends Mapper {
                public void map(LongWritable key, Text value, Context context)
                                throws IOException, InterruptedException {
                        String line = value.toString();
                        try {
                                String[] lineSplit = line.split("\t");
                                context.write(new Text(lineSplit[0]), new Text(""));
                            } catch (java.lang.ArrayIndexOutOfBoundsException e) {
                                context.getCounter(Counter.LINESKIP).increment(1);
                                return;
                        }
                }
        }

        public static class Reduce extends Reducer {
                private Set count = new HashSet();

                public void reduce(Text key, Iterable values, Context context)
                                throws IOException, InterruptedException {
                      for(Text value:values){
                             count.add(value.toString());
                     }
                        context.write(key, new Text(""));
                }
        }
-------------------------  这个问题纠结我2周了，这个方面的学习资料太少了，我的map和reduce是这样写的，但是数据量大一些，就会内存溢出，我想我这个思路是错误的
        你说的  “必须用reduce去group后的key才能得到去重效果 ”，这个 map和reduce是具体怎么写的啊？

#8

引用 7 楼 wzl189 的回复:

Quote: 引用 4 楼 tjytad1982 的回复:

学习

   public static class Map extends Mapper {
                public void map(LongWritable key, Text value, Context context)
                                throws IOException, InterruptedException {
                        String line = value.toString();
                        try {
                                String[] lineSplit = line.split("\t");
                                context.write(new Text(lineSplit[0]), new Text(""));
                             context.write(new Text("uniq") ,new Text(lineSplit[0]) );

                            } catch (java.lang.ArrayIndexOutOfBoundsException e) {
                                context.getCounter(Counter.LINESKIP).increment(1);
                                return;
                        }
                }
        }

        public static class Reduce extends Reducer {
                private Set count = new HashSet();

                public void reduce(Text key, Iterable values, Context context)
                                throws IOException, InterruptedException {
                      for(Text value:values){
                             count.add(value.toString());
                     }
                        context.write("uniq", new Text(count.size()+""));
                }
        }
-------------------------  这个问题纠结我2周了，这个方面的学习资料太少了，我的map和reduce是这样写的，但是数据量大一些，就会内存溢出，我想我这个思路是错误的
        你说的  “必须用reduce去group后的key才能得到去重效果 ”，这个 map和reduce是具体怎么写的啊？

-------------刚才写的mapreduce错了，以这个为准

#9

引用 1 楼 tntzbzc 的回复:

map按第一列为key，value无所谓
reduce class中初始化一个计数器
每个reduce方法中计数器每次加一
reduce 的cleanup方法中commit计数器就可以了

谢谢了，请教下，你说的这个map我知道怎么写了，但是这个reduce怎么写啊？

#10

我晚点写个完整例子给你

#11

import java.io.IOException;



import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.io.LongWritable;

import org.apache.hadoop.io.NullWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Job;

import org.apache.hadoop.mapreduce.Mapper;

import org.apache.hadoop.mapreduce.Reducer;

import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;

import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;



public class wzl189_distinct {

	public static class MyMapper extends

			Mapper {



		Text outKey = new Text();



		@Override

		public void map(Object key, Text value, Context context)

				throws IOException, InterruptedException {



			String tmp[] = value.toString().split(" ");

			if (tmp.length != 2)

				return;

			outKey.set(tmp[0]);

			context.write(outKey, NullWritable.get());



		}

	}



	public static class MyReducer extends

			Reducer {



		long myCount = 0l;



		@Override

		public void reduce(Text key, Iterable values,

				Context context) throws IOException, InterruptedException {

			++myCount;

		}



		@Override

		public void cleanup(Context context) throws IOException,

				InterruptedException {

			context.write(new LongWritable(myCount), NullWritable.get());

		};

	}



	public static void main(String[] args) throws Exception {

		Configuration conf = new Configuration();

		if (args.length != 2) {

			System.err.println("Usage:  ");

			System.exit(2);

		}



		conf.set("mapred.child.java.opts", "-Xmx350m -Xmx1024m");



		@SuppressWarnings("deprecation")

		Job job = new Job(conf, "wzl189_distinct");

		job.setNumReduceTasks(1);

		job.setInputFormatClass(TextInputFormat.class);

		job.setJarByClass(wzl189_distinct.class);

		job.setMapperClass(MyMapper.class);



		job.setMapOutputKeyClass(Text.class);

		job.setMapOutputValueClass(NullWritable.class);



		job.setReducerClass(MyReducer.class);

		job.setOutputKeyClass(Text.class);

		job.setOutputValueClass(NullWritable.class);



		FileInputFormat.addInputPath(job, new Path(args[0]));

		FileOutputFormat.setOutputPath(job, new Path(args[1]));

		System.exit(job.waitForCompletion(true) ? 0 : 1);

	}

}

reduce阶段只用一个计数器就行了

#12

引用 11 楼 tntzbzc 的回复:

import java.io.IOException;



import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.io.LongWritable;

import org.apache.hadoop.io.NullWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Job;

import org.apache.hadoop.mapreduce.Mapper;

import org.apache.hadoop.mapreduce.Reducer;

import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;

import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;



public class wzl189_distinct {

	public static class MyMapper extends

			Mapper {



		Text outKey = new Text();



		@Override

		public void map(Object key, Text value, Context context)

				throws IOException, InterruptedException {



			String tmp[] = value.toString().split(" ");

			if (tmp.length != 2)

				return;

			outKey.set(tmp[0]);

			context.write(outKey, NullWritable.get());



		}

	}



	public static class MyReducer extends

			Reducer {



		long myCount = 0l;



		@Override

		public void reduce(Text key, Iterable values,

				Context context) throws IOException, InterruptedException {

			++myCount;

		}



		@Override

		public void cleanup(Context context) throws IOException,

				InterruptedException {

			context.write(new LongWritable(myCount), NullWritable.get());

		};

	}



	public static void main(String[] args) throws Exception {

		Configuration conf = new Configuration();

		if (args.length != 2) {

			System.err.println("Usage:  ");

			System.exit(2);

		}



		conf.set("mapred.child.java.opts", "-Xmx350m -Xmx1024m");



		@SuppressWarnings("deprecation")

		Job job = new Job(conf, "wzl189_distinct");

		job.setNumReduceTasks(1);

		job.setInputFormatClass(TextInputFormat.class);

		job.setJarByClass(wzl189_distinct.class);

		job.setMapperClass(MyMapper.class);



		job.setMapOutputKeyClass(Text.class);

		job.setMapOutputValueClass(NullWritable.class);



		job.setReducerClass(MyReducer.class);

		job.setOutputKeyClass(Text.class);

		job.setOutputValueClass(NullWritable.class);



		FileInputFormat.addInputPath(job, new Path(args[0]));

		FileOutputFormat.setOutputPath(job, new Path(args[1]));

		System.exit(job.waitForCompletion(true) ? 0 : 1);

	}

}

reduce阶段只用一个计数器就行了

太感谢了，你了解这么多啊，我都搞了2周，没有结果，想再请教最后一个问题：
假如第一列是姓名，第二列是班级（先不管我这个需求是否合理）
john 100
john 100
mary 100
mary 200
tom 200

想统计处如下结果，就是按班级人数去重
100 2
200 2

这个mapreduce怎么写啊？望高手最后再解答下，万分感谢了。

#13

map 输出key 用班级 + 分隔符 + 姓名
重写 grouping 实现二次排序，如果reduce num > 1 还需要重写 partition
reduce略作修改，增个姓名变量，比较当前姓名是否和前一个姓名是否一致，如果不一致计数器+=1

代码就不贴了，LZ多思考一下，这种简单的MR不难解决

#14

求教，那如果是5列数据，用其中的三列来去重呢。列之间\t 分割，行之间\n分割。map不会写呀怎么取出三列数据

#15

不错，太感谢了

推荐阅读

java
在类中定义数组时出错 - Error on defining arrays in class

Iamtryingtomakeaclassthatwillreadatextfileofnamesintoanarray,thenreturnthatarra ... [详细]

蜡笔小新 2023-12-14 17:38:12
java
关于cuowu类的错误提示和使用AdjustmentListener的问题

本文讨论了一个关于cuowu类的问题，作者在使用cuowu类时遇到了错误提示和使用AdjustmentListener的问题。文章提供了16个解决方案，并给出了两个可能导致错误的原因。 ... [详细]

蜡笔小新 2023-12-13 22:09:56
import
Spring源码解密之默认标签的解析方式分析

本文分析了Spring源码解密中默认标签的解析方式。通过对命名空间的判断，区分默认命名空间和自定义命名空间，并采用不同的解析方式。其中，bean标签的解析最为复杂和重要。 ... [详细]

蜡笔小新 2023-12-14 17:24:50
stream
向QTextEdit拖放文件的方法及实现步骤

本文介绍了在使用QTextEdit时如何实现拖放文件的功能，包括相关的方法和实现步骤。通过重写dragEnterEvent和dropEvent函数，并结合QMimeData和QUrl等类，可以轻松实现向QTextEdit拖放文件的功能。详细的代码实现和说明可以参考本文提供的示例代码。 ... [详细]

蜡笔小新 2023-12-14 16:06:38
java
Spring特性实现接口多类的动态调用详解

本文详细介绍了如何使用Spring特性实现接口多类的动态调用。通过对Spring IoC容器的基础类BeanFactory和ApplicationContext的介绍，以及getBeansOfType方法的应用，解决了在实际工作中遇到的接口及多个实现类的问题。同时，文章还提到了SPI使用的不便之处，并介绍了借助ApplicationContext实现需求的方法。阅读本文，你将了解到Spring特性的实现原理和实际应用方式。 ... [详细]

蜡笔小新 2023-12-14 03:24:19
js
后台获取视图对应的字符串

1.帮助类后台获取视图对应的字符串publicclassViewHelper{将View输出为字符串(注：不会执行对应的ac ... [详细]

蜡笔小新 2023-12-13 18:03:01
input
关于Linq to sql 实现模糊查询 string数组

前景：当UI一个查询条件为多项选择，或录入多个条件的时候，比如查询所有名称里面包含以下动态条件，需要模糊查询里面每一项时比如是这样一个数组条件：newstring[]{兴业银行, ... [详细]

蜡笔小新 2023-12-13 09:34:59
int
Linux重启网络命令实例及关机和重启示例教程

本文介绍了Linux系统中重启网络命令的实例，以及使用不同方式关机和重启系统的示例教程。包括使用图形界面和控制台访问系统的方法，以及使用shutdown命令进行系统关机和重启的句法和用法。 ... [详细]

蜡笔小新 2023-12-14 15:52:52
java
PHP中的单例模式与静态变量的区别及使用方法

本文介绍了PHP中的单例模式与静态变量的区别及使用方法。在PHP中，静态变量的存活周期仅仅是每次PHP的会话周期，与Java、C++不同。静态变量在PHP中的作用域仅限于当前文件内，在函数或类中可以传递变量。本文还通过示例代码解释了静态变量在函数和类中的使用方法，并说明了静态变量的生命周期与结构体的生命周期相关联。同时，本文还介绍了静态变量在类中的使用方法，并通过示例代码展示了如何在类中使用静态变量。 ... [详细]

蜡笔小新 2023-12-13 18:03:36
input
java 模拟get post请求_Java后台模拟发送http的get和post请求，并测试

个人学习使用：谨慎参考1Client类importcom.thoughtworks.gauge.Step;importcom.thoughtworks.gauge.T ... [详细]

蜡笔小新 2023-12-13 14:20:23
split
超级简单加解密工具的方案和功能

本文介绍了一个超级简单的加解密工具的方案和功能。该工具可以读取文件头，并根据特定长度进行加密，加密后将加密部分写入源文件。同时，该工具也支持解密操作。加密和解密过程是可逆的。本文还提到了一些相关的功能和使用方法，并给出了Python代码示例。 ... [详细]

蜡笔小新 2023-12-10 16:38:34
split
使用FLASK REST API的机器学习模型

在本教程中，我们将看到如何使用FLASK制作第一个用于机器学习模型的RESTAPI。我们将从创建机器学习模型开始。然后，我们将看到使用Flask创建AP ... [详细]

蜡笔小新 2023-10-17 19:13:12
java
获取时间的函数js代码,js获取时区代码

本文目录一览：1、js获取服务器时间（动态）2 ... [详细]

蜡笔小新 2023-10-17 16:49:20
split
mapbox矢量切片标准_下载python3中的mapbox向量切片,矢量

python3下载mapbox矢量切片通过观察mapbox的页面开发者工具里的network可以发现，打开矢量切片和字体切片pbf和prite图标的链接， ... [详细]

蜡笔小新 2023-10-16 21:46:42
split
Creating dynamically named aws_lambda_alias results in badness

Thisissuewasoriginallyopenedbyashashicorp/terraform#5664.Itwasmigratedhe ... [详细]

蜡笔小新 2023-10-16 19:31:15

Alistar1991_281

这个家伙很懒，什么也没留下！

Tags | 热门标签

RankList | 热门文章