热门标签 | HotTags
当前位置:  开发笔记 > 编程语言 > 正文

【大数据开发】Flink之DataSet转换操作2(九)

点击蓝字 关注我们Flink之DataSet转换操作2(九)01TransForm之Reduce操作    通过两两合并将数据集中的元素合并成一个元素,可以在整

点击蓝字 关注我们

Flink之DataSet转换操作2(九)

01

TransForm之Reduce操作

    通过两两合并将数据集中的元素合并成一个元素,可以在整个数据集上使用。


将数据合并

val dataSet = env.fromElements(1,2,3,4)
val results = dataSet.reduce((x,y) => (x+y))


//Java
import org.apache.flink.api.common.functions.FlatMapFunction;
import org.apache.flink.api.common.functions.ReduceFunction;
import org.apache.flink.api.common.typeinfo.Types;
import org.apache.flink.api.java.DataSet;
import org.apache.flink.api.java.ExecutionEnvironment;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.util.Collector;


public class ReduceJavaDemo {
public static void main(String[] args) throws Exception{
        ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
DataSet<String> source = env.fromElements(
"spark hbase java",
"java spark hive",
"java hbase hbase"
);


// DataSet> result = source.flatMap(new FlatMapFunction>() {
// @Override
// public void flatMap(String line, Collector> collector) throws Exception {
// for(String word : line.toUpperCase().split(" ")){
// collector.collect(new Tuple2<>(word , 1));
// }
// }
// })
// .groupBy("f0")
// .reduce(new ReduceFunction>() {
// @Override
// public Tuple2 reduce(Tuple2 t1, Tuple2 t2) throws Exception {
// return new Tuple2<>(t1.f0,t1.f1+t2.f1);
// }
// });




DataSet<Tuple2<String,Integer>> result = source.flatMap((String line , Collector<Tuple2<String,Integer>> collector) -> {
for(String word : line.toUpperCase().split(" ")){
collector.collect(new Tuple2<>(word,1));
}
}).returns(Types.TUPLE(Types.STRING,Types.INT))
.groupBy("f0")
                .reduce((x , y) -> new Tuple2<>(x.f0,x.f1+y.f1));
result.print();
}
}

//Scala
import org.apache.flink.api.scala.{ExecutionEnvironment, _}
import org.apache.flink.util.Collector


object ReduceScalaDemo {
def main(args: Array[String]): Unit = {
val env = ExecutionEnvironment.getExecutionEnvironment


val source = env.fromElements(
"spark hbase java",
"java spark hive",
"java hbase hbase"
)


val result = source
.flatMap((line : String,collector : Collector[(String,Int)]) => {
(line.toUpperCase.split(" ")).foreach(word => (collector.collect((word,1))))
})
.groupBy("_1")
.reduce((x,y) => (x._1,x._2+y._2))
      .print()
  }
}




02

TransForm之ReduceGroup操作

    将一组元素合并成一个或者多个元素,可以在整个数据集上使用。这是对reduce程序的一个小优化。


优化数据合并

val dataSet = env.fromElements(1,2,3,4)
val results = dataSet.reduceFroup(in => in reduce(x,y) => (x+y))


import org.apache.flink.api.common.functions.FlatMapFunction;
import org.apache.flink.api.common.functions.GroupReduceFunction;
import org.apache.flink.api.common.typeinfo.Types;
import org.apache.flink.api.java.DataSet;
import org.apache.flink.api.java.ExecutionEnvironment;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.util.Collector;


public class ReduceGroupJavaDemo {
public static void main(String[] args) throws Exception{
        ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
DataSet<String> source = env.fromElements(
"spark hbase java",
"java spark hive",
"java hbase hbase"
);


// DataSet> result = source
// .flatMap(new FlatMapFunction>() {
// @Override
// public void flatMap(String line, Collector> collector) throws Exception {
// for(String word : line.toUpperCase().split(" ")){
// collector.collect(new Tuple2<>(word,1));
// }
// }
// })
// .groupBy("f0")
// .reduceGroup(new GroupReduceFunction, Tuple2>() {
// @Override
// public void reduce(Iterable> iterable, Collector> collector) throws Exception {
// String word = null;
// int count = 0;
// for(Tuple2 tuple : iterable){
// word = tuple.f0;
// count += tuple.f1;
// }
// collector.collect(new Tuple2<>(word,count));
// }
// });


DataSetString,Integer>> result = source
.flatMap((String line ,CollectorString,Integer>> collector) -> {
for(String word : line.toUpperCase().split(" ")){
collector.collect(new Tuple2<>(word,1));
}
}).returns(Types.TUPLE(Types.STRING,Types.INT))
.groupBy("f0")
.reduceGroup((IterableString,Integer>> iterable , CollectorString,Integer>> collector) -> {
String word = null;
int count = 0;
for(Tuple2<String,Integer> tuple : iterable){
word = tuple.f0;
count += tuple.f1;
}
collector.collect(new Tuple2<>(word,count));
                }).returns(Types.TUPLE(Types.STRING,Types.INT));
result.print();
}
}



03

TransForm之CombineGroup操作

    我们可以通过CombineGroup事先在每一台机器上进行聚合操作,再通过ReduceGroup将每台机器CombineGroup输出的结果进行聚合,

这样的话,ReduceGroup需要汇总的数据量就少很多,从而加快计算的速度。


优化数据合并

.groupBy("_1")
.combineGroup((words , out : Collector[(String , Int)]) =>{
out.collect(words reduce((x,y)=> (х. _1,x. _2+y. 2))
})
.groupBy("_1")
.reduceGroup(x => x.reduce((x,y)=> (x.1,x._2+y._2))


//Java
import org.apache.flink.api.common.functions.FlatMapFunction;
import org.apache.flink.api.common.functions.GroupCombineFunction;
import org.apache.flink.api.common.functions.GroupReduceFunction;
import org.apache.flink.api.common.typeinfo.Types;
import org.apache.flink.api.java.DataSet;
import org.apache.flink.api.java.ExecutionEnvironment;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.util.Collector;


public class CombineGroupJavaDemo {
public static void main(String[] args) throws Exception{
ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
DataSet<String> source = env.fromElements(
"spark hbase java",
"java spark hive",
"java hbase hbase"
);
// DataSet> result = source
// .flatMap(new FlatMapFunction>() {
// @Override
// public void flatMap(String line, Collector> collector) throws Exception {
// for(String word : line.toUpperCase().split(" ")){
// collector.collect(new Tuple2<>(word,1));
// }
// }
// })
//
// //本地上进行聚合操作
// .groupBy("f0")
// .combineGroup(new GroupCombineFunction, Tuple2>() {
// @Override
// public void combine(Iterable> iterable, Collector> collector) throws Exception {
// String word = null;
// int count =0;
// for(Tuple2 tuple : iterable){
// word = tuple.f0;
// count += tuple.f1;
// }
// collector.collect(new Tuple2<>(word,count));
// }
// })
// .groupBy("f0")
// .reduceGroup(new GroupReduceFunction, Tuple2>() {
// @Override
// public void reduce(Iterable> iterable, Collector> collector) throws Exception {
// String word = null;
// int count = 0;
// for(Tuple2 tuple : iterable){
// word = tuple.f0;
// count += tuple.f1;
// }
// collector.collect(new Tuple2<>(word,count));
// }
// });


DataSetString,Integer>> result = source
.flatMap((String line , CollectorString,Integer>> collector) -> {
for(String word : line.toUpperCase().split(" ")){
collector.collect(new Tuple2<>(word,1));
}
}).returns(Types.TUPLE(Types.STRING,Types.INT))
.groupBy("f0")
.combineGroup((IterableString,Integer>> iterable , CollectorString,Integer>> collector) -> {
String word = null;
int count = 0;
for(Tuple2<String,Integer> tuple : iterable){
word = tuple.f0;
count += tuple.f1;
}
collector.collect(new Tuple2<>(word,count));
}).returns(Types.TUPLE(Types.STRING,Types.INT))
.groupBy("f0")
.reduceGroup((IterableString,Integer>> iterable ,CollectorString,Integer>> collector) -> {
String word = null;
int count = 0 ;
for(Tuple2<String,Integer> tuple : iterable){
word =tuple.f0;
count += tuple.f1;
}
collector.collect(new Tuple2<>(word,count));
}).returns(Types.TUPLE(Types.STRING,Types.INT));


result.print();
}
}

//Scala
import org.apache.flink.api.scala.{ExecutionEnvironment,_}
import org.apache.flink.util.Collector


object CombineGroupScalaDemo {
def main(args: Array[String]): Unit = {
val env = ExecutionEnvironment.getExecutionEnvironment


val source = env.fromElements(
"spark hbase java",
"java spark hive",
"java hbase hbase"
)


val result = source
.flatMap((line : String , collector : Collector[(String,Int)]) => {
line.toUpperCase.split(" ").foreach(word => (collector.collect(word,1)))
})
.groupBy("_1")
.combineGroup((iterator , combine_collector : Collector[(String,Int)] )=> {
combine_collector.collect(iterator reduce((x,y) => (x._1,x._2+y._2)))
})
.groupBy("_1")
.reduceGroup((iterator , collector : Collector[(String,Int)]) => {
collector.collect(iterator reduce((x,y) => (x._1,x._2+y._2)))
}).print()
}
}



04

TransForm之Aggregate操作

    通过Aggregate Function将一组元素值合并成单个值,可以在整个DataSet数据集上使用通过Aggregate Function将一组元素值合并成单个值,可以在整个DataSet数据集上使用。


取合并后数值的最大值

.group("_1").aggregate(Aggregations.SUM,1)
.group("_1").aggregate(Aggregations.SUM,1).max(1)
.group("_1").aggregate(Aggregations.SUM,1).maxby(1)


//Java
import org.apache.flink.api.common.functions.FlatMapFunction;
import org.apache.flink.api.java.DataSet;
import org.apache.flink.api.java.ExecutionEnvironment;
import org.apache.flink.api.java.aggregation.Aggregations;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.util.Collector;


public class AggregateJavaDemo {
public static void main(String[] args) throws Exception{
ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();


DataSet<String> source = env.fromElements(
"spark hbase java",
"java spark hive",
"java hbase hbase"
);


DataSetString,Integer>> result = source
.flatMap(new FlatMapFunction<String, Tuple2<String, Integer>>() {
@Override
public void flatMap(String s, CollectorString, Integer>> collector) throws Exception {
for(String word : s.toUpperCase().split(" ")){
collector.collect(new Tuple2<>(word,1));
}
}
})
.groupBy("f0")
.aggregate(Aggregations.SUM,1);
result.print();
}
}

扫描二维码

关注我们

微信号 : BIGDT_IN 




推荐阅读
author-avatar
小木马
这个家伙很懒,什么也没留下!
PHP1.CN | 中国最专业的PHP中文社区 | DevBox开发工具箱 | json解析格式化 |PHP资讯 | PHP教程 | 数据库技术 | 服务器技术 | 前端开发技术 | PHP框架 | 开发工具 | 在线工具
Copyright © 1998 - 2020 PHP1.CN. All Rights Reserved | 京公网安备 11010802041100号 | 京ICP备19059560号-4 | PHP1.CN 第一PHP社区 版权所有