20210111Spark35(SparkStreaming4)

作者：KellylikePchy_224 | 来源：互联网 | 2023-10-10 15:19

聚类运算Redis版非聚类运算hbase版Phoenix版求最大值Redis单机版是支持事务的,但是集

聚类运算Redis版
非聚类运算 hbase版 Phoenix版求最大值

Redis单机版是支持事务的,但是集群版是不支持事务的

回顾Redis的事务操作

object JedisTransactionTest { def main(args: Array[String]): Unit = { //数据库连接 val jedis = new Jedis("linux03", 6379) jedis.auth("123456") //开启jedis的pipeline（整体来执行） var pipeline: Pipeline = null try { pipeline = jedis.pipelined() //开启多个操作 pipeline.multi() pipeline.hincrBy("wc", "aaa", 18) //Redis Hincrby 命令用于为哈希表中的字段值加上指定增量值。 //val i = 1 / 0 pipeline.hset("dot", "bbb", "19") //将哈希表 key 中的域 field 的值设为 value 。 //如果 key 不存在，一个新的哈希表被创建并进行HSET 操作。 //如果域 field 已经存在于哈希表中，旧值将被覆盖 //提交事务 pipeline.sync() pipeline.exec() } catch { case e: Exception => { e.printStackTrace() pipeline.discard() //回滚事务 } } finally { pipeline.close() jedis.close() } } }

Ridis版聚类运算

//连接池 object JedisConnectionPool { val cOnfig= new JedisPoolConfig() config.setMaxTotal(5) config.setMaxIdle(2) config.setTestOnBorrow(true) val pool = new JedisPool(config, "linux03", 6379, 10000, "123456") def getConnection(): Jedis = { pool.getResource } }

//获取历史偏移量 object OffsetUtils { def queryHistoryOffsetFromRedis(appName: String, groupId: String): Map[TopicPartition, Long] = { val historyOffset = new mutable.HashMap[TopicPartition, Long]() val jedis = JedisConnectionPool.getConnection() val res: util.Map[String, String] = jedis.hgetAll(appName + "_" + groupId) import scala.collection.JavaConverters._ for (t

//保证ExactlyOnce，精确一次性语义（要求数据处理且被处理一次） object KafkaWordCountToRedis { def main(args: Array[String]): Unit = { val appName = this.getClass.getSimpleName val groupId = "g008" val cOnf= new SparkConf().setAppName(appName).setMaster("local[*]") val ssc = new StreamingContext(conf, Seconds(5)) ssc.sparkContext.setLogLevel("WARN") val kafkaParams = Map[String, Object]( "bootstrap.servers" -> "linux03:9092,linux04:9092,linux05:9092", "key.deserializer" -> classOf[StringDeserializer], "value.deserializer" -> classOf[StringDeserializer], "group.id" -> groupId, "auto.offset.reset" -> "earliest", "enable.auto.commit" -> (false: java.lang.Boolean) //不自动提交偏移量 ) val topics = Array("wordcount18") val historyOffset: Map[TopicPartition, Long] = OffsetUtils.queryHistoryOffsetFromRedis(appName, groupId) //直连：sparkstreaming用来读取数据的task（kafka的消费者），直接连接到kafka指定topic的leader分区中 val lines: InputDStream[ConsumerRecord[String, String]] = KafkaUtils.createDirectStream( ssc, LocationStrategies.PreferConsistent, //位置策略 ConsumerStrategies.Subscribe[String, String](topics, kafkaParams, historyOffset) //消费策略 ) //调用完createDirectStream，返回的第一个DStream进行操作，因为只有第一手的DStream有偏移量 lines.foreachRDD(rdd => { //在指定的时间周期内，kafka中有新的数据写入 if (!rdd.isEmpty()) { //获取偏移量 val offsetRanges: Array[OffsetRange] = rdd.asInstanceOf[HasOffsetRanges].offsetRanges val lines: RDD[String] = rdd.map(_.value()) val reduced: RDD[(String, Int)] = lines.flatMap(_.split(" ")).map((_, 1)).reduceByKey(_ + _) //将计算好的结果收集到Driver端，然后写入到Redis val res: Array[(String, Int)] = reduced.collect() val jedis = JedisConnectionPool.getConnection() var pipeline: Pipeline = null try { pipeline = jedis.pipelined() pipeline.multi() //目的：是为了实现ExactlyOnce，将计算好的结果和偏移量在同一个事物中同时写入到Redis res.foreach(t => { pipeline.hincrBy("dot", t._1, t._2) }) offsetRanges.foreach(o => { val topic = o.topic val partition = o.partition val untilOffset = o.untilOffset pipeline.hset(appName + "_" + groupId, topic + "_" + partition, untilOffset.toString) }) pipeline.exec() pipeline.sync() } catch { case e: Exception => { e.printStackTrace() pipeline.discard() } } finally { pipeline.close() jedis.close() } } }) ssc.start() ssc.awaitTermination() } }

非聚类运算

hbase 只支持行级事务

image.png

数据在kafka中已经有唯一的id,可作为hbase中的rowkey
每一行的最后一个数据记录偏移量,即可无需读取,查找以一个组内的最大的偏移量即可
但是hbase是不支持分组查询的
解决方案:
1.自定义协处理器
2.使用Phoenix来SQL查询
如果消费者多读了,但是hbase的id是唯一的,所以会把先前的数据给覆盖掉

//每一次启动该程序，都要从Hbase查询历史偏移量 def queryHistoryOffsetFromHbase(appid: String, groupid: String): Map[TopicPartition, Long] = { val offsets = new mutable.HashMap[TopicPartition, Long]() val cOnnection= DriverManager.getConnection("jdbc:phoenix:linux03,linux04,linux05:2181") //分组求max，就是求没有分分区最（大）新的偏移量 val ps = connection.prepareStatement("select "topic_partition", max("offset") from "t_orders" where "appid_groupid" = ? group by "topic_partition"") ps.setString(1, appid + "_" + groupid) //查询返回结果 val rs: ResultSet = ps.executeQuery() while(rs.next()) { val topicAndPartition = rs.getString(1) val fields = topicAndPartition.split("_") val topic = fields(0) val partition = fields(1).toInt val offset = rs.getLong(2) offsets.put(new TopicPartition(topic, partition), offset) } offsets.toMap }

/** * https://www.jianshu.com/p/f1340eaa3e06 * * spark.task.maxFailures * yarn.resourcemanager.am.max-attempts * spark.speculation * * create view "t_orders" (pk VARCHAR PRIMARY KEY, "offset"."appid_groupid" VARCHAR, "offset"."topic_partition" VARCHAR, "offset"."offset" UNSIGNED_LONG); * select max("offset") from "t_orders" where "appid_groupid" = 'g1' group by "topic_partition"; * */ object KafkaToHbase { def main(args: Array[String]): Unit = { //true a1 g1 ta,tb val Array(isLocal, appName, groupId, allTopics) = args val cOnf= new SparkConf() .setAppName(appName) if (isLocal.toBoolean) { conf.setMaster("local[*]") } val sc = new SparkContext(conf) sc.setLogLevel("WARN") val ssc: StreamingCOntext= new StreamingContext(sc, Milliseconds(5000)) val topics = allTopics.split(",") //SparkSteaming跟kafka整合的参数 val kafkaParams = Map[String, Object]( "bootstrap.servers" -> "linux03:9092,linux04:9092,linux05:9092", "key.deserializer" -> classOf[StringDeserializer].getName, "value.deserializer" -> "org.apache.kafka.common.serialization.StringDeserializer", "group.id" -> groupId, "auto.offset.reset" -> "earliest", //如果没有记录偏移量，第一次从最开始读，有偏移量，接着偏移量读 "enable.auto.commit" -> (false: java.lang.Boolean) //消费者不自动提交偏移量 ) //查询历史偏移量【上一次成功写入到数据库的偏移量】 val historyOffsets: Map[TopicPartition, Long] = OffsetUtils.queryHistoryOffsetFromHbase(appName, groupId) //跟Kafka进行整合，需要引入跟Kafka整合的依赖 //createDirectStream更加高效，使用的是Kafka底层的消费API，消费者直接连接到Kafka的Leader分区进行消费 //直连方式，RDD的分区数量和Kafka的分区数量是一一对应的【数目一样】 val kafkaDStream: InputDStream[ConsumerRecord[String, String]] = KafkaUtils.createDirectStream[String, String]( ssc, LocationStrategies.PreferConsistent, //调度task到Kafka所在的节点 ConsumerStrategies.Subscribe[String, String](topics, kafkaParams, historyOffsets) //指定订阅Topic的规则, 从历史偏移量接着读取数据 ) // var offsetRanges: Array[OffsetRange] = null // // val ds2 = kafkaDStream.map(rdd => { // //获取KakfaRDD的偏移量(Driver) // offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges // rdd.value() // }) kafkaDStream.foreachRDD(rdd => { if (!rdd.isEmpty()) { //获取KakfaRDD的偏移量(Driver) val offsetRanges: Array[OffsetRange] = rdd.asInstanceOf[HasOffsetRanges].offsetRanges //获取KafkaRDD中的数据 val lines: RDD[String] = rdd.map(_.value()) val orderRDD: RDD[Order] = lines.map(line => { var order: Order = null try { order = JSON.parseObject(line, classOf[Order]) } catch { case e: JSOnException=> { //TODO } } order }) //过滤问题数据 val filtered: RDD[Order] = orderRDD.filter(_ != null) //调用Action filtered.foreachPartition(iter => { if(iter.nonEmpty) { //将RDD中每一个分区中的数据保存到Hbase中 //创建一个Hbase的连接 val connection: COnnection= HBaseUtil.getConnection("linux03,linux04,linux05", 2181) val htable = connection.getTable(TableName.valueOf("t_orders")) //定义一个ArrayList，并且指定长度，用于批量写入put val puts = new util.ArrayList[Put](100) //可以获取当前Task的PartitionID，然后到offsetRanges数组中取对应下标的偏移量，就是对应分区的偏移量 val offsetRange: OffsetRange = offsetRanges(TaskContext.get.partitionId()) //在Executor端获取到偏移量 //遍历分区中的每一条数据 iter.foreach(order => { //取出想要的数据 val oid = order.oid val mOney= order.money //封装到put中 val put = new Put(Bytes.toBytes(oid)) //data列族，offset列族 put.addColumn(Bytes.toBytes("data"), Bytes.toBytes("money"), Bytes.toBytes(money)) //put.addColumn(Bytes.toBytes("data"), Bytes.toBytes("province"), Bytes.toBytes(province)) //分区中的最后一条，将偏移量区出来，存到hbase的offset列族 if(!iter.hasNext) { put.addColumn(Bytes.toBytes("offset"), Bytes.toBytes("appid_groupid"), Bytes.toBytes(appName + "_" + groupId)) put.addColumn(Bytes.toBytes("offset"), Bytes.toBytes("topic_partition"), Bytes.toBytes(offsetRange.topic + "_" + offsetRange.partition)) put.addColumn(Bytes.toBytes("offset"), Bytes.toBytes("offset"), Bytes.toBytes(offsetRange.untilOffset)) } //将封装到的put添加到puts集合中 puts.add(put) //满足一定大小，批量写入 if(puts.size() == 100) { htable.put(puts) puts.clear() //清空puts集合 } }) //将不满足批量写入条数的数据在写入 htable.put(puts) //关闭连接 htable.close() connection.close() } }) } }) ssc.start() ssc.awaitTermination() } }

推荐阅读

import
深入解析Android联系人数据库设计：AbstractContactsProvider

本文探讨了Android系统中联系人数据库的设计，特别是AbstractContactsProvider类的作用与实现。文章提供了对源代码的详细分析，并解释了该类如何支持跨数据库操作及事务处理。源代码可从官方Android网站下载。 ... [详细]

蜡笔小新 2024-11-24 18:04:54
post
从0到1搭建大数据平台

从0到1搭建大数据平台 ... [详细]

蜡笔小新 2024-11-12 15:26:03
function
解决 PHP 中 Zend_Controller_Response_Exception 头信息发送问题

本文探讨了在 PHP 的 Zend 框架下，使用 PHPUnit 进行单元测试时遇到的 Zend_Controller_Response_Exception 错误，并提供了解决方案。 ... [详细]

蜡笔小新 2024-11-24 20:41:05
function
探索OpenWrt中的LuCI框架

本文深入探讨了OpenWrt系统中轻量级HTTP服务器uhttpd的工作原理及其配置，重点介绍了LuCI界面的实现机制。 ... [详细]

蜡笔小新 2024-11-24 20:29:37
import
Python Selenium WebDriver 浏览器驱动详解与实践

本文详细介绍了如何使用Python结合Selenium和unittest构建自动化测试框架，重点解析了WebDriver浏览器驱动的配置与使用方法，涵盖Chrome、Firefox、IE/Edge等主流浏览器。 ... [详细]

蜡笔小新 2024-11-24 19:59:11
spring
IntelliJ IDEA配置微服务启动显示

通过编辑IntelliJ IDEA的workspace.xml文件，可以实现微服务启动对象的显示。具体步骤包括定位并修改workspace.xml中的RunDashboard部分。 ... [详细]

蜡笔小新 2024-11-24 18:20:58
function
深入理解PHP中的超全局变量与AJAX技术

本文详细介绍了PHP中的几种超全局变量，包括$GLOBAL、$_SERVER、$_POST、$_GET等，并探讨了AJAX的工作原理及其优缺点。通过具体示例，帮助读者更好地理解和应用这些技术。 ... [详细]

蜡笔小新 2024-11-24 16:35:09
version
GNU/Linux系统中动态库搜索路径的指定与管理

本文概述了在GNU/Linux系统中，动态库在链接和运行阶段的搜索路径及其指定方法，包括通过编译时参数、环境变量及系统配置文件等方式来控制动态库的查找路径。 ... [详细]

蜡笔小新 2024-11-24 15:56:16
testing
Implementing and Testing Ext Ajax Calls with Promises

This article explores the process of integrating Promises into Ext Ajax calls for a more functional programming approach, along with detailed steps on testing these asynchronous operations. ... [详细]

蜡笔小新 2024-11-24 15:29:28
import
Java实现凯撒密码的简易加解密程序

本文介绍了如何使用Java编程语言实现凯撒密码的加密与解密功能。凯撒密码是一种替换式密码，通过将字母表中的每个字母向前或向后移动固定数量的位置来实现加密。 ... [详细]

蜡笔小新 2024-11-24 15:16:47
buffer
利用mysqladmin ext监控MySQL运行状态

本文介绍如何通过mysqladmin ext命令监控MySQL数据库的运行状态，包括性能指标的实时查看和分析。 ... [详细]

蜡笔小新 2024-11-24 13:13:45
spring
使用 ModelAttribute 实现页面数据自动填充

本文介绍了如何利用 Spring MVC 中的 ModelAttribute 注解，在页面跳转后自动填充表单数据。主要探讨了两种实现方法及其背后的原理。 ... [详细]

蜡笔小新 2024-11-24 12:55:24
io
Struts2 必备 JAR 包汇总

本文列举了构建和运行 Struts2 应用程序所需的核心 JAR 文件，包括文件上传、日志记录、模板引擎等关键组件。 ... [详细]

蜡笔小新 2024-11-24 04:42:19
window
流处理中的计数挑战与解决方案

本文探讨了在流处理中进行计数的各种技术和挑战，并基于作者在2016年圣何塞举行的Hadoop World大会上的演讲进行了深入分析。文章不仅介绍了传统批处理和Lambda架构的局限性，还详细探讨了流处理架构的优势及其在现代大数据应用中的重要作用。 ... [详细]

蜡笔小新 2024-11-20 13:50:01
window
Spark中使用map或flatMap将DataSet[A]转换为DataSet[B]时Schema变为Binary的问题及解决方案

本文探讨了在使用Spark的map或flatMap算子将一个数据集转换为另一个数据集时，遇到的Schema变为Binary的问题，并提供了详细的解决方案。 ... [详细]

蜡笔小新 2024-11-12 08:06:20

KellylikePchy_224

这个家伙很懒，什么也没留下！

Tags | 热门标签

RankList | 热门文章