FlinkConnectorKafka（一）

作者：情调 | 来源：互联网 | 2023-07-12 17:41

1.前言本来想整理一下flink部署方案(standalone、yarn、k8)的

1.前言

本来想整理一下flink部署方案(standalone、yarn、k8)的，写的文档在公司内网，拿不到外网。姑且跳过部署方案，有部署相关问题的，可以评论或私聊给我。直接讲下Flink-Connector-Kafka吧。

2.概述

Flink提供Kafka连接器，用于读取和写入kafka topics。Flink Kafka消费者集成了Flink的检查点机制来提供Flink处理的exactly-once语义（注意：不是端到端保证）。为实现这一点，Flink并不依靠Kafka的消费者group offset跟踪，而是跟踪检查点内部offset。

Flink’s Kafkaconsumer 称为FlinkKafkaConsumer08
(or 09 forKafka 0.9.0.x versions, etc. or just FlinkKafkaConsumer
for Kafka >= 1.0.0 versions)。可消费一个或多个topic。构造方法至少需要(3-4个参数)，直接看代码（注：zookeeper.connect仅kafka0.8版本才需要）。

注：当然kafka consumer支持的所有配置项，这里都可以配置。

Properties
properties
=
new
Properties();

properties.setProperty("bootstrap.servers",
"localhost:9092");

// only required for Kafka 0.8

properties.setProperty("zookeeper.connect",
"localhost:2181");

properties.setProperty("group.id",
"test");

DataStream<String>
stream
=
env

.addSource(new
FlinkKafkaConsumer08<>("topic",
new
SimpleStringSchema(),
properties));

3.反序列化

Consumer从kafka读取到二进制的bytemessage，需要通过反序列化将其转换成Flink的 Java/Scala objects， DeserializationSchema
接口允许用户指定一个
schema
。对于每一条
kafka message
，
deserialize(byte[] message)
方法都被调用，用于完成反序列化。
具体反序列化接口内容：

package org.apache.flink.api.common.serialization;
import org.apache.flink.annotation.Public;
import org.apache.flink.api.java.typeutils.ResultTypeQueryable;
import java.io.IOException;
import java.io.Serializable;
/**
* The deserialization schema describes how to turn the byte messages delivered by certain
* data sources (for example Apache Kafka) into data types (Java/Scala objects) that are
* processed by Flink.
*
*

In addition, the DeserializationSchema describes the produced type ({@link #getProducedType()}),
* which lets Flink create internal serializers and structures to handle the type.
*
*

Note: In most cases, one should start from {@link AbstractDeserializationSchema}, which
* takes care of producing the return type information automatically.
*
*

A DeserializationSchema must be {@link Serializable} because its instances are often part of
* an operator or transformation function.
*
* @param <T> The type created by the deserialization schema.
*/
@Public
public interface DeserializationSchema<T> extends Serializable, ResultTypeQueryable<T> {

   /**
    * Deserializes the byte message.
    *
    * @param message The message, as a byte array.
    *
    * @return The deserialized message as an object (null if the message cannot be deserialized).
    */
   T deserialize(byte[] message) throws IOException;

   /**
    * Method to decide whether the element signals the end of the stream. If
    * true is returned the element won't be emitted.
    *
    * @param nextElement The element to test for the end-of-stream signal.
    * @return True, if the element signals end of stream, false otherwise.
    */
   boolean isEndOfStream(T nextElement);
}

4.这里穿插说明一下flink序列化与反序列化

flink-core默认提供一个最简单的字符序列化/反序列化类（默认：utf-8）

/*
* Licensed to the Apache Software Foundation (ASF) under one or more
* contributor license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright ownership.
* The ASF licenses this file to You under the Apache License, Version 2.0
* (the "License"); you may not use this file except in compliance with
* the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
package org.apache.flink.api.common.serialization;
import org.apache.flink.annotation.PublicEvolving;
import org.apache.flink.api.common.typeinfo.BasicTypeInfo;
import org.apache.flink.api.common.typeinfo.TypeInformation;
import java.io.IOException;
import java.io.ObjectOutputStream;
import java.nio.charset.Charset;
import java.nio.charset.StandardCharsets;
import static org.apache.flink.util.Preconditions.checkNotNull;
/**
* Very simple serialization schema for strings.
*
*

By default, the serializer uses "UTF-8" for string/byte conversion.
*/
@PublicEvolving
public class SimpleStringSchema implements DeserializationSchema, SerializationSchema {

private static final long serialVersionUID = 1L;
/** The charset to use to convert between strings and bytes.
* The field is transient because we serialize a different delegate object instead */
private transient Charset charset;
/**
* Creates a new SimpleStringSchema that uses "UTF-8" as the encoding.
*/
public SimpleStringSchema() {
this(StandardCharsets.UTF_8);
}

/**
* Creates a new SimpleStringSchema that uses the given charset to convert between strings and bytes.
*
* @param charset The charset to use to convert between strings and bytes.
*/
public SimpleStringSchema(Charset charset) {
this.charset = checkNotNull(charset);
}

/**
* Gets the charset used by this schema for serialization.
* @return The charset used by this schema for serialization.
*/
public Charset getCharset() {
return charset;
}

// ------------------------------------------------------------------------
Kafka Serialization
------------------------------------------------------------------------
@Override
public String deserialize(byte[] message) {
return new String(message, charset);
}

@Override
public boolean isEndOfStream(String nextElement) {
return false;
}

@Override
public byte[] serialize(String element) {
return element.getBytes(charset);
}

@Override
public TypeInformation getProducedType() {
return BasicTypeInfo.STRING_TYPE_INFO;
}

// ------------------------------------------------------------------------
Java Serialization
------------------------------------------------------------------------
private void writeObject (ObjectOutputStream out) throws IOException {
out.defaultWriteObject();
out.writeUTF(charset.name());
}

private void readObject(java.io.ObjectInputStream in) throws IOException, ClassNotFoundException {
in.defaultReadObject();
String charsetName = in.readUTF();
this.charset = Charset.forName(charsetName);
}
}

官方demo也有个KafkaEventSchema类实现自定义KafkaEvent(POJO)序列化/反序列化类

/*
* Licensed to the Apache Software Foundation (ASF) under one or more
* contributor license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright ownership.
* The ASF licenses this file to You under the Apache License, Version 2.0
* (the "License"); you may not use this file except in compliance with
* the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
package org.apache.flink.streaming.examples.kafka;
import org.apache.flink.api.common.serialization.DeserializationSchema;
import org.apache.flink.api.common.serialization.SerializationSchema;
import org.apache.flink.api.common.typeinfo.TypeInformation;
import java.io.IOException;
/**
* The serialization schema for the {@link KafkaEvent} type. This class defines how to transform a
* Kafka record's bytes to a {@link KafkaEvent}, and vice-versa.
*/
public class KafkaEventSchema implements DeserializationSchema, SerializationSchema {

private static final long serialVersionUID = 6154188370181669758L;
@Override
public byte[] serialize(KafkaEvent event) {
return event.toString().getBytes();
}

@Override
public KafkaEvent deserialize(byte[] message) throws IOException {
return KafkaEvent.fromString(new String(message));
}

@Override
public boolean isEndOfStream(KafkaEvent nextElement) {
return false;
}

@Override
public TypeInformation getProducedType() {
return TypeInformation.of(KafkaEvent.class);
}
}

总结：数据源不是固定数据格式的话，直接使用SimpleStringSchema即可。否则，可以自己实现固定数据格式对应的POJO。

5.Kafka
Consumers Start Position Configuration

这个不用多说，直接看demo，一目了然

final StreamExecutionEnvironmentenv = StreamExecutionEnvironment.getExecutionEnvironment();

FlinkKafkaConsumer08<String>myConsumer = new FlinkKafkaConsumer08<>(...);

myConsumer.setStartFromEarliest(); // start fromthe earliest record possible

myConsumer.setStartFromLatest(); // start fromthe latest record

myConsumer.setStartFromTimestamp(...);// start fromspecified epoch timestamp (milliseconds)

myConsumer.setStartFromGroupOffsets();// the defaultbehaviour

DataStream<String>stream = env.addSource(myConsumer);

6.Kafka
Consumers and Fault Tolerance

Flinkcheckpoint开启时，checkpoint以一致的方式将kafka offsets和其他operations状态一起保存。Job失败，Flink将存储最近一次的checkpoint，启动之后依据该checkpoint内kafka offsets，开始消费kafka数据。

final StreamExecutionEnvironmentenv = StreamExecutionEnvironment.getExecutionEnvironment();

env.enableCheckpointing(5000);

7.Kafka
Consumers Topic and Partition Discovery

默认情况下，partitiondiscovery是关闭的。开启方法：在properties config设置非负值给flink.partition-discovery.interval-millis
，表示
discovery时间间隔(ms)

8.Kafka Consumers
Offset Committing Behaviour Configuration

Flink Kafka Consumer允许配置offsets的提交到kafka或者(zookeeperin 0.8)。这种方式提交的offsets并不用来保证容错性，仅用来暴露出consumer的消费情况。Offsets提交行为描述：未开启checkpoint，offsets提交依赖于consumer的自动提交(nable.auto.commit
(or auto.commit.enable
for Kafka0.8) auto.commit.interval.ms
)开启checkpoint，offsets提交到checkpoint，setCommitOffsetsOnCheckpoints(boolean)默认是开启的。consumer的自动提交强制关闭，这一点见如下源码：

// make sure that auto commit is disabled when our offset commit mode is ON_CHECKPOINTS;
// this overwrites whatever setting the user configured in the properties
if (offsetCommitMode == OffsetCommitMode.ON_CHECKPOINTS || offsetCommitMode == OffsetCommitMode.DISABLED) {
properties.setProperty(ConsumerConfig.ENABLE_AUTO_COMMIT_CONFIG, "false");
}

9.Flink subtask与kafka partition对应算法（重要的一点）

这一点特别重要，官网没说明。先看下算法实现：

* @param partition the Kafka partition
* @param numParallelSubtasks total number of parallel subtasks
*
* @return index of the target subtask that the Kafka partition should be assigned to.
*/
public static int assign(KafkaTopicPartition partition, int numParallelSubtasks) {
int startIndex = ((partition.getTopic().hashCode() * 31) & 0x7FFFFFFF) % numParallelSubtasks;
// here, the assumption is that the id of Kafka partitions are always ascending
// starting from 0, and therefore can be used directly as the offset clockwise from the start index
return (startIndex + partition.getPartition()) % numParallelSubtasks;
}

其实算法也很简单，举个例子，topic有三个partition，flinkconsumerkafka并行度为3，那么：

partition0对应flinkconsumerkafkasubtask1；

partition1对应flinkconsumerkafkasubtask2；

partition2对应flinkconsumerkafkasubtask0；

我们都知道partition跟subtask都是从0开始计数的，个人觉得完全可以改成如下算法：

return (partition.getPartition()) % numParallelSubtasks;

最合适的做法就是flinkconsumerkafka并行度与kafka partition数相同，即能保证消费效率，也不浪费资源。

10总结

至此，Flink-connector-kafka的consumer用法基本介绍完成。有疑问欢迎沟通交流。

欢迎关注微信公众号

推荐阅读

require
我们如何在kafkaconect分发模式下手动定义主题分区和复制

我正在使用sql-serverkafka-connect和debezium监视sqlserver数据库，但是当我发布并运行我的wo ... [详细]

蜡笔小新 2023-10-16 12:54:59
bit
Linux下Kafka单机安装配置方法（实操成功）

本文介绍了在Linux下安装和配置Kafka的方法，包括安装JDK、下载和解压Kafka、配置Kafka的参数，以及配置Kafka的日志目录、服务器IP和日志存放路径等。同时还提供了单机配置部署的方法和zookeeper地址和端口的配置。通过实操成功的案例，帮助读者快速完成Kafka的安装和配置。 ... [详细]

蜡笔小新 2023-12-12 18:14:32
node.js
PHP cURL请求体在node.js中没有定义的问题

本文讨论了在使用PHP cURL发送POST请求时，请求体在node.js中没有定义的问题。作者尝试了多种解决方案，但仍然无法解决该问题。同时提供了当前PHP代码示例。 ... [详细]

蜡笔小新 2023-12-10 10:19:23
node.js
视图分区_组复制常规操作网络分区amp;混合使用IPV6与IPV4 | 全方位认识 MySQL 8.0 Group Replication...

网络分区对于常规事务而言，每当组内有事务数据需要被复制时，组内的成员需要达成共识(要么都提交，要么都回滚)。对于组成员资格的变更也和保持组 ... [详细]

蜡笔小新 2023-10-16 18:09:11
java
kafka 0.9+消费者配置参数说明

ConsumerConfiguration在kafka0.9使用JavaConsumer替代了老版本的scalaConsumer。新版的配置如下：bootstrap. ... [详细]

蜡笔小新 2023-10-16 10:44:59
main
druid接入kafka indexing service整个流程

先介绍下我们的druid集群配置Overload1台Coordinator1台Middlemanager3台Broker3台Historical一共12台，其中cold6台，hot ... [详细]

蜡笔小新 2023-10-15 19:51:21
main
你知道Kafka和Redis的各自优缺点吗？一文带你优化选择，不走弯路

你知道Kafka和Redis的各自优缺点吗？一文带你优化选择，不走弯路 ... [详细]

蜡笔小新 2023-10-15 17:24:27
main
马蜂窝数据总监分享：从数仓到数据中台，大数据演进技术选型最优解

大家好，今天分享的议题主要包括几大内容：带大家回顾一下大数据在国内的发展，从传统数仓到当前数据中台的演进过程；我个人认为数 ... [详细]

蜡笔小新 2023-10-14 14:20:07
main
MacOS系统安装MySQL及设置MySQL服务开机启动和密码修改方法

本文介绍了在MacOS系统上安装MySQL的步骤，并详细说明了如何设置MySQL服务的开机启动和如何修改MySQL的密码。通过下载MySQL的macos版本并按照提示一步一步安装，在系统偏好设置中可以找到MySQL的图标进行设置。同时，还介绍了通过终端命令来修改MySQL的密码的具体操作步骤。 ... [详细]

蜡笔小新 2023-12-11 17:35:39
header
如何使用PHP代码将表格导出为UTF8格式的Excel文件

本文介绍了如何使用PHP代码将表格导出为UTF8格式的Excel文件。首先，需要连接到数据库并获取表格的列名。然后，设置文件名和文件指针，并将内容写入文件。最后，设置响应头部，将文件作为附件下载。 ... [详细]

蜡笔小新 2023-12-11 00:29:33
header
express工程中的json调用方法

本文介绍了在express工程中如何调用json数据，包括建立app.js文件、创建数据接口以及获取全部数据和typeid为1的数据的方法。 ... [详细]

蜡笔小新 2023-12-10 13:09:24
java
windows部署hadoop2.7.0

这里使用自己编译的hadoop-2.7.0版本部署在windows上，记得几年前，部署hadoop需要借助于cygwin，还需要开启ssh服务，最近发现，原来不需要借助cy ... [详细]

蜡笔小新 2023-10-17 21:04:04
stream
什么是大数据lambda架构

一、什么是Lambda架构Lambda架构由Storm的作者[NathanMarz]提出，根据维基百科的定义，Lambda架构的设计是为了在处理大规模数 ... [详细]

蜡笔小新 2023-10-17 16:06:09
node.js
Webpack是怎么工作的

这篇“Webpack是怎么工作的”文章的知识点大部分人都不太理解，所以小编给大家总结了以下内容，内容详细，步骤清晰，具有一定的借鉴价值，希望大 ... [详细]

蜡笔小新 2023-10-16 14:38:01
stream
基于时间序列的异常检测系统的实现思路之一

技术方案：Spark、kafka、opentsdb、Yahoo的egads模型静态训练：采用两种算法进行模型的训练：指数移动平均和HotWinters，模型一天训练一次，即每天0点开始训练， ... [详细]

蜡笔小新 2023-10-13 12:23:40