目录
- 1. Key Generation
- 1.1 SimpleKeyGenerator
- 1.2 ComplexKeyGenerator
- 1.3 NonPartitionedKeyGenerator
- 1.4 CustomKeyGenerator
- 1.5 TimestampBasedKeyGenerator
- 2. Concurrency Control
1. Key Generation
Hudi提供了几种key generators,key generators的通用配置如下:
Config | 含义/目的 |
---|
hoodie.datasource.write.recordkey.field | 数据的key字段,必须包含 |
hoodie.datasource.write.partitionpath.field | 数据的partition字段,必须包含 |
hoodie.datasource.write.keygenerator.class | full path的Key generator class,必须包含 |
hoodie.datasource.write.partitionpath.urlencode | 默认为false,如果为true,partition path将按url进行编码 |
hoodie.datasource.write.hive_style_partitioning | 默认为false,分区字段名称只有partition_field_value,如果为true,分区字段名称为:partition_field_name=partition_field_value |
1.1 SimpleKeyGenerator
将一个列转换成string类型,作为分区字段名称
1.2 ComplexKeyGenerator
recordkey和partitionpath都将一个或多个字段作为key,多个字段逗号分隔。比如"Hoodie.datasource.write.recordkey.field" : "col1,col3"
1.3 NonPartitionedKeyGenerator
如果表不是分区表,使用NonPartitionedKeyGenerator,生成一个empty “” partiiton
1.4 CustomKeyGenerator
可以同时使用SimpleKeyGenerator、ComplexKeyGenerator、TimestampBasedKeyGenerator
hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.CustomKeyGenerator
- 指定recordkey,可以是SimpleKeyGenerator或ComplexKeyGenerator
hoodie.datasource.write.recordkey.field=col1,col3
创建的record key格式为:col1:value1,col3:value3
- 指定partitionpath,格式为:“field1:PartitionKeyType1,field2:PartitionKeyType2,…”,PartitionKeyType的可选值为simple、timestamp
hoodie.datasource.write.partitionpath.field=col2:simple,col4:timestamp
HDFS上创建的分区路径为:value2/value4
1.5 TimestampBasedKeyGenerator
这个key generator用于partition字段,需要设置的配置如下:
Config | 含义/目录 |
---|
hoodie.deltastreamer.keygen.timebased.timestamp.type | UNIX_TIMESTAMP、DATE_STRING、MIXED、EPOCHMILLISECONDS、SCALAR |
hoodie.deltastreamer.keygen.timebased.output.dateformat | 输出的date format |
hoodie.deltastreamer.keygen.timebased.timezone | data format的Timezone |
oodie.deltastreamer.keygen.timebased.input.dateformat | 输入的date format |
下面是使用的一些例子
Timestamp is GMT
Config字段 | 值 |
---|
hoodie.deltastreamer.keygen.timebased.timestamp.type | “EPOCHMILLISECONDS” |
hoodie.deltastreamer.keygen.timebased.output.dateformat | “yyyy-MM-dd hh” |
hoodie.deltastreamer.keygen.timebased.timezone | “GMT+8:00” |
输入字段值: “1578283932000L”,生成的Partition path: “2020-01-06 12”
如果输入字段值为null,生成的Partition path: “1970-01-01 08”
Timestamp is DATE_STRING
Config字段 | 值 |
---|
hoodie.deltastreamer.keygen.timebased.timestamp.type | “DATE_STRING” |
hoodie.deltastreamer.keygen.timebased.output.dateformat | “yyyy-MM-dd hh” |
hoodie.deltastreamer.keygen.timebased.timezone | “GMT+8:00” |
hoodie.deltastreamer.keygen.timebased.input.dateformat | “yyyy-MM-dd hh:mm:ss” |
输入字段值: “2020-01-06 12:12:12”,生成的Partition path: “2020-01-06 12”
如果输入字段值为null,生成的Partition path: “1970-01-01 12:00:00”
Scalar examples
Config字段 | 值 |
---|
hoodie.deltastreamer.keygen.timebased.timestamp.type | “SCALAR” |
hoodie.deltastreamer.keygen.timebased.output.dateformat | “yyyy-MM-dd hh” |
hoodie.deltastreamer.keygen.timebased.timezone | “GMT” |
hoodie.deltastreamer.keygen.timebased.timestamp.scalar.time.unit | “days” |
输入字段值: “20000L”,生成的Partition path: “2024-10-04 12” | |
如果输入字段值为null,生成的Partition path: “1970-01-02 12”
2. Concurrency Control
支持的方式:
- MVCC:Hudi的table service,如compaction、clean,利用MVCC在写入和读取之间提供snapshot isolation。可以实现单一写入和并发读
- OPTIMISTIC CONCURRENCY(experimental):实现并发写入,需要Zookeeper或HiveMetastore获取locks的支持。如write_A写入file1和file2,write_B写入file3和file4,则两个write写入成功;如write_A写入file1和file2,write_B写入file2和file3,则只能有一个write成功,另一个write失败
Multi Writer Guarantees
- upsert: 表不会有重复数据
- insert: 即使开启dedup,表也可能有重复数据
- bulk_insert: 即使开启dedup,表也可能有重复数据
- incremental pull: Data consumption和checkpoints可能会乱序