作者:心语忆录_288 | 来源:互联网 | 2023-08-07 18:44
基于CDH5集群配置snappy压缩,配置步骤如下:1、常用的三种压缩gzip,lzo,snappy,经分析对比算法压缩后压缩前压缩速度解压速度GZIP13.4%21MBs118M
基于CDH5集群配置snappy压缩,配置步骤如下:
1、常用的三种压缩gzip,lzo,snappy,经分析对比
算法 压缩后/压缩前 压缩速度 解压速度
GZIP 13.4% 21 MB/s 118 MB/s
LZO 20.5% 135 MB/s 410 MB/s
Snappy 22.2% 172 MB/s 409 MB/s
snappy综合实力最佳,lzo我们也尝试使用,但是常导致个别老机器down机。
2、配置hdfs的core-site.xml相应压缩项
io.compression.codecs
org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.BZip2Codec,org.apache.hadoop.io.compress.DeflateCodec,org.apache.hadoop.io.compress.SnappyCodec,org.apache.hadoop.io.compress.Lz4Codec
3、配置mapreduce的mapred-site.xml压缩项
mapreduce.output.fileoutputformat.compress
true
mapreduce.output.fileoutputformat.compress.type
BLOCK
mapreduce.output.fileoutputformat.compress.codec
org.apache.hadoop.io.compress.SnappyCodec
mapreduce.map.output.compress.codec
org.apache.hadoop.io.compress.SnappyCodec
mapreduce.map.output.compress
true
4、配置hive的hive-site.xml压缩项
hive.enforce.bucketing
true
hive.exec.compress.output
true
io.compression.codecs
org.apache.hadoop.io.compress.SnappyCodec
hive.auto.convert.join
false
hive.support.concurrency
false
5、配置spark的压缩项
spark-env.sh
export JAVA_HOME=/usr/java/jdk1.7.0_67-cloudera
export SPARK_MASTER_IP=10.130.2.20
export SPARK_MASTER_PORT=7077
export SPARK_WORKER_CORES=48
export SPARK_WORKER_INSTANCES=1
export SPARK_WORKER_MEMORY=37g
export HADOOP_HOME=/opt/cloudera/parcels/CDH/lib/hadoop
export JAVA_LIBRARY_PATH=$JAVA_LIBRARY_PATH:$HADOOP_HOME/lib/native
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$HADOOP_HOME/lib/native
export SPARK_LIBRARY_PATH=$SPARK_LIBRARY_PATH:$HADOOP_HOME/lib/native
export SPARK_CLASSPATH=$SPARK_CLASSPATH:$HADOOP_HOME/lib/snappy-java-1.0.4.1.jar
spark-defaults.conf
spark.local.dir /diskb/sparktmp,/diskc/sparktmp,/diskd/sparktmp,/diske/sparktmp,/diskf/sparktmp,/diskg/sparktmp
spark.io.compression.codec snappy
总结:
经过如上配置,集群中的mr, hive ,spark的作业,都会以snappy进行压缩处理,极大的减少了IO的消耗,提高了性能。