Spark性能优化coalesce(n)

作者：大街上 | 来源：互联网 | 2023-08-25 21:21

有时用Spark运行Job的时候，输出可能会出现一些空或者小内容。这时重新将输出的Partition进行重新调整，可以减少RDD中Patition的数目。两种方式： 1. coa

有时用Spark 运行Job 的时候，输出可能会出现一些空或者小内容。这时重新将输出的Partition 进行重新调整，可以减少RDD中Patition的数目。
两种方式：
1. coalesce(numPartitions:Int, shuffle:Boolean = false)
2. repartition(numPartitions:Int)

，repartition只是coalesce 简化版

/**
* Return a new RDD that has exactly numPartitions partitions.
*
* Can increase or decrease the level of parallelism in this RDD. Internally, this uses
* a shuffle to redistribute data.
*
* If you are decreasing the number of partitions in this RDD, consider using `coalesce`,
* which can avoid performing a shuffle.
如果你想减少patitions 用coalesce
*/

def repartition(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T] = withScope {
coalesce(numPartitions, shuffle = true)
}

/**
* Return a new RDD that is reduced into `numPartitions` partitions.
*
* This results in a narrow dependency, e.g. if you go from 1000 partitions
* to 100 partitions, there will not be a shuffle, instead each of the 100
* new partitions will claim 10 of the current partitions.
*
* However, if you're doing a drastic coalesce, e.g. to numPartitiOns= 1,
* this may result in your computation taking place on fewer nodes than
* you like (e.g. one node in the case of numPartitiOns= 1). To avoid this,
* you can pass shuffle = true. This will add a shuffle step, but means the
* current upstream partitions will be executed in parallel (per whatever
* the current partitioning is).
*
* Note: With shuffle = true, you can actually coalesce to a larger number
* of partitions. This is useful if you have a small number of partitions,
* say 100, potentially with a few partitions being abnormally large. Calling
* coalesce(1000, shuffle = true) will result in 1000 partitions with the
* data distributed using a hash partitioner.
shuffle为收缩默认采用 partitionCoalescer进行分区
*/
def coalesce(numPartitions: Int, shuffle: Boolean = false,
partitionCoalescer: Option[PartitionCoalescer] = Option.empty)
(implicit ord: Ordering[T] = null)
: RDD[T] = withScope {
require(numPartitions > 0, s"Number of partitions ($numPartitions) must be positive.")
if (shuffle) {
/** Distributes elements evenly across output partitions, starting from a random partition. */
//index 所在分区， items
val distributePartition = (index: Int, items: Iterator[T]) => {
var position = (new Random(index)).nextInt(numPartitions)
items.map { t =>
// Note that the hash code of the key will just be the key itself. The HashPartitioner
// will mod it with the number of total partitions.
position = position + 1
(position, t)
}
} : Iterator[(Int, T)]
// include a shuffle step so that our upstream tasks are still distributed
new CoalescedRDD(
//如果扩展 newShuffledRDD 这个核心操作distributePartition看对现有分区进行重新划分看上面函数distributePartition 。
new ShuffledRDD[Int, T, T](mapPartitionsWithIndex(distributePartition),
new HashPartitioner(numPartitions)),
numPartitions,
partitionCoalescer).values
} else {
//shuffle为false
new CoalescedRDD(this, numPartitions, partitionCoalescer)
}
}

推荐阅读

range
技术日志：使用 Ruby 爬虫抓取拉勾网职位数据并生成词云分析报告

技术日志：使用 Ruby 爬虫抓取拉勾网职位数据并生成词云分析报告 ... [详细]

蜡笔小新 2024-11-07 14:33:19
range
在范围[0..n-1]中产生m个不同的随机数 - Generating m distinct random numbers in the range [0..n-1]

Ihavetwomethodsofgeneratingmdistinctrandomnumbersintherange[0..n-1]我有两种方法在范围[0.n-1]中生 ... [详细]

蜡笔小新 2024-11-13 09:49:14
js
SoundPool

如果应用程序经常播放密集、急促而又短暂的音效（如游戏音效）那么使用MediaPlayer显得有些不太适合了。因为MediaPlayer存在如下缺点：1)延时时间较长，且资源占用率高 ... [详细]

蜡笔小新 2024-11-13 16:47:19
php
Java 编程错误：对象无法转换为 long 类型

本文介绍了在 Java 编程中遇到的一个常见错误：对象无法转换为 long 类型，并提供了详细的解决方案。 ... [详细]

蜡笔小新 2024-11-13 10:57:24
php
从0到1搭建大数据平台

从0到1搭建大数据平台 ... [详细]

蜡笔小新 2024-11-12 15:26:03
usb
杜甫《喜晴》的两种英译比较

本文对比了杜甫《喜晴》的两种英文翻译版本：a. Pleased with Sunny Weather 和 b. Rejoicing in Clearing Weather。a 版由 alexcwlin 翻译并经 Adam Lam 编辑，b 版则由哈佛大学的宇文所安教授 (Prof. Stephen Owen) 翻译。 ... [详细]

蜡笔小新 2024-11-12 15:02:28
js
秒建一个后台管理系统？用这5个开源免费的Java项目就够了

秒建一个后台管理系统？用这5个开源免费的Java项目就够了 ... [详细]

蜡笔小新 2024-11-12 03:21:33
int
如何使用 `org.eclipse.rdf4j.query.impl.MapBindingSet.getValue()` 方法及其代码示例详解

如何使用 `org.eclipse.rdf4j.query.impl.MapBindingSet.getValue()` 方法及其代码示例详解 ... [详细]

蜡笔小新 2024-11-11 02:42:52
int
Spring框架中枚举参数的正确使用方法与技巧

本文详细阐述了在Spring Boot框架中正确使用枚举参数的方法与技巧，旨在帮助开发者更高效地掌握和应用枚举类型的数据传递，适合对Spring Boot感兴趣的读者深入学习。 ... [详细]

蜡笔小新 2024-11-09 20:34:17
int
使用 ListView 浏览安卓系统中的回收站文件

使用 ListView 浏览安卓系统中的回收站文件 ... [详细]

蜡笔小新 2024-11-09 16:34:55
js
Python 伦理黑客技术：深入探讨后门攻击（第三部分）

在《Python 伦理黑客技术：深入探讨后门攻击（第三部分）》中，作者详细分析了后门攻击中的Socket问题。由于TCP协议基于流，难以确定消息批次的结束点，这给后门攻击的实现带来了挑战。为了解决这一问题，文章提出了一系列有效的技术方案，包括使用特定的分隔符和长度前缀，以确保数据包的准确传输和解析。这些方法不仅提高了攻击的隐蔽性和可靠性，还为安全研究人员提供了宝贵的参考。 ... [详细]

蜡笔小新 2024-11-09 16:33:02
int
FFMpeg学习进阶：音频处理基础理论与重采样技术详解

在Android平台中，播放音频的采样率通常固定为44.1kHz，而录音的采样率则固定为8kHz。为了确保音频设备的正常工作，底层驱动必须预先设定这些固定的采样率。当上层应用提供的采样率与这些预设值不匹配时，需要通过重采样（resample）技术来调整采样率，以保证音频数据的正确处理和传输。本文将详细探讨FFMpeg在音频处理中的基础理论及重采样技术的应用。 ... [详细]

蜡笔小新 2024-11-09 13:46:55
php
使用Maven JAR插件将单个或多个文件及其依赖项合并为一个可引用的JAR包

本文介绍了如何利用Maven中的maven-assembly-plugin插件将单个或多个Java文件及其依赖项打包成一个可引用的JAR文件。首先，需要创建一个新的Maven项目，并将待打包的Java文件复制到该项目中。通过配置maven-assembly-plugin，可以实现将所有文件及其依赖项合并为一个独立的JAR包，方便在其他项目中引用和使用。此外，该方法还支持自定义装配描述符，以满足不同场景下的需求。 ... [详细]

蜡笔小新 2024-11-09 01:59:29
php
第二章：Kafka基础入门与核心概念解析

本章节主要介绍了Kafka的基本概念及其核心特性。Kafka是一种分布式消息发布和订阅系统，以其卓越的性能和高吞吐量而著称。最初，Kafka被设计用于LinkedIn的活动流和运营数据处理，旨在高效地管理和传输大规模的数据流。这些数据主要包括用户活动记录、系统日志和其他实时信息。通过深入解析Kafka的设计原理和应用场景，读者将能够更好地理解其在现代大数据架构中的重要地位。 ... [详细]

蜡笔小新 2024-11-06 11:10:03
js
使用JavaScript生成Java兼容的UUID代码实现与优化技巧

本文介绍了UUID（通用唯一标识符）的概念及其在JavaScript中生成Java兼容UUID的代码实现与优化技巧。UUID是一个128位的唯一标识符，广泛应用于分布式系统中以确保唯一性。文章详细探讨了如何利用JavaScript生成符合Java标准的UUID，并提供了多种优化方法，以提高生成效率和兼容性。 ... [详细]

蜡笔小新 2024-11-05 18:19:54

大街上

这个家伙很懒，什么也没留下！

Tags | 热门标签

RankList | 热门文章