热门标签 | HotTags
当前位置:  开发笔记 > 编程语言 > 正文

YARN:最大并行Map任务计数-YARN:maximumparallelMaptaskcount

FollowingismentionedintheHadoopdefinitiveguideHadoop权威指南中提到了以下内容Whatqualifiesasasmal

Following is mentioned in the Hadoop definitive guide

Hadoop权威指南中提到了以下内容

"What qualifies as a small job? By default one that has less than 10 mappers, only one reducer, and the input size is less than the size of one HDFS block. "

But how does it count no of mapper in a job before executing it on YARN ? In MR1 number of mapper depends on the no. of input splits. is the same applies for the YARN as well ? In YARN containers are flexible. So Is there any way for computing max number of map task that can run on a given cluster in parallel( some kind of tight upper bound, because it will give me rough idea about how much data i can process in parallel ? ) ?

但是在YARN上执行它之前,它如何计算作业中的mapper?在MR1中,映射器的数量取决于否。输入分裂。是否同样适用于YARN?在YARN容器中是灵活的。那么有没有办法计算可以在给定集群上并行运行的最大数量的map任务(某种紧张的上限,因为它会让我粗略地了解我可以并行处理多少数据?)?

2 个解决方案

#1


But how does it count no of mapper in a job before executing it on YARN ? In MR1 number of mapper depends on the no. of input splits. is the same applies for the YARN as well ?

但是在YARN上执行它之前,它如何计算作业中的mapper?在MR1中,映射器的数量取决于否。输入分裂。是否同样适用于YARN?

Yes, in YARN as well if you are using MapReduce based frameworks, the number of mappers depend on input splits.

是的,在YARN中,如果您使用基于MapReduce的框架,则映射器的数量取决于输入拆分。

In YARN containers are flexible. So Is there any way for computing max number of map task that can run on a given cluster in parallel( some kind of tight upper bound, because it will give me rough idea about how much data i can process in parallel ? ) ?

在YARN容器中是灵活的。那么有没有办法计算可以在给定集群上并行运行的最大数量的map任务(某种紧张的上限,因为它会让我粗略地了解我可以并行处理多少数据?)?

The number of map tasks that can run in parallel on the YARN cluster depends on how many containers that can be launched and run in parallel on the cluster. This ultimately depends on how you will configure MapReduce in the cluster, which is clearly explained clearly in this guide from cloudera.

可以在YARN群集上并行运行的映射任务数取决于可以在群集上并行启动和运行的容器数。这最终取决于您将如何在群集中配置MapReduce,这在cloudera的本指南中已清楚地解释。

#2


mapreduce.job.maps = MIN(yarn.nodemanager.resource.memory-mb / mapreduce.map.memory.mb,yarn.nodemanager.resource.cpu-vcores / mapreduce.map.cpu.vcores, number of physical drives x workload factor) x number of worker nodes

mapreduce.job.reduces = MIN(yarn.nodemanager.resource.memory-mb / mapreduce.reduce.memory.mb,yarn.nodemanager.resource.cpu-vcores / mapreduce.reduce.cpu.vcores, # of physical drives xworkload factor) x # of worker nodes

The workload factor can be set to 2.0 for most workloads. Consider a higher setting for CPU-bound workloads.

对于大多数工作负载,工作负载因子可以设置为2.0。考虑更高的CPU绑定工作负载设置。

yarn.nodemanager.resource.memory-mb( Available Memory on a node for containers )= Total System memory – Reserved memory( like 10-20% of memory for Linux and its daemon services) -   HDFS Data node ( 1024 MB) – (resources for task buffers, such as the HDFS Sort I/O buffer) – (Memory allocated for DataNode( default 1024 MB), NodeManager, RegionServer etc.)

Hadoop is a disk I/O-centric platform by design. The number of independent physical drives (“spindles”) dedicated to DataNode use limits how much concurrent processing a node can sustain. As a result, the number of vcores allocated to the NodeManager should be the lesser of either:

Hadoop是一个以磁盘I / O为中心的平台。专用于DataNode的独立物理驱动器(“主轴”)的数量限制了节点可以承受的并发处理量。因此,分配给NodeManager的vcores数量应该是以下两者中的较小者:

 [(total vcores) – (number of vcores reserved for non-YARN use)] or  [ 2 x (number of physical disks used for DataNode storage)]

So

yarn.nodemanager.resource.cpu-vcores = min{ ((total vcores) – (number of vcores reserved for non-YARN use)),  (2 x (number of physical disks used for DataNode storage))}

Available vcores  on a node for cOntainers= total no. of vcores – for operating system( For calculating vcore demand, consider the number of concurrent processes or tasks each service runs as an initial guide. For OS we take 2 ) – Yarn node Manager( Def. is  1) – HDFS data node( Def. is  1).

Note ==>

mapreduce.map.memory.mb is combination of both mapreduce.map.java.opts.max.heap + some head room (safety value)

The settings for mapreduce.[map | reduce].java.opts.max.heap specify the default memory allotted for mapper and reducer heap size, respectively. The mapreduce.[map| reduce].memory.mb settings specify memory allotted their containers, and the value assigned should allow overhead beyond the task heap size. Cloudera recommends applying a factor of 1.2 to the mapreduce.[map | reduce].java.opts.max.heap setting. The optimal value depends on the actual tasks. Cloudera also recommends setting mapreduce.map.memory.mb to 1–2 GB and setting mapreduce.reduce.memory.mb to twice the mapper value. The ApplicationMaster heap size is 1 GB by default, and can be increased if your jobs contain many concurrent tasks.

mapreduce的设置。[map | reduce] .java.opts.max.heap分别指定为mapper和reducer堆大小分配的默认内存。 mapreduce。[map | reduce] .memory.mb设置指定分配其容器的内存,并且分配的值应允许超出任务堆大小的开销。 Cloudera建议在mapreduce中应用因子1.2。[map | reduce] .java.opts.max.heap设置。最佳值取决于实际任务。 Cloudera还建议将mapreduce.map.memory.mb设置为1-2 GB,并将mapreduce.reduce.memory.mb设置为mapper值的两倍。 ApplicationMaster堆大小默认为1 GB,如果作业包含许多并发任务,则可以增加它。


Reference –

  • http://www.cloudera.com/content/cloudera/en/documentation/core/latest/topics/cdh_ig_yarn_tuning.html
  • http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.0.6.0/bk_installing_manually_book/content/rpm-chap1-11.html

推荐阅读
  • hadoop3.1.2 first programdefault wordcount (Mac)
    hadoop3.1.2安装完成后的第一个实操示例程 ... [详细]
  • Hadoop 2.6 主要由 HDFS 和 YARN 两大部分组成,其中 YARN 包含了运行在 ResourceManager 的 JVM 中的组件以及在 NodeManager 中运行的部分。本文深入探讨了 Hadoop 2.6 日志文件的解析方法,并详细介绍了 MapReduce 日志管理的最佳实践,旨在帮助用户更好地理解和优化日志处理流程,提高系统运维效率。 ... [详细]
  • 本文详细介绍了在Windows操作系统上使用Python 3.8.5编译支持CUDA 11和cuDNN 8.0.2的TensorFlow 2.3的步骤。文章不仅提供了详细的编译指南,还分享了编译后的文件下载链接,方便用户快速获取所需资源。此外,文中还涵盖了常见的编译问题及其解决方案,确保用户能够顺利进行编译和安装。 ... [详细]
  • 数据读取hadoopFileParameters:path–pathtoHadoopfileinputFormatClass–fullyqualifiedclassnameo ... [详细]
  • Hadoop组件具有机架感知功能。例如,通过将一个块的分片放在不同的机架上,HDFS块放置将使用机架感知来实现容错。这可以在群集中发生网络切换故障或分区 ... [详细]
  • 背景使用yarncreateumi创建了一个干净的基于umidva的react项目。在遇到组件之间的通讯时,需要使用到dva。如何使用dva实现组件之间的通讯呢&# ... [详细]
  • 各个组件confspark-env.sh配置spark的环境变量confspark-default.conf配置spark应用默认的配置项和spark-env.sh有重合之处,可在 ... [详细]
  • php更新数据库字段的函数是,php更新数据库字段的函数是 ... [详细]
  • 深入解析Spring框架中的双亲委派机制突破方法
    在探讨Spring框架中突破双亲委派机制的方法之前,首先需要了解类加载器的基本概念。类加载器负责将类的全限定名转换为对应的二进制字节流。每个类在被特定的类加载器加载后,其唯一性得到保证。然而,这种机制在某些场景下可能会限制灵活性,因此Spring框架提供了一些策略来突破这一限制,以实现更加动态和灵活的类加载。这些策略不仅能够提升系统的可扩展性,还能在复杂的运行环境中确保类的正确加载和管理。 ... [详细]
  • 本文深入探讨了 Android 中的 SharedPreferences 机制及其应用场景。作为一种轻量级的数据存储方案,SharedPreferences 采用了键值对的形式,类似于 iOS 中的 NSUserDefaults。它适用于存储简单的配置信息和用户偏好设置,如登录状态、主题选择等。通过分析其内部实现原理和使用方法,本文为开发者提供了详细的指导和最佳实践建议。 ... [详细]
  • Hadoop + Spark安装(三) —— 调hadoop
    ***************************测试hadoop及问题跟进***************************执行以下语句报错datahadoop-2.9. ... [详细]
  • 【原创】七、Hadoop 2.5.2+zookeeper高可用部署
    一、原理(四大要点)(1)保证元数据一致(edits)namenode(fsimage edits)a、NFSb、journalnodec、zk(2)只有一台namenode对外提 ... [详细]
  • 1、概述hdfs文件系统主要设计为了存储大文件的文件系统;如果有个TB级别的文件,我们该怎么存储呢?分布式文件系统未出现的时候࿰ ... [详细]
  • HDFS是什么?HDFS全称HadoopDistributedFileSystem,简称HDFS,是一个分布式文件系统。它是谷歌的GFS提出之后出现的另外一种文件系统。它有一定高 ... [详细]
  • 查看完成状态yarnapplication-list-appStatesFINISHED查看所有状态yarnapplication-list-appStatesALL查看某个应 ... [详细]
author-avatar
小薇虫虫_851_413
这个家伙很懒,什么也没留下!
PHP1.CN | 中国最专业的PHP中文社区 | DevBox开发工具箱 | json解析格式化 |PHP资讯 | PHP教程 | 数据库技术 | 服务器技术 | 前端开发技术 | PHP框架 | 开发工具 | 在线工具
Copyright © 1998 - 2020 PHP1.CN. All Rights Reserved | 京公网安备 11010802041100号 | 京ICP备19059560号-4 | PHP1.CN 第一PHP社区 版权所有