热门标签 | HotTags
当前位置:  开发笔记 > 编程语言 > 正文

大数据基础_大数据基础问答之一

篇首语:本文由编程笔记#小编为大家整理,主要介绍了大数据基础问答-之一相关的知识,希望对你有一定的参考价值。WhatisHado

篇首语:本文由编程笔记#小编为大家整理,主要介绍了大数据基础问答-之一相关的知识,希望对你有一定的参考价值。



What is Hadoop?

==========

Hadoop is an open source, Java-based programming framework that supports the processing and storage of extremely large data sets in a distributed computing environment. It is part of the Apache project sponsored by the Apache Software Foundation.

 

What technology inspired the invention of Hadoop?

===========

Google paper:

  • MapReduce
  • Google File System

At architecture level, what are the most key components of Hadoop?

===========

HDFS – Hadoop Distributed File System, which is capable of storing data across thousands of commodity servers to achieve high bandwidth between nodes.

MapReduce - provides the programming model used to tackle large distributed data processing -- mapping data and reducing it to a result.

Yarn – Hadoop Yet Another Resource Negotiator which provides resource management and scheduling for user applications.

 

What is HDFS?

===========

HDFS provides the scalable, fault-tolerant, cost-efficient storage for big data. HDFS is a Java-based file system that provides scalable and reliable data storage, and it was designed to span large clusters of commodity servers. HDFS has demonstrated production scalability of up to 200 PB of storage and a single cluster of 4500 servers, supporting close to a billion files and blocks.

 

Describe the HDFS architecture, please.

===========

HDFS has a master/slave architecture.

An HDFS cluster consists of a single NameNode, a master server that manages the file system namespace and regulates access to files by clients.

In addition, there are a number of DataNodes, usually one per node in the cluster, which manage storage attached to the nodes that they run on.

HDFS exposes a file system namespace and allows user data to be stored in files.

Internally, a file is split into one or more blocks and these blocks are stored in a set of DataNodes.

  • The NameNode executes file system namespace operations like opening, closing, and renaming files and directories. It also determines the mapping of blocks to DataNodes.
  • The DataNodes are responsible for serving read and write requests from the file system’s clients. The DataNodes also perform block creation, deletion, and replication upon instruction from the NameNode.

image

 

 

How is a file stored in HDFS?

===========

A file is separated into blocks of at least 64MB. These blocks will be stored in DataNodes.

Block placement strategy:

  • One replica on local node.
  • Second replica on a remote rack.
  • Third replica on the same remote rack.
  • Additional replicas are randomly placed.

 

A data file of 1 TB requires how much storage and network traffic to store in HDFS?

===========

1 TB file requires:

  • 3 TB storage
  • 3 TB network traffic

image

 

How is filesystem metadata stored in Hadoop?

===========

This filesystem metadata is stored in two different constructs: the fsimage and the edit log.

The fsimage is a file that represents a point-in-time snapshot of the filesystem’s metadata.

Rather than writing a new fsimage every time the namespace is modified, the NameNode instead records the modifying operation in the edit log for durability.

 

This way, if the NameNode crashes, it can restore its state by first loading the fsimage then replaying all the operations (also called edits or transactions) in the edit log to catch up to the most recent state of the namesystem. The edit log comprises a series of files, called edit log segments, that together represent all the namesystem modifications made since the creation of the fsimage.

image

 

image

 

What is ZooKeeper and how is that fit into hadoop?

===========

ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services.

image

 

image

 

What is Hadoop MapReduce?

===========

Hadoop MapReduce is a software framework for easily writing applications which process vast amounts of data (multi-terabyte data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner.

 

A MapReduce job usually splits the input data-set into independent chunks which are processed by the map tasks in a completely parallel manner. The framework sorts the outputs of the maps, which are then input to the reduce tasks. Typically both the input and the output of the job are stored in a file-system. The framework takes care of scheduling tasks, monitoring them and re-executes the failed tasks.

 

Typically the compute nodes and the storage nodes are the same, that is, the MapReduce framework and the Hadoop Distributed File System (see HDFS Architecture Guide) are running on the same set of nodes. This configuration allows the framework to effectively schedule tasks on the nodes where data is already present, resulting in very high aggregate bandwidth across the cluster.

 

The MapReduce framework consists of a single master JobTracker and one slave TaskTracker per cluster-node. The master is responsible for scheduling the jobs\' component tasks on the slaves, monitoring them and re-executing the failed tasks. The slaves execute the tasks as directed by the master.

 

Minimally, applications specify the input/output locations and supply map and reduce functions via implementations of appropriate interfaces and/or abstract-classes. These, and other job parameters, comprise the job configuration. The Hadoop job client then submits the job (jar/executable etc.) and configuration to the JobTracker which then assumes the responsibility of distributing the software/configuration to the slaves, scheduling tasks and monitoring them, providing status and diagnostic information to the job-client.

image

 

image

 

What is Yarn?

===========

The fundamental idea of YARN is to split up the functionalities of resource management and job scheduling/monitoring into separate daemons. The idea is to have a global ResourceManager (RM) and per-application ApplicationMaster (AM). An application is either a single job or a DAG of jobs.

The ResourceManager and the NodeManager form the data-computation framework. The ResourceManager is the ultimate authority that arbitrates resources among all the applications in the system. The NodeManager is the per-machine framework agent who is responsible for containers, monitoring their resource usage (cpu, memory, disk, network) and reporting the same to the ResourceManager/Scheduler.

The per-application ApplicationMaster is, in effect, a framework specific library and is tasked with negotiating resources from the ResourceManager and working with the NodeManager(s) to execute and monitor the tasks.

image

 

image

 

Put everything above together?

==============

image



推荐阅读
  • 采用IKE方式建立IPsec安全隧道
    一、【组网和实验环境】按如上的接口ip先作配置,再作ipsec的相关配置,配置文本见文章最后本文实验采用的交换机是H3C模拟器,下载地址如 ... [详细]
  • 实用正则表达式有哪些
    小编给大家分享一下实用正则表达式有哪些,相信大部分人都还不怎么了解,因此分享这篇文章给大家参考一下,希望大家阅读完这篇文章后大有收获,下 ... [详细]
  • 深入解析SpringMVC核心组件:DispatcherServlet的工作原理
    本文详细探讨了SpringMVC的核心组件——DispatcherServlet的运作机制,旨在帮助有一定Java和Spring基础的开发人员理解HTTP请求是如何被映射到Controller并执行的。文章将解答以下问题:1. HTTP请求如何映射到Controller;2. Controller是如何被执行的。 ... [详细]
  • 深入解析Spring启动过程
    本文详细介绍了Spring框架的启动流程,帮助开发者理解其内部机制。通过具体示例和代码片段,解释了Bean定义、工厂类、读取器以及条件评估等关键概念,使读者能够更全面地掌握Spring的初始化过程。 ... [详细]
  • yikesnews第11期:微软Office两个0day和一个提权0day
    点击阅读原文可点击链接根据法国大选被黑客干扰,发送了带漏洞的文档Trumps_Attack_on_Syria_English.docx而此漏洞与ESET&FireEy ... [详细]
  • This post discusses an issue encountered while using the @name annotation in documentation generation, specifically regarding nested class processing and unexpected output. ... [详细]
  • ElasticSearch 集群监控与优化
    本文详细介绍了如何有效地监控 ElasticSearch 集群,涵盖了关键性能指标、集群健康状况、统计信息以及内存和垃圾回收的监控方法。 ... [详细]
  • 并发编程 12—— 任务取消与关闭 之 shutdownNow 的局限性
    Java并发编程实践目录并发编程01——ThreadLocal并发编程02——ConcurrentHashMap并发编程03——阻塞队列和生产者-消费者模式并发编程04——闭锁Co ... [详细]
  • 深入解析Java枚举及其高级特性
    本文详细介绍了Java枚举的概念、语法、使用规则和应用场景,并探讨了其在实际编程中的高级应用。所有相关内容已收录于GitHub仓库[JavaLearningmanual](https://github.com/Ziphtracks/JavaLearningmanual),欢迎Star并持续关注。 ... [详细]
  • 我有一个SpringRestController,它处理API调用的版本1。继承在SpringRestControllerpackagerest.v1;RestCon ... [详细]
  • 历经三十年的开发,Mathematica 已成为技术计算领域的标杆,为全球的技术创新者、教育工作者、学生及其他用户提供了一个领先的计算平台。最新版本 Mathematica 12.3.1 增加了多项核心语言、数学计算、可视化和图形处理的新功能。 ... [详细]
  • 本文详细探讨了如何通过分析单个或多个线程在瓶颈情况下的表现,来了解处理器资源的消耗。无论是单进程还是多进程环境,监控关键指标如线程数量、占用时间及调度优先级等,有助于揭示潜在的性能问题。 ... [详细]
  • 使用Powershell Studio快速构建GUI应用程序
    本文介绍了如何利用Powershell Studio创建功能强大的可视化界面。相较于传统的开发工具,Powershell Studio提供了更为简便和高效的开发体验,尤其适合需要快速构建图形用户界面(GUI)的场景。 ... [详细]
  • SDN网络拓扑发现机制解析
    本文深入探讨了SDN(软件定义网络)中拓扑发现的原理与实现方法,重点介绍了LLDP协议在OpenFlow环境中的应用,并讨论了非OpenFlow设备存在时的链路发现策略。 ... [详细]
  • 探讨ChatGPT在法律和版权方面的潜在风险及影响,分析其作为内容创造工具的合法性和合规性。 ... [详细]
author-avatar
无限制空间689
这个家伙很懒,什么也没留下!
PHP1.CN | 中国最专业的PHP中文社区 | DevBox开发工具箱 | json解析格式化 |PHP资讯 | PHP教程 | 数据库技术 | 服务器技术 | 前端开发技术 | PHP框架 | 开发工具 | 在线工具
Copyright © 1998 - 2020 PHP1.CN. All Rights Reserved | 京公网安备 11010802041100号 | 京ICP备19059560号-4 | PHP1.CN 第一PHP社区 版权所有