大数据基础_大数据基础问答之一

作者：无限制空间689 | 来源：互联网 | 2023-10-12 15:35

篇首语：本文由编程笔记#小编为大家整理，主要介绍了大数据基础问答-之一相关的知识，希望对你有一定的参考价值。WhatisHado

篇首语：本文由编程笔记#小编为大家整理，主要介绍了大数据基础问答-之一相关的知识，希望对你有一定的参考价值。

What is Hadoop?

==========

Hadoop is an open source, Java-based programming framework that supports the processing and storage of extremely large data sets in a distributed computing environment. It is part of the Apache project sponsored by the Apache Software Foundation.

What technology inspired the invention of Hadoop?

===========

Google paper:

MapReduce
Google File System

At architecture level, what are the most key components of Hadoop?

===========

HDFS – Hadoop Distributed File System, which is capable of storing data across thousands of commodity servers to achieve high bandwidth between nodes.

MapReduce - provides the programming model used to tackle large distributed data processing -- mapping data and reducing it to a result.

Yarn – Hadoop Yet Another Resource Negotiator which provides resource management and scheduling for user applications.

What is HDFS?

===========

HDFS provides the scalable, fault-tolerant, cost-efficient storage for big data. HDFS is a Java-based file system that provides scalable and reliable data storage, and it was designed to span large clusters of commodity servers. HDFS has demonstrated production scalability of up to 200 PB of storage and a single cluster of 4500 servers, supporting close to a billion files and blocks.

Describe the HDFS architecture, please.

===========

HDFS has a master/slave architecture.

An HDFS cluster consists of a single NameNode, a master server that manages the file system namespace and regulates access to files by clients.

In addition, there are a number of DataNodes, usually one per node in the cluster, which manage storage attached to the nodes that they run on.

HDFS exposes a file system namespace and allows user data to be stored in files.

Internally, a file is split into one or more blocks and these blocks are stored in a set of DataNodes.

The NameNode executes file system namespace operations like opening, closing, and renaming files and directories. It also determines the mapping of blocks to DataNodes.
The DataNodes are responsible for serving read and write requests from the file system’s clients. The DataNodes also perform block creation, deletion, and replication upon instruction from the NameNode.

How is a file stored in HDFS?

===========

A file is separated into blocks of at least 64MB. These blocks will be stored in DataNodes.

Block placement strategy:

One replica on local node.
Second replica on a remote rack.
Third replica on the same remote rack.
Additional replicas are randomly placed.

A data file of 1 TB requires how much storage and network traffic to store in HDFS?

===========

1 TB file requires:

3 TB storage
3 TB network traffic

How is filesystem metadata stored in Hadoop?

===========

This filesystem metadata is stored in two different constructs: the fsimage and the edit log.

The fsimage is a file that represents a point-in-time snapshot of the filesystem’s metadata.

Rather than writing a new fsimage every time the namespace is modified, the NameNode instead records the modifying operation in the edit log for durability.

This way, if the NameNode crashes, it can restore its state by first loading the fsimage then replaying all the operations (also called edits or transactions) in the edit log to catch up to the most recent state of the namesystem. The edit log comprises a series of files, called edit log segments, that together represent all the namesystem modifications made since the creation of the fsimage.

What is ZooKeeper and how is that fit into hadoop?

===========

ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services.

What is Hadoop MapReduce?

===========

Hadoop MapReduce is a software framework for easily writing applications which process vast amounts of data (multi-terabyte data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner.

A MapReduce job usually splits the input data-set into independent chunks which are processed by the map tasks in a completely parallel manner. The framework sorts the outputs of the maps, which are then input to the reduce tasks. Typically both the input and the output of the job are stored in a file-system. The framework takes care of scheduling tasks, monitoring them and re-executes the failed tasks.

Typically the compute nodes and the storage nodes are the same, that is, the MapReduce framework and the Hadoop Distributed File System (see HDFS Architecture Guide) are running on the same set of nodes. This configuration allows the framework to effectively schedule tasks on the nodes where data is already present, resulting in very high aggregate bandwidth across the cluster.

The MapReduce framework consists of a single master JobTracker and one slave TaskTracker per cluster-node. The master is responsible for scheduling the jobs\' component tasks on the slaves, monitoring them and re-executing the failed tasks. The slaves execute the tasks as directed by the master.

Minimally, applications specify the input/output locations and supply map and reduce functions via implementations of appropriate interfaces and/or abstract-classes. These, and other job parameters, comprise the job configuration. The Hadoop job client then submits the job (jar/executable etc.) and configuration to the JobTracker which then assumes the responsibility of distributing the software/configuration to the slaves, scheduling tasks and monitoring them, providing status and diagnostic information to the job-client.

What is Yarn?

===========

The fundamental idea of YARN is to split up the functionalities of resource management and job scheduling/monitoring into separate daemons. The idea is to have a global ResourceManager (RM) and per-application ApplicationMaster (AM). An application is either a single job or a DAG of jobs.

The ResourceManager and the NodeManager form the data-computation framework. The ResourceManager is the ultimate authority that arbitrates resources among all the applications in the system. The NodeManager is the per-machine framework agent who is responsible for containers, monitoring their resource usage (cpu, memory, disk, network) and reporting the same to the ResourceManager/Scheduler.

The per-application ApplicationMaster is, in effect, a framework specific library and is tasked with negotiating resources from the ResourceManager and working with the NodeManager(s) to execute and monitor the tasks.

Put everything above together?

==============

推荐阅读

ip
MongoDB 高可用集群搭建指南：分片、读写分离与负载均衡

本文详细介绍了如何搭建一个高可用的MongoDB集群，包括环境准备、用户配置、目录创建、MongoDB安装、配置文件设置、集群组件部署等步骤。特别关注分片、读写分离及负载均衡的实现。 ... [详细]

蜡笔小新 2024-11-20 18:28:16
string
spring boot使用jetty无法启动

spring boot使用jetty无法启动 ... [详细]

蜡笔小新 2024-11-21 10:15:52
sum
Windows环境下Apache频繁崩溃的解决方案

本文探讨了在Windows系统中运行Apache服务器时频繁出现崩溃的问题，并提供了多种可能的解决方案和建议。错误日志显示多个子进程因达到最大请求限制而退出。 ... [详细]

蜡笔小新 2024-11-20 13:07:27
jar
解决Spring Cloud Eureka自定义端口时连接错误的问题

在尝试通过自定义端口部署Spring Cloud Eureka时遇到了连接失败的问题。本文详细描述了问题的现象，并提供了有效的解决方案，以帮助遇到类似情况的开发者。 ... [详细]

蜡笔小新 2024-11-20 13:05:47
function
Qt中信号与槽机制对比传统回调函数的优势

在Qt框架中，信号与槽机制是一种独特的组件间通信方式。本文探讨了这一机制相较于传统的C风格回调函数所具有的优势，并分析了其潜在的不足之处。 ... [详细]

蜡笔小新 2024-11-20 10:48:37
string
Maven + Spring + MyBatis + MySQL 环境搭建与实例解析

本文详细介绍如何使用MySQL数据库进行环境搭建，包括创建数据库表并插入示例数据。随后，逐步指导如何配置Maven项目，整合Spring框架与MyBatis，实现高效的数据访问。 ... [详细]

蜡笔小新 2024-11-21 18:39:23
function
ABAP开发者需关注的几大关键问题

长期从事ABAP开发工作的专业人士，在面对行业新趋势时，往往需要重新审视自己的发展方向。本文探讨了几位资深专家对ABAP未来走向的看法，以及开发者应如何调整技能以适应新的技术环境。 ... [详细]

蜡笔小新 2024-11-21 18:21:06
ip
Oracle VM VirtualBox 使用指南：创建静态网页及高级功能

本文详细介绍了如何在Oracle VM VirtualBox中实现主机与虚拟机之间的数据交换，包括安装Guest Additions增强功能，以及如何利用这些功能进行文件传输、屏幕调整等操作。 ... [详细]

蜡笔小新 2024-11-21 18:13:22
ip
SIP基础概览

本文介绍了SIP（Session Initiation Protocol，会话发起协议）的基本概念、功能、消息格式及其实现机制。SIP是一种在IP网络上用于建立、管理和终止多媒体通信会话的应用层协议。 ... [详细]

蜡笔小新 2024-11-21 17:42:08
python
Web动态服务器Python基本实现

Web动态服务器Python基本实现 ... [详细]

蜡笔小新 2024-11-21 08:01:30
input
使用Service Locator模式实现高效的服务命名访问

本文探讨了如何通过Service Locator模式来简化和优化在B/S架构中的服务命名访问，特别是对于需要频繁访问的服务，如JNDI和XMLNS。该模式通过缓存机制减少了重复查找的成本，并提供了对多种服务的统一访问接口。 ... [详细]

蜡笔小新 2024-11-20 19:26:30
input
如何在PHP中安装Xdebug扩展

本文介绍了如何从PECL下载并编译安装Xdebug扩展，以及如何配置PHP和PHPStorm以启用调试功能。 ... [详细]

蜡笔小新 2024-11-20 18:31:50
function
深入解析 Bootstrap Table 的使用技巧

本文详细介绍了如何利用 Bootstrap Table 实现数据展示与操作，包括数据加载、表格配置及前后端交互等关键步骤。 ... [详细]

蜡笔小新 2024-11-20 17:21:26
ip
linux网络子系统分析（二）—— 协议栈分层框架的建立

目录一、综述二、INET的初始化2.1INET接口注册2.2抽象实体的建立2.3代码细节分析2.3.1socket参数三、其他协议3.1PF_PACKET3.2P ... [详细]

蜡笔小新 2024-11-20 15:21:14
string
详解Android中Binder.getCallingPid()方法及其应用实例

本文详细介绍了`android.os.Binder.getCallingPid()`方法的功能和应用场景，并提供了多个实际的代码示例。通过这些示例，开发者可以更好地理解如何在不同的开发场景中使用该方法。 ... [详细]

蜡笔小新 2024-11-19 20:22:56

无限制空间689

这个家伙很懒，什么也没留下！

Tags | 热门标签

RankList | 热门文章