一、NameNode的HA
1、 core-site.xml
• For MRv1:
fs.default.name/name>
hdfs://mycluster
• For YARN:
fs.defaultFS
hdfs://mycluster
ha.zookeeper.quorum
zk1.example.com:2181,zk2.example.com:2181,zk3.example.com:2181
2、 hdfs-site.xml
dfs.nameservices
mycluster
dfs.ha.namenodes.mycluster
nn1,nn2
dfs.namenode.rpc-address.mycluster.nn1
machine1.example.com:8020
dfs.namenode.rpc-address.mycluster.nn2
machine2.example.com:8020
dfs.namenode.http-address.mycluster.nn1
machine1.example.com:50070
dfs.namenode.http-address.mycluster.nn2
machine2.example.com:50070
dfs.namenode.shared.edits.dir
qjournal://node1.example.com:8485;node2.example.com:8485;node3.example.com:8485/mycluster
===JournalNode===
dfs.journalnode.edits.dir
/data/1/dfs/jn
===Client Failover COnfiguration===
dfs.client.failover.proxy.provider.mycluster
org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider
===Fencing COnfiguration===
dfs.ha.fencing.methods
sshfence
dfs.ha.fencing.ssh.private-key-files
/home/exampleuser/.ssh/id_rsa
==================
dfs.ha.automatic-failover.enabled
true
3、在NameNode的一个节点上执行初始化到ZK的HA状态信息的命令
hdfs zkfc -formatZK
4、格式化NameNode
5、安装和启动JournalNode(要在NameNode之前启动)
sudo yum install hadoop-hdfs-journalnode
sudo service hadoop-hdfs-journalnode start
6、启动NameNode
(1)、Start the primary (formatted) NameNode:
sudo -u hdfs hadoop namenode -format
sudo service hadoop-hdfs-namenode start
(2)、Start the standby NameNode:
sudo -u hdfs hdfs namenode -bootstrapStandby
sudo service hadoop-hdfs-namenode start
7、配置自动故障转移:在NameNode节点上安装和运行ZKFC
sudo yum install hadoop-hdfs-zkfc
sudo service hadoop-hdfs-zkfc start
8、验证自动故障转移
kill -9
观察效果
二、Jobtracker的HA
1、在两个节点上安装HA Jobtracker包
sudo yum install hadoop-0.20-mapreduce-jobtrackerha
2、如果想主动故障恢复,就需要在两个HA jobtracker节点安装zkfc
sudo yum install hadoop-0.20-mapreduce-zkfc
3、配置HA jobtracker
mapred.job.tracker
logicaljt
mapred.jobtrackers.logicaljt
jt1,jt2
Comma-separated list of JobTracker IDs.
mapred.jobtracker.rpc-address.logicaljt.jt1
myjt1.myco.com:8021
mapred.jobtracker.rpc-address.logicaljt.jt2
myjt2.myco.com:8022
mapred.job.tracker.http.address.logicaljt.jt1
0.0.0.0:50030
mapred.job.tracker.http.address.logicaljt.jt2
0.0.0.0:50031
mapred.ha.jobtracker.rpc-address.logicaljt.jt1
myjt1.myco.com:8023
mapred.ha.jobtracker.rpc-address.logicaljt.jt2
myjt2.myco.com:8024
mapred.ha.jobtracker.http-redirect-address.logicaljt.jt1
myjt1.myco.com:50030
mapred.ha.jobtracker.http-redirect-address.logicaljt.jt2
myjt2.myco.com:50031
mapred.jobtracker.restart.recover
true
mapred.job.tracker.persist.jobstatus.active
true
mapred.job.tracker.persist.jobstatus.hours
1
mapred.job.tracker.persist.jobstatus.dir
/jobtracker/jobsInfo
mapred.client.failover.proxy.provider.logicaljt
org.apache.hadoop.mapred.ConfiguredFailoverProxyProvider
mapred.client.failover.max.attempts
15
mapred.client.failover.sleep.base.millis
500
mapred.client.failover.sleep.max.millis
1500
mapred.client.failover.connection.retries
0
mapred.client.failover.connection.retries.on.timeouts
0
mapred.ha.fencing.methods
shell(/bin/true)
4、启动HA jobtracker:在两个HA jobtracker上启动
sudo service hadoop-0.20-mapreduce-jobtrackerha start
如果没有配置主动故障恢复,两个启动的jobtracker都处于standby状态。
可以根据sudo -u mapred hadoop mrhaadmin -getServiceState 获取jobtracker状态信息
是 mapred.jobtrackers.里面的name,如上面配置的 jt1 or jt2。
将一个jobtracker切换至Active状态:
sudo -u mapred hadoop mrhaadmin -transitionToActive
sudo -u mapred hadoop mrhaadmin -getServiceState
5、故障转移验证(手动故障转移)
sudo -u mapred hadoop mrhaadmin -failover
例如:将有故障的active的jt1转移到jt2,这个时候jt2变成active了
sudo -u mapred hadoop mrhaadmin -failover jt1 jt2
如果转移成功,jt2状态将变成active,执行以下命令可以查看
sudo -u mapred hadoop mrhaadmin -getServiceState jt2
6、配置自动故障转移
(1)、安装配置zookeeper集群(可以公用hdfs ha的zk)
(2)、手动故障转移参数配置
mapred.job.tracker
logicaljt
mapred.jobtrackers.logicaljt
jt1,jt2
Comma-separated list of JobTracker IDs.
mapred.jobtracker.rpc-address.logicaljt.jt1
myjt1.myco.com:8021
mapred.jobtracker.rpc-address.logicaljt.jt2
myjt2.myco.com:8022
mapred.job.tracker.http.address.logicaljt.jt1
0.0.0.0:50030
mapred.job.tracker.http.address.logicaljt.jt2
0.0.0.0:50031
mapred.ha.jobtracker.rpc-address.logicaljt.jt1
myjt1.myco.com:8023
mapred.ha.jobtracker.rpc-address.logicaljt.jt2
myjt2.myco.com:8024
mapred.ha.jobtracker.http-redirect-address.logicaljt.jt1
myjt1.myco.com:50030
mapred.ha.jobtracker.http-redirect-address.logicaljt.jt2
myjt2.myco.com:50031
mapred.jobtracker.restart.recover
true
mapred.job.tracker.persist.jobstatus.active
true
mapred.job.tracker.persist.jobstatus.hours
1
mapred.job.tracker.persist.jobstatus.dir
/jobtracker/jobsInfo
mapred.client.failover.proxy.provider.logicaljt
org.apache.hadoop.mapred.ConfiguredFailoverProxyProvider
mapred.client.failover.max.attempts
15
mapred.client.failover.sleep.base.millis
500
mapred.client.failover.sleep.max.millis
1500
mapred.client.failover.connection.retries
0
mapred.client.failover.connection.retries.on.timeouts
0
mapred.ha.fencing.methods
shell(/bin/true)
(3)、配置故障恢复控制参数
mapred-site.xml:
mapred.ha.automatic-failover.enabled
true
mapred.ha.zkfc.port
8018
(4)、初始化在ZK中的HA的状态信息(在某一个jobtracker上面执行就行,在执行前,zk集群必须先启动起来)
sudo service hadoop-0.20-mapreduce-zkfc init
OR
sudo -u mapred hadoop mrzkfc -formatZK
(5)、启动自动故障恢复
在每一个jobtracker节点上启动zkfc和jobtracker:
sudo service hadoop-0.20-mapreduce-zkfc start
sudo service hadoop-0.20-mapreduce-jobtrackerha start
(6)、验证自动故障恢复功能
首先通过以下命令找到那个jt是处于active状态:
sudo -u mapred hadoop mrhaadmin -getServiceState
然后执行kill命令杀死对应的jvm进程:
kill -9
最后就看是否成功将active状态切换到另外一个节点上。