作者:步履乘风 | 来源:互联网 | 2023-07-24 17:10
去年在Amazon上手动搭建了八个结点的小集群做测试,两个namenode,一个yarn,五个datanode,距离上次使用大概有两个月的时间,之前用的时候都没问题,今天启动后正常查看hdfs上的文件
去年在Amazon上手动搭建了八个结点的小集群做测试,两个namenode,一个yarn,五个datanode,距离上次使用大概有两个月的时间,之前用的时候都没问题,今天启动后正常查看hdfs上的文件时报错如下:
ubuntu@ip-172-31-9-9:~$ hadoop fs -ls /
16/07/18 06:52:48 INFO retry.RetryInvocationHandler: Exception while invoking getFileInfo of class ClientNamenodeProtocolTranslatorPB over ip-172-31-10-20/172.31.10.20:9000 after 1 fail over attempts. Trying to fail over after sleeping for 901ms.
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): Operation category READ is not supported in state standby
at org.apache.hadoop.hdfs.server.namenode.ha.StandbyState.checkOperation(StandbyState.java:87)
at org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.checkOperation(NameNode.java:1774)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkOperation(FSNamesystem.java:1313)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getFileInfo(FSNamesystem.java:3856)
at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getFileInfo(NameNodeRpcServer.java:1006)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getFileInfo(ClientNamenodeProtocolServerSideTranslatorPB.java:843)
at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2043)
at org.apache.hadoop.ipc.Client.call(Client.java:1476)
at org.apache.hadoop.ipc.Client.call(Client.java:1407)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
at com.sun.proxy.$Proxy9.getFileInfo(Unknown Source)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:771)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
at com.sun.proxy.$Proxy10.getFileInfo(Unknown Source)
at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:2116)
at org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1305)
at org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1301)
at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1301)
at org.apache.hadoop.fs.Globber.getFileStatus(Globber.java:57)
at org.apache.hadoop.fs.Globber.glob(Globber.java:265)
at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:1655)
at org.apache.hadoop.fs.shell.PathData.expandAsGlob(PathData.java:326)
at org.apache.hadoop.fs.shell.Command.expandArgument(Command.java:235)
at org.apache.hadoop.fs.shell.Command.expandArguments(Command.java:218)
at org.apache.hadoop.fs.shell.Command.processRawArguments(Command.java:201)
at org.apache.hadoop.fs.shell.Command.run(Command.java:165)
at org.apache.hadoop.fs.FsShell.run(FsShell.java:287)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
at org.apache.hadoop.fs.FsShell.main(FsShell.java:340)
分析日志发现了这句 “Operation category READ is not supported in state standby”,处在standby状态不支持读操作,正常是一个active一个standby,这样可能是两个namenode均处在standby状态,于是通过浏览器直接访问namenode:http://###.###.###.###:50070,发现两个namenode确实均处在standby状态。
很奇怪为什么会出现这样的情况,集群的设置没有改变过,hadoop和zookeeper的启动也都是正常的,不知道是什么原因,待补充。
参考了博文 hadoop HA启动时 两个namenode节点都是standby,解决办法,其内容如下:
- 首先你要确定不用ha的时候你的hadoop集群是正常的,不然找错误的方向就偏离了
- 如果都正常,配置ha 需要zookeeper,先要看看是不是zookeeper没有配置好的问题
- 如果都正常,在hadoop安装目录执行sbin/hadoop-daemon.sh start
zkfc,这句是启动zookeeper选举制度,然后执行bin/hdfs haadmin -transitionToActive nn2
其中nn2是你的namenode中的一个(如果是新配置的HAhadoop集群,可能是zkfc(DFSZKFailoverController)没有格式化导致namenode节点的自动切换机制没有开启)
- 你在hadoop-env.sh中是需要配置JAVA_HOME的,但是不需要配置其他,HADOOP_HOME和PATH是需要配置在/etc/profile中的
我的hadoop集群之前运行正常,zookeeper也没问题,我从第三步开始测试,过程如下:
ubuntu@ip-172-31-9-9:~$ hadoop/hadoop-2.7.1/sbin/hadoop-daemon.sh start zkfc
starting zkfc, logging to /home/ubuntu/hadoop/hadoop-2.7.1/logs/hadoop-ubuntu-zkfc-ip-172-31-9-9.out
ubuntu@ip-172-31-9-9:~$ hadoop/hadoop-2.7.1/bin/hdfs haadmin -transitionToActive nn2
Automatic failover is enabled for NameNode at ip-172-31-9-9/172.31.9.9:9000
Refusing to manually manage HA state, since it may cause
a split-brain scenario or other incorrect state.
If you are very sure you know what you are doing, please
specify the --forcemanual flag.
ubuntu@ip-172-31-9-9:~$ hadoop/hadoop-2.7.1/bin/hdfs haadmin -transitionToActive --forcemanual nn2
You have specified the --forcemanual flag. This flag is dangerous, as it can induce a split-brain scenario that WILL CORRUPT your HDFS namespace, possibly irrecoverably.
It is recommended not to use this flag, but instead to shut down the cluster and disable automatic failover if you prefer to manually manage your HA state.
You may abort safely by answering 'n' or hitting ^C now.
Are you sure you want to continue? (Y or N) y
16/07/18 07:05:34 WARN ha.HAAdmin: Proceeding with manual HA state management even though
automatic failover is enabled for NameNode at ip-172-31-9-9/172.31.9.9:9000
16/07/18 07:05:35 WARN ha.HAAdmin: Proceeding with manual HA state management even though
automatic failover is enabled for NameNode at ip-172-31-10-20/172.31.10.20:9000
ubuntu@ip-172-31-9-9:~$ hadoop fs -ls /
Found 8 items
drwxr-xr-x - ubuntu supergroup 0 2016-01-10 13:57 /hadooptest
最后解决了问题。