HDFS打开一个文件,需要在客户端调用DistributedFileSystem.open(Path f, int bufferSize),其实现为:
public FSDataInputStream open(Path f, int bufferSize) throws IOException { return new DFSClient.DFSDataInputStream( dfs.open(getPathName(f), bufferSize, verifyChecksum, statistics)); } |
其中dfs为DistributedFileSystem的成员变量DFSClient,其open函数被调用,其中创建一个DFSInputStream(src, buffersize, verifyChecksum)并返回。
在DFSInputStream的构造函数中,openInfo函数被调用,其主要从namenode中得到要打开的文件所对应的blocks的信息,实现如下:
synchronized void openInfo() throws IOException { LocatedBlocks newInfo = callGetBlockLocations(namenode, src, 0, prefetchSize); this.locatedBlocks = newInfo; this.currentNode = null; } |
private static LocatedBlocks callGetBlockLocations(ClientProtocol namenode, String src, long start, long length) throws IOException { return namenode.getBlockLocations(src, start, length); } |
LocatedBlocks主要包含一个链表的List
上面namenode.getBlockLocations是一个RPC调用,最终调用NameNode类的getBlockLocations函数。
NameNode.getBlockLocations实现如下:
public LocatedBlocks getBlockLocations(String src, long offset, long length) throws IOException { return namesystem.getBlockLocations(getClientMachine(), src, offset, length); } |
namesystem是NameNode一个成员变量,其类型为FSNamesystem,保存的是NameNode的name space树,其中一个重要的成员变量为FSDirectory dir。
FSDirectory和Lucene中的FSDirectory没有任何关系,其主要包括FSImage fsImage,用于读写硬盘上的fsimage文件,FSImage类有成员变量FSEditLog editLog,用于读写硬盘上的edit文件,这两个文件的关系在上一篇文章中已经解释过。
FSDirectory还有一个重要的成员变量INodeDirectoryWithQuota rootDir,INodeDirectoryWithQuota的父类为INodeDirectory,实现如下:
public class INodeDirectory extends INode { …… private List …… } |
由此可见INodeDirectory本身是一个INode,其中包含一个链表的INode,此链表中,如果仍为文件夹,则是类型INodeDirectory,如果是文件,则是类型INodeFile,INodeFile中有成员变量BlockInfo blocks[],是此文件包含的block的信息。显然这是一棵树形的结构。
FSNamesystem.getBlockLocations函数如下:
public LocatedBlocks getBlockLocations(String src, long offset, long length, boolean doAccessTime) throws IOException { final LocatedBlocks ret = getBlockLocationsInternal(src, dir.getFileINode(src), offset, length, Integer.MAX_VALUE, doAccessTime); return ret; } |
dir.getFileINode(src)通过路径名从文件系统树中找到INodeFile,其中保存的是要打开的文件的INode的信息。
getBlockLocationsInternal的实现如下:
private synchronized LocatedBlocks getBlockLocationsInternal(String src, INodeFile inode, long offset, long length, int nrBlocksToReturn, boolean doAccessTime) throws IOException { //得到此文件的block信息 Block[] blocks = inode.getBlocks(); List //计算从offset开始,长度为length所涉及的blocks int curBlk = 0; long curPos = 0, blkSize = 0; int nrBlocks = (blocks[0].getNumBytes() == 0) ? 0 : blocks.length; for (curBlk = 0; curBlk blkSize = blocks[curBlk].getNumBytes(); if (curPos + blkSize > offset) { //当offset在curPos和curPos + blkSize之间的时候,curBlk指向offset所在的block break; } curPos += blkSize; } long endOff = offset + length; //循环,依次遍历从curBlk开始的每个block,直到当前位置curPos越过endOff do { int numNodes = blocksMap.numNodes(blocks[curBlk]); int numCorruptNodes = countNodes(blocks[curBlk]).corruptReplicas(); int numCorruptReplicas = corruptReplicas.numCorruptReplicas(blocks[curBlk]); boolean blockCorrupt = (numCorruptNodes == numNodes); int numMachineSet = blockCorrupt ? numNodes : (numNodes - numCorruptNodes); //依次找到此block所对应的datanode,将其中没有损坏的放入machineSet中 DatanodeDescriptor[] machineSet = new DatanodeDescriptor[numMachineSet]; if (numMachineSet > 0) { numNodes = 0; for(Iterator blocksMap.nodeIterator(blocks[curBlk]); it.hasNext();) { DatanodeDescriptor dn = it.next(); boolean replicaCorrupt = corruptReplicas.isReplicaCorrupt(blocks[curBlk], dn); if (blockCorrupt || (!blockCorrupt && !replicaCorrupt)) machineSet[numNodes++] = dn; } } //使用此machineSet和当前的block构造一个LocatedBlock results.add(new LocatedBlock(blocks[curBlk], machineSet, curPos, blockCorrupt)); curPos += blocks[curBlk].getNumBytes(); curBlk++; } while (curPos && curBlk && results.size() //使用此LocatedBlock链表构造一个LocatedBlocks对象返回 return inode.createLocatedBlocks(results); } |
通过RPC调用,在NameNode得到的LocatedBlocks对象,作为成员变量构造DFSInputStream对象,最后包装为FSDataInputStream返回给用户。
文件读取的时候,客户端利用文件打开的时候得到的FSDataInputStream.read(long position, byte[] buffer, int offset, int length)函数进行文件读操作。
FSDataInputStream会调用其封装的DFSInputStream的read(long position, byte[] buffer, int offset, int length)函数,实现如下:
public int read(long position, byte[] buffer, int offset, int length) throws IOException { long filelen = getFileLength(); int realLen = length; if ((position + length) > filelen) { realLen = (int)(filelen - position); } //首先得到包含从offset到offset + length内容的block列表 //比如对于64M一个block的文件系统来说,欲读取从100M开始,长度为128M的数据,则block列表包括第2,3,4块block List int remaining = realLen; //对每一个block,从中读取内容 //对于上面的例子,对于第2块block,读取从36M开始,读取长度28M,对于第3块,读取整一块64M,对于第4块,读取从0开始,长度为36M,共128M数据 for (LocatedBlock blk : blockRange) { long targetStart = position - blk.getStartOffset(); long bytesToRead = Math.min(remaining, blk.getBlockSize() - targetStart); fetchBlockByteRange(blk, targetStart, targetStart + bytesToRead - 1, buffer, offset); remaining -= bytesToRead; position += bytesToRead; offset += bytesToRead; } assert remaining == 0 : "Wrong number of bytes read."; if (stats != null) { stats.incrementBytesRead(realLen); } return realLen; } |
其中getBlockRange函数如下:
private synchronized List long length) throws IOException { List //首先从缓存的locatedBlocks中查找offset所在的block在缓存链表中的位置 int blockIdx &#61; locatedBlocks.findBlock(offset); if (blockIdx <0) { // block is not cached blockIdx &#61; LocatedBlocks.getInsertIndex(blockIdx); } long remaining &#61; length; long curOff &#61; offset; while(remaining > 0) { LocatedBlock blk &#61; null; //按照blockIdx的位置找到block if(blockIdx blk &#61; locatedBlocks.get(blockIdx); //如果block为空&#xff0c;则缓存中没有此block&#xff0c;则直接从NameNode中查找这些block&#xff0c;并加入缓存 if (blk &#61;&#61; null || curOff LocatedBlocks newBlocks; newBlocks &#61; callGetBlockLocations(namenode, src, curOff, remaining); locatedBlocks.insertRange(blockIdx, newBlocks.getLocatedBlocks()); continue; } //如果block找到&#xff0c;则放入结果集 blockRange.add(blk); long bytesRead &#61; blk.getStartOffset() &#43; blk.getBlockSize() - curOff; remaining -&#61; bytesRead; curOff &#43;&#61; bytesRead; //取下一个block blockIdx&#43;&#43;; } return blockRange; } |
其中fetchBlockByteRange实现如下&#xff1a;
private void fetchBlockByteRange(LocatedBlock block, long start, long end, byte[] buf, int offset) throws IOException { Socket dn &#61; null; int numAttempts &#61; block.getLocations().length; //此while循环为读取失败后的重试次数 while (dn &#61;&#61; null && numAttempts-- > 0 ) { //选择一个DataNode来读取数据 DNAddrPair retval &#61; chooseDataNode(block); DatanodeInfo chosenNode &#61; retval.info; InetSocketAddress targetAddr &#61; retval.addr; BlockReader reader &#61; null; try { //创建Socket连接到DataNode dn &#61; socketFactory.createSocket(); dn.connect(targetAddr, socketTimeout); dn.setSoTimeout(socketTimeout); int len &#61; (int) (end - start &#43; 1); //利用建立的Socket链接&#xff0c;生成一个reader负责从DataNode读取数据 reader &#61; BlockReader.newBlockReader(dn, src, block.getBlock().getBlockId(), block.getBlock().getGenerationStamp(), start, len, buffersize, verifyChecksum, clientName); //读取数据 int nread &#61; reader.readAll(buf, offset, len); return; } finally { IOUtils.closeStream(reader); IOUtils.closeSocket(dn); dn &#61; null; } //如果读取失败&#xff0c;则将此DataNode标记为失败节点 addToDeadNodes(chosenNode); } } |
BlockReader.newBlockReader函数实现如下&#xff1a;
public static BlockReader newBlockReader( Socket sock, String file, long blockId, long genStamp, long startOffset, long len, int bufferSize, boolean verifyChecksum, String clientName) throws IOException { //使用Socket建立写入流&#xff0c;向DataNode发送读指令 DataOutputStream out &#61; new DataOutputStream( new BufferedOutputStream(NetUtils.getOutputStream(sock,HdfsConstants.WRITE_TIMEOUT))); out.writeShort( DataTransferProtocol.DATA_TRANSFER_VERSION ); out.write( DataTransferProtocol.OP_READ_BLOCK ); out.writeLong( blockId ); out.writeLong( genStamp ); out.writeLong( startOffset ); out.writeLong( len ); Text.writeString(out, clientName); out.flush(); //使用Socket建立读入流&#xff0c;用于从DataNode读取数据 DataInputStream in &#61; new DataInputStream( new BufferedInputStream(NetUtils.getInputStream(sock), bufferSize)); DataChecksum checksum &#61; DataChecksum.newDataChecksum( in ); long firstChunkOffset &#61; in.readLong(); //生成一个reader&#xff0c;主要包含读入流&#xff0c;用于读取数据 return new BlockReader( file, blockId, in, checksum, verifyChecksum, startOffset, firstChunkOffset, sock ); } |
BlockReader的readAll函数就是用上面生成的DataInputStream读取数据。
在DataNode启动的时候&#xff0c;会调用函数startDataNode&#xff0c;其中与数据读取有关的逻辑如下&#xff1a;
void startDataNode(Configuration conf, AbstractList ) throws IOException { …… // 建立一个ServerSocket&#xff0c;并生成一个DataXceiverServer来监控客户端的链接 ServerSocket ss &#61; (socketWriteTimeout > 0) ? ServerSocketChannel.open().socket() : new ServerSocket(); Server.bind(ss, socAddr, 0); ss.setReceiveBufferSize(DEFAULT_DATA_SOCKET_SIZE); // adjust machine name with the actual port tmpPort &#61; ss.getLocalPort(); selfAddr &#61; new InetSocketAddress(ss.getInetAddress().getHostAddress(), tmpPort); this.dnRegistration.setName(machineName &#43; ":" &#43; tmpPort); this.threadGroup &#61; new ThreadGroup("dataXceiverServer"); this.dataXceiverServer &#61; new Daemon(threadGroup, new DataXceiverServer(ss, conf, this)); this.threadGroup.setDaemon(true); // auto destroy when empty …… } |
DataXceiverServer.run()函数如下&#xff1a;
public void run() { while (datanode.shouldRun) { //接受客户端的链接 Socket s &#61; ss.accept(); s.setTcpNoDelay(true); //生成一个线程DataXceiver来对建立的链接提供服务 new Daemon(datanode.threadGroup, new DataXceiver(s, datanode, this)).start(); } try { ss.close(); } catch (IOException ie) { LOG.warn(datanode.dnRegistration &#43; ":DataXceiveServer: " &#43; StringUtils.stringifyException(ie)); } } |
DataXceiver.run()函数如下&#xff1a;
public void run() { DataInputStream in&#61;null; try { //建立一个输入流&#xff0c;读取客户端发送的指令 in &#61; new DataInputStream( new BufferedInputStream(NetUtils.getInputStream(s), SMALL_BUFFER_SIZE)); short version &#61; in.readShort(); boolean local &#61; s.getInetAddress().equals(s.getLocalAddress()); byte op &#61; in.readByte(); // Make sure the xciver count is not exceeded int curXceiverCount &#61; datanode.getXceiverCount(); long startTime &#61; DataNode.now(); switch ( op ) { //读取 case DataTransferProtocol.OP_READ_BLOCK: //真正的读取数据 readBlock( in ); datanode.myMetrics.readBlockOp.inc(DataNode.now() - startTime); if (local) datanode.myMetrics.readsFromLocalClient.inc(); else datanode.myMetrics.readsFromRemoteClient.inc(); break; //写入 case DataTransferProtocol.OP_WRITE_BLOCK: //真正的写入数据 writeBlock( in ); datanode.myMetrics.writeBlockOp.inc(DataNode.now() - startTime); if (local) datanode.myMetrics.writesFromLocalClient.inc(); else datanode.myMetrics.writesFromRemoteClient.inc(); break; //其他的指令 …… } } catch (Throwable t) { LOG.error(datanode.dnRegistration &#43; ":DataXceiver",t); } finally { IOUtils.closeStream(in); IOUtils.closeSocket(s); dataXceiverServer.childSockets.remove(s); } } |
private void readBlock(DataInputStream in) throws IOException { //读取指令 long blockId &#61; in.readLong(); Block block &#61; new Block( blockId, 0 , in.readLong()); long startOffset &#61; in.readLong(); long length &#61; in.readLong(); String clientName &#61; Text.readString(in); //创建一个写入流&#xff0c;用于向客户端写数据 OutputStream baseStream &#61; NetUtils.getOutputStream(s, datanode.socketWriteTimeout); DataOutputStream out &#61; new DataOutputStream( new BufferedOutputStream(baseStream, SMALL_BUFFER_SIZE)); //生成BlockSender用于读取本地的block的数据&#xff0c;并发送给客户端 //BlockSender有一个成员变量InputStream blockIn用于读取本地block的数据 BlockSender blockSender &#61; new BlockSender(block, startOffset, length, true, true, false, datanode, clientTraceFmt); out.writeShort(DataTransferProtocol.OP_STATUS_SUCCESS); // send op status //向客户端写入数据 long read &#61; blockSender.sendBlock(out, baseStream, null); …… } finally { IOUtils.closeStream(out); IOUtils.closeStream(blockSender); } } |
下面解析向hdfs上传一个文件的过程。
上传一个文件到hdfs&#xff0c;一般会调用DistributedFileSystem.create&#xff0c;其实现如下&#xff1a;
public FSDataOutputStream create(Path f, FsPermission permission, boolean overwrite, int bufferSize, short replication, long blockSize, Progressable progress) throws IOException { return new FSDataOutputStream (dfs.create(getPathName(f), permission, overwrite, replication, blockSize, progress, bufferSize), statistics); } |
其最终生成一个FSDataOutputStream用于向新生成的文件中写入数据。其成员变量dfs的类型为DFSClient&#xff0c;DFSClient的create函数如下&#xff1a;
public OutputStream create(String src, FsPermission permission, boolean overwrite, short replication, long blockSize, Progressable progress, int buffersize ) throws IOException { checkOpen(); if (permission &#61;&#61; null) { permission &#61; FsPermission.getDefault(); } FsPermission masked &#61; permission.applyUMask(FsPermission.getUMask(conf)); OutputStream result &#61; new DFSOutputStream(src, masked, overwrite, replication, blockSize, progress, buffersize, conf.getInt("io.bytes.per.checksum", 512)); leasechecker.put(src, result); return result; } |
其中构造了一个DFSOutputStream&#xff0c;在其构造函数中&#xff0c;同过RPC调用NameNode的create来创建一个文件。
当然&#xff0c;构造函数中还做了一件重要的事情&#xff0c;就是streamer.start()&#xff0c;也即启动了一个pipeline&#xff0c;用于写数据&#xff0c;在写入数据的过程中&#xff0c;我们会仔细分析。
DFSOutputStream(String src, FsPermission masked, boolean overwrite, short replication, long blockSize, Progressable progress, int buffersize, int bytesPerChecksum) throws IOException { this(src, blockSize, progress, bytesPerChecksum); computePacketChunkSize(writePacketSize, bytesPerChecksum); try { namenode.create( src, masked, clientName, overwrite, replication, blockSize); } catch(RemoteException re) { throw re.unwrapRemoteException(AccessControlException.class, QuotaExceededException.class); } streamer.start(); } |
NameNode的create函数调用namesystem.startFile函数&#xff0c;其又调用startFileInternal函数&#xff0c;实现如下&#xff1a;
private synchronized void startFileInternal(String src, PermissionStatus permissions, String holder, String clientMachine, boolean overwrite, boolean append, short replication, long blockSize ) throws IOException { ...... //创建一个新的文件&#xff0c;状态为under construction&#xff0c;没有任何data block与之对应 long genstamp &#61; nextGenerationStamp(); INodeFileUnderConstruction newNode &#61; dir.addFile(src, permissions, replication, blockSize, holder, clientMachine, clientNode, genstamp); ...... } |
下面轮到客户端向新创建的文件中写入数据了&#xff0c;一般会使用FSDataOutputStream的write函数&#xff0c;最终会调用DFSOutputStream的writeChunk函数&#xff1a;
按照hdfs的设计&#xff0c;对block的数据写入使用的是pipeline的方式&#xff0c;也即将数据分成一个个的package&#xff0c;如果需要复制三分&#xff0c;分别写入DataNode 1, 2, 3&#xff0c;则会进行如下的过程&#xff1a;
protected synchronized void writeChunk(byte[] b, int offset, int len, byte[] checksum) throws IOException { //创建一个package&#xff0c;并写入数据 currentPacket &#61; new Packet(packetSize, chunksPerPacket, bytesCurBlock); currentPacket.writeChecksum(checksum, 0, cklen); currentPacket.writeData(b, offset, len); currentPacket.numChunks&#43;&#43;; bytesCurBlock &#43;&#61; len; //如果此package已满&#xff0c;则放入队列中准备发送 if (currentPacket.numChunks &#61;&#61; currentPacket.maxChunks || bytesCurBlock &#61;&#61; blockSize) { ...... dataQueue.addLast(currentPacket); //唤醒等待dataqueue的传输线程&#xff0c;也即DataStreamer dataQueue.notifyAll(); currentPacket &#61; null; ...... } } |
DataStreamer的run函数如下&#xff1a;
public void run() { while (!closed && clientRunning) { Packet one &#61; null; synchronized (dataQueue) { //如果队列中没有package&#xff0c;则等待 while ((!closed && !hasError && clientRunning && dataQueue.size() &#61;&#61; 0) || doSleep) { try { dataQueue.wait(1000); } catch (InterruptedException e) { } doSleep &#61; false; } try { //得到队列中的第一个package one &#61; dataQueue.getFirst(); long offsetInBlock &#61; one.offsetInBlock; //由NameNode分配block&#xff0c;并生成一个写入流指向此block if (blockStream &#61;&#61; null) { nodes &#61; nextBlockOutputStream(src); response &#61; new ResponseProcessor(nodes); response.start(); } ByteBuffer buf &#61; one.getBuffer(); //将package从dataQueue移至ackQueue,等待确认 dataQueue.removeFirst(); dataQueue.notifyAll(); synchronized (ackQueue) { ackQueue.addLast(one); ackQueue.notifyAll(); } //利用生成的写入流将数据写入DataNode中的block blockStream.write(buf.array(), buf.position(), buf.remaining()); if (one.lastPacketInBlock) { blockStream.writeInt(0); //表示此block写入完毕 } blockStream.flush(); } catch (Throwable e) { } } ...... } |
其中重要的一个函数是nextBlockOutputStream&#xff0c;实现如下&#xff1a;
private DatanodeInfo[] nextBlockOutputStream(String client) throws IOException { LocatedBlock lb &#61; null; boolean retry &#61; false; DatanodeInfo[] nodes; int count &#61; conf.getInt("dfs.client.block.write.retries", 3); boolean success; do { ...... //由NameNode为文件分配DataNode和block lb &#61; locateFollowingBlock(startTime); block &#61; lb.getBlock(); nodes &#61; lb.getLocations(); //创建向DataNode的写入流 success &#61; createBlockOutputStream(nodes, clientName, false); ...... } while (retry && --count >&#61; 0); return nodes; } |
locateFollowingBlock中通过RPC调用namenode.addBlock(src, clientName)函数
NameNode的addBlock函数实现如下&#xff1a;
public LocatedBlock addBlock(String src, String clientName) throws IOException { LocatedBlock locatedBlock &#61; namesystem.getAdditionalBlock(src, clientName); return locatedBlock; } |
FSNamesystem的getAdditionalBlock实现如下&#xff1a;
public LocatedBlock getAdditionalBlock(String src, String clientName ) throws IOException { long fileLength, blockSize; int replication; DatanodeDescriptor clientNode &#61; null; Block newBlock &#61; null; ...... //为新的block选择DataNode DatanodeDescriptor targets[] &#61; replicator.chooseTarget(replication, clientNode, null, blockSize); ...... //得到文件路径中所有path的INode&#xff0c;其中最后一个是新添加的文件对的INode&#xff0c;状态为under construction INode[] pathINodes &#61; dir.getExistingPathINodes(src); int inodesLen &#61; pathINodes.length; INodeFileUnderConstruction pendingFile &#61; (INodeFileUnderConstruction) pathINodes[inodesLen - 1]; //为文件分配block, 并设置在那写DataNode上 newBlock &#61; allocateBlock(src, pathINodes); pendingFile.setTargets(targets); ...... return new LocatedBlock(newBlock, targets, fileLength); } |
在分配了DataNode和block以后&#xff0c;createBlockOutputStream开始写入数据。
private boolean createBlockOutputStream(DatanodeInfo[] nodes, String client, boolean recoveryFlag) { //创建一个socket&#xff0c;链接DataNode InetSocketAddress target &#61; NetUtils.createSocketAddr(nodes[0].getName()); s &#61; socketFactory.createSocket(); int timeoutValue &#61; 3000 * nodes.length &#43; socketTimeout; s.connect(target, timeoutValue); s.setSoTimeout(timeoutValue); s.setSendBufferSize(DEFAULT_DATA_SOCKET_SIZE); long writeTimeout &#61; HdfsConstants.WRITE_TIMEOUT_EXTENSION * nodes.length &#43; datanodeWriteTimeout; DataOutputStream out &#61; new DataOutputStream( new BufferedOutputStream(NetUtils.getOutputStream(s, writeTimeout), DataNode.SMALL_BUFFER_SIZE)); blockReplyStream &#61; new DataInputStream(NetUtils.getInputStream(s)); //写入指令 out.writeShort( DataTransferProtocol.DATA_TRANSFER_VERSION ); out.write( DataTransferProtocol.OP_WRITE_BLOCK ); out.writeLong( block.getBlockId() ); out.writeLong( block.getGenerationStamp() ); out.writeInt( nodes.length ); out.writeBoolean( recoveryFlag ); Text.writeString( out, client ); out.writeBoolean(false); out.writeInt( nodes.length - 1 ); //注意&#xff0c;次循环从1开始&#xff0c;而非从0开始。将除了第一个DataNode以外的另外两个DataNode的信息发送给第一个DataNode, 第一个DataNode可以根据此信息将数据写给另两个DataNode for (int i &#61; 1; i nodes[i].write(out); } checksum.writeHeader( out ); out.flush(); firstBadLink &#61; Text.readString(blockReplyStream); if (firstBadLink.length() !&#61; 0) { throw new IOException("Bad connect ack with firstBadLink " &#43; firstBadLink); } blockStream &#61; out; } |
客户端在DataStreamer的run函数中创建了写入流后&#xff0c;调用blockStream.write将数据写入DataNode
DataNode的DataXceiver中&#xff0c;收到指令DataTransferProtocol.OP_WRITE_BLOCK则调用writeBlock函数&#xff1a;
private void writeBlock(DataInputStream in) throws IOException { DatanodeInfo srcDataNode &#61; null; //读入头信息 Block block &#61; new Block(in.readLong(), dataXceiverServer.estimateBlockSize, in.readLong()); int pipelineSize &#61; in.readInt(); // num of datanodes in entire pipeline boolean isRecovery &#61; in.readBoolean(); // is this part of recovery? String client &#61; Text.readString(in); // working on behalf of this client boolean hasSrcDataNode &#61; in.readBoolean(); // is src node info present if (hasSrcDataNode) { srcDataNode &#61; new DatanodeInfo(); srcDataNode.readFields(in); } int numTargets &#61; in.readInt(); if (numTargets <0) { throw new IOException("Mislabelled incoming datastream."); } //读入剩下的DataNode列表&#xff0c;如果当前是第一个DataNode&#xff0c;则此列表中收到的是第二个&#xff0c;第三个DataNode的信息&#xff0c;如果当前是第二个DataNode&#xff0c;则受到的是第三个DataNode的信息 DatanodeInfo targets[] &#61; new DatanodeInfo[numTargets]; for (int i &#61; 0; i DatanodeInfo tmp &#61; new DatanodeInfo(); tmp.readFields(in); targets[i] &#61; tmp; } DataOutputStream mirrorOut &#61; null; // stream to next target DataInputStream mirrorIn &#61; null; // reply from next target DataOutputStream replyOut &#61; null; // stream to prev target Socket mirrorSock &#61; null; // socket to next target BlockReceiver blockReceiver &#61; null; // responsible for data handling String mirrorNode &#61; null; // the name:port of next target String firstBadLink &#61; ""; // first datanode that failed in connection setup try { //生成一个BlockReceiver, 其有成员变量DataInputStream in为从客户端或者上一个DataNode读取数据&#xff0c;还有成员变量DataOutputStream mirrorOut&#xff0c;用于向下一个DataNode写入数据&#xff0c;还有成员变量OutputStream out用于将数据写入本地。 blockReceiver &#61; new BlockReceiver(block, in, s.getRemoteSocketAddress().toString(), s.getLocalSocketAddress().toString(), isRecovery, client, srcDataNode, datanode); // get a connection back to the previous target replyOut &#61; new DataOutputStream( NetUtils.getOutputStream(s, datanode.socketWriteTimeout)); //如果当前不是最后一个DataNode&#xff0c;则同下一个DataNode建立socket连接 if (targets.length > 0) { InetSocketAddress mirrorTarget &#61; null; // Connect to backup machine mirrorNode &#61; targets[0].getName(); mirrorTarget &#61; NetUtils.createSocketAddr(mirrorNode); mirrorSock &#61; datanode.newSocket(); int timeoutValue &#61; numTargets * datanode.socketTimeout; int writeTimeout &#61; datanode.socketWriteTimeout &#43; (HdfsConstants.WRITE_TIMEOUT_EXTENSION * numTargets); mirrorSock.connect(mirrorTarget, timeoutValue); mirrorSock.setSoTimeout(timeoutValue); mirrorSock.setSendBufferSize(DEFAULT_DATA_SOCKET_SIZE); //创建向下一个DataNode写入数据的流 mirrorOut &#61; new DataOutputStream( new BufferedOutputStream( NetUtils.getOutputStream(mirrorSock, writeTimeout), SMALL_BUFFER_SIZE)); mirrorIn &#61; new DataInputStream(NetUtils.getInputStream(mirrorSock)); mirrorOut.writeShort( DataTransferProtocol.DATA_TRANSFER_VERSION ); mirrorOut.write( DataTransferProtocol.OP_WRITE_BLOCK ); mirrorOut.writeLong( block.getBlockId() ); mirrorOut.writeLong( block.getGenerationStamp() ); mirrorOut.writeInt( pipelineSize ); mirrorOut.writeBoolean( isRecovery ); Text.writeString( mirrorOut, client ); mirrorOut.writeBoolean(hasSrcDataNode); if (hasSrcDataNode) { // pass src node information srcDataNode.write(mirrorOut); } mirrorOut.writeInt( targets.length - 1 ); //此出也是从1开始&#xff0c;将除了下一个DataNode的其他DataNode信息发送给下一个DataNode for ( int i &#61; 1; i targets[i].write( mirrorOut ); } blockReceiver.writeChecksumHeader(mirrorOut); mirrorOut.flush(); } //使用BlockReceiver接受block String mirrorAddr &#61; (mirrorSock &#61;&#61; null) ? null : mirrorNode; blockReceiver.receiveBlock(mirrorOut, mirrorIn, replyOut, mirrorAddr, null, targets.length); ...... } finally { // close all opened streams IOUtils.closeStream(mirrorOut); IOUtils.closeStream(mirrorIn); IOUtils.closeStream(replyOut); IOUtils.closeSocket(mirrorSock); IOUtils.closeStream(blockReceiver); } } |
BlockReceiver的receiveBlock函数中&#xff0c;一段重要的逻辑如下&#xff1a;
void receiveBlock( DataOutputStream mirrOut, // output to next datanode DataInputStream mirrIn, // input from next datanode DataOutputStream replyOut, // output to previous datanode String mirrAddr, BlockTransferThrottler throttlerArg, int numTargets) throws IOException { ...... //不断的接受package&#xff0c;直到结束 while (receivePacket() > 0) {} if (mirrorOut !&#61; null) { try { mirrorOut.writeInt(0); // mark the end of the block mirrorOut.flush(); } catch (IOException e) { handleMirrorOutError(e); } } ...... } |
BlockReceiver的receivePacket函数如下&#xff1a;
private int receivePacket() throws IOException { //从客户端或者上一个节点接收一个package int payloadLen &#61; readNextPacket(); buf.mark(); //read the header buf.getInt(); // packet length offsetInBlock &#61; buf.getLong(); // get offset of packet in block long seqno &#61; buf.getLong(); // get seqno boolean lastPacketInBlock &#61; (buf.get() !&#61; 0); int endOfHeader &#61; buf.position(); buf.reset(); setBlockPosition(offsetInBlock); //将package写入下一个DataNode if (mirrorOut !&#61; null) { try { mirrorOut.write(buf.array(), buf.position(), buf.remaining()); mirrorOut.flush(); } catch (IOException e) { handleMirrorOutError(e); } } buf.position(endOfHeader); int len &#61; buf.getInt(); offsetInBlock &#43;&#61; len; int checksumLen &#61; ((len &#43; bytesPerChecksum - 1)/bytesPerChecksum)* checksumSize; int checksumOff &#61; buf.position(); int dataOff &#61; checksumOff &#43; checksumLen; byte pktBuf[] &#61; buf.array(); buf.position(buf.limit()); // move to the end of the data. ...... //将数据写入本地的block out.write(pktBuf, dataOff, len); /// flush entire packet before sending ack flush(); // put in queue for pending acks if (responder !&#61; null) { ((PacketResponder)responder.getRunnable()).enqueue(seqno, lastPacketInBlock); } return payloadLen; } |