热门标签 | HotTags
当前位置:  开发笔记 > 编程语言 > 正文

HDFS读写过程解析

一、文件的打开1.1、客户端HDFS打开一个文件,需要在客户端调用DistributedFileSystem.open(Pathf,intbufferSize)&#x

一、文件的打开

1.1、客户端

HDFS打开一个文件,需要在客户端调用DistributedFileSystem.open(Path f, int bufferSize),其实现为:

public FSDataInputStream open(Path f, int bufferSize) throws IOException {

  return new DFSClient.DFSDataInputStream(

        dfs.open(getPathName(f), bufferSize, verifyChecksum, statistics));

}

其中dfs为DistributedFileSystem的成员变量DFSClient,其open函数被调用,其中创建一个DFSInputStream(src, buffersize, verifyChecksum)并返回。

在DFSInputStream的构造函数中,openInfo函数被调用,其主要从namenode中得到要打开的文件所对应的blocks的信息,实现如下:

 

synchronized void openInfo() throws IOException {

  LocatedBlocks newInfo = callGetBlockLocations(namenode, src, 0, prefetchSize);

  this.locatedBlocks = newInfo;

  this.currentNode = null;

}

private static LocatedBlocks callGetBlockLocations(ClientProtocol namenode,

    String src, long start, long length) throws IOException {

    return namenode.getBlockLocations(src, start, length);

}

LocatedBlocks主要包含一个链表的List blocks,其中每个LocatedBlock包含如下信息:

  • Block b:此block的信息
  • long offset:此block在文件中的偏移量
  • DatanodeInfo[] locs:此block位于哪些DataNode上

上面namenode.getBlockLocations是一个RPC调用,最终调用NameNode类的getBlockLocations函数。

1.2、NameNode

NameNode.getBlockLocations实现如下:

public LocatedBlocks   getBlockLocations(String src,

                                        long offset,

                                        long length) throws IOException {

  return namesystem.getBlockLocations(getClientMachine(),

                                      src, offset, length);

}

namesystem是NameNode一个成员变量,其类型为FSNamesystem,保存的是NameNode的name space树,其中一个重要的成员变量为FSDirectory dir。

FSDirectory和Lucene中的FSDirectory没有任何关系,其主要包括FSImage fsImage,用于读写硬盘上的fsimage文件,FSImage类有成员变量FSEditLog editLog,用于读写硬盘上的edit文件,这两个文件的关系在上一篇文章中已经解释过。

FSDirectory还有一个重要的成员变量INodeDirectoryWithQuota rootDir,INodeDirectoryWithQuota的父类为INodeDirectory,实现如下:

public class INodeDirectory extends INode {

  ……

  private List children;

  ……

}

由此可见INodeDirectory本身是一个INode,其中包含一个链表的INode,此链表中,如果仍为文件夹,则是类型INodeDirectory,如果是文件,则是类型INodeFile,INodeFile中有成员变量BlockInfo blocks[],是此文件包含的block的信息。显然这是一棵树形的结构。

FSNamesystem.getBlockLocations函数如下:

public LocatedBlocks getBlockLocations(String src, long offset, long length,

    boolean doAccessTime) throws IOException {

  final LocatedBlocks ret = getBlockLocationsInternal(src, dir.getFileINode(src),

      offset, length, Integer.MAX_VALUE, doAccessTime); 

  return ret;

}

dir.getFileINode(src)通过路径名从文件系统树中找到INodeFile,其中保存的是要打开的文件的INode的信息。

getBlockLocationsInternal的实现如下:

 

private synchronized LocatedBlocks getBlockLocationsInternal(String src,

                                                     INodeFile inode,

                                                     long offset,

                                                     long length,

                                                     int nrBlocksToReturn,

                                                     boolean doAccessTime)

                                                     throws IOException {

  //得到此文件的block信息

  Block[] blocks = inode.getBlocks();

  List results = new ArrayList(blocks.length);

  //计算从offset开始,长度为length所涉及的blocks

  int curBlk = 0;

  long curPos = 0, blkSize = 0;

  int nrBlocks = (blocks[0].getNumBytes() == 0) ? 0 : blocks.length;

  for (curBlk = 0; curBlk

    blkSize = blocks[curBlk].getNumBytes();

    if (curPos + blkSize > offset) {

      //当offset在curPos和curPos + blkSize之间的时候,curBlk指向offset所在的block

      break;

    }

    curPos += blkSize;

  }

  long endOff = offset + length;

  //循环,依次遍历从curBlk开始的每个block,直到当前位置curPos越过endOff

  do {

    int numNodes = blocksMap.numNodes(blocks[curBlk]);

    int numCorruptNodes = countNodes(blocks[curBlk]).corruptReplicas();

    int numCorruptReplicas = corruptReplicas.numCorruptReplicas(blocks[curBlk]);

    boolean blockCorrupt = (numCorruptNodes == numNodes);

    int numMachineSet = blockCorrupt ? numNodes :

                          (numNodes - numCorruptNodes);

    //依次找到此block所对应的datanode,将其中没有损坏的放入machineSet中

    DatanodeDescriptor[] machineSet = new DatanodeDescriptor[numMachineSet];

    if (numMachineSet > 0) {

      numNodes = 0;

      for(Iterator it =

          blocksMap.nodeIterator(blocks[curBlk]); it.hasNext();) {

        DatanodeDescriptor dn = it.next();

        boolean replicaCorrupt = corruptReplicas.isReplicaCorrupt(blocks[curBlk], dn);

        if (blockCorrupt || (!blockCorrupt && !replicaCorrupt))

          machineSet[numNodes++] = dn;

      }

    }

    //使用此machineSet和当前的block构造一个LocatedBlock

    results.add(new LocatedBlock(blocks[curBlk], machineSet, curPos,

                blockCorrupt));

    curPos += blocks[curBlk].getNumBytes();

    curBlk++;

  } while (curPos

        && curBlk

        && results.size()

  //使用此LocatedBlock链表构造一个LocatedBlocks对象返回

  return inode.createLocatedBlocks(results);

}

1.3、客户端

通过RPC调用,在NameNode得到的LocatedBlocks对象,作为成员变量构造DFSInputStream对象,最后包装为FSDataInputStream返回给用户。

 

二、文件的读取

2.1、客户端

文件读取的时候,客户端利用文件打开的时候得到的FSDataInputStream.read(long position, byte[] buffer, int offset, int length)函数进行文件读操作。

FSDataInputStream会调用其封装的DFSInputStream的read(long position, byte[] buffer, int offset, int length)函数,实现如下:

 

public int read(long position, byte[] buffer, int offset, int length)

  throws IOException {

  long filelen = getFileLength();

  int realLen = length;

  if ((position + length) > filelen) {

    realLen = (int)(filelen - position);

  }

  //首先得到包含从offset到offset + length内容的block列表

  //比如对于64M一个block的文件系统来说,欲读取从100M开始,长度为128M的数据,则block列表包括第2,3,4块block

  List blockRange = getBlockRange(position, realLen);

  int remaining = realLen;

  //对每一个block,从中读取内容

  //对于上面的例子,对于第2块block,读取从36M开始,读取长度28M,对于第3块,读取整一块64M,对于第4块,读取从0开始,长度为36M,共128M数据

  for (LocatedBlock blk : blockRange) {

    long targetStart = position - blk.getStartOffset();

    long bytesToRead = Math.min(remaining, blk.getBlockSize() - targetStart);

    fetchBlockByteRange(blk, targetStart,

                        targetStart + bytesToRead - 1, buffer, offset);

    remaining -= bytesToRead;

    position += bytesToRead;

    offset += bytesToRead;

  }

  assert remaining == 0 : "Wrong number of bytes read.";

  if (stats != null) {

    stats.incrementBytesRead(realLen);

  }

  return realLen;

}

其中getBlockRange函数如下:

 

private synchronized List getBlockRange(long offset,

                                                      long length)

                                                    throws IOException {

  List blockRange = new ArrayList();

  //首先从缓存的locatedBlocks中查找offset所在的block在缓存链表中的位置

  int blockIdx = locatedBlocks.findBlock(offset);

  if (blockIdx <0) { // block is not cached

    blockIdx &#61; LocatedBlocks.getInsertIndex(blockIdx);

  }

  long remaining &#61; length;

  long curOff &#61; offset;

  while(remaining > 0) {

    LocatedBlock blk &#61; null;

    //按照blockIdx的位置找到block

    if(blockIdx

      blk &#61; locatedBlocks.get(blockIdx);

    //如果block为空&#xff0c;则缓存中没有此block&#xff0c;则直接从NameNode中查找这些block&#xff0c;并加入缓存

    if (blk &#61;&#61; null || curOff

      LocatedBlocks newBlocks;

      newBlocks &#61; callGetBlockLocations(namenode, src, curOff, remaining);

      locatedBlocks.insertRange(blockIdx, newBlocks.getLocatedBlocks());

      continue;

    }

    //如果block找到&#xff0c;则放入结果集

    blockRange.add(blk);

    long bytesRead &#61; blk.getStartOffset() &#43; blk.getBlockSize() - curOff;

    remaining -&#61; bytesRead;

    curOff &#43;&#61; bytesRead;

    //取下一个block

    blockIdx&#43;&#43;;

  }

  return blockRange;

}

其中fetchBlockByteRange实现如下&#xff1a;

 

private void fetchBlockByteRange(LocatedBlock block, long start,

                                 long end, byte[] buf, int offset) throws IOException {

  Socket dn &#61; null;

  int numAttempts &#61; block.getLocations().length;

  //此while循环为读取失败后的重试次数

  while (dn &#61;&#61; null && numAttempts-- > 0 ) {

    //选择一个DataNode来读取数据

    DNAddrPair retval &#61; chooseDataNode(block);

    DatanodeInfo chosenNode &#61; retval.info;

    InetSocketAddress targetAddr &#61; retval.addr;

    BlockReader reader &#61; null;

    try {

      //创建Socket连接到DataNode

      dn &#61; socketFactory.createSocket();

      dn.connect(targetAddr, socketTimeout);

      dn.setSoTimeout(socketTimeout);

      int len &#61; (int) (end - start &#43; 1);

      //利用建立的Socket链接&#xff0c;生成一个reader负责从DataNode读取数据

      reader &#61; BlockReader.newBlockReader(dn, src,

                                          block.getBlock().getBlockId(),

                                          block.getBlock().getGenerationStamp(),

                                          start, len, buffersize,

                                          verifyChecksum, clientName);

      //读取数据

      int nread &#61; reader.readAll(buf, offset, len);

      return;

    } finally {

      IOUtils.closeStream(reader);

      IOUtils.closeSocket(dn);

      dn &#61; null;

    }

    //如果读取失败&#xff0c;则将此DataNode标记为失败节点

    addToDeadNodes(chosenNode);

  }

}

BlockReader.newBlockReader函数实现如下&#xff1a;

 

public static BlockReader newBlockReader( Socket sock, String file,

                                   long blockId,

                                   long genStamp,

                                   long startOffset, long len,

                                   int bufferSize, boolean verifyChecksum,

                                   String clientName)

                                   throws IOException {

  //使用Socket建立写入流&#xff0c;向DataNode发送读指令

  DataOutputStream out &#61; new DataOutputStream(

    new BufferedOutputStream(NetUtils.getOutputStream(sock,HdfsConstants.WRITE_TIMEOUT)));

  out.writeShort( DataTransferProtocol.DATA_TRANSFER_VERSION );

  out.write( DataTransferProtocol.OP_READ_BLOCK );

  out.writeLong( blockId );

  out.writeLong( genStamp );

  out.writeLong( startOffset );

  out.writeLong( len );

  Text.writeString(out, clientName);

  out.flush();

  //使用Socket建立读入流&#xff0c;用于从DataNode读取数据

  DataInputStream in &#61; new DataInputStream(

      new BufferedInputStream(NetUtils.getInputStream(sock),

                              bufferSize));

  DataChecksum checksum &#61; DataChecksum.newDataChecksum( in );

  long firstChunkOffset &#61; in.readLong();

  //生成一个reader&#xff0c;主要包含读入流&#xff0c;用于读取数据

  return new BlockReader( file, blockId, in, checksum, verifyChecksum,

                          startOffset, firstChunkOffset, sock );

}

BlockReader的readAll函数就是用上面生成的DataInputStream读取数据。

2.2、DataNode

在DataNode启动的时候&#xff0c;会调用函数startDataNode&#xff0c;其中与数据读取有关的逻辑如下&#xff1a;

 

void startDataNode(Configuration conf,

                   AbstractList dataDirs

                   ) throws IOException {

  ……

  // 建立一个ServerSocket&#xff0c;并生成一个DataXceiverServer来监控客户端的链接

  ServerSocket ss &#61; (socketWriteTimeout > 0) ?

        ServerSocketChannel.open().socket() : new ServerSocket();

  Server.bind(ss, socAddr, 0);

  ss.setReceiveBufferSize(DEFAULT_DATA_SOCKET_SIZE);

  // adjust machine name with the actual port

  tmpPort &#61; ss.getLocalPort();

  selfAddr &#61; new InetSocketAddress(ss.getInetAddress().getHostAddress(),

                                   tmpPort);

  this.dnRegistration.setName(machineName &#43; ":" &#43; tmpPort);

  this.threadGroup &#61; new ThreadGroup("dataXceiverServer");

  this.dataXceiverServer &#61; new Daemon(threadGroup,

      new DataXceiverServer(ss, conf, this));

  this.threadGroup.setDaemon(true); // auto destroy when empty

  ……

}

DataXceiverServer.run()函数如下&#xff1a;

 

public void run() {

  while (datanode.shouldRun) {

      //接受客户端的链接

      Socket s &#61; ss.accept();

      s.setTcpNoDelay(true);

      //生成一个线程DataXceiver来对建立的链接提供服务

      new Daemon(datanode.threadGroup,

          new DataXceiver(s, datanode, this)).start();

  }

  try {

    ss.close();

  } catch (IOException ie) {

    LOG.warn(datanode.dnRegistration &#43; ":DataXceiveServer: "

                            &#43; StringUtils.stringifyException(ie));

  }

}

DataXceiver.run()函数如下&#xff1a;

 

public void run() {

  DataInputStream in&#61;null;

  try {

    //建立一个输入流&#xff0c;读取客户端发送的指令

    in &#61; new DataInputStream(

        new BufferedInputStream(NetUtils.getInputStream(s),

                                SMALL_BUFFER_SIZE));

    short version &#61; in.readShort();

    boolean local &#61; s.getInetAddress().equals(s.getLocalAddress());

    byte op &#61; in.readByte();

    // Make sure the xciver count is not exceeded

    int curXceiverCount &#61; datanode.getXceiverCount();

    long startTime &#61; DataNode.now();

    switch ( op ) {

    //读取

    case DataTransferProtocol.OP_READ_BLOCK:

      //真正的读取数据

      readBlock( in );

      datanode.myMetrics.readBlockOp.inc(DataNode.now() - startTime);

      if (local)

        datanode.myMetrics.readsFromLocalClient.inc();

      else

        datanode.myMetrics.readsFromRemoteClient.inc();

      break;

    //写入

    case DataTransferProtocol.OP_WRITE_BLOCK:

      //真正的写入数据

      writeBlock( in );

      datanode.myMetrics.writeBlockOp.inc(DataNode.now() - startTime);

      if (local)

        datanode.myMetrics.writesFromLocalClient.inc();

      else

        datanode.myMetrics.writesFromRemoteClient.inc();

      break;

    //其他的指令

    ……

    }

  } catch (Throwable t) {

    LOG.error(datanode.dnRegistration &#43; ":DataXceiver",t);

  } finally {

    IOUtils.closeStream(in);

    IOUtils.closeSocket(s);

    dataXceiverServer.childSockets.remove(s);

  }

}

 

private void readBlock(DataInputStream in) throws IOException {

  //读取指令

  long blockId &#61; in.readLong();         

  Block block &#61; new Block( blockId, 0 , in.readLong());

  long startOffset &#61; in.readLong();

  long length &#61; in.readLong();

  String clientName &#61; Text.readString(in);

  //创建一个写入流&#xff0c;用于向客户端写数据

  OutputStream baseStream &#61; NetUtils.getOutputStream(s,

      datanode.socketWriteTimeout);

  DataOutputStream out &#61; new DataOutputStream(

               new BufferedOutputStream(baseStream, SMALL_BUFFER_SIZE));

  //生成BlockSender用于读取本地的block的数据&#xff0c;并发送给客户端

  //BlockSender有一个成员变量InputStream blockIn用于读取本地block的数据

  BlockSender blockSender &#61; new BlockSender(block, startOffset, length,

          true, true, false, datanode, clientTraceFmt);

   out.writeShort(DataTransferProtocol.OP_STATUS_SUCCESS); // send op status

   //向客户端写入数据

   long read &#61; blockSender.sendBlock(out, baseStream, null);

   ……

  } finally {

    IOUtils.closeStream(out);

    IOUtils.closeStream(blockSender);

  }

}

三、文件的写入

下面解析向hdfs上传一个文件的过程。

3.1、客户端

上传一个文件到hdfs&#xff0c;一般会调用DistributedFileSystem.create&#xff0c;其实现如下&#xff1a;

 

  public FSDataOutputStream create(Path f, FsPermission permission,

    boolean overwrite,

    int bufferSize, short replication, long blockSize,

    Progressable progress) throws IOException {

    return new FSDataOutputStream

       (dfs.create(getPathName(f), permission,

                   overwrite, replication, blockSize, progress, bufferSize),

        statistics);

  }

其最终生成一个FSDataOutputStream用于向新生成的文件中写入数据。其成员变量dfs的类型为DFSClient&#xff0c;DFSClient的create函数如下&#xff1a;

  public OutputStream create(String src,

                             FsPermission permission,

                             boolean overwrite,

                             short replication,

                             long blockSize,

                             Progressable progress,

                             int buffersize

                             ) throws IOException {

    checkOpen();

    if (permission &#61;&#61; null) {

      permission &#61; FsPermission.getDefault();

    }

    FsPermission masked &#61; permission.applyUMask(FsPermission.getUMask(conf));

    OutputStream result &#61; new DFSOutputStream(src, masked,

        overwrite, replication, blockSize, progress, buffersize,

        conf.getInt("io.bytes.per.checksum", 512));

    leasechecker.put(src, result);

    return result;

  }

其中构造了一个DFSOutputStream&#xff0c;在其构造函数中&#xff0c;同过RPC调用NameNode的create来创建一个文件。 
当然&#xff0c;构造函数中还做了一件重要的事情&#xff0c;就是streamer.start()&#xff0c;也即启动了一个pipeline&#xff0c;用于写数据&#xff0c;在写入数据的过程中&#xff0c;我们会仔细分析。

  DFSOutputStream(String src, FsPermission masked, boolean overwrite,

      short replication, long blockSize, Progressable progress,

      int buffersize, int bytesPerChecksum) throws IOException {

    this(src, blockSize, progress, bytesPerChecksum);

    computePacketChunkSize(writePacketSize, bytesPerChecksum);

    try {

      namenode.create(

          src, masked, clientName, overwrite, replication, blockSize);

    } catch(RemoteException re) {

      throw re.unwrapRemoteException(AccessControlException.class,

                                     QuotaExceededException.class);

    }

    streamer.start();

  }

 

3.2、NameNode

NameNode的create函数调用namesystem.startFile函数&#xff0c;其又调用startFileInternal函数&#xff0c;实现如下&#xff1a;

  private synchronized void startFileInternal(String src,

                                              PermissionStatus permissions,

                                              String holder,

                                              String clientMachine,

                                              boolean overwrite,

                                              boolean append,

                                              short replication,

                                              long blockSize

                                              ) throws IOException {

    ......

   //创建一个新的文件&#xff0c;状态为under construction&#xff0c;没有任何data block与之对应

   long genstamp &#61; nextGenerationStamp();

   INodeFileUnderConstruction newNode &#61; dir.addFile(src, permissions,

      replication, blockSize, holder, clientMachine, clientNode, genstamp);

   ......

  }

 

3.3、客户端

下面轮到客户端向新创建的文件中写入数据了&#xff0c;一般会使用FSDataOutputStream的write函数&#xff0c;最终会调用DFSOutputStream的writeChunk函数&#xff1a;

按照hdfs的设计&#xff0c;对block的数据写入使用的是pipeline的方式&#xff0c;也即将数据分成一个个的package&#xff0c;如果需要复制三分&#xff0c;分别写入DataNode 1, 2, 3&#xff0c;则会进行如下的过程&#xff1a;

  • 首先将package 1写入DataNode 1
  • 然后由DataNode 1负责将package 1写入DataNode 2&#xff0c;同时客户端可以将pacage 2写入DataNode 1
  • 然后DataNode 2负责将package 1写入DataNode 3, 同时客户端可以讲package 3写入DataNode 1&#xff0c;DataNode 1将package 2写入DataNode 2
  • 就这样将一个个package排着队的传递下去&#xff0c;直到所有的数据全部写入并复制完毕

  protected synchronized void writeChunk(byte[] b, int offset, int len, byte[] checksum)

                                                        throws IOException {

      //创建一个package&#xff0c;并写入数据

      currentPacket &#61; new Packet(packetSize, chunksPerPacket,

                                   bytesCurBlock);

      currentPacket.writeChecksum(checksum, 0, cklen);

      currentPacket.writeData(b, offset, len);

      currentPacket.numChunks&#43;&#43;;

      bytesCurBlock &#43;&#61; len;

      //如果此package已满&#xff0c;则放入队列中准备发送

      if (currentPacket.numChunks &#61;&#61; currentPacket.maxChunks ||

          bytesCurBlock &#61;&#61; blockSize) {

          ......

          dataQueue.addLast(currentPacket);

          //唤醒等待dataqueue的传输线程&#xff0c;也即DataStreamer

          dataQueue.notifyAll();

          currentPacket &#61; null;

          ......

      }

  }


DataStreamer的run函数如下&#xff1a;

  public void run() {

    while (!closed && clientRunning) {

      Packet one &#61; null;

      synchronized (dataQueue) {

        //如果队列中没有package&#xff0c;则等待

        while ((!closed && !hasError && clientRunning

               && dataQueue.size() &#61;&#61; 0) || doSleep) {

          try {

            dataQueue.wait(1000);

          } catch (InterruptedException  e) {

          }

          doSleep &#61; false;

        }

        try {

          //得到队列中的第一个package

          one &#61; dataQueue.getFirst();

          long offsetInBlock &#61; one.offsetInBlock;

          //由NameNode分配block&#xff0c;并生成一个写入流指向此block

          if (blockStream &#61;&#61; null) {

            nodes &#61; nextBlockOutputStream(src);

            response &#61; new ResponseProcessor(nodes);

            response.start();

          }

          ByteBuffer buf &#61; one.getBuffer();

          //将package从dataQueue移至ackQueue,等待确认

          dataQueue.removeFirst();

          dataQueue.notifyAll();

          synchronized (ackQueue) {

            ackQueue.addLast(one);

            ackQueue.notifyAll();

          }

          //利用生成的写入流将数据写入DataNode中的block

          blockStream.write(buf.array(), buf.position(), buf.remaining());

          if (one.lastPacketInBlock) {

            blockStream.writeInt(0); //表示此block写入完毕

          }

          blockStream.flush();

        } catch (Throwable e) {

        }

      }

      ......

  }

 

其中重要的一个函数是nextBlockOutputStream&#xff0c;实现如下&#xff1a;

  private DatanodeInfo[] nextBlockOutputStream(String client) throws IOException {

    LocatedBlock lb &#61; null;

    boolean retry &#61; false;

    DatanodeInfo[] nodes;

    int count &#61; conf.getInt("dfs.client.block.write.retries", 3);

    boolean success;

    do {

      ......

      //由NameNode为文件分配DataNode和block

      lb &#61; locateFollowingBlock(startTime);

      block &#61; lb.getBlock();

      nodes &#61; lb.getLocations();

      //创建向DataNode的写入流

      success &#61; createBlockOutputStream(nodes, clientName, false);

      ......

    } while (retry && --count >&#61; 0);

    return nodes;

  }

 

locateFollowingBlock中通过RPC调用namenode.addBlock(src, clientName)函数

 

3.4、NameNode

NameNode的addBlock函数实现如下&#xff1a;

  public LocatedBlock addBlock(String src,

                               String clientName) throws IOException {

    LocatedBlock locatedBlock &#61; namesystem.getAdditionalBlock(src, clientName);

    return locatedBlock;

  }

FSNamesystem的getAdditionalBlock实现如下&#xff1a;

  public LocatedBlock getAdditionalBlock(String src,

                                         String clientName

                                         ) throws IOException {

    long fileLength, blockSize;

    int replication;

    DatanodeDescriptor clientNode &#61; null;

    Block newBlock &#61; null;

    ......

    //为新的block选择DataNode

    DatanodeDescriptor targets[] &#61; replicator.chooseTarget(replication,

                                                           clientNode,

                                                           null,

                                                           blockSize);

    ......

    //得到文件路径中所有path的INode&#xff0c;其中最后一个是新添加的文件对的INode&#xff0c;状态为under construction

    INode[] pathINodes &#61; dir.getExistingPathINodes(src);

    int inodesLen &#61; pathINodes.length;

    INodeFileUnderConstruction pendingFile  &#61; (INodeFileUnderConstruction)

                                                pathINodes[inodesLen - 1];

    //为文件分配block, 并设置在那写DataNode上

    newBlock &#61; allocateBlock(src, pathINodes);

    pendingFile.setTargets(targets);

    ......

    return new LocatedBlock(newBlock, targets, fileLength);

  }

 

3.5、客户端

在分配了DataNode和block以后&#xff0c;createBlockOutputStream开始写入数据。

  private boolean createBlockOutputStream(DatanodeInfo[] nodes, String client,

                  boolean recoveryFlag) {

      //创建一个socket&#xff0c;链接DataNode

      InetSocketAddress target &#61; NetUtils.createSocketAddr(nodes[0].getName());

      s &#61; socketFactory.createSocket();

      int timeoutValue &#61; 3000 * nodes.length &#43; socketTimeout;

      s.connect(target, timeoutValue);

      s.setSoTimeout(timeoutValue);

      s.setSendBufferSize(DEFAULT_DATA_SOCKET_SIZE);

      long writeTimeout &#61; HdfsConstants.WRITE_TIMEOUT_EXTENSION * nodes.length &#43;

                          datanodeWriteTimeout;

      DataOutputStream out &#61; new DataOutputStream(

          new BufferedOutputStream(NetUtils.getOutputStream(s, writeTimeout),

                                   DataNode.SMALL_BUFFER_SIZE));

      blockReplyStream &#61; new DataInputStream(NetUtils.getInputStream(s));

      //写入指令

      out.writeShort( DataTransferProtocol.DATA_TRANSFER_VERSION );

      out.write( DataTransferProtocol.OP_WRITE_BLOCK );

      out.writeLong( block.getBlockId() );

      out.writeLong( block.getGenerationStamp() );

      out.writeInt( nodes.length );

      out.writeBoolean( recoveryFlag );

      Text.writeString( out, client );

      out.writeBoolean(false);

      out.writeInt( nodes.length - 1 );

      //注意&#xff0c;次循环从1开始&#xff0c;而非从0开始。将除了第一个DataNode以外的另外两个DataNode的信息发送给第一个DataNode, 第一个DataNode可以根据此信息将数据写给另两个DataNode

      for (int i &#61; 1; i

        nodes[i].write(out);

      }

      checksum.writeHeader( out );

      out.flush();

      firstBadLink &#61; Text.readString(blockReplyStream);

      if (firstBadLink.length() !&#61; 0) {

        throw new IOException("Bad connect ack with firstBadLink " &#43; firstBadLink);

      }

      blockStream &#61; out;

  }

 

客户端在DataStreamer的run函数中创建了写入流后&#xff0c;调用blockStream.write将数据写入DataNode

 

3.6、DataNode

DataNode的DataXceiver中&#xff0c;收到指令DataTransferProtocol.OP_WRITE_BLOCK则调用writeBlock函数&#xff1a;

  private void writeBlock(DataInputStream in) throws IOException {

    DatanodeInfo srcDataNode &#61; null;

    //读入头信息

    Block block &#61; new Block(in.readLong(),

        dataXceiverServer.estimateBlockSize, in.readLong());

    int pipelineSize &#61; in.readInt(); // num of datanodes in entire pipeline

    boolean isRecovery &#61; in.readBoolean(); // is this part of recovery?

    String client &#61; Text.readString(in); // working on behalf of this client

    boolean hasSrcDataNode &#61; in.readBoolean(); // is src node info present

    if (hasSrcDataNode) {

      srcDataNode &#61; new DatanodeInfo();

      srcDataNode.readFields(in);

    }

    int numTargets &#61; in.readInt();

    if (numTargets <0) {

      throw new IOException("Mislabelled incoming datastream.");

    }

    //读入剩下的DataNode列表&#xff0c;如果当前是第一个DataNode&#xff0c;则此列表中收到的是第二个&#xff0c;第三个DataNode的信息&#xff0c;如果当前是第二个DataNode&#xff0c;则受到的是第三个DataNode的信息

    DatanodeInfo targets[] &#61; new DatanodeInfo[numTargets];

    for (int i &#61; 0; i

      DatanodeInfo tmp &#61; new DatanodeInfo();

      tmp.readFields(in);

      targets[i] &#61; tmp;

    }

    DataOutputStream mirrorOut &#61; null;  // stream to next target

    DataInputStream mirrorIn &#61; null;    // reply from next target

    DataOutputStream replyOut &#61; null;   // stream to prev target

    Socket mirrorSock &#61; null;           // socket to next target

    BlockReceiver blockReceiver &#61; null; // responsible for data handling

    String mirrorNode &#61; null;           // the name:port of next target

    String firstBadLink &#61; "";           // first datanode that failed in connection setup

    try {

      //生成一个BlockReceiver, 其有成员变量DataInputStream in为从客户端或者上一个DataNode读取数据&#xff0c;还有成员变量DataOutputStream mirrorOut&#xff0c;用于向下一个DataNode写入数据&#xff0c;还有成员变量OutputStream out用于将数据写入本地。

      blockReceiver &#61; new BlockReceiver(block, in,

          s.getRemoteSocketAddress().toString(),

          s.getLocalSocketAddress().toString(),

          isRecovery, client, srcDataNode, datanode);

      // get a connection back to the previous target

      replyOut &#61; new DataOutputStream(

                     NetUtils.getOutputStream(s, datanode.socketWriteTimeout));

      //如果当前不是最后一个DataNode&#xff0c;则同下一个DataNode建立socket连接

      if (targets.length > 0) {

        InetSocketAddress mirrorTarget &#61; null;

        // Connect to backup machine

        mirrorNode &#61; targets[0].getName();

        mirrorTarget &#61; NetUtils.createSocketAddr(mirrorNode);

        mirrorSock &#61; datanode.newSocket();

        int timeoutValue &#61; numTargets * datanode.socketTimeout;

        int writeTimeout &#61; datanode.socketWriteTimeout &#43;

                             (HdfsConstants.WRITE_TIMEOUT_EXTENSION * numTargets);

        mirrorSock.connect(mirrorTarget, timeoutValue);

        mirrorSock.setSoTimeout(timeoutValue);

        mirrorSock.setSendBufferSize(DEFAULT_DATA_SOCKET_SIZE);

        //创建向下一个DataNode写入数据的流

        mirrorOut &#61; new DataOutputStream(

             new BufferedOutputStream(

                         NetUtils.getOutputStream(mirrorSock, writeTimeout),

                         SMALL_BUFFER_SIZE));

        mirrorIn &#61; new DataInputStream(NetUtils.getInputStream(mirrorSock));

        mirrorOut.writeShort( DataTransferProtocol.DATA_TRANSFER_VERSION );

        mirrorOut.write( DataTransferProtocol.OP_WRITE_BLOCK );

        mirrorOut.writeLong( block.getBlockId() );

        mirrorOut.writeLong( block.getGenerationStamp() );

        mirrorOut.writeInt( pipelineSize );

        mirrorOut.writeBoolean( isRecovery );

        Text.writeString( mirrorOut, client );

        mirrorOut.writeBoolean(hasSrcDataNode);

        if (hasSrcDataNode) { // pass src node information

          srcDataNode.write(mirrorOut);

        }

        mirrorOut.writeInt( targets.length - 1 );

        //此出也是从1开始&#xff0c;将除了下一个DataNode的其他DataNode信息发送给下一个DataNode

        for ( int i &#61; 1; i

          targets[i].write( mirrorOut );

        }

        blockReceiver.writeChecksumHeader(mirrorOut);

        mirrorOut.flush();

      }

      //使用BlockReceiver接受block

      String mirrorAddr &#61; (mirrorSock &#61;&#61; null) ? null : mirrorNode;

      blockReceiver.receiveBlock(mirrorOut, mirrorIn, replyOut,

                                 mirrorAddr, null, targets.length);

      ......

    } finally {

      // close all opened streams

      IOUtils.closeStream(mirrorOut);

      IOUtils.closeStream(mirrorIn);

      IOUtils.closeStream(replyOut);

      IOUtils.closeSocket(mirrorSock);

      IOUtils.closeStream(blockReceiver);

    }

  }

 

BlockReceiver的receiveBlock函数中&#xff0c;一段重要的逻辑如下&#xff1a;

  void receiveBlock(

      DataOutputStream mirrOut, // output to next datanode

      DataInputStream mirrIn,   // input from next datanode

      DataOutputStream replyOut,  // output to previous datanode

      String mirrAddr, BlockTransferThrottler throttlerArg,

      int numTargets) throws IOException {

      ......

      //不断的接受package&#xff0c;直到结束

      while (receivePacket() > 0) {}

      if (mirrorOut !&#61; null) {

        try {

          mirrorOut.writeInt(0); // mark the end of the block

          mirrorOut.flush();

        } catch (IOException e) {

          handleMirrorOutError(e);

        }

      }

      ......

  }

 

BlockReceiver的receivePacket函数如下&#xff1a;

  private int receivePacket() throws IOException {

    //从客户端或者上一个节点接收一个package

    int payloadLen &#61; readNextPacket();

    buf.mark();

    //read the header

    buf.getInt(); // packet length

    offsetInBlock &#61; buf.getLong(); // get offset of packet in block

    long seqno &#61; buf.getLong();    // get seqno

    boolean lastPacketInBlock &#61; (buf.get() !&#61; 0);

    int endOfHeader &#61; buf.position();

    buf.reset();

    setBlockPosition(offsetInBlock);

    //将package写入下一个DataNode

    if (mirrorOut !&#61; null) {

      try {

        mirrorOut.write(buf.array(), buf.position(), buf.remaining());

        mirrorOut.flush();

      } catch (IOException e) {

        handleMirrorOutError(e);

      }

    }

    buf.position(endOfHeader);       

    int len &#61; buf.getInt();

    offsetInBlock &#43;&#61; len;

    int checksumLen &#61; ((len &#43; bytesPerChecksum - 1)/bytesPerChecksum)*

                                                            checksumSize;

    int checksumOff &#61; buf.position();

    int dataOff &#61; checksumOff &#43; checksumLen;

    byte pktBuf[] &#61; buf.array();

    buf.position(buf.limit()); // move to the end of the data.

    ......

    //将数据写入本地的block

    out.write(pktBuf, dataOff, len);

    /// flush entire packet before sending ack

    flush();

    // put in queue for pending acks

    if (responder !&#61; null) {

      ((PacketResponder)responder.getRunnable()).enqueue(seqno,

                                      lastPacketInBlock);

    }

    return payloadLen;

  }








推荐阅读
author-avatar
yangwei的马甲
这个家伙很懒,什么也没留下!
PHP1.CN | 中国最专业的PHP中文社区 | DevBox开发工具箱 | json解析格式化 |PHP资讯 | PHP教程 | 数据库技术 | 服务器技术 | 前端开发技术 | PHP框架 | 开发工具 | 在线工具
Copyright © 1998 - 2020 PHP1.CN. All Rights Reserved | 京公网安备 11010802041100号 | 京ICP备19059560号-4 | PHP1.CN 第一PHP社区 版权所有