Uploaded image for project: 'HBase'
  1. HBase
  2. HBASE-6435

Reading WAL files after a recovery leads to time lost in HDFS timeouts when using dead datanodes

Log workAgile BoardRank to TopRank to BottomAttach filesAttach ScreenshotBulk Copy AttachmentsBulk Move AttachmentsVotersWatch issueWatchersCreate sub-taskConvert to sub-taskMoveLinkCloneLabelsUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 0.95.2
    • 0.95.0
    • master, regionserver
    • None
    • Reviewed
    • Hide
      This JIRA adds a hook in the HDFS client to reorder the replica locations for HLog files. The default ordering in HDFS is rack aware + random. When reading a HLog file, we prefer not to use the replica on the same server as the region server that wrote the HLog: this server is likely to be not available, and this will delay the HBase recovery by one minute. This occurs because the recovery starts sooner in HBase than in HDFS: 3 minutes by default in HBase vs. 10:30 minutes in HDFS. This will be changed in HDFS-3703. Moreover, when a HDFS file is already opened for writing, a read triggers another call to get the file size, leading to another timeout (see HDFS-3704), but as well a wrong file size value (see HDFS-3701 and HBASE-6401). Technically:
      - this hook won't be useful anymore when HDFS-3702 or HDFS-3705 or HDFS-3706 is available and used in HBase.
      - the hook intercepts the calls to the nanemode and reorder the locations it returned, extracting the region server name from the HLog file. This server is put at the end of the list, ensuring it will be tried only if all the others fail.
      - It has been tested with HDFS 1.0.3. of HDFS 2.0 apha.
      - It can be deactivated (at master & region server start-up) by setting "hbase.filesystem.reorder.blocks" to false in the HBase configuration.
      Show
      This JIRA adds a hook in the HDFS client to reorder the replica locations for HLog files. The default ordering in HDFS is rack aware + random. When reading a HLog file, we prefer not to use the replica on the same server as the region server that wrote the HLog: this server is likely to be not available, and this will delay the HBase recovery by one minute. This occurs because the recovery starts sooner in HBase than in HDFS: 3 minutes by default in HBase vs. 10:30 minutes in HDFS. This will be changed in HDFS-3703 . Moreover, when a HDFS file is already opened for writing, a read triggers another call to get the file size, leading to another timeout (see HDFS-3704 ), but as well a wrong file size value (see HDFS-3701 and HBASE-6401 ). Technically: - this hook won't be useful anymore when HDFS-3702 or HDFS-3705 or HDFS-3706 is available and used in HBase. - the hook intercepts the calls to the nanemode and reorder the locations it returned, extracting the region server name from the HLog file. This server is put at the end of the list, ensuring it will be tried only if all the others fail. - It has been tested with HDFS 1.0.3. of HDFS 2.0 apha. - It can be deactivated (at master & region server start-up) by setting "hbase.filesystem.reorder.blocks" to false in the HBase configuration.
    • 0.96notable

    Description

      HBase writes a Write-Ahead-Log to revover from hardware failure. This log is written on hdfs.
      Through ZooKeeper, HBase gets informed usually in 30s that it should start the recovery process.
      This means reading the Write-Ahead-Log to replay the edits on the other servers.

      In standards deployments, HBase process (regionserver) are deployed on the same box as the datanodes.

      It means that when the box stops, we've actually lost one of the edits, as we lost both the regionserver and the datanode.

      As HDFS marks a node as dead after ~10 minutes, it appears as available when we try to read the blocks to recover. As such, we are delaying the recovery process by 60 seconds as the read will usually fail with a socket timeout. If the file is still opened for writing, it adds an extra 20s + a risk of losing edits if we connect with ipc to the dead DN.

      Possible solutions are:

      • shorter dead datanodes detection by the NN. Requires a NN code change.
      • better dead datanodes management in DFSClient. Requires a DFS code change.
      • NN customisation to write the WAL files on another DN instead of the local one.
      • reordering the blocks returned by the NN on the client side to put the blocks on the same DN as the dead RS at the end of the priority queue. Requires a DFS code change or a kind of workaround.

      The solution retained is the last one. Compared to what was discussed on the mailing list, the proposed patch will not modify HDFS source code but adds a proxy. This for two reasons:

      • Some HDFS functions managing block orders are static (MD5MD5CRC32FileChecksum). Implementing the hook in the DFSClient would require to implement partially the fix, change the DFS interface to make this function non static, or put the hook static. None of these solution is very clean.
      • Adding a proxy allows to put all the code in HBase, simplifying dependency management.

      Nevertheless, it would be better to have this in HDFS. But this solution allows to target the last version only, and this could allow minimal interface changes such as non static methods.

      Moreover, writing the blocks to the non local DN would be an even better solution long term.

      Attachments

        1. 6435.unfinished.patch
          Delete this attachment
          14 kB
          Nicolas Liochon
        2. 6435.v10.patch
          Delete this attachment
          34 kB
          Nicolas Liochon
        3. 6435.v10.patch
          Delete this attachment
          34 kB
          Nicolas Liochon
        4. 6435.v12.patch
          Delete this attachment
          39 kB
          Nicolas Liochon
        5. 6435.v12.patch
          Delete this attachment
          39 kB
          Nicolas Liochon
        6. 6435.v12.patch
          Delete this attachment
          39 kB
          Nicolas Liochon
        7. 6435.v13.patch
          Delete this attachment
          39 kB
          Nicolas Liochon
        8. 6435.v14.patch
          Delete this attachment
          39 kB
          Nicolas Liochon
        9. 6435.v2.patch
          Delete this attachment
          31 kB
          Nicolas Liochon
        10. 6435.v7.patch
          Delete this attachment
          33 kB
          Nicolas Liochon
        11. 6435.v8.patch
          Delete this attachment
          34 kB
          Nicolas Liochon
        12. 6435.v9.patch
          Delete this attachment
          34 kB
          Nicolas Liochon
        13. 6435.v9.patch
          Delete this attachment
          34 kB
          Nicolas Liochon
        14. 6435-v12.txt
          Delete this attachment
          39 kB
          Ted Yu
        15. 6535.v11.patch
          Delete this attachment
          36 kB
          Nicolas Liochon

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            nkeywal Nicolas Liochon Assign to me
            nkeywal Nicolas Liochon
            Votes:
            0 Vote for this issue
            Watchers:
            15 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment