Uploaded image for project: 'HBase'
  1. HBase
  2. HBASE-21544

Backport HBASE-20734 Colocate recovered edits directory with hbase.wal.dir

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 2.0.4
    • Component/s: wal
    • Labels:
      None
    • Hadoop Flags:
      Reviewed
    • Release Note:
      Hide
      This change moves the recovered.edits files which are created by the WALSplitter from the default filesystem into the WAL filesystem. This better enables the separate filesystem for WAL and HFile deployment model, by avoiding a check which requires that the HFile filesystem provides the hflush capability.
      Show
      This change moves the recovered.edits files which are created by the WALSplitter from the default filesystem into the WAL filesystem. This better enables the separate filesystem for WAL and HFile deployment model, by avoiding a check which requires that the HFile filesystem provides the hflush capability.

      Description

      Been talking through this with a bunch of folks. Enis Soztutar brought me back from the cliff of despair though.

      Context: running HBase on top of a filesystem that doesn't have hflush for hfiles. In our case, on top of Azure's Hadoop-compatible filesystems (WASB, ABFS).

      When a RS fails and we have an SCP running for it, you'll see log splitting get into an "infinite" loop where the master keeps resubmitting and the RS which takes the action deterministically fails with the following:

      2018-11-26 20:59:18,415 ERROR [RS_LOG_REPLAY_OPS-regionserver/wn2-b831f9:16020-0-Writer-2] wal.FSHLogProvider: The RegionServer write ahead log provider for FileSystem implementations relies on the ability to call hflush for proper operation during component failures, but the current FileSystem does not support doing so. Please check the config value of 'hbase.wal.dir' and ensure it points to a FileSystem mount that has suitable capabilities for output streams.
      2018-11-26 20:59:18,415 WARN  [RS_LOG_REPLAY_OPS-regionserver/wn2-b831f9:16020-0-Writer-2] wal.AbstractProtobufLogWriter: WALTrailer is null. Continuing with default.
      2018-11-26 20:59:18,467 ERROR [RS_LOG_REPLAY_OPS-regionserver/wn2-b831f9:16020-0-Writer-2] wal.WALSplitter: Got while writing log entry to log
      java.io.IOException: cannot get log writer
              at org.apache.hadoop.hbase.wal.FSHLogProvider.createWriter(FSHLogProvider.java:96)
              at org.apache.hadoop.hbase.wal.FSHLogProvider.createWriter(FSHLogProvider.java:61)
              at org.apache.hadoop.hbase.wal.WALFactory.createRecoveredEditsWriter(WALFactory.java:370)
              at org.apache.hadoop.hbase.wal.WALSplitter.createWriter(WALSplitter.java:804)
              at org.apache.hadoop.hbase.wal.WALSplitter$LogRecoveredEditsOutputSink.createWAP(WALSplitter.java:1530)
              at org.apache.hadoop.hbase.wal.WALSplitter$LogRecoveredEditsOutputSink.getWriterAndPath(WALSplitter.java:1501)
              at org.apache.hadoop.hbase.wal.WALSplitter$LogRecoveredEditsOutputSink.appendBuffer(WALSplitter.java:1584)
              at org.apache.hadoop.hbase.wal.WALSplitter$LogRecoveredEditsOutputSink.append(WALSplitter.java:1566)
              at org.apache.hadoop.hbase.wal.WALSplitter$WriterThread.writeBuffer(WALSplitter.java:1090)
              at org.apache.hadoop.hbase.wal.WALSplitter$WriterThread.doRun(WALSplitter.java:1082)
              at org.apache.hadoop.hbase.wal.WALSplitter$WriterThread.run(WALSplitter.java:1052)
      Caused by: org.apache.hadoop.hbase.util.CommonFSUtils$StreamLacksCapabilityException: hflush
              at org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.initOutput(ProtobufLogWriter.java:99)
              at org.apache.hadoop.hbase.regionserver.wal.AbstractProtobufLogWriter.init(AbstractProtobufLogWriter.java:165)
              at org.apache.hadoop.hbase.wal.FSHLogProvider.createWriter(FSHLogProvider.java:77)
              ... 10 more

      This is the sanity check added by HBASE-18784, failing on creating the writer for the recovered.edits file.

      The odd-ball here is that our recovered.edits writer is just a WAL writer class. The WAL writer class thinks it always should have hflush support; however, we don't actually need that for writing out the recovered.edits files. If close() on the recovered.edits file would fail, we're trash any intermediate data in the filesystem and rerun the whole process.

      It's my understanding that this check is over-bearing and we should not make the check when the ProtobufLogWriter is being used for the recovered.edits file.

      Zach York, Sean Busbey fyi

        Attachments

        1. HBASE-20734.002.branch-2.0.patch
          63 kB
          Josh Elser
        2. HBASE-20734.001.branch-2.0.patch
          65 kB
          Josh Elser

          Issue Links

            Activity

              People

              • Assignee:
                elserj Josh Elser
                Reporter:
                elserj Josh Elser
              • Votes:
                0 Vote for this issue
                Watchers:
                11 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: