Uploaded image for project: 'HBase'
  1. HBase
  2. HBASE-27484

FNFE on StoreFileScanner after a flush followed by a compaction

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 2.6.0, 3.0.0-alpha-3, 2.5.1, 2.4.15
    • 2.6.0, 3.0.0-alpha-4, 2.4.16, 2.5.3
    • None
    • None

    Description

      One of our customers was running SyncTable from a 1.2 based cluster, where SyncTable map tasks were open scanners on a 2.4 based cluster for comparing the two clusters. Few of the map tasks failed with a DoNotRetryException caused by a FileNotFoundException blowing all the way up to the client:

      Error: org.apache.hadoop.hbase.DoNotRetryIOException: org.apache.hadoop.hbase.DoNotRetryIOException: java.io.FileNotFoundException: open s3a://xxxxxxxxx-xxxxxxxx-xxxxxxx/hbase/data/default/xxxxxx/4c53da8c2ab9b7d7a0d6046ef3bb701c/0/e2c58350e4e54c21b0f713a4c271b8c0 at 7225 on s3a://xxxxxxxxx-xxxxxxxx-xxxxxxx/hbase/data/default/xxxxxx/4c53da8c2ab9b7d7a0d6046ef3bb701c/0/e2c58350e4e54c21b0f713a4c271b8c0: com.amazonaws.services.s3.model.AmazonS3Exception: The specified key does not exist. (Service: Amazon S3; Status Code: 404; Error Code: NoSuchKey; Request ID: KBRNC67WZGCS4SCF; S3 Extended Request ID: wWEJm8tlFSj/8g+xFpmD1vgWzT/n7HzBcOFAZ8ayIqKMsDZGN/d2kEhdusPLhMd540h+QAPP1xw=; Proxy: null), S3 Extended Request ID: wWEJm8tlFSj/8g+xFpmD1vgWzT/n7HzBcOFAZ8ayIqKMsDZGN/d2kEhdusPLhMd540h+QAPP1xw=:NoSuchKey
              at org.apache.hadoop.hbase.regionserver.RSRpcServices.scan(RSRpcServices.java:3712)
              at org.apache.hadoop.hbase.shaded.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:45819)
              at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:392)
              at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:140)
              at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:359)
              at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:339)
      Caused by: java.io.FileNotFoundException: open s3a://xxxxxxxxx-xxxxxxxx-xxxxxxx/hbase/data/default/xxxxxx/4c53da8c2ab9b7d7a0d6046ef3bb701c/0/e2c58350e4e54c21b0f713a4c271b8c0 at 7225 on s3a://xxxxxxxxx-xxxxxxxx-xxxxxxx/hbase/data/default/xxxxxx/4c53da8c2ab9b7d7a0d6046ef3bb701c/0/e2c58350e4e54c21b0f713a4c271b8c0: com.amazonaws.services.s3.model.AmazonS3Exception: The specified key does not exist. (Service: Amazon S3; Status Code: 404; Error Code: NoSuchKey; Request ID: KBRNC67WZGCS4SCF; S3 Extended Request ID: wWEJm8tlFSj/8g+xFpmD1vgWzT/n7HzBcOFAZ8ayIqKMsDZGN/d2kEhdusPLhMd540h+QAPP1xw=; Proxy: null), S3 Extended Request ID: wWEJm8tlFSj/8g+xFpmD1vgWzT/n7HzBcOFAZ8ayIqKMsDZGN/d2kEhdusPLhMd540h+QAPP1xw=:NoSuchKey
      ...
      at org.apache.hadoop.hbase.io.hfile.HFileReaderImpl$HFileScannerImpl.seekTo(HFileReaderImpl.java:632)
              at org.apache.hadoop.hbase.regionserver.StoreFileScanner.seekAtOrAfter(StoreFileScanner.java:315)
              at org.apache.hadoop.hbase.regionserver.StoreFileScanner.seek(StoreFileScanner.java:216)
              at org.apache.hadoop.hbase.regionserver.StoreScanner.seekScanners(StoreScanner.java:417)
              at org.apache.hadoop.hbase.regionserver.StoreScanner.reopenAfterFlush(StoreScanner.java:1018)
              at org.apache.hadoop.hbase.regionserver.StoreScanner.next(StoreScanner.java:552)
              at org.apache.hadoop.hbase.regionserver.KeyValueHeap.next(KeyValueHeap.java:155)
              at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.populateResult(HRegion.java:7399)
              at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextInternal(HRegion.java:7567)
              at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextRaw(HRegion.java:7331)
              at org.apache.hadoop.hbase.regionserver.RSRpcServices.scan(RSRpcServices.java:3373) 

      We can see on the RS logs that the above file got recently create as an outcome of a memstore flush, then compaction is triggered shortly:

      2022-11-11 22:16:50,322 INFO org.apache.hadoop.hbase.regionserver.DefaultStoreFlusher: Flushed memstore data size=208.15 KB at sequenceid=4949703 (bloomFilter=false), to=s3a://xxxxxxxxx-xxxxxxxx-xxxxxxx/hbase/data/default/xxxxxx/4c53da8c2ab9b7d7a0d6046ef3bb701c/0/e2c58350e4e54c21b0f713a4c271b8c0
      2022-11-11 22:16:50,513 INFO org.apache.hadoop.hbase.regionserver.HStore: Added s3a://xxxxxxxxx-xxxxxxxx-xxxxxxx/hbase/data/default/xxxxxx/4c53da8c2ab9b7d7a0d6046ef3bb701c/0/e2c58350e4e54c21b0f713a4c271b8c0, entries=951, sequenceid=4949703, filesize=26.2 K
      ...
      2022-11-11 22:16:50,791 INFO org.apache.hadoop.hbase.regionserver.HRegion: Starting compaction of 4c53da8c2ab9b7d7a0d6046ef3bb701c/0 in xxxxxx,IT001E90506702\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00,1618275339031.4c53da8c2ab9b7d7a0d6046ef3bb701c.
      2022-11-11 22:16:50,791 INFO org.apache.hadoop.hbase.regionserver.HStore: Starting compaction of [s3a://xxxxxxxxx-xxxxxxxx-xxxxxxx/hbase/data/default/xxxxxx/4c53da8c2ab9b7d7a0d6046ef3bb701c/0/9a865e5e3ba74ab6b9597fe0e295fd8d, s3a://xxxxxxxxx-xxxxxxxx-xxxxxxx/hbase/data/default/xxxxxx/4c53da8c2ab9b7d7a0d6046ef3bb701c/0/adbe118d8c76482d8faed3170437be0b, s3a://xxxxxxxxx-xxxxxxxx-xxxxxxx/hbase/data/default/xxxxxx/4c53da8c2ab9b7d7a0d6046ef3bb701c/0/e2c58350e4e54c21b0f713a4c271b8c0] into tmpdir=s3a://xxxxxxxxx-xxxxxxxx-xxxxxxx/hbase/data/default/xxxxxx/4c53da8c2ab9b7d7a0d6046ef3bb701c/.tmp, totalSize=27.1 M
      ...
      2022-11-11 22:16:54,469 INFO org.apache.hadoop.hbase.regionserver.HStore: Completed compaction of 3 file(s) in 4c53da8c2ab9b7d7a0d6046ef3bb701c/0 of 4c53da8c2ab9b7d7a0d6046ef3bb701c into bc44bfd06a03429fb331ab4dc10dbb8d(size=27.1 M), total size for store is 4.9 G. This selection was in queue for 0sec, and took 3sec to execute.
      

      I believe this is an unlucky scenario where the compaction discharger moved the compacted away files while the StoreFileScanner was getting created but had not updated the refCounter on the file reader yet. We couldn't reproduce this on a real cluster, but I could emulate it with a UT and artificially inducing a delay in the StoreFileScanner creation when creating scans not for compactions. One possible fix is to update the reader refCounter as soon we get the files for the StoreFileScanner instances we are creating.

      Attachments

        Issue Links

          Activity

            People

              wchevreuil Wellington Chevreuil
              wchevreuil Wellington Chevreuil
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: