Uploaded image for project: 'Hadoop HDFS'
  1. Hadoop HDFS
  2. HDFS-12528

Add an option to not disable short-circuit reads on failures

    XMLWordPrintableJSON

Details

    • Reviewed
    • Added an option to not disables short-circuit reads on failures, by setting dfs.domain.socket.disable.interval.seconds to 0.

    Description

      We have scenarios where data ingestion makes use of the -appendToFile operation to add new data to existing HDFS files. In these situations, we're frequently running into the problem described below.

      We're using Impala to query the HDFS data with short-circuit reads (SCR) enabled. After each file read, Impala "unbuffer"'s the HDFS file to reduce the memory footprint. In some cases, though, Impala still keeps the HDFS file handle open for reuse.

      The "unbuffer" call, however, causes the file's current block reader to be closed, which makes the associated ShortCircuitReplica evictable from the ShortCircuitCache. When the cluster is under load, this means that the ShortCircuitReplica can be purged off the cache pretty fast, which closes the file descriptor to the underlying storage file.

      That means that when Impala re-reads the file it has to re-open the storage files associated with the ShortCircuitReplica's that were evicted from the cache. If there were no appends to those blocks, the re-open will succeed without problems. If one block was appended since the ShortCircuitReplica was created, the re-open will fail with the following error:

      Meta file for BP-810388474-172.31.113.69-1499543341726:blk_1074012183_273087 not found
      

      This error is handled as an "unknown response" by the BlockReaderFactory [1], which disables short-circuit reads for 10 minutes [2] for the client.

      These 10 minutes without SCR can have a big performance impact for the client operations. In this particular case ("Meta file not found") it would suffice to return null without disabling SCR. This particular block read would fall back to the normal, non-short-circuited, path and other SCR requests would continue to work as expected.

      It might also be interesting to be able to control how long SCR is disabled for in the "unknown response" case. 10 minutes seems a bit to long and not being able to change that is a problem.

      [1] https://github.com/apache/hadoop/blob/f67237cbe7bc48a1b9088e990800b37529f1db2a/hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/client/impl/BlockReaderFactory.java#L646

      [2] https://github.com/apache/hadoop/blob/f67237cbe7bc48a1b9088e990800b37529f1db2a/hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/shortcircuit/DomainSocketFactory.java#L97

      Attachments

        1. HDFS-12528.05.patch
          17 kB
          Xiao Chen
        2. HDFS-12528.04.patch
          17 kB
          Xiao Chen
        3. HDFS-12528.03.patch
          17 kB
          Xiao Chen
        4. HDFS-12528.02.patch
          17 kB
          Xiao Chen
        5. HDFS-12528.01.patch
          13 kB
          Xiao Chen
        6. HDFS-12528.000.patch
          5 kB
          John Zhuge

        Issue Links

          Activity

            People

              xiaochen Xiao Chen
              asdaraujo Andre Araujo
              Votes:
              0 Vote for this issue
              Watchers:
              12 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: