Uploaded image for project: 'Hadoop HDFS'
  1. Hadoop HDFS
  2. HDFS-7151

DFSInputStream method seek works incorrectly on huge HDFS block size

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Critical
    • Resolution: Unresolved
    • 2.3.0, 2.4.0, 2.5.0, 2.4.1, 2.5.1, 3.0.0-alpha1
    • None
    • datanode, fuse-dfs, hdfs-client
    • None
    • dfs.block.size > 2Gb

    Description

      Hadoop incorrectly works with block size more than 2Gb.

      The seek method of DFSInputStream class used int (32 bit signed) internal value for seeking inside current block. This cause seek error when block size is greater 2Gb.

      Found when using very large parquet files (10Gb) in Impala on Cloudera cluster with block size 10Gb.

      Here is some log output:
      W0924 08:27:15.920017 40026 DFSInputStream.java:1397] BlockReader failed to seek to 4390830898. Instead, it seeked to 95863602.
      W0924 08:27:15.921295 40024 DFSInputStream.java:1397] BlockReader failed to seek to 5597521814. Instead, it seeked to 1302554518.

      BlockReader seek only 32-bit offsets (4390830898-95863602=4Gb as 5597521814-1302554518).

      The code fragment producing that bug:
      int diff = (int)(targetPos - pos);
      if (diff <= blockReader.available()) {

      Similar errors can exist in other parts of the HDFS.

      Attachments

        Activity

          People

            Unassigned Unassigned
            A_Rewoonenco Andrew Rewoonenco
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:

              Time Tracking

                Estimated:
                Original Estimate - 48h
                48h
                Remaining:
                Remaining Estimate - 48h
                48h
                Logged:
                Time Spent - Not Specified
                Not Specified