Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-11583

When PTF is used over a large partitions result could be corrupted

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Critical
    • Resolution: Fixed
    • Affects Version/s: 0.14.0, 0.13.1, 0.14.1, 1.0.0, 1.2.0, 1.2.1
    • Fix Version/s: 1.3.0, 2.0.0
    • Component/s: PTF-Windowing
    • Labels:
      None
    • Environment:

      Hadoop 2.6 + Apache hive built from trunk

      Description

      Dataset:
      Window has 50001 record (2 blocks on disk and 1 block in memory)
      Size of the second block is >32Mb (2 splits)

      Result:
      When the last block is read from the disk only first split is actually loaded. The second split gets missed. The total count of the result dataset is correct, but some records are missing and another are duplicated.

      Example:

      CREATE TABLE ptf_big_src (
        id INT,
        key STRING,
        grp STRING,
        value STRING
      ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';
      
      LOAD DATA LOCAL INPATH '../../data/files/ptf_3blocks.txt.gz' OVERWRITE INTO TABLE ptf_big_src;
      
      SELECT grp, COUNT(1) cnt FROM ptf_big_trg GROUP BY grp ORDER BY cnt desc;
      ---
      -- A	25000
      -- B	20000
      -- C	5001
      ---
      
      CREATE TABLE ptf_big_trg AS SELECT *, row_number() OVER (PARTITION BY key ORDER BY grp) grp_num FROM ptf_big_src;
      
      SELECT grp, COUNT(1) cnt FROM ptf_big_trg GROUP BY grp ORDER BY cnt desc;
      -- 
      -- A	34296
      -- B	15704
      -- C	1
      ---
      

      Counts by 'grp' are incorrect!

      1. HIVE-11583.patch
        8 kB
        Illya Yalovyy

        Issue Links

          Activity

          Hide
          yalovyyi Illya Yalovyy added a comment -

          I have implemented a qtest for this issue, but it requires a rather big data file. What is the best way to submit this file? It is a gzip file, size = 204Kb. I can attach this file to the ticket.

          Show
          yalovyyi Illya Yalovyy added a comment - I have implemented a qtest for this issue, but it requires a rather big data file. What is the best way to submit this file? It is a gzip file, size = 204Kb. I can attach this file to the ticket.
          Hide
          ashutoshc Ashutosh Chauhan added a comment -

          +1

          Show
          ashutoshc Ashutosh Chauhan added a comment - +1
          Hide
          hiveqa Hive QA added a comment -

          Overall: -1 at least one tests failed

          Here are the results of testing the latest attachment:
          https://issues.apache.org/jira/secure/attachment/12755773/HIVE-11583.patch

          ERROR: -1 due to 2 failed/errored test(s), 9412 tests executed
          Failed tests:

          TestParseNegative - did not produce a TEST-*.xml file
          org.apache.hive.hcatalog.api.TestHCatClient.testTableSchemaPropagation
          

          Test results: http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/5276/testReport
          Console output: http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/5276/console
          Test logs: http://ec2-174-129-184-35.compute-1.amazonaws.com/logs/PreCommit-HIVE-TRUNK-Build-5276/

          Messages:

          Executing org.apache.hive.ptest.execution.PrepPhase
          Executing org.apache.hive.ptest.execution.ExecutionPhase
          Executing org.apache.hive.ptest.execution.ReportingPhase
          Tests exited with: TestsFailedException: 2 tests failed
          

          This message is automatically generated.

          ATTACHMENT ID: 12755773 - PreCommit-HIVE-TRUNK-Build

          Show
          hiveqa Hive QA added a comment - Overall : -1 at least one tests failed Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12755773/HIVE-11583.patch ERROR: -1 due to 2 failed/errored test(s), 9412 tests executed Failed tests: TestParseNegative - did not produce a TEST-*.xml file org.apache.hive.hcatalog.api.TestHCatClient.testTableSchemaPropagation Test results: http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/5276/testReport Console output: http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/5276/console Test logs: http://ec2-174-129-184-35.compute-1.amazonaws.com/logs/PreCommit-HIVE-TRUNK-Build-5276/ Messages: Executing org.apache.hive.ptest.execution.PrepPhase Executing org.apache.hive.ptest.execution.ExecutionPhase Executing org.apache.hive.ptest.execution.ReportingPhase Tests exited with: TestsFailedException: 2 tests failed This message is automatically generated. ATTACHMENT ID: 12755773 - PreCommit-HIVE-TRUNK-Build
          Hide
          ashutoshc Ashutosh Chauhan added a comment -

          Pushed to master. Thanks, Illya Yalovyy

          Show
          ashutoshc Ashutosh Chauhan added a comment - Pushed to master. Thanks, Illya Yalovyy
          Hide
          yalovyyi Illya Yalovyy added a comment -

          Ashutosh Chauhan, I have a qTest for this issue, but includes rather big gz - compressed file. What is the best way to contribute it? The question is how to create a patch for this big binary file?

          Show
          yalovyyi Illya Yalovyy added a comment - Ashutosh Chauhan , I have a qTest for this issue, but includes rather big gz - compressed file. What is the best way to contribute it? The question is how to create a patch for this big binary file?
          Hide
          ashutoshc Ashutosh Chauhan added a comment -

          Spilling is controlled by config hive.join.cache.size Perhaps, you can set that to very low value in q test so as to trigger spilling and thus testing this without needing a large input data.

          Show
          ashutoshc Ashutosh Chauhan added a comment - Spilling is controlled by config hive.join.cache.size Perhaps, you can set that to very low value in q test so as to trigger spilling and thus testing this without needing a large input data.
          Hide
          yalovyyi Illya Yalovyy added a comment -

          Oh... I was thinking about all possible ways to reduce the size of file. cahce size is only one piece of the puzzle. The important thing is physical file system blocks and it seems like I cannot control it from withing Hive script.

          Show
          yalovyyi Illya Yalovyy added a comment - Oh... I was thinking about all possible ways to reduce the size of file. cahce size is only one piece of the puzzle. The important thing is physical file system blocks and it seems like I cannot control it from withing Hive script.
          Hide
          ashutoshc Ashutosh Chauhan added a comment -

          Hive q tests use hive-shims-common/src/main/java/org/apache/hadoop/fs/ProxyLocalFileSystem.java I think you can configure its block size via fs.local.block.size

          Show
          ashutoshc Ashutosh Chauhan added a comment - Hive q tests use hive-shims-common/src/main/java/org/apache/hadoop/fs/ProxyLocalFileSystem.java I think you can configure its block size via fs.local.block.size
          Hide
          yalovyyi Illya Yalovyy added a comment -

          I tried, and when I did it from hive script it didn't take any effect. Is the any way to reconfigure it BEFORE the test?

          Show
          yalovyyi Illya Yalovyy added a comment - I tried, and when I did it from hive script it didn't take any effect. Is the any way to reconfigure it BEFORE the test?
          Hide
          sershe Sergey Shelukhin added a comment -

          Should this issue be backported to branch-1? It looks like a bug.

          Show
          sershe Sergey Shelukhin added a comment - Should this issue be backported to branch-1? It looks like a bug.
          Hide
          yalovyyi Illya Yalovyy added a comment -

          Yes. It is a quite critical bug.

          Show
          yalovyyi Illya Yalovyy added a comment - Yes. It is a quite critical bug.
          Hide
          sershe Sergey Shelukhin added a comment -

          Committed to branch-1

          Show
          sershe Sergey Shelukhin added a comment - Committed to branch-1
          Hide
          yalovyyi Illya Yalovyy added a comment -

          What about a qtest for this issue? What is the best course of action?

          Show
          yalovyyi Illya Yalovyy added a comment - What about a qtest for this issue? What is the best course of action?
          Hide
          sershe Sergey Shelukhin added a comment -

          This was committed a while ago... the test can be created in a separate JIRA if needed. I don't have background on this issue, I bulk commented yesterday on a large list of issues whose title looks like a bug and that were committed to master but not to branch-1, obtained via a script

          Show
          sershe Sergey Shelukhin added a comment - This was committed a while ago... the test can be created in a separate JIRA if needed. I don't have background on this issue, I bulk commented yesterday on a large list of issues whose title looks like a bug and that were committed to master but not to branch-1, obtained via a script
          Hide
          yalovyyi Illya Yalovyy added a comment -

          In a nutshell the question is what is the best way to upload/provide a rather big binary file to the test? Should I just attach it to a ticket?

          Show
          yalovyyi Illya Yalovyy added a comment - In a nutshell the question is what is the best way to upload/provide a rather big binary file to the test? Should I just attach it to a ticket?
          Hide
          sershe Sergey Shelukhin added a comment -

          You could generate it in the test by repeatedly cross joining. Or does the file have to be in a specific form that is not reproducible by the queries?

          Show
          sershe Sergey Shelukhin added a comment - You could generate it in the test by repeatedly cross joining. Or does the file have to be in a specific form that is not reproducible by the queries?
          Hide
          yalovyyi Illya Yalovyy added a comment -

          I think it should be possible to generate that data set (size of files matter), but I didn't want to make qtest even slower... I'll think about this approach.

          Show
          yalovyyi Illya Yalovyy added a comment - I think it should be possible to generate that data set (size of files matter), but I didn't want to make qtest even slower... I'll think about this approach.

            People

            • Assignee:
              yalovyyi Illya Yalovyy
              Reporter:
              yalovyyi Illya Yalovyy
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development