Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-11977

Hive should handle an external avro table with zero length files present

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 0.14.0, 1.0.0, 1.2.0, 1.1.0, 1.2.1
    • Fix Version/s: 2.0.0
    • Labels:
      None
    • Target Version/s:

      Description

      If a zero length file is in the top level directory housing an external avro table, all hive queries on the table fail.

      This issue is that org.apache.hadoop.hive.ql.io.avro.AvroGenericRecordReader creates a new org.apache.avro.file.DataFileReader and DataFileReader throws an exception when trying to read an empty file (because the empty file lacks the magic number marking it as avro).

      AvroGenericRecordReader should detect an empty file and then behave reasonably.

      Caused by: java.io.IOException: Not a data file.
      at org.apache.avro.file.DataFileStream.initialize(DataFileStream.java:102)
      at org.apache.avro.file.DataFileReader.<init>(DataFileReader.java:97)
      at org.apache.hadoop.hive.ql.io.avro.AvroGenericRecordReader.<init>(AvroGenericRecordReader.java:81)
      at org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat.getRecordReader(AvroContainerInputFormat.java:51)
      at org.apache.hadoop.hive.ql.io.HiveInputFormat.getRecordReader(HiveInputFormat.java:246)
      ... 25 more

      1. HIVE-11977.patch
        2 kB
        Aaron Dossett
      2. HIVE-11977.2.patch
        4 kB
        Aaron Dossett

        Issue Links

          Activity

          Hide
          brocknoland Brock Noland added a comment -

          Aaron Dossett Sorry, I just saw this ping! I moved my mail account and had not yet configured my rules appropiately. This patch looks good! Nice work

          Sergey Shelukhin - agreed, it'd be great to see this in 1.x.

          Show
          brocknoland Brock Noland added a comment - Aaron Dossett Sorry, I just saw this ping! I moved my mail account and had not yet configured my rules appropiately. This patch looks good! Nice work Sergey Shelukhin - agreed, it'd be great to see this in 1.x.
          Hide
          sershe Sergey Shelukhin added a comment -

          Should this issue be backported to branch-1? It looks like a bug.

          Show
          sershe Sergey Shelukhin added a comment - Should this issue be backported to branch-1? It looks like a bug.
          Hide
          ashutoshc Ashutosh Chauhan added a comment -

          Pushed to master. Thanks, Aaron!

          Show
          ashutoshc Ashutosh Chauhan added a comment - Pushed to master. Thanks, Aaron!
          Hide
          ashutoshc Ashutosh Chauhan added a comment -

          +1

          Show
          ashutoshc Ashutosh Chauhan added a comment - +1
          Hide
          hiveqa Hive QA added a comment -

          Overall: -1 at least one tests failed

          Here are the results of testing the latest attachment:
          https://issues.apache.org/jira/secure/attachment/12765213/HIVE-11977.2.patch

          SUCCESS: +1 due to 1 test(s) being added or modified.

          ERROR: -1 due to 3 failed/errored test(s), 9651 tests executed
          Failed tests:

          org.apache.hive.hcatalog.api.TestHCatClient.testTableSchemaPropagation
          org.apache.hive.hcatalog.hbase.TestPigHBaseStorageHandler.org.apache.hive.hcatalog.hbase.TestPigHBaseStorageHandler
          org.apache.hive.jdbc.TestSSL.testSSLVersion
          

          Test results: http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/5551/testReport
          Console output: http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/5551/console
          Test logs: http://ec2-174-129-184-35.compute-1.amazonaws.com/logs/PreCommit-HIVE-TRUNK-Build-5551/

          Messages:

          Executing org.apache.hive.ptest.execution.TestCheckPhase
          Executing org.apache.hive.ptest.execution.PrepPhase
          Executing org.apache.hive.ptest.execution.ExecutionPhase
          Executing org.apache.hive.ptest.execution.ReportingPhase
          Tests exited with: TestsFailedException: 3 tests failed
          

          This message is automatically generated.

          ATTACHMENT ID: 12765213 - PreCommit-HIVE-TRUNK-Build

          Show
          hiveqa Hive QA added a comment - Overall : -1 at least one tests failed Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12765213/HIVE-11977.2.patch SUCCESS: +1 due to 1 test(s) being added or modified. ERROR: -1 due to 3 failed/errored test(s), 9651 tests executed Failed tests: org.apache.hive.hcatalog.api.TestHCatClient.testTableSchemaPropagation org.apache.hive.hcatalog.hbase.TestPigHBaseStorageHandler.org.apache.hive.hcatalog.hbase.TestPigHBaseStorageHandler org.apache.hive.jdbc.TestSSL.testSSLVersion Test results: http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/5551/testReport Console output: http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/5551/console Test logs: http://ec2-174-129-184-35.compute-1.amazonaws.com/logs/PreCommit-HIVE-TRUNK-Build-5551/ Messages: Executing org.apache.hive.ptest.execution.TestCheckPhase Executing org.apache.hive.ptest.execution.PrepPhase Executing org.apache.hive.ptest.execution.ExecutionPhase Executing org.apache.hive.ptest.execution.ReportingPhase Tests exited with: TestsFailedException: 3 tests failed This message is automatically generated. ATTACHMENT ID: 12765213 - PreCommit-HIVE-TRUNK-Build
          Hide
          dossett@gmail.com Aaron Dossett added a comment -

          Thank you all around, Ashutosh Chauhan! Yes, I have tested this on local clusters and we are planning to deploy it to production next week as well. I will resubmit my patch, thank you for that pointer

          Show
          dossett@gmail.com Aaron Dossett added a comment - Thank you all around, Ashutosh Chauhan ! Yes, I have tested this on local clusters and we are planning to deploy it to production next week as well. I will resubmit my patch, thank you for that pointer
          Hide
          ashutoshc Ashutosh Chauhan added a comment -

          Alas, its unfortunate that Avro project chose to inflict pain on their users, instead of making their reader robust. I am fine with having this fix in Hive.
          Btw, have you tested this on cluster? Also, you need to name your patch as per https://cwiki.apache.org/confluence/display/Hive/Hive+PreCommit+Patch+Testing to trigger QA run and then making status as "Patch Available"

          Show
          ashutoshc Ashutosh Chauhan added a comment - Alas, its unfortunate that Avro project chose to inflict pain on their users, instead of making their reader robust. I am fine with having this fix in Hive. Btw, have you tested this on cluster? Also, you need to name your patch as per https://cwiki.apache.org/confluence/display/Hive/Hive+PreCommit+Patch+Testing to trigger QA run and then making status as "Patch Available"
          Hide
          dossett@gmail.com Aaron Dossett added a comment -

          Thanks, Ashutosh Chauhan, I checked the Avro JIRA based on your suggestion. The Avro project declined that option in AVRO-1530 and suggested clients ignore zero length files. That also led me to HIVE-7316, which my issue duplicates.

          Brock Noland Your thoughts, since you are on both of the above JIRAs?

          Show
          dossett@gmail.com Aaron Dossett added a comment - Thanks, Ashutosh Chauhan , I checked the Avro JIRA based on your suggestion. The Avro project declined that option in AVRO-1530 and suggested clients ignore zero length files. That also led me to HIVE-7316 , which my issue duplicates. Brock Noland Your thoughts, since you are on both of the above JIRAs?
          Hide
          ashutoshc Ashutosh Chauhan added a comment -

          If I understand correctly, reason you are suggesting is Reader should be resilient to such invalid files, If so, I think better place to skip such files is Avro's native reader itself.
          That way all of Avro user gets advantage of this, not just Hive. e.g, If you read that data directly (i.e., outside of Hive) than this fix will again be needed.

          Show
          ashutoshc Ashutosh Chauhan added a comment - If I understand correctly, reason you are suggesting is Reader should be resilient to such invalid files, If so, I think better place to skip such files is Avro's native reader itself. That way all of Avro user gets advantage of this, not just Hive. e.g, If you read that data directly (i.e., outside of Hive) than this fix will again be needed.
          Hide
          dossett@gmail.com Aaron Dossett added a comment -

          Ashutosh Chauhan Thank you for your response! My thought is that any process for generating this data could have failure scenarios that result in zero length files, this was the case when I initially ran into this issue. A file was opened on HDFS and "held" as zero length file before data was written to it, and it crashed before any data could be written. The consequences of these cases, that the entire table is unreadable (based on my experience), seems disproportionate to the actual problem. Likewise, a process deleting empty files could expose small windows where the table was unusable.

          Would adding a warning and/or adding an option like hive.exec.orc.skip.corrupt.data be more appropriate than silently ignoring the files? This is my first foray into Hive internals, so perhaps that orc option is not an exact comparison to this situation, but as a user it seems similar.

          Thank you again for the response and your feedback!

          Show
          dossett@gmail.com Aaron Dossett added a comment - Ashutosh Chauhan Thank you for your response! My thought is that any process for generating this data could have failure scenarios that result in zero length files, this was the case when I initially ran into this issue. A file was opened on HDFS and "held" as zero length file before data was written to it, and it crashed before any data could be written. The consequences of these cases, that the entire table is unreadable (based on my experience), seems disproportionate to the actual problem. Likewise, a process deleting empty files could expose small windows where the table was unusable. Would adding a warning and/or adding an option like hive.exec.orc.skip.corrupt.data be more appropriate than silently ignoring the files? This is my first foray into Hive internals, so perhaps that orc option is not an exact comparison to this situation, but as a user it seems similar. Thank you again for the response and your feedback!
          Hide
          ashutoshc Ashutosh Chauhan added a comment -

          Thanks for patch Aaron Dossett
          A 0-length file is an invalid Avro file, as in Avro's DataFileWriter will always write MAGIC header for version. Thats the reason DataFileReader expects it and throws up when it doesn't get one.
          It seems these 0 length files got there because of some faulty generator process. Isn't it better to just not generate those 0 length files. Or, alternatively, delete these faulty files.

          Show
          ashutoshc Ashutosh Chauhan added a comment - Thanks for patch Aaron Dossett A 0-length file is an invalid Avro file, as in Avro's DataFileWriter will always write MAGIC header for version. Thats the reason DataFileReader expects it and throws up when it doesn't get one. It seems these 0 length files got there because of some faulty generator process. Isn't it better to just not generate those 0 length files. Or, alternatively, delete these faulty files.
          Hide
          dossett@gmail.com Aaron Dossett added a comment -

          Attached a second patch that includes a unit test and better patch formatting

          Show
          dossett@gmail.com Aaron Dossett added a comment - Attached a second patch that includes a unit test and better patch formatting
          Hide
          hiveqa Hive QA added a comment -

          Overall: -1 no tests executed

          Here are the results of testing the latest attachment:
          https://issues.apache.org/jira/secure/attachment/12764065/HIVE-11977.patch

          Test results: http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/5462/testReport
          Console output: http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/5462/console
          Test logs: http://ec2-174-129-184-35.compute-1.amazonaws.com/logs/PreCommit-HIVE-TRUNK-Build-5462/

          Messages:

          Executing org.apache.hive.ptest.execution.PrepPhase
          Tests exited with: NonZeroExitCodeException
          Command 'bash /data/hive-ptest/working/scratch/source-prep.sh' failed with exit status 1 and output '+ [[ -n /usr/java/jdk1.7.0_45-cloudera ]]
          + export JAVA_HOME=/usr/java/jdk1.7.0_45-cloudera
          + JAVA_HOME=/usr/java/jdk1.7.0_45-cloudera
          + export PATH=/usr/java/jdk1.7.0_45-cloudera/bin/:/usr/local/apache-maven-3.0.5/bin:/usr/java/jdk1.7.0_45-cloudera/bin:/usr/local/apache-ant-1.9.1/bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/home/hiveptest/bin
          + PATH=/usr/java/jdk1.7.0_45-cloudera/bin/:/usr/local/apache-maven-3.0.5/bin:/usr/java/jdk1.7.0_45-cloudera/bin:/usr/local/apache-ant-1.9.1/bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/home/hiveptest/bin
          + export 'ANT_OPTS=-Xmx1g -XX:MaxPermSize=256m '
          + ANT_OPTS='-Xmx1g -XX:MaxPermSize=256m '
          + export 'M2_OPTS=-Xmx1g -XX:MaxPermSize=256m -Dhttp.proxyHost=localhost -Dhttp.proxyPort=3128'
          + M2_OPTS='-Xmx1g -XX:MaxPermSize=256m -Dhttp.proxyHost=localhost -Dhttp.proxyPort=3128'
          + cd /data/hive-ptest/working/
          + tee /data/hive-ptest/logs/PreCommit-HIVE-TRUNK-Build-5462/source-prep.txt
          + [[ false == \t\r\u\e ]]
          + mkdir -p maven ivy
          + [[ git = \s\v\n ]]
          + [[ git = \g\i\t ]]
          + [[ -z master ]]
          + [[ -d apache-github-source-source ]]
          + [[ ! -d apache-github-source-source/.git ]]
          + [[ ! -d apache-github-source-source ]]
          + cd apache-github-source-source
          + git fetch origin
          From https://github.com/apache/hive
             dc130f0..1636292  branch-1   -> origin/branch-1
             a5ffa71..6a8d7e4  master     -> origin/master
          + git reset --hard HEAD
          HEAD is now at a5ffa71 HIVE-11724 : WebHcat get jobs to order jobs on time order with latest at top (Kiran Kumar Kolli, reviewed by Hari Subramaniyan)
          + git clean -f -d
          Removing ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/HiveHepPlannerContext.java
          Removing ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/HiveVolcanoPlannerContext.java
          Removing ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveRulesRegistry.java
          + git checkout master
          Already on 'master'
          Your branch is behind 'origin/master' by 3 commits, and can be fast-forwarded.
          + git reset --hard origin/master
          HEAD is now at 6a8d7e4 HIVE-11819 : HiveServer2 catches OOMs on request threads (Sergey Shelukhin, reviewed by Vaibhav Gumashta)
          + git merge --ff-only origin/master
          Already up-to-date.
          + git gc
          + patchCommandPath=/data/hive-ptest/working/scratch/smart-apply-patch.sh
          + patchFilePath=/data/hive-ptest/working/scratch/build.patch
          + [[ -f /data/hive-ptest/working/scratch/build.patch ]]
          + chmod +x /data/hive-ptest/working/scratch/smart-apply-patch.sh
          + /data/hive-ptest/working/scratch/smart-apply-patch.sh /data/hive-ptest/working/scratch/build.patch
          patch: **** malformed patch at line 34: @@ -146,7 +156,7 @@ private boolean pathIsInPartition(Path split, String partitionPath) {
          
          patch: **** malformed patch at line 34: @@ -146,7 +156,7 @@ private boolean pathIsInPartition(Path split, String partitionPath) {
          
          patch: **** malformed patch at line 34: @@ -146,7 +156,7 @@ private boolean pathIsInPartition(Path split, String partitionPath) {
          
          The patch does not appear to apply with p0, p1, or p2
          + exit 1
          '
          

          This message is automatically generated.

          ATTACHMENT ID: 12764065 - PreCommit-HIVE-TRUNK-Build

          Show
          hiveqa Hive QA added a comment - Overall : -1 no tests executed Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12764065/HIVE-11977.patch Test results: http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/5462/testReport Console output: http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/5462/console Test logs: http://ec2-174-129-184-35.compute-1.amazonaws.com/logs/PreCommit-HIVE-TRUNK-Build-5462/ Messages: Executing org.apache.hive.ptest.execution.PrepPhase Tests exited with: NonZeroExitCodeException Command 'bash /data/hive-ptest/working/scratch/source-prep.sh' failed with exit status 1 and output '+ [[ -n /usr/java/jdk1.7.0_45-cloudera ]] + export JAVA_HOME=/usr/java/jdk1.7.0_45-cloudera + JAVA_HOME=/usr/java/jdk1.7.0_45-cloudera + export PATH=/usr/java/jdk1.7.0_45-cloudera/bin/:/usr/local/apache-maven-3.0.5/bin:/usr/java/jdk1.7.0_45-cloudera/bin:/usr/local/apache-ant-1.9.1/bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/home/hiveptest/bin + PATH=/usr/java/jdk1.7.0_45-cloudera/bin/:/usr/local/apache-maven-3.0.5/bin:/usr/java/jdk1.7.0_45-cloudera/bin:/usr/local/apache-ant-1.9.1/bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/home/hiveptest/bin + export 'ANT_OPTS=-Xmx1g -XX:MaxPermSize=256m ' + ANT_OPTS='-Xmx1g -XX:MaxPermSize=256m ' + export 'M2_OPTS=-Xmx1g -XX:MaxPermSize=256m -Dhttp.proxyHost=localhost -Dhttp.proxyPort=3128' + M2_OPTS='-Xmx1g -XX:MaxPermSize=256m -Dhttp.proxyHost=localhost -Dhttp.proxyPort=3128' + cd /data/hive-ptest/working/ + tee /data/hive-ptest/logs/PreCommit-HIVE-TRUNK-Build-5462/source-prep.txt + [[ false == \t\r\u\e ]] + mkdir -p maven ivy + [[ git = \s\v\n ]] + [[ git = \g\i\t ]] + [[ -z master ]] + [[ -d apache-github-source-source ]] + [[ ! -d apache-github-source-source/.git ]] + [[ ! -d apache-github-source-source ]] + cd apache-github-source-source + git fetch origin From https://github.com/apache/hive dc130f0..1636292 branch-1 -> origin/branch-1 a5ffa71..6a8d7e4 master -> origin/master + git reset --hard HEAD HEAD is now at a5ffa71 HIVE-11724 : WebHcat get jobs to order jobs on time order with latest at top (Kiran Kumar Kolli, reviewed by Hari Subramaniyan) + git clean -f -d Removing ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/HiveHepPlannerContext.java Removing ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/HiveVolcanoPlannerContext.java Removing ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveRulesRegistry.java + git checkout master Already on 'master' Your branch is behind 'origin/master' by 3 commits, and can be fast-forwarded. + git reset --hard origin/master HEAD is now at 6a8d7e4 HIVE-11819 : HiveServer2 catches OOMs on request threads (Sergey Shelukhin, reviewed by Vaibhav Gumashta) + git merge --ff-only origin/master Already up-to-date. + git gc + patchCommandPath=/data/hive-ptest/working/scratch/smart-apply-patch.sh + patchFilePath=/data/hive-ptest/working/scratch/build.patch + [[ -f /data/hive-ptest/working/scratch/build.patch ]] + chmod +x /data/hive-ptest/working/scratch/smart-apply-patch.sh + /data/hive-ptest/working/scratch/smart-apply-patch.sh /data/hive-ptest/working/scratch/build.patch patch: **** malformed patch at line 34: @@ -146,7 +156,7 @@ private boolean pathIsInPartition(Path split, String partitionPath) { patch: **** malformed patch at line 34: @@ -146,7 +156,7 @@ private boolean pathIsInPartition(Path split, String partitionPath) { patch: **** malformed patch at line 34: @@ -146,7 +156,7 @@ private boolean pathIsInPartition(Path split, String partitionPath) { The patch does not appear to apply with p0, p1, or p2 + exit 1 ' This message is automatically generated. ATTACHMENT ID: 12764065 - PreCommit-HIVE-TRUNK-Build
          Hide
          dossett@gmail.com Aaron Dossett added a comment -

          Uploading my first take at a fix. I am working on adding appropriate unit / integration tests, but any feedback would be welcome in the meantime.

          Show
          dossett@gmail.com Aaron Dossett added a comment - Uploading my first take at a fix. I am working on adding appropriate unit / integration tests, but any feedback would be welcome in the meantime.

            People

            • Assignee:
              dossett@gmail.com Aaron Dossett
              Reporter:
              dossett@gmail.com Aaron Dossett
            • Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development