Hive
  1. Hive
  2. HIVE-3025

Fix Hive ARCHIVE command on 0.22 and 0.23

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Duplicate
    • Affects Version/s: 0.9.0
    • Fix Version/s: None
    • Component/s: Query Processor
    • Labels:
      None

      Description

      archive.q and archive_multi.q fail when Hive is run on top of Hadoop 0.22 or 0.23.

        Issue Links

          Activity

          Hide
          Phabricator added a comment -

          cwsteinbach requested code review of "HIVE-3025 [jira] Fix Hive ARCHIVE command on 0.22 and 0.23".
          Reviewers: JIRA, ashutoshc

          HIVE-3025. Fix Hive ARCHIVE command on 0.22 and 0.23

          archive.q and archive_multi.q fail when Hive is run on top of Hadoop 0.22 or 0.23.

          This patch moves the HiveHarFileSystem shim to shims/common and adds testcases
          for archive.q and archive_multi.q that take into account MAPREDUCE-1806 on
          0.22 and 0.23.

          TEST PLAN
          EMPTY

          REVISION DETAIL
          https://reviews.facebook.net/D3195

          AFFECTED FILES
          ql/src/java/org/apache/hadoop/hive/ql/exec/ArchiveUtils.java
          ql/src/test/queries/clientpositive/archive.q
          ql/src/test/queries/clientpositive/archive_mr_1806.q
          ql/src/test/queries/clientpositive/archive_multi.q
          ql/src/test/queries/clientpositive/archive_multi_mr_1806.q
          ql/src/test/results/clientpositive/archive.q.out
          ql/src/test/results/clientpositive/archive_mr_1806.q.out
          ql/src/test/results/clientpositive/archive_multi.q.out
          ql/src/test/results/clientpositive/archive_multi_mr_1806.q.out
          shims/src/common/java/org/apache/hadoop/hive/shims/HiveHarFileSystem.java
          shims/src/0.20/java/org/apache/hadoop/hive/shims/HiveHarFileSystem.java

          MANAGE HERALD DIFFERENTIAL RULES
          https://reviews.facebook.net/herald/view/differential/

          WHY DID I GET THIS EMAIL?
          https://reviews.facebook.net/herald/transcript/7251/

          To: JIRA, ashutoshc, cwsteinbach

          Show
          Phabricator added a comment - cwsteinbach requested code review of " HIVE-3025 [jira] Fix Hive ARCHIVE command on 0.22 and 0.23". Reviewers: JIRA, ashutoshc HIVE-3025 . Fix Hive ARCHIVE command on 0.22 and 0.23 archive.q and archive_multi.q fail when Hive is run on top of Hadoop 0.22 or 0.23. This patch moves the HiveHarFileSystem shim to shims/common and adds testcases for archive.q and archive_multi.q that take into account MAPREDUCE-1806 on 0.22 and 0.23. TEST PLAN EMPTY REVISION DETAIL https://reviews.facebook.net/D3195 AFFECTED FILES ql/src/java/org/apache/hadoop/hive/ql/exec/ArchiveUtils.java ql/src/test/queries/clientpositive/archive.q ql/src/test/queries/clientpositive/archive_mr_1806.q ql/src/test/queries/clientpositive/archive_multi.q ql/src/test/queries/clientpositive/archive_multi_mr_1806.q ql/src/test/results/clientpositive/archive.q.out ql/src/test/results/clientpositive/archive_mr_1806.q.out ql/src/test/results/clientpositive/archive_multi.q.out ql/src/test/results/clientpositive/archive_multi_mr_1806.q.out shims/src/common/java/org/apache/hadoop/hive/shims/HiveHarFileSystem.java shims/src/0.20/java/org/apache/hadoop/hive/shims/HiveHarFileSystem.java MANAGE HERALD DIFFERENTIAL RULES https://reviews.facebook.net/herald/view/differential/ WHY DID I GET THIS EMAIL? https://reviews.facebook.net/herald/transcript/7251/ To: JIRA, ashutoshc, cwsteinbach
          Hide
          Phabricator added a comment -

          zhenxiao has commented on the revision "HIVE-3025 [jira] Fix Hive ARCHIVE command on 0.22 and 0.23".

          +1

          Good for me

          REVISION DETAIL
          https://reviews.facebook.net/D3195

          To: JIRA, ashutoshc, cwsteinbach
          Cc: zhenxiao

          Show
          Phabricator added a comment - zhenxiao has commented on the revision " HIVE-3025 [jira] Fix Hive ARCHIVE command on 0.22 and 0.23". +1 Good for me REVISION DETAIL https://reviews.facebook.net/D3195 To: JIRA, ashutoshc, cwsteinbach Cc: zhenxiao
          Hide
          Ashutosh Chauhan added a comment -

          Carl Steinbach I left some comments on Phabricator.

          Show
          Ashutosh Chauhan added a comment - Carl Steinbach I left some comments on Phabricator.
          Hide
          Vikram Dixit K added a comment -

          I am unable to apply the patch on trunk as it says files are missing.

          ql/src/test/queries/clientpositive/archive_mr_1806.q
          ql/src/test/queries/clientpositive/archive_multi_mr_1806.q
          ql/src/test/results/clientpositive/archive_mr_1806.q.out
          ql/src/test/results/clientpositive/archive_multi_mr_1806.q.out

          Please let me know if I am missing anything. I am trying to apply it on the svn repo.

          Thanks,
          Vikram

          Show
          Vikram Dixit K added a comment - I am unable to apply the patch on trunk as it says files are missing. ql/src/test/queries/clientpositive/archive_mr_1806.q ql/src/test/queries/clientpositive/archive_multi_mr_1806.q ql/src/test/results/clientpositive/archive_mr_1806.q.out ql/src/test/results/clientpositive/archive_multi_mr_1806.q.out Please let me know if I am missing anything. I am trying to apply it on the svn repo. Thanks, Vikram
          Hide
          Vikram Dixit K added a comment -

          After digging more into this with @hashutosh's help, we see the following issues:

          1. The hadoop archive command line has changed.
          2. There is no way in the current set of commands supported by hive for a user to specify a parent directory for the archive.
          3. The api createHadoopArchive in all shims is the same which is counter-intuitive.

          The hadoop archive command has changed between versions 0.20 and 0.20S/1.0/0.23. There is a compulsory command line parameter -p that is required in the latter versions. Since these versions are driving the same command line as 0.20 (without the -p), they fail. This needs to be fixed in the createHadoopArchive api.

          The createHadoopArchive has the issue that it checks hive.archive.har.parentdir.settable. The user, in the current set of commands available, has no way of setting a parent directory for the creation of the archive. So, in the future when that ability is added, we need to revisit the createHadoopArchive api itself or derive it from conf.

          The createHadoopArchive api is the same across all the shims, i.e. Hadoop20Shims.java and the HadoopShimsSecure.java have the exact same implementation of this api which is counter-intuitive considering the shims are supposed to be specific for versions of hadoop.

          So, I propose at this time, we should fix the createHadoopArchive in the HadoopShimsSecure to adhere to the new command line expected by those versions of Hadoop. We should also fix the Hadoop20Shims api to not worry about the -p parameter since it cannot use it.

          Please let me know if I am missing something.

          Show
          Vikram Dixit K added a comment - After digging more into this with @hashutosh's help, we see the following issues: 1. The hadoop archive command line has changed. 2. There is no way in the current set of commands supported by hive for a user to specify a parent directory for the archive. 3. The api createHadoopArchive in all shims is the same which is counter-intuitive. The hadoop archive command has changed between versions 0.20 and 0.20S/1.0/0.23. There is a compulsory command line parameter -p that is required in the latter versions. Since these versions are driving the same command line as 0.20 (without the -p), they fail. This needs to be fixed in the createHadoopArchive api. The createHadoopArchive has the issue that it checks hive.archive.har.parentdir.settable. The user, in the current set of commands available, has no way of setting a parent directory for the creation of the archive. So, in the future when that ability is added, we need to revisit the createHadoopArchive api itself or derive it from conf. The createHadoopArchive api is the same across all the shims, i.e. Hadoop20Shims.java and the HadoopShimsSecure.java have the exact same implementation of this api which is counter-intuitive considering the shims are supposed to be specific for versions of hadoop. So, I propose at this time, we should fix the createHadoopArchive in the HadoopShimsSecure to adhere to the new command line expected by those versions of Hadoop. We should also fix the Hadoop20Shims api to not worry about the -p parameter since it cannot use it. Please let me know if I am missing something.
          Hide
          Ashutosh Chauhan added a comment -

          Vikram Dixit K I forgot what was the resolution of your investigation on this?

          Show
          Ashutosh Chauhan added a comment - Vikram Dixit K I forgot what was the resolution of your investigation on this?
          Hide
          Ashutosh Chauhan added a comment -

          Seems like HIVE-3338 fixed it for hadoop-1, but their is still some problem with hadoop-2

          Show
          Ashutosh Chauhan added a comment - Seems like HIVE-3338 fixed it for hadoop-1, but their is still some problem with hadoop-2
          Hide
          Brock Noland added a comment -

          Due to this archive_multi.q (and others) fail for 0.23. The error for archive_multi.q is below. It looks to because HarFileSystem (potentially through makeQualified) will add en extra : to the authority and this extra colon trips up HiveFileFormatUtils.getPartitionDescFromPathRecursively. See "cannot find dir = har://pfile-localhost:/" below versions the paths in pathToPartitionInfo "har://pfile-localhost/".

           [junit] java.io.IOException: cannot find dir = har://pfile-localhost:/home/noland/workspaces/hive-apache/hive/build/ql/test/data/warehouse/tstsrcpart/ds=2008-04-08/data.har/hr=11/000000_0 in pathToPartitionInfo: [har://pfile-localhost/home/noland/workspaces/hive-apache/hive/build/ql/test/data/warehouse/tstsrcpart/ds=2008-04-08/data.har/hr=11, har://pfile-localhost/home/noland/workspaces/hive-apache/hive/build/ql/test/data/warehouse/tstsrcpart/ds=2008-04-08/data.har/hr=12]
              [junit] 	at org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getPartitionDescFromPathRecursively(HiveFileFormatUtils.java:298)
              [junit] 	at org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getPartitionDescFromPathRecursively(HiveFileFormatUtils.java:260)
              [junit] 	at org.apache.hadoop.hive.ql.io.CombineHiveInputFormat$CombineHiveInputSplit.<init>(CombineHiveInputFormat.java:104)
              [junit] 	at org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.getSplits(CombineHiveInputFormat.java:409)
              [junit] 	at org.apache.hadoop.mapreduce.JobSubmitter.writeOldSplits(JobSubmitter.java:480)
              [junit] 	at org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:472)
              [junit] 	at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:367)
              [junit] 	at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1218)
              [junit] 	at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1215)
          
          Show
          Brock Noland added a comment - Due to this archive_multi.q (and others) fail for 0.23. The error for archive_multi.q is below. It looks to because HarFileSystem (potentially through makeQualified) will add en extra : to the authority and this extra colon trips up HiveFileFormatUtils.getPartitionDescFromPathRecursively. See "cannot find dir = har://pfile-localhost:/" below versions the paths in pathToPartitionInfo "har://pfile-localhost/". [junit] java.io.IOException: cannot find dir = har://pfile-localhost:/home/noland/workspaces/hive-apache/hive/build/ql/test/data/warehouse/tstsrcpart/ds=2008-04-08/data.har/hr=11/000000_0 in pathToPartitionInfo: [har://pfile-localhost/home/noland/workspaces/hive-apache/hive/build/ql/test/data/warehouse/tstsrcpart/ds=2008-04-08/data.har/hr=11, har://pfile-localhost/home/noland/workspaces/hive-apache/hive/build/ql/test/data/warehouse/tstsrcpart/ds=2008-04-08/data.har/hr=12] [junit] at org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getPartitionDescFromPathRecursively(HiveFileFormatUtils.java:298) [junit] at org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getPartitionDescFromPathRecursively(HiveFileFormatUtils.java:260) [junit] at org.apache.hadoop.hive.ql.io.CombineHiveInputFormat$CombineHiveInputSplit.<init>(CombineHiveInputFormat.java:104) [junit] at org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.getSplits(CombineHiveInputFormat.java:409) [junit] at org.apache.hadoop.mapreduce.JobSubmitter.writeOldSplits(JobSubmitter.java:480) [junit] at org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:472) [junit] at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:367) [junit] at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1218) [junit] at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1215)
          Hide
          Vikram Dixit K added a comment -

          Further work on this jira is happening on HIVE-4910.

          Show
          Vikram Dixit K added a comment - Further work on this jira is happening on HIVE-4910 .

            People

            • Assignee:
              Carl Steinbach
              Reporter:
              Carl Steinbach
            • Votes:
              1 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development