Hive
  1. Hive
  2. HIVE-1510

HiveCombineInputFormat should not use prefix matching to find the partitionDesc for a given path

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.7.0
    • Component/s: None
    • Labels:
      None

      Description

      set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat;

      drop table combine_3_srcpart_seq_rc;

      create table combine_3_srcpart_seq_rc (key int , value string) partitioned by (ds string, hr string) stored as sequencefile;

      insert overwrite table combine_3_srcpart_seq_rc partition (ds="2010-08-03", hr="00") select * from src;

      alter table combine_3_srcpart_seq_rc set fileformat rcfile;
      insert overwrite table combine_3_srcpart_seq_rc partition (ds="2010-08-03", hr="001") select * from src;

      desc extended combine_3_srcpart_seq_rc partition(ds="2010-08-03", hr="00");
      desc extended combine_3_srcpart_seq_rc partition(ds="2010-08-03", hr="001");

      select * from combine_3_srcpart_seq_rc where ds="2010-08-03" order by key;

      drop table combine_3_srcpart_seq_rc;

      will fail.

      1. hive-1510.1.patch
        36 kB
        He Yongqiang
      2. hive-1510.3.patch
        41 kB
        He Yongqiang
      3. hive-1510.4.patch
        42 kB
        He Yongqiang

        Activity

        Hide
        Namit Jain added a comment -

        Some minor comments
        TestHiveFileFormatUtils:

        1. Use a different PartitionDesc every time instead of partDesc_1 for the partitions.
        2. Spelling mistakes: forth group

        Otherwise, it looks good to me. Ning, can you also OK it , since we spent a lot of time debugging in the past.

        Also, before checking it, can you try the following 4 types of queries (with CombineHiveInputFormat):

        1. hadoop 17 normal query
        2. hadoop 17 sampling query
        3. hadoop 20 normal query
        4. hadoop 20 sampling query

        Show
        Namit Jain added a comment - Some minor comments TestHiveFileFormatUtils: 1. Use a different PartitionDesc every time instead of partDesc_1 for the partitions. 2. Spelling mistakes: forth group Otherwise, it looks good to me. Ning, can you also OK it , since we spent a lot of time debugging in the past. Also, before checking it, can you try the following 4 types of queries (with CombineHiveInputFormat): 1. hadoop 17 normal query 2. hadoop 17 sampling query 3. hadoop 20 normal query 4. hadoop 20 sampling query
        Hide
        Ning Zhang added a comment -

        In HiveFileFormatUtils, removing scheme and authorities from the URI and only retain the path part may cause problem when the URI is har:// rather than hdfs://. This is one of the bugs that Paul fixed in hadoop for HAR to be able to work with CHIF.

        As a general comment, is it easier to just modify the pathToPartitionInfo to add a Path.SEPARATER at the end? You don't need to introduce the recursive checking and still can use the prefix matching.

        Show
        Ning Zhang added a comment - In HiveFileFormatUtils, removing scheme and authorities from the URI and only retain the path part may cause problem when the URI is har:// rather than hdfs://. This is one of the bugs that Paul fixed in hadoop for HAR to be able to work with CHIF. As a general comment, is it easier to just modify the pathToPartitionInfo to add a Path.SEPARATER at the end? You don't need to introduce the recursive checking and still can use the prefix matching.
        Hide
        He Yongqiang added a comment -

        will also test against har. even the old code also remove the scheme and authorities part when try to match partitionDesc.
        no offense – the old code is not very clear, and not efficient. The new code does the same thing with simplified logic.

        Show
        He Yongqiang added a comment - will also test against har. even the old code also remove the scheme and authorities part when try to match partitionDesc. no offense – the old code is not very clear, and not efficient. The new code does the same thing with simplified logic.
        Hide
        Ning Zhang added a comment -

        It's fine for me if you feel strong for it. The concern from me (besides har+CHIF support) is the performance implication when using CHIF merging large number of small files inside a partition. Siying has a use case where the pathToPartitionInfo is very large and the # of files in the splits is also very large. Determining whether partitionDesc for each input path takes a long time. In your patch, you have another HashMap for the path part of the pathToPartitionInfo (which trade memory for speed), but introduced another loop for comparing parent of paths. It would be nice (better performance) if you could avoid this loop by simply appending '/' at the end. But if it doesn't hurt the performance or appending '/' doesn't work, the current patch is fine for me too.

        As an aside, we should find out why pathToPartitionInfo in some cases contains paths only rather than the full URI. The ideal case is that it should always contains the full URI so that we don't rely on heuristics. But this could be another JIRA.

        Show
        Ning Zhang added a comment - It's fine for me if you feel strong for it. The concern from me (besides har+CHIF support) is the performance implication when using CHIF merging large number of small files inside a partition. Siying has a use case where the pathToPartitionInfo is very large and the # of files in the splits is also very large. Determining whether partitionDesc for each input path takes a long time. In your patch, you have another HashMap for the path part of the pathToPartitionInfo (which trade memory for speed), but introduced another loop for comparing parent of paths. It would be nice (better performance) if you could avoid this loop by simply appending '/' at the end. But if it doesn't hurt the performance or appending '/' doesn't work, the current patch is fine for me too. As an aside, we should find out why pathToPartitionInfo in some cases contains paths only rather than the full URI. The ideal case is that it should always contains the full URI so that we don't rely on heuristics. But this could be another JIRA.
        Hide
        He Yongqiang added a comment -

        Will update the patch once https://issues.apache.org/jira/browse/HIVE-1515 is in.

        Show
        He Yongqiang added a comment - Will update the patch once https://issues.apache.org/jira/browse/HIVE-1515 is in.
        Hide
        He Yongqiang added a comment -

        Since HIVE-1515 depends on Hadoop, can we close this jira without adding new archive testcases.

        Show
        He Yongqiang added a comment - Since HIVE-1515 depends on Hadoop, can we close this jira without adding new archive testcases.
        Hide
        Ning Zhang added a comment -

        As discussed offline with Yongqiang, we should clean up the pathToPartitionInfo to contain only canonical representations for each partition. This could result in much cleaner code. If we do that IOPrepareCache is not needed at all and the function getPartitionDescFromPath is just simple hash lookup. We can make it as a follow up JIRA along with cleaning up the unnecessary info in pathToPartitionInfo as well.

        Here's some comments on the current patch:

        • the IOPrepareCache is cleared in Driver, which should only contain generic code irrespect to task types. Can you do it in ExecDriver.execute()? This will new cache is only used in ExecDriver anyways.
        • some comments on why you need a new hash map keyed with the paths only will be helpful.
        Show
        Ning Zhang added a comment - As discussed offline with Yongqiang, we should clean up the pathToPartitionInfo to contain only canonical representations for each partition. This could result in much cleaner code. If we do that IOPrepareCache is not needed at all and the function getPartitionDescFromPath is just simple hash lookup. We can make it as a follow up JIRA along with cleaning up the unnecessary info in pathToPartitionInfo as well. Here's some comments on the current patch: the IOPrepareCache is cleared in Driver, which should only contain generic code irrespect to task types. Can you do it in ExecDriver.execute()? This will new cache is only used in ExecDriver anyways. some comments on why you need a new hash map keyed with the paths only will be helpful.
        Hide
        He Yongqiang added a comment -

        >>the IOPrepareCache is cleared in Driver, which should only contain generic code irrespect to task types. Can you do it in ExecDriver.execute()? This will new cache is only used in ExecDriver anyways.

        ExecDriver is per map-reduce task. Driver is per query. We should do this for query granularity. I think the pathToPartitionDesc is also per query map?

        >>some comments on why you need a new hash map keyed with the paths only will be helpful.
        will do it in a next patch.

        Show
        He Yongqiang added a comment - >>the IOPrepareCache is cleared in Driver, which should only contain generic code irrespect to task types. Can you do it in ExecDriver.execute()? This will new cache is only used in ExecDriver anyways. ExecDriver is per map-reduce task. Driver is per query. We should do this for query granularity. I think the pathToPartitionDesc is also per query map? >>some comments on why you need a new hash map keyed with the paths only will be helpful. will do it in a next patch.
        Hide
        He Yongqiang added a comment -

        About the additional hashmap added, it is used to match path to partitionDesc by discarding partitionDesc's schema information.

        In the long run, we should normalize all input path to let them contain full schema and authorization information. This is a must to let hive work with multiple hdfs clusters.

        Show
        He Yongqiang added a comment - About the additional hashmap added, it is used to match path to partitionDesc by discarding partitionDesc's schema information. In the long run, we should normalize all input path to let them contain full schema and authorization information. This is a must to let hive work with multiple hdfs clusters.
        Hide
        Ning Zhang added a comment -

        Other than the clean architecture concerns (Driver should be generic and should not assume tasks contain MR jobs), it seems also doesn't work if parallel execution is enabled: IOPrepareCache is thread local and parallel MR jobs are launched in different threads.

        Show
        Ning Zhang added a comment - Other than the clean architecture concerns (Driver should be generic and should not assume tasks contain MR jobs), it seems also doesn't work if parallel execution is enabled: IOPrepareCache is thread local and parallel MR jobs are launched in different threads.
        Hide
        Ning Zhang added a comment -

        +1, will commit if tests pass

        Show
        Ning Zhang added a comment - +1, will commit if tests pass
        Hide
        Ning Zhang added a comment -

        Yongqiang, the 0.17 test failed on index_compact3.q and script_pipe.q (the latter may be a false alarm). Can you take a look?

        Show
        Ning Zhang added a comment - Yongqiang, the 0.17 test failed on index_compact3.q and script_pipe.q (the latter may be a false alarm). Can you take a look?
        Hide
        He Yongqiang added a comment -

        even without this patch, the 0.17 test failed on index_compat3.q. Please file a separate jira for this issue.

        Show
        He Yongqiang added a comment - even without this patch, the 0.17 test failed on index_compat3.q. Please file a separate jira for this issue.
        Hide
        Ning Zhang added a comment -

        it does't fail on trunk but caused by parallel test. HIVE-1576 was filed for this.

        Will tes again and commit once HIVE-1307 is committed.

        Show
        Ning Zhang added a comment - it does't fail on trunk but caused by parallel test. HIVE-1576 was filed for this. Will tes again and commit once HIVE-1307 is committed.
        Hide
        Ning Zhang added a comment -

        Committed. Thanks Yongqiang!

        Show
        Ning Zhang added a comment - Committed. Thanks Yongqiang!

          People

          • Assignee:
            He Yongqiang
            Reporter:
            He Yongqiang
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development