Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.8.0
    • Component/s: None
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      Speedup the addInputPaths for combined symlink inputformat, and added some other micro optimizations which also work for normal cases.

      This can help reducing the start time of one query from 5 hours to less than 20 mins.

      1. HIVE-2218.1.patch
        10 kB
        He Yongqiang
      2. HIVE-2218.2.patch
        3 kB
        He Yongqiang
      3. HIVE-2218.3.patch
        3 kB
        He Yongqiang

        Activity

        Carl Steinbach made changes -
        Status Resolved [ 5 ] Closed [ 6 ]
        Hide
        Hudson added a comment -

        Integrated in Hive-trunk-h0.21 #790 (See https://builds.apache.org/job/Hive-trunk-h0.21/790/)

        Show
        Hudson added a comment - Integrated in Hive-trunk-h0.21 #790 (See https://builds.apache.org/job/Hive-trunk-h0.21/790/ )
        Ning Zhang made changes -
        Status Patch Available [ 10002 ] Resolved [ 5 ]
        Hadoop Flags [Reviewed]
        Fix Version/s 0.8.0 [ 12316178 ]
        Resolution Fixed [ 1 ]
        Hide
        Ning Zhang added a comment -

        Committed. Thanks Yongqiang!

        Show
        Ning Zhang added a comment - Committed. Thanks Yongqiang!
        Hide
        Ning Zhang added a comment -

        +1

        Show
        Ning Zhang added a comment - +1
        He Yongqiang made changes -
        Attachment HIVE-2218.3.patch [ 12482719 ]
        Hide
        He Yongqiang added a comment -

        use path.getParent instead of writing code to get the parent of a path

        Show
        He Yongqiang added a comment - use path.getParent instead of writing code to get the parent of a path
        Hide
        jiraposter@reviews.apache.org added a comment -

        -----------------------------------------------------------
        This is an automatically generated e-mail. To reply, visit:
        https://reviews.apache.org/r/898/#review846
        -----------------------------------------------------------

        trunk/ql/src/java/org/apache/hadoop/hive/ql/io/CombineHiveInputFormat.java
        <https://reviews.apache.org/r/898/#comment1839>

        If we change accept(), we need to remove + File.separator here.

        trunk/ql/src/java/org/apache/hadoop/hive/ql/io/CombineHiveInputFormat.java
        <https://reviews.apache.org/r/898/#comment1837>

        I think it would be better to use path.getParent().toString() instead of doing string manipulation.

        • Ning

        On 2011-06-15 20:24:15, Yongqiang He wrote:

        -----------------------------------------------------------

        This is an automatically generated e-mail. To reply, visit:

        https://reviews.apache.org/r/898/

        -----------------------------------------------------------

        (Updated 2011-06-15 20:24:15)

        Review request for hive.

        Summary

        -------

        speedup addInputPaths

        This addresses bug HIVE-2218.

        https://issues.apache.org/jira/browse/HIVE-2218

        Diffs

        -----

        trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/ExecDriver.java 1135335

        trunk/ql/src/java/org/apache/hadoop/hive/ql/io/CombineHiveInputFormat.java 1135335

        Diff: https://reviews.apache.org/r/898/diff

        Testing

        -------

        yes.

        Thanks,

        Yongqiang

        Show
        jiraposter@reviews.apache.org added a comment - ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/898/#review846 ----------------------------------------------------------- trunk/ql/src/java/org/apache/hadoop/hive/ql/io/CombineHiveInputFormat.java < https://reviews.apache.org/r/898/#comment1839 > If we change accept(), we need to remove + File.separator here. trunk/ql/src/java/org/apache/hadoop/hive/ql/io/CombineHiveInputFormat.java < https://reviews.apache.org/r/898/#comment1837 > I think it would be better to use path.getParent().toString() instead of doing string manipulation. Ning On 2011-06-15 20:24:15, Yongqiang He wrote: ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/898/ ----------------------------------------------------------- (Updated 2011-06-15 20:24:15) Review request for hive. Summary ------- speedup addInputPaths This addresses bug HIVE-2218 . https://issues.apache.org/jira/browse/HIVE-2218 Diffs ----- trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/ExecDriver.java 1135335 trunk/ql/src/java/org/apache/hadoop/hive/ql/io/CombineHiveInputFormat.java 1135335 Diff: https://reviews.apache.org/r/898/diff Testing ------- yes. Thanks, Yongqiang
        He Yongqiang made changes -
        Attachment HIVE-2218.2.patch [ 12482715 ]
        Hide
        jiraposter@reviews.apache.org added a comment -

        -----------------------------------------------------------
        This is an automatically generated e-mail. To reply, visit:
        https://reviews.apache.org/r/898/
        -----------------------------------------------------------

        (Updated 2011-06-15 20:24:15.631412)

        Review request for hive.

        Changes
        -------

        address Ning's comments. Did the minimum change and the performance is acceptable. We can try to remove empty path check if in future we see the latency is not good.

        Summary
        -------

        speedup addInputPaths

        This addresses bug HIVE-2218.
        https://issues.apache.org/jira/browse/HIVE-2218

        Diffs (updated)


        trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/ExecDriver.java 1135335
        trunk/ql/src/java/org/apache/hadoop/hive/ql/io/CombineHiveInputFormat.java 1135335

        Diff: https://reviews.apache.org/r/898/diff

        Testing
        -------

        yes.

        Thanks,

        Yongqiang

        Show
        jiraposter@reviews.apache.org added a comment - ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/898/ ----------------------------------------------------------- (Updated 2011-06-15 20:24:15.631412) Review request for hive. Changes ------- address Ning's comments. Did the minimum change and the performance is acceptable. We can try to remove empty path check if in future we see the latency is not good. Summary ------- speedup addInputPaths This addresses bug HIVE-2218 . https://issues.apache.org/jira/browse/HIVE-2218 Diffs (updated) trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/ExecDriver.java 1135335 trunk/ql/src/java/org/apache/hadoop/hive/ql/io/CombineHiveInputFormat.java 1135335 Diff: https://reviews.apache.org/r/898/diff Testing ------- yes. Thanks, Yongqiang
        Hide
        jiraposter@reviews.apache.org added a comment -

        -----------------------------------------------------------
        This is an automatically generated e-mail. To reply, visit:
        https://reviews.apache.org/r/898/#review842
        -----------------------------------------------------------

        trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/ExecDriver.java
        <https://reviews.apache.org/r/898/#comment1826>

        This change will make the order of paths in pathProcessed non-deterministic. This means mapred.input.dir will have not have the same order as before. Not sure if it is safe or not, but if you change HashSet with LinkedHashSet, the order will be preserved.

        trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/ExecDriver.java
        <https://reviews.apache.org/r/898/#comment1828>

        Here needs some comments why this case doesn't need to check empty paths.

        In terms of efficiency, it seems to me that checking empty paths is not the most expensive part (# of RPCs is large but each listStatus() should be fast). Also we should be able to cache (needs to extend Utilities.isEmpty) the results of listStatus for each path, which are anyway needed in other operations (compute splits etc). If

        trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/ExecDriver.java
        <https://reviews.apache.org/r/898/#comment1829>

        It seem that we are doing redundant work as FileInputFormat.setInputPaths(JobConf, CommaSeparatedString). I think it would be safer and cleaner to first get an array of paths and call:

        FileInputFormat.setInputPaths(StringUtils.stringToPath(String[] paths))

        trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java
        <https://reviews.apache.org/r/898/#comment1830>

        indentation

        trunk/ql/src/java/org/apache/hadoop/hive/ql/io/CombineHiveInputFormat.java
        <https://reviews.apache.org/r/898/#comment1833>

        why do we need it here?

        trunk/ql/src/java/org/apache/hadoop/hive/ql/io/CombineHiveInputFormat.java
        <https://reviews.apache.org/r/898/#comment1834>

        ! mrwork.getPartDescToRework().isEmpty()

        trunk/ql/src/java/org/apache/hadoop/hive/ql/io/CombineHiveInputFormat.java
        <https://reviews.apache.org/r/898/#comment1835>

        The logic here is too complex and I think it's better to be refactored. Is the following what you wanted?

        if (all_partitions_are_rework())

        { prepareNullCombineFilter(combine); }

        else

        { prepareNormalCombineFilter(combine); }


        InputSplitShim[] iss = combine.getSplits()

        trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/MapredWork.java
        <https://reviews.apache.org/r/898/#comment1832>

        Does here just need a HashSet<PartitionDesc> rather than a HashMap<PartitionDesc, Boolean>?

        • Ning

        On 2011-06-14 21:09:13, Yongqiang He wrote:

        -----------------------------------------------------------

        This is an automatically generated e-mail. To reply, visit:

        https://reviews.apache.org/r/898/

        -----------------------------------------------------------

        (Updated 2011-06-14 21:09:13)

        Review request for hive.

        Summary

        -------

        speedup addInputPaths

        This addresses bug HIVE-2218.

        https://issues.apache.org/jira/browse/HIVE-2218

        Diffs

        -----

        trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/ExecDriver.java 1135335

        trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java 1135335

        trunk/ql/src/java/org/apache/hadoop/hive/ql/io/CombineHiveInputFormat.java 1135335

        trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/MapredWork.java 1135335

        Diff: https://reviews.apache.org/r/898/diff

        Testing

        -------

        yes.

        Thanks,

        Yongqiang

        Show
        jiraposter@reviews.apache.org added a comment - ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/898/#review842 ----------------------------------------------------------- trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/ExecDriver.java < https://reviews.apache.org/r/898/#comment1826 > This change will make the order of paths in pathProcessed non-deterministic. This means mapred.input.dir will have not have the same order as before. Not sure if it is safe or not, but if you change HashSet with LinkedHashSet, the order will be preserved. trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/ExecDriver.java < https://reviews.apache.org/r/898/#comment1828 > Here needs some comments why this case doesn't need to check empty paths. In terms of efficiency, it seems to me that checking empty paths is not the most expensive part (# of RPCs is large but each listStatus() should be fast). Also we should be able to cache (needs to extend Utilities.isEmpty) the results of listStatus for each path, which are anyway needed in other operations (compute splits etc). If trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/ExecDriver.java < https://reviews.apache.org/r/898/#comment1829 > It seem that we are doing redundant work as FileInputFormat.setInputPaths(JobConf, CommaSeparatedString). I think it would be safer and cleaner to first get an array of paths and call: FileInputFormat.setInputPaths(StringUtils.stringToPath(String[] paths)) trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java < https://reviews.apache.org/r/898/#comment1830 > indentation trunk/ql/src/java/org/apache/hadoop/hive/ql/io/CombineHiveInputFormat.java < https://reviews.apache.org/r/898/#comment1833 > why do we need it here? trunk/ql/src/java/org/apache/hadoop/hive/ql/io/CombineHiveInputFormat.java < https://reviews.apache.org/r/898/#comment1834 > ! mrwork.getPartDescToRework().isEmpty() trunk/ql/src/java/org/apache/hadoop/hive/ql/io/CombineHiveInputFormat.java < https://reviews.apache.org/r/898/#comment1835 > The logic here is too complex and I think it's better to be refactored. Is the following what you wanted? if (all_partitions_are_rework()) { prepareNullCombineFilter(combine); } else { prepareNormalCombineFilter(combine); } InputSplitShim[] iss = combine.getSplits() trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/MapredWork.java < https://reviews.apache.org/r/898/#comment1832 > Does here just need a HashSet<PartitionDesc> rather than a HashMap<PartitionDesc, Boolean>? Ning On 2011-06-14 21:09:13, Yongqiang He wrote: ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/898/ ----------------------------------------------------------- (Updated 2011-06-14 21:09:13) Review request for hive. Summary ------- speedup addInputPaths This addresses bug HIVE-2218 . https://issues.apache.org/jira/browse/HIVE-2218 Diffs ----- trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/ExecDriver.java 1135335 trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java 1135335 trunk/ql/src/java/org/apache/hadoop/hive/ql/io/CombineHiveInputFormat.java 1135335 trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/MapredWork.java 1135335 Diff: https://reviews.apache.org/r/898/diff Testing ------- yes. Thanks, Yongqiang
        Hide
        jiraposter@reviews.apache.org added a comment -

        -----------------------------------------------------------
        This is an automatically generated e-mail. To reply, visit:
        https://reviews.apache.org/r/898/
        -----------------------------------------------------------

        Review request for hive.

        Summary
        -------

        speedup addInputPaths

        This addresses bug HIVE-2218.
        https://issues.apache.org/jira/browse/HIVE-2218

        Diffs


        trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/ExecDriver.java 1135335
        trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java 1135335
        trunk/ql/src/java/org/apache/hadoop/hive/ql/io/CombineHiveInputFormat.java 1135335
        trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/MapredWork.java 1135335

        Diff: https://reviews.apache.org/r/898/diff

        Testing
        -------

        yes.

        Thanks,

        Yongqiang

        Show
        jiraposter@reviews.apache.org added a comment - ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/898/ ----------------------------------------------------------- Review request for hive. Summary ------- speedup addInputPaths This addresses bug HIVE-2218 . https://issues.apache.org/jira/browse/HIVE-2218 Diffs trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/ExecDriver.java 1135335 trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java 1135335 trunk/ql/src/java/org/apache/hadoop/hive/ql/io/CombineHiveInputFormat.java 1135335 trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/MapredWork.java 1135335 Diff: https://reviews.apache.org/r/898/diff Testing ------- yes. Thanks, Yongqiang
        Show
        He Yongqiang added a comment - https://reviews.apache.org/r/898/
        He Yongqiang made changes -
        Status Open [ 1 ] Patch Available [ 10002 ]
        He Yongqiang made changes -
        Field Original Value New Value
        Attachment HIVE-2218.1.patch [ 12482573 ]
        He Yongqiang created issue -

          People

          • Assignee:
            He Yongqiang
            Reporter:
            He Yongqiang
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development