Hive
  1. Hive
  2. HIVE-2282

Local mode needs to work well with block sampling

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.8.0
    • Component/s: Query Processor
    • Labels:
      None

      Description

      Currently, if block sampling is enabled and large set of data are sampled to a small set, local mode needs to be kicked in.

      1. HIVE-2282.4.patch.txt
        19 kB
        Kevin Wilfong
      2. HIVE-2282.3.patch.txt
        12 kB
        Kevin Wilfong
      3. HIVE-2282.2.patch.txt
        12 kB
        Kevin Wilfong
      4. HIVE-2282.1.patch.txt
        9 kB
        Kevin Wilfong

        Activity

        Transition Time In Source Status Execution Times Last Executer Last Execution Date
        Open Open Patch Available Patch Available
        19h 34m 1 Kevin Wilfong 14/Jul/11 19:29
        Patch Available Patch Available Open Open
        22s 1 Kevin Wilfong 14/Jul/11 19:29
        Open Open In Progress In Progress
        23s 1 Kevin Wilfong 14/Jul/11 19:29
        In Progress In Progress Resolved Resolved
        29d 1h 28m 1 Siying Dong 12/Aug/11 20:57
        Resolved Resolved Closed Closed
        126d 3h 58m 1 Carl Steinbach 16/Dec/11 23:56
        Carl Steinbach made changes -
        Status Resolved [ 5 ] Closed [ 6 ]
        Carl Steinbach made changes -
        Fix Version/s 0.8.0 [ 12316178 ]
        Component/s Query Processor [ 12312586 ]
        Siying Dong made changes -
        Status In Progress [ 3 ] Resolved [ 5 ]
        Resolution Fixed [ 1 ]
        Hide
        Siying Dong added a comment -

        Committed. Thanks Kevin!

        Show
        Siying Dong added a comment - Committed. Thanks Kevin!
        Hide
        Siying Dong added a comment -

        I don't know why but I ran the test suites twice and both failed. Can you rebase your codes and try to run the whole test suites and see whether all the tests pass? I'll try again too.

        Show
        Siying Dong added a comment - I don't know why but I ran the test suites twice and both failed. Can you rebase your codes and try to run the whole test suites and see whether all the tests pass? I'll try again too.
        Kevin Wilfong made changes -
        Attachment HIVE-2282.4.patch.txt [ 12487457 ]
        Hide
        jiraposter@reviews.apache.org added a comment -

        -----------------------------------------------------------
        This is an automatically generated e-mail. To reply, visit:
        https://reviews.apache.org/r/1132/
        -----------------------------------------------------------

        (Updated 2011-07-22 17:40:44.736466)

        Review request for hive and Siying Dong.

        Changes
        -------

        I added the q.out file which I had forgotten for the new q file.

        I also modified the test queries to select count(1) instead of selecting keys and values.

        Summary
        -------

        A query should run in local mode when block sampling is used and the sample is small enough. The size of the sample is currently being estimated, as it is done to estimate the number of reducers.

        This addresses bug HIVE-2282.
        https://issues.apache.org/jira/browse/HIVE-2282

        Diffs (updated)


        ql/src/java/org/apache/hadoop/hive/ql/exec/MapRedTask.java 53769a0
        ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java cd3de76
        ql/src/test/org/apache/hadoop/hive/ql/hooks/VerifyIsLocalModeHook.java PRE-CREATION
        ql/src/test/queries/clientpositive/sample_islocalmode_hook.q PRE-CREATION
        ql/src/test/results/clientpositive/sample_islocalmode_hook.q.out PRE-CREATION

        Diff: https://reviews.apache.org/r/1132/diff

        Testing
        -------

        TestCliDriver TestNegativeCliDriver, manually tested

        Thanks,

        Kevin

        Show
        jiraposter@reviews.apache.org added a comment - ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/1132/ ----------------------------------------------------------- (Updated 2011-07-22 17:40:44.736466) Review request for hive and Siying Dong. Changes ------- I added the q.out file which I had forgotten for the new q file. I also modified the test queries to select count(1) instead of selecting keys and values. Summary ------- A query should run in local mode when block sampling is used and the sample is small enough. The size of the sample is currently being estimated, as it is done to estimate the number of reducers. This addresses bug HIVE-2282 . https://issues.apache.org/jira/browse/HIVE-2282 Diffs (updated) ql/src/java/org/apache/hadoop/hive/ql/exec/MapRedTask.java 53769a0 ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java cd3de76 ql/src/test/org/apache/hadoop/hive/ql/hooks/VerifyIsLocalModeHook.java PRE-CREATION ql/src/test/queries/clientpositive/sample_islocalmode_hook.q PRE-CREATION ql/src/test/results/clientpositive/sample_islocalmode_hook.q.out PRE-CREATION Diff: https://reviews.apache.org/r/1132/diff Testing ------- TestCliDriver TestNegativeCliDriver, manually tested Thanks, Kevin
        Hide
        Siying Dong added a comment -

        Also, query like "select key, value from sih_src tablesample(1 percent)" actually doesn't generate stable result. You can use select count(1) instead. That will generate correct results.

        Show
        Siying Dong added a comment - Also, query like "select key, value from sih_src tablesample(1 percent)" actually doesn't generate stable result. You can use select count(1) instead. That will generate correct results.
        Hide
        Siying Dong added a comment -

        Kevin, you forgot to add file ql/src/test/results/clientpositive/sample_islocalmode_hook.q.out to the patch.

        Show
        Siying Dong added a comment - Kevin, you forgot to add file ql/src/test/results/clientpositive/sample_islocalmode_hook.q.out to the patch.
        Hide
        Siying Dong added a comment -

        +1, will commit after testing.

        Show
        Siying Dong added a comment - +1, will commit after testing.
        Hide
        jiraposter@reviews.apache.org added a comment -

        -----------------------------------------------------------
        This is an automatically generated e-mail. To reply, visit:
        https://reviews.apache.org/r/1132/
        -----------------------------------------------------------

        (Updated 2011-07-15 21:45:16.168124)

        Review request for hive and Siying Dong.

        Changes
        -------

        That's a good point, sorry I misunderstood it originally.

        Renamed estimateSampledInputSize to estimateInputSize.

        Summary
        -------

        A query should run in local mode when block sampling is used and the sample is small enough. The size of the sample is currently being estimated, as it is done to estimate the number of reducers.

        This addresses bug HIVE-2282.
        https://issues.apache.org/jira/browse/HIVE-2282

        Diffs (updated)


        ql/src/java/org/apache/hadoop/hive/ql/exec/MapRedTask.java 53769a0
        ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java cd3de76
        ql/src/test/org/apache/hadoop/hive/ql/hooks/VerifyIsLocalModeHook.java PRE-CREATION
        ql/src/test/queries/clientpositive/sample_islocalmode_hook.q PRE-CREATION

        Diff: https://reviews.apache.org/r/1132/diff

        Testing
        -------

        TestCliDriver TestNegativeCliDriver, manually tested

        Thanks,

        Kevin

        Show
        jiraposter@reviews.apache.org added a comment - ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/1132/ ----------------------------------------------------------- (Updated 2011-07-15 21:45:16.168124) Review request for hive and Siying Dong. Changes ------- That's a good point, sorry I misunderstood it originally. Renamed estimateSampledInputSize to estimateInputSize. Summary ------- A query should run in local mode when block sampling is used and the sample is small enough. The size of the sample is currently being estimated, as it is done to estimate the number of reducers. This addresses bug HIVE-2282 . https://issues.apache.org/jira/browse/HIVE-2282 Diffs (updated) ql/src/java/org/apache/hadoop/hive/ql/exec/MapRedTask.java 53769a0 ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java cd3de76 ql/src/test/org/apache/hadoop/hive/ql/hooks/VerifyIsLocalModeHook.java PRE-CREATION ql/src/test/queries/clientpositive/sample_islocalmode_hook.q PRE-CREATION Diff: https://reviews.apache.org/r/1132/diff Testing ------- TestCliDriver TestNegativeCliDriver, manually tested Thanks, Kevin
        Kevin Wilfong made changes -
        Attachment HIVE-2282.3.patch.txt [ 12486688 ]
        Hide
        jiraposter@reviews.apache.org added a comment -

        -----------------------------------------------------------
        This is an automatically generated e-mail. To reply, visit:
        https://reviews.apache.org/r/1132/#review1084
        -----------------------------------------------------------

        I mean you can just change the function name to something like estimateInputSize().

        • Siying

        On 2011-07-15 20:48:38, Kevin Wilfong wrote:

        -----------------------------------------------------------

        This is an automatically generated e-mail. To reply, visit:

        https://reviews.apache.org/r/1132/

        -----------------------------------------------------------

        (Updated 2011-07-15 20:48:38)

        Review request for hive and Siying Dong.

        Summary

        -------

        A query should run in local mode when block sampling is used and the sample is small enough. The size of the sample is currently being estimated, as it is done to estimate the number of reducers.

        This addresses bug HIVE-2282.

        https://issues.apache.org/jira/browse/HIVE-2282

        Diffs

        -----

        ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java cd3de76

        ql/src/test/org/apache/hadoop/hive/ql/hooks/VerifyIsLocalModeHook.java PRE-CREATION

        ql/src/test/queries/clientpositive/sample_islocalmode_hook.q PRE-CREATION

        ql/src/java/org/apache/hadoop/hive/ql/exec/MapRedTask.java 53769a0

        Diff: https://reviews.apache.org/r/1132/diff

        Testing

        -------

        TestCliDriver TestNegativeCliDriver, manually tested

        Thanks,

        Kevin

        Show
        jiraposter@reviews.apache.org added a comment - ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/1132/#review1084 ----------------------------------------------------------- I mean you can just change the function name to something like estimateInputSize(). Siying On 2011-07-15 20:48:38, Kevin Wilfong wrote: ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/1132/ ----------------------------------------------------------- (Updated 2011-07-15 20:48:38) Review request for hive and Siying Dong. Summary ------- A query should run in local mode when block sampling is used and the sample is small enough. The size of the sample is currently being estimated, as it is done to estimate the number of reducers. This addresses bug HIVE-2282 . https://issues.apache.org/jira/browse/HIVE-2282 Diffs ----- ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java cd3de76 ql/src/test/org/apache/hadoop/hive/ql/hooks/VerifyIsLocalModeHook.java PRE-CREATION ql/src/test/queries/clientpositive/sample_islocalmode_hook.q PRE-CREATION ql/src/java/org/apache/hadoop/hive/ql/exec/MapRedTask.java 53769a0 Diff: https://reviews.apache.org/r/1132/diff Testing ------- TestCliDriver TestNegativeCliDriver, manually tested Thanks, Kevin
        Kevin Wilfong made changes -
        Attachment HIVE-2282.2.patch.txt [ 12486679 ]
        Kevin Wilfong made changes -
        Attachment HIVE-2282.2.patch.txt [ 12486680 ]
        Kevin Wilfong made changes -
        Attachment HIVE-2282.2.patch.txt [ 12486679 ]
        Hide
        jiraposter@reviews.apache.org added a comment -

        -----------------------------------------------------------
        This is an automatically generated e-mail. To reply, visit:
        https://reviews.apache.org/r/1132/
        -----------------------------------------------------------

        (Updated 2011-07-15 20:48:38.625544)

        Review request for hive and Siying Dong.

        Changes
        -------

        I added comments to the estimateSampledInputSize function. This function does set the input size even if there is no sampling, but this means that we do not need to create two cases everywhere we might need to use an estimated input size or an actual input size. Instead, we can just run the function (which only does significant work the first time it is run thanks to a boolean flag) and the input size will be set to the appropriate values. It only estimates the input size if sampling is used.

        I also added the header to VerifyIsLocalModeHook.java

        Summary
        -------

        A query should run in local mode when block sampling is used and the sample is small enough. The size of the sample is currently being estimated, as it is done to estimate the number of reducers.

        This addresses bug HIVE-2282.
        https://issues.apache.org/jira/browse/HIVE-2282

        Diffs (updated)


        ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java cd3de76
        ql/src/test/org/apache/hadoop/hive/ql/hooks/VerifyIsLocalModeHook.java PRE-CREATION
        ql/src/test/queries/clientpositive/sample_islocalmode_hook.q PRE-CREATION
        ql/src/java/org/apache/hadoop/hive/ql/exec/MapRedTask.java 53769a0

        Diff: https://reviews.apache.org/r/1132/diff

        Testing
        -------

        TestCliDriver TestNegativeCliDriver, manually tested

        Thanks,

        Kevin

        Show
        jiraposter@reviews.apache.org added a comment - ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/1132/ ----------------------------------------------------------- (Updated 2011-07-15 20:48:38.625544) Review request for hive and Siying Dong. Changes ------- I added comments to the estimateSampledInputSize function. This function does set the input size even if there is no sampling, but this means that we do not need to create two cases everywhere we might need to use an estimated input size or an actual input size. Instead, we can just run the function (which only does significant work the first time it is run thanks to a boolean flag) and the input size will be set to the appropriate values. It only estimates the input size if sampling is used. I also added the header to VerifyIsLocalModeHook.java Summary ------- A query should run in local mode when block sampling is used and the sample is small enough. The size of the sample is currently being estimated, as it is done to estimate the number of reducers. This addresses bug HIVE-2282 . https://issues.apache.org/jira/browse/HIVE-2282 Diffs (updated) ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java cd3de76 ql/src/test/org/apache/hadoop/hive/ql/hooks/VerifyIsLocalModeHook.java PRE-CREATION ql/src/test/queries/clientpositive/sample_islocalmode_hook.q PRE-CREATION ql/src/java/org/apache/hadoop/hive/ql/exec/MapRedTask.java 53769a0 Diff: https://reviews.apache.org/r/1132/diff Testing ------- TestCliDriver TestNegativeCliDriver, manually tested Thanks, Kevin
        Hide
        jiraposter@reviews.apache.org added a comment -

        -----------------------------------------------------------
        This is an automatically generated e-mail. To reply, visit:
        https://reviews.apache.org/r/1132/#review1081
        -----------------------------------------------------------

        ql/src/test/org/apache/hadoop/hive/ql/hooks/VerifyIsLocalModeHook.java
        <https://reviews.apache.org/r/1132/#comment2210>

        We need a header for licensing.

        • Siying

        On 2011-07-15 02:16:34, Kevin Wilfong wrote:

        -----------------------------------------------------------

        This is an automatically generated e-mail. To reply, visit:

        https://reviews.apache.org/r/1132/

        -----------------------------------------------------------

        (Updated 2011-07-15 02:16:34)

        Review request for hive and Siying Dong.

        Summary

        -------

        A query should run in local mode when block sampling is used and the sample is small enough. The size of the sample is currently being estimated, as it is done to estimate the number of reducers.

        This addresses bug HIVE-2282.

        https://issues.apache.org/jira/browse/HIVE-2282

        Diffs

        -----

        ql/src/test/queries/clientpositive/sample_islocalmode_hook.q PRE-CREATION

        ql/src/java/org/apache/hadoop/hive/ql/exec/MapRedTask.java 53769a0

        ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java cd3de76

        ql/src/test/org/apache/hadoop/hive/ql/hooks/VerifyIsLocalModeHook.java PRE-CREATION

        Diff: https://reviews.apache.org/r/1132/diff

        Testing

        -------

        TestCliDriver TestNegativeCliDriver, manually tested

        Thanks,

        Kevin

        Show
        jiraposter@reviews.apache.org added a comment - ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/1132/#review1081 ----------------------------------------------------------- ql/src/test/org/apache/hadoop/hive/ql/hooks/VerifyIsLocalModeHook.java < https://reviews.apache.org/r/1132/#comment2210 > We need a header for licensing. Siying On 2011-07-15 02:16:34, Kevin Wilfong wrote: ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/1132/ ----------------------------------------------------------- (Updated 2011-07-15 02:16:34) Review request for hive and Siying Dong. Summary ------- A query should run in local mode when block sampling is used and the sample is small enough. The size of the sample is currently being estimated, as it is done to estimate the number of reducers. This addresses bug HIVE-2282 . https://issues.apache.org/jira/browse/HIVE-2282 Diffs ----- ql/src/test/queries/clientpositive/sample_islocalmode_hook.q PRE-CREATION ql/src/java/org/apache/hadoop/hive/ql/exec/MapRedTask.java 53769a0 ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java cd3de76 ql/src/test/org/apache/hadoop/hive/ql/hooks/VerifyIsLocalModeHook.java PRE-CREATION Diff: https://reviews.apache.org/r/1132/diff Testing ------- TestCliDriver TestNegativeCliDriver, manually tested Thanks, Kevin
        Hide
        jiraposter@reviews.apache.org added a comment -

        -----------------------------------------------------------
        This is an automatically generated e-mail. To reply, visit:
        https://reviews.apache.org/r/1132/#review1080
        -----------------------------------------------------------

        ql/src/java/org/apache/hadoop/hive/ql/exec/MapRedTask.java
        <https://reviews.apache.org/r/1132/#comment2209>

        This function name seems to be confusing. Looks like the input size is set even if there is no sampling, right? Also, can you add comments to this function?

        Other than that, the patch looks OK.

        • Siying

        On 2011-07-15 02:16:34, Kevin Wilfong wrote:

        -----------------------------------------------------------

        This is an automatically generated e-mail. To reply, visit:

        https://reviews.apache.org/r/1132/

        -----------------------------------------------------------

        (Updated 2011-07-15 02:16:34)

        Review request for hive and Siying Dong.

        Summary

        -------

        A query should run in local mode when block sampling is used and the sample is small enough. The size of the sample is currently being estimated, as it is done to estimate the number of reducers.

        This addresses bug HIVE-2282.

        https://issues.apache.org/jira/browse/HIVE-2282

        Diffs

        -----

        ql/src/test/queries/clientpositive/sample_islocalmode_hook.q PRE-CREATION

        ql/src/java/org/apache/hadoop/hive/ql/exec/MapRedTask.java 53769a0

        ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java cd3de76

        ql/src/test/org/apache/hadoop/hive/ql/hooks/VerifyIsLocalModeHook.java PRE-CREATION

        Diff: https://reviews.apache.org/r/1132/diff

        Testing

        -------

        TestCliDriver TestNegativeCliDriver, manually tested

        Thanks,

        Kevin

        Show
        jiraposter@reviews.apache.org added a comment - ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/1132/#review1080 ----------------------------------------------------------- ql/src/java/org/apache/hadoop/hive/ql/exec/MapRedTask.java < https://reviews.apache.org/r/1132/#comment2209 > This function name seems to be confusing. Looks like the input size is set even if there is no sampling, right? Also, can you add comments to this function? Other than that, the patch looks OK. Siying On 2011-07-15 02:16:34, Kevin Wilfong wrote: ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/1132/ ----------------------------------------------------------- (Updated 2011-07-15 02:16:34) Review request for hive and Siying Dong. Summary ------- A query should run in local mode when block sampling is used and the sample is small enough. The size of the sample is currently being estimated, as it is done to estimate the number of reducers. This addresses bug HIVE-2282 . https://issues.apache.org/jira/browse/HIVE-2282 Diffs ----- ql/src/test/queries/clientpositive/sample_islocalmode_hook.q PRE-CREATION ql/src/java/org/apache/hadoop/hive/ql/exec/MapRedTask.java 53769a0 ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java cd3de76 ql/src/test/org/apache/hadoop/hive/ql/hooks/VerifyIsLocalModeHook.java PRE-CREATION Diff: https://reviews.apache.org/r/1132/diff Testing ------- TestCliDriver TestNegativeCliDriver, manually tested Thanks, Kevin
        Hide
        jiraposter@reviews.apache.org added a comment -

        -----------------------------------------------------------
        This is an automatically generated e-mail. To reply, visit:
        https://reviews.apache.org/r/1132/
        -----------------------------------------------------------

        Review request for hive and Siying Dong.

        Summary
        -------

        A query should run in local mode when block sampling is used and the sample is small enough. The size of the sample is currently being estimated, as it is done to estimate the number of reducers.

        This addresses bug HIVE-2282.
        https://issues.apache.org/jira/browse/HIVE-2282

        Diffs


        ql/src/test/queries/clientpositive/sample_islocalmode_hook.q PRE-CREATION
        ql/src/java/org/apache/hadoop/hive/ql/exec/MapRedTask.java 53769a0
        ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java cd3de76
        ql/src/test/org/apache/hadoop/hive/ql/hooks/VerifyIsLocalModeHook.java PRE-CREATION

        Diff: https://reviews.apache.org/r/1132/diff

        Testing
        -------

        TestCliDriver TestNegativeCliDriver, manually tested

        Thanks,

        Kevin

        Show
        jiraposter@reviews.apache.org added a comment - ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/1132/ ----------------------------------------------------------- Review request for hive and Siying Dong. Summary ------- A query should run in local mode when block sampling is used and the sample is small enough. The size of the sample is currently being estimated, as it is done to estimate the number of reducers. This addresses bug HIVE-2282 . https://issues.apache.org/jira/browse/HIVE-2282 Diffs ql/src/test/queries/clientpositive/sample_islocalmode_hook.q PRE-CREATION ql/src/java/org/apache/hadoop/hive/ql/exec/MapRedTask.java 53769a0 ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java cd3de76 ql/src/test/org/apache/hadoop/hive/ql/hooks/VerifyIsLocalModeHook.java PRE-CREATION Diff: https://reviews.apache.org/r/1132/diff Testing ------- TestCliDriver TestNegativeCliDriver, manually tested Thanks, Kevin
        Show
        Kevin Wilfong added a comment - https://reviews.apache.org/r/1132/
        Kevin Wilfong made changes -
        Attachment HIVE-2282.1.patch.txt [ 12486484 ]
        Kevin Wilfong made changes -
        Attachment HIVE-2282.1.patch.txt [ 12486485 ]
        Kevin Wilfong made changes -
        Attachment HIVE-2282.1.patch.txt [ 12486484 ]
        Kevin Wilfong made changes -
        Status Open [ 1 ] In Progress [ 3 ]
        Kevin Wilfong made changes -
        Status Patch Available [ 10002 ] Open [ 1 ]
        Kevin Wilfong made changes -
        Status Open [ 1 ] Patch Available [ 10002 ]
        John Sichi made changes -
        Field Original Value New Value
        Assignee Kevin Wilfong [ kevinwilfong ]
        Siying Dong created issue -

          People

          • Assignee:
            Kevin Wilfong
            Reporter:
            Siying Dong
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development