Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.4.0
    • Component/s: Query Processor
    • Labels:
      None

      Description

      It would be good to have a test mode in hive - this will help in checking the validity of a hive drop on a production cluster.

      The following would be good to have:

      Testmode --> In testmode, all input tables are sampled (if not already sampled) and all output tables are prefixed by a user supplied name.
      This way, multiple hive drops can be compared quickly for correctness

      New Options:

      
      // whether hive is running in test mode. If yes, it turns on sampling and prefixes the output tablename
      set hive.test.mode=true;
      // if hive is running in test mode, prefixes the output table by this string
      set hive.test.mode.prefix=;
      // if hive is running in test mode and table is not bucketed, sampling frequency
      set hive.test.mode.samplefreq=256;
      // if hive is running in test mode, dont sample the above comma seperated list of tables
      set hive.test.mode.nosamplelist=;
      
      
      1. hive.518.7.patch
        20 kB
        Namit Jain
      2. hive.518.6.patch
        20 kB
        Namit Jain
      3. hive-518.5.patch
        19 kB
        Raghotham Murthy
      4. hive.518.4.patch
        18 kB
        Namit Jain
      5. hive.518.3.patch
        15 kB
        Namit Jain
      6. hive.518.2.patch
        14 kB
        Namit Jain
      7. hive.518.1.patch
        13 kB
        Namit Jain

        Activity

        Hide
        Raghotham Murthy added a comment -

        Committed. Thanks Namit!

        Show
        Raghotham Murthy added a comment - Committed. Thanks Namit!
        Hide
        Raghotham Murthy added a comment -

        running tests now.

        Show
        Raghotham Murthy added a comment - running tests now.
        Hide
        Namit Jain added a comment -

        resolved conflicts

        Show
        Namit Jain added a comment - resolved conflicts
        Hide
        Raghotham Murthy added a comment -

        There is now a test failure because of a change in hive-511 which was committed after you created the patch. Can you regenerate the patch with the latest code?

        [junit] 09/05/27 18:25:14 INFO exec.FileSinkOperator: Moving tmp dir: /mnt/vol/devrs005.snc1/rmurthy/hive-committer/build/ql/tmp/_tmp.816501927.10000.insclause-0 to: /mnt/vol/devrs005.snc1/rmurthy/hive-committer/build/ql/tmp/816501927.10000.insclause-0
        [junit] diff -a -I (file:)\|(/tmp/.*) /mnt/vol/devrs005.snc1/rmurthy/hive-committer/build/ql/test/logs/clientpositive/input30.q.out /mnt/vol/devrs005.snc1/rmurthy/hive-committer/ql/src/test/results/clientpositive/input30.q.out
        [junit] 23c23
        [junit] < expr: (((hash(rand(UDFToLong(460476415))) & 2147483647) % 32) = 0)
        [junit]
        [junit] > expr: (((default_sample_hashfn(rand(UDFToLong(460476415))) & 2147483647) % 32) = 0)

        Show
        Raghotham Murthy added a comment - There is now a test failure because of a change in hive-511 which was committed after you created the patch. Can you regenerate the patch with the latest code? [junit] 09/05/27 18:25:14 INFO exec.FileSinkOperator: Moving tmp dir: /mnt/vol/devrs005.snc1/rmurthy/hive-committer/build/ql/tmp/_tmp.816501927.10000.insclause-0 to: /mnt/vol/devrs005.snc1/rmurthy/hive-committer/build/ql/tmp/816501927.10000.insclause-0 [junit] diff -a -I ( file:&#41;\ |(/tmp/.*) /mnt/vol/devrs005.snc1/rmurthy/hive-committer/build/ql/test/logs/clientpositive/input30.q.out /mnt/vol/devrs005.snc1/rmurthy/hive-committer/ql/src/test/results/clientpositive/input30.q.out [junit] 23c23 [junit] < expr: (((hash(rand(UDFToLong(460476415))) & 2147483647) % 32) = 0) [junit] — [junit] > expr: (((default_sample_hashfn(rand(UDFToLong(460476415))) & 2147483647) % 32) = 0)
        Hide
        Raghotham Murthy added a comment -

        +1

        will commit once tests pass.

        Show
        Raghotham Murthy added a comment - +1 will commit once tests pass.
        Hide
        Raghotham Murthy added a comment -

        Seeing some weird errors. input30.q, input31.q and input32.q individually succeed. But, when run along with other queries, it seems like the specific queries are not being run in test mode.

        Show
        Raghotham Murthy added a comment - Seeing some weird errors. input30.q, input31.q and input32.q individually succeed. But, when run along with other queries, it seems like the specific queries are not being run in test mode.
        Hide
        Raghotham Murthy added a comment -

        +1

        looks good. will commit once tests pass.

        Show
        Raghotham Murthy added a comment - +1 looks good. will commit once tests pass.
        Hide
        Raghotham Murthy added a comment -

        can you add a test for the unsampled tables feature?

        Show
        Raghotham Murthy added a comment - can you add a test for the unsampled tables feature?
        Hide
        Namit Jain added a comment -

        incorporated comments

        Show
        Namit Jain added a comment - incorporated comments
        Hide
        Raghotham Murthy added a comment -

        one solution would be to provide another set option with the list of tables which should not be sampled in test mode.

        set hive.test.mode.unsampled.tables=table1,table2

        Another option might be to actually allow users to specify the entire tablesample clause for every table that needs to sampled in test-mode. But that seems like a lot more work for not a lot of additional benefit.

        Show
        Raghotham Murthy added a comment - one solution would be to provide another set option with the list of tables which should not be sampled in test mode. set hive.test.mode.unsampled.tables=table1,table2 Another option might be to actually allow users to specify the entire tablesample clause for every table that needs to sampled in test-mode. But that seems like a lot more work for not a lot of additional benefit.
        Hide
        Namit Jain added a comment -

        I agree with it - it will not lead to any problem since the join results will be empty in both the new and
        the old drop, but the whole purpose of testing may be lost.

        Hinting seems useless, because if the pipelines can be modified to add query level hints, the queries themselves
        can be modified.

        Via a configuration parameter, the list of tables can be specified and sampling may only be applicable to
        those tables. It will need the pipelines to be modified, or we can take a more aggressive approach and add
        sampling to all tables unless the user asks us not to do so. This way, only the offending pipelines (for eg.
        the one pointed by Raghu) needs to be modified.

        Show
        Namit Jain added a comment - I agree with it - it will not lead to any problem since the join results will be empty in both the new and the old drop, but the whole purpose of testing may be lost. Hinting seems useless, because if the pipelines can be modified to add query level hints, the queries themselves can be modified. Via a configuration parameter, the list of tables can be specified and sampling may only be applicable to those tables. It will need the pipelines to be modified, or we can take a more aggressive approach and add sampling to all tables unless the user asks us not to do so. This way, only the offending pipelines (for eg. the one pointed by Raghu) needs to be modified.
        Hide
        Raghotham Murthy added a comment -

        what happens if the production query has one sampled table joined against an unsampled table? A common example is facts table sampled by user, joined with a dimension table on a dimension attribute like gender/country etc. by adding an arbitrary sample clause on the dimension table, the join result may be empty.

        Show
        Raghotham Murthy added a comment - what happens if the production query has one sampled table joined against an unsampled table? A common example is facts table sampled by user, joined with a dimension table on a dimension attribute like gender/country etc. by adding an arbitrary sample clause on the dimension table, the join result may be empty.
        Hide
        Zheng Shao added a comment -

        One additional comment: Can you use random(460476415) instead of random(1)? random(1) is likely to appear in user's query as well, which may make the query sampling non-uniform.

        This is a really simple change now, but might save a lot of time debugging in the future.

        Show
        Zheng Shao added a comment - One additional comment: Can you use random(460476415) instead of random(1)? random(1) is likely to appear in user's query as well, which may make the query sampling non-uniform. This is a really simple change now, but might save a lot of time debugging in the future.
        Hide
        Namit Jain added a comment -

        incorporated comments

        Show
        Namit Jain added a comment - incorporated comments
        Hide
        Zheng Shao added a comment -

        1. Can you add comment for SemanticAnalyzer.genSamplePredicate? Especially for the new planExpr parameter. It's not clear what this parameter means.

        2. Can you add comments on what we will do in case hive is running in test mode and table is bucketed?

        +<property>
        + <name>hive.test.mode.samplefreq</name>
        + <value>32</value>
        + <description>if hive is running in test mode and table is not bucketed, sampling frequency</description>
        +</property>

        Show
        Zheng Shao added a comment - 1. Can you add comment for SemanticAnalyzer.genSamplePredicate? Especially for the new planExpr parameter. It's not clear what this parameter means. 2. Can you add comments on what we will do in case hive is running in test mode and table is bucketed? +<property> + <name>hive.test.mode.samplefreq</name> + <value>32</value> + <description>if hive is running in test mode and table is not bucketed, sampling frequency</description> +</property>

          People

          • Assignee:
            Namit Jain
            Reporter:
            Namit Jain
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development