Hive
  1. Hive
  2. HIVE-976

test outputs should compare the file/directory for sampling

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.5.0
    • Component/s: Testing Infrastructure
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      Currently, all lines starting with file: are ignored.

      It means that

      file:/Users/heyongqiang/Documents/workspace/Hive-Test/build/ql/test/data/warehouse/srcbucket/srcbucket0.txt [s]

      and

      file:/Users/heyongqiang/Documents/workspace/Hive-Test/build/ql/test/data/warehouse/srcbucket [s]

      are same - that is not good because it will hide some of the optimizations of sampling.
      This should be changed to compare the last token.

      1. HIVE-976.2.patch
        630 kB
        John Sichi
      2. HIVE-976.patch
        617 kB
        John Sichi

        Activity

        Hide
        Zheng Shao added a comment -

        I am seeing sample6.q and sample7.q test failures due to this.
        When running them by themselves, they are working correctly. But when running all tests together, they are failure because somehow they took the whole srcbucket as input while table sample should really take only srcbucket0.txt.

        Show
        Zheng Shao added a comment - I am seeing sample6.q and sample7.q test failures due to this. When running them by themselves, they are working correctly. But when running all tests together, they are failure because somehow they took the whole srcbucket as input while table sample should really take only srcbucket0.txt.
        Hide
        Paul Yang added a comment -

        I'm looking at sample7.q.out, which has the lines that are quoted in the description. As I understand it, the test should check lines of the form 'file:/[directory]/[file]' and verify that the file part (or the last token) is the same as those in the reference test files. But aren't there lines in the test output that begin with 'file:/' that should be ignored completely?

        Anyway, assuming that we can look at just the last token of lines that begin with 'file:/', it doesn't seem like there is a way to use only diff to handle this case. The man page for diff does not describe any useful options. I have two ideas

        1. Instead of using diff directly, write a script that uses diff + some additional logic to detect this condition. Probably will be a little slower.

        2. in mapredWork, add an additional function that would display the last token of the paths. Sort of an ugly idea, but would work with diff in the current form.

        Show
        Paul Yang added a comment - I'm looking at sample7.q.out, which has the lines that are quoted in the description. As I understand it, the test should check lines of the form 'file:/ [directory] / [file] ' and verify that the file part (or the last token) is the same as those in the reference test files. But aren't there lines in the test output that begin with 'file:/' that should be ignored completely? Anyway, assuming that we can look at just the last token of lines that begin with 'file:/', it doesn't seem like there is a way to use only diff to handle this case. The man page for diff does not describe any useful options. I have two ideas 1. Instead of using diff directly, write a script that uses diff + some additional logic to detect this condition. Probably will be a little slower. 2. in mapredWork, add an additional function that would display the last token of the paths. Sort of an ugly idea, but would work with diff in the current form.
        Hide
        Namit Jain added a comment -

        I would go with 2. - tests take a long time anyway, slowing them down further will be difficult.

        Show
        Namit Jain added a comment - I would go with 2. - tests take a long time anyway, slowing them down further will be difficult.
        Hide
        John Sichi added a comment -

        I have implemented 2 and am currently running through tests; will submit a patch with updated logs once they pass.

        I added a new derived attribute (normalExplain=false) "base file name" on partitionDesc, and since the whole point is to avoid filtering this information out, I think there will be a lot of log updates required as part of the patch (unless there's a way to make it selective for only tests that need it).

        Show
        John Sichi added a comment - I have implemented 2 and am currently running through tests; will submit a patch with updated logs once they pass. I added a new derived attribute (normalExplain=false) "base file name" on partitionDesc, and since the whole point is to avoid filtering this information out, I think there will be a lot of log updates required as part of the patch (unless there's a way to make it selective for only tests that need it).
        Hide
        John Sichi added a comment -

        Code changes together with log updates, for review by Namit. I had to add ORDER BY to a number of existing .q files due to non-determinism in the expected results.

        Show
        John Sichi added a comment - Code changes together with log updates, for review by Namit. I had to add ORDER BY to a number of existing .q files due to non-determinism in the expected results.
        Hide
        Namit Jain added a comment -

        1. Make the new field in partitionDesc transient.
        2. Are you getting diff in input40.q - if not, remove the order by - it will change the query plan and I think the test was added to run without a map-reduce job.
        3. sample6.q - change the explain plan statement also along with the query in all cases.

        Show
        Namit Jain added a comment - 1. Make the new field in partitionDesc transient. 2. Are you getting diff in input40.q - if not, remove the order by - it will change the query plan and I think the test was added to run without a map-reduce job. 3. sample6.q - change the explain plan statement also along with the query in all cases.
        Hide
        John Sichi added a comment -

        Regarding input40.q: yes, I get a diff because the results from this query come out in a different order:

        select * from tmp_insert_test_p where ds = '2009-08-01';

        A big chunk of rows moves in the output. Diff-based testing doesn't really work if it's not possible to add order by uniformly, does it?

        Show
        John Sichi added a comment - Regarding input40.q: yes, I get a diff because the results from this query come out in a different order: select * from tmp_insert_test_p where ds = '2009-08-01'; A big chunk of rows moves in the output. Diff-based testing doesn't really work if it's not possible to add order by uniformly, does it?
        Hide
        Namit Jain added a comment -

        I agree - feel free to add order by in input40.q

        Show
        Namit Jain added a comment - I agree - feel free to add order by in input40.q
        Hide
        John Sichi added a comment -

        Here's the updated patch which addresses Namit's comments (1) and (3).

        Show
        John Sichi added a comment - Here's the updated patch which addresses Namit's comments (1) and (3).
        Hide
        Namit Jain added a comment -

        +1

        looks good - will commit if the tests pass

        Show
        Namit Jain added a comment - +1 looks good - will commit if the tests pass
        Hide
        Namit Jain added a comment -

        Committed. Thanks John

        Show
        Namit Jain added a comment - Committed. Thanks John

          People

          • Assignee:
            John Sichi
            Reporter:
            Namit Jain
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development