Hive
  1. Hive
  2. HIVE-1131

Add column lineage information to the pre execution hooks

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.6.0
    • Component/s: Query Processor
    • Labels:
      None
    • Hadoop Flags:
      Incompatible change, Reviewed
    • Release Note:
      This changes the signature of PostExecute.java

      Description

      We need a mechanism to pass the lineage information of the various columns of a table to a pre execution hook so that applications can use that for:

      • auditing
      • dependency checking

      and many other applications.

      The proposal is to expose this through a bunch of classes to the pre execution hook interface to the clients and put in the necessary transformation logic in the optimizer to generate this information.

      1. hive.1131.9.patch
        2.16 MB
        Namit Jain
      2. HIVE-1131_2.patch
        87 kB
        Ashish Thusoo
      3. HIVE-1131_3.patch
        87 kB
        Ashish Thusoo
      4. HIVE-1131_4.patch
        2.13 MB
        Ashish Thusoo
      5. HIVE-1131_5.patch
        2.20 MB
        Ashish Thusoo
      6. HIVE-1131_6.patch
        2.17 MB
        Ashish Thusoo
      7. HIVE-1131_7.patch
        2.18 MB
        Ashish Thusoo
      8. HIVE-1131_8.patch
        2.18 MB
        Ashish Thusoo
      9. HIVE-1131.patch
        92 kB
        Ashish Thusoo

        Activity

        Hide
        Namit Jain added a comment -

        Committed. Thanks Ashish

        Show
        Namit Jain added a comment - Committed. Thanks Ashish
        Hide
        Namit Jain added a comment -

        +1

        looks good, running tests again, will merge if the tests pass

        Show
        Namit Jain added a comment - +1 looks good, running tests again, will merge if the tests pass
        Hide
        Namit Jain added a comment -

        uploaded a new patch with updated test results

        Show
        Namit Jain added a comment - uploaded a new patch with updated test results
        Hide
        Ashish Thusoo added a comment -

        Another one with test fixes.

        Show
        Ashish Thusoo added a comment - Another one with test fixes.
        Hide
        Zheng Shao added a comment -

        Still seeing test failures from HIVE-1131_7.patch

        .ptest_0/test.17.2.1.log:    [junit] Begin query: groupby8.q
        .ptest_0/test.17.2.1.log:    [junit] junit.framework.AssertionFailedError: Client execution results failed with error code = 1
        --
        .ptest_1/test.17.2.1.log:    [junit] Begin query: groupby8_map_skew.q
        .ptest_1/test.17.2.1.log:    [junit] junit.framework.AssertionFailedError: Client execution results failed with error code = 1
        --
        .ptest_1/test.17.2.1.log:    [junit] Begin query: multi_insert.q
        .ptest_1/test.17.2.1.log:    [junit] junit.framework.AssertionFailedError: Client execution results failed with error code = 1
        --
        .ptest_1/test.17.2.1.log:    [junit] Begin query: reduce_deduplicate.q
        .ptest_1/test.17.2.1.log:    [junit] junit.framework.AssertionFailedError: Client execution results failed with error code = 1
        --
        .ptest_1/test.17.2.1.log:    [junit] Begin query: union18.q
        .ptest_1/test.17.2.1.log:    [junit] junit.framework.AssertionFailedError: Client execution results failed with error code = 1
        --
        .ptest_2/test.17.2.1.log:    [junit] Begin query: groupby7.q
        .ptest_2/test.17.2.1.log:    [junit] junit.framework.AssertionFailedError: Client execution results failed with error code = 1
        --
        .ptest_2/test.17.2.1.log:    [junit] Begin query: groupby8_noskew.q
        .ptest_2/test.17.2.1.log:    [junit] junit.framework.AssertionFailedError: Client execution results failed with error code = 1
        --
        .ptest_2/test.17.2.1.log:    [junit] Begin query: input12.q
        .ptest_2/test.17.2.1.log:    [junit] junit.framework.AssertionFailedError: Client execution results failed with error code = 1
        --
        
        Show
        Zheng Shao added a comment - Still seeing test failures from HIVE-1131 _7.patch .ptest_0/test.17.2.1.log: [junit] Begin query: groupby8.q .ptest_0/test.17.2.1.log: [junit] junit.framework.AssertionFailedError: Client execution results failed with error code = 1 -- .ptest_1/test.17.2.1.log: [junit] Begin query: groupby8_map_skew.q .ptest_1/test.17.2.1.log: [junit] junit.framework.AssertionFailedError: Client execution results failed with error code = 1 -- .ptest_1/test.17.2.1.log: [junit] Begin query: multi_insert.q .ptest_1/test.17.2.1.log: [junit] junit.framework.AssertionFailedError: Client execution results failed with error code = 1 -- .ptest_1/test.17.2.1.log: [junit] Begin query: reduce_deduplicate.q .ptest_1/test.17.2.1.log: [junit] junit.framework.AssertionFailedError: Client execution results failed with error code = 1 -- .ptest_1/test.17.2.1.log: [junit] Begin query: union18.q .ptest_1/test.17.2.1.log: [junit] junit.framework.AssertionFailedError: Client execution results failed with error code = 1 -- .ptest_2/test.17.2.1.log: [junit] Begin query: groupby7.q .ptest_2/test.17.2.1.log: [junit] junit.framework.AssertionFailedError: Client execution results failed with error code = 1 -- .ptest_2/test.17.2.1.log: [junit] Begin query: groupby8_noskew.q .ptest_2/test.17.2.1.log: [junit] junit.framework.AssertionFailedError: Client execution results failed with error code = 1 -- .ptest_2/test.17.2.1.log: [junit] Begin query: input12.q .ptest_2/test.17.2.1.log: [junit] junit.framework.AssertionFailedError: Client execution results failed with error code = 1 --
        Hide
        Ashish Thusoo added a comment -

        submitting.

        Show
        Ashish Thusoo added a comment - submitting.
        Hide
        Ashish Thusoo added a comment -

        Another patch which fixes the QueryPlan to have LinkedHashMaps as that was also creating instability in the tests.

        Show
        Ashish Thusoo added a comment - Another patch which fixes the QueryPlan to have LinkedHashMaps as that was also creating instability in the tests.
        Hide
        Ashish Thusoo added a comment -

        With fixes to tests and with null dropped.

        Show
        Ashish Thusoo added a comment - With fixes to tests and with null dropped.
        Hide
        Zheng Shao added a comment -

        The following tests failed. Mostly because the order of Can you take a look?
        Also, it will be great to get rid of the "null" after EXPRESSION in the following example.

        groupby11.q
        groupby7_map_skew.q
        input13.q
        script_pipe.q
        groupby9.q
        multi_insert.q
        union17.q
        
        example:
            [junit] diff -a -I file: -I /tmp/ -I invalidscheme: -I lastUpdateTime -I lastAccessTime -I owner -I transient_lastDdlTime\
         -I java.lang.RuntimeException -I at org -I at sun -I at java -I at junit -I Caused by: -I [.][.][.] [0-9]* more /data/users/\
        zshao/hadoop_hive_trunk/.ptest_1/build/ql/test/logs/clientpositive/groupby9.q.out /data/users/zshao/hadoop_hive_trunk/.ptest_\
        1/ql/src/test/results/clientpositive/groupby9.q.out
            [junit] 238,239d237
            [junit] < POSTHOOK: Lineage: dest1.key EXPRESSION null[(src)src.FieldSchema(name:key, type:string, comment:default), ]
            [junit] < POSTHOOK: Lineage: dest1.value EXPRESSION null[(src)src.FieldSchema(name:value, type:string, comment:default), \
        ]
            [junit] 242a241,242
            [junit] > POSTHOOK: Lineage: dest1.key EXPRESSION null[(src)src.FieldSchema(name:key, type:string, comment:default), ]
            [junit] > POSTHOOK: Lineage: dest1.value EXPRESSION null[(src)src.FieldSchema(name:value, type:string, comment:default), \
        ]
        
        
        Show
        Zheng Shao added a comment - The following tests failed. Mostly because the order of Can you take a look? Also, it will be great to get rid of the "null" after EXPRESSION in the following example. groupby11.q groupby7_map_skew.q input13.q script_pipe.q groupby9.q multi_insert.q union17.q example: [junit] diff -a -I file: -I /tmp/ -I invalidscheme: -I lastUpdateTime -I lastAccessTime -I owner -I transient_lastDdlTime\ -I java.lang.RuntimeException -I at org -I at sun -I at java -I at junit -I Caused by: -I [.][.][.] [0-9]* more /data/users/\ zshao/hadoop_hive_trunk/.ptest_1/build/ql/test/logs/clientpositive/groupby9.q.out /data/users/zshao/hadoop_hive_trunk/.ptest_\ 1/ql/src/test/results/clientpositive/groupby9.q.out [junit] 238,239d237 [junit] < POSTHOOK: Lineage: dest1.key EXPRESSION null [(src)src.FieldSchema(name:key, type:string, comment: default ), ] [junit] < POSTHOOK: Lineage: dest1.value EXPRESSION null [(src)src.FieldSchema(name:value, type:string, comment: default ), \ ] [junit] 242a241,242 [junit] > POSTHOOK: Lineage: dest1.key EXPRESSION null [(src)src.FieldSchema(name:key, type:string, comment: default ), ] [junit] > POSTHOOK: Lineage: dest1.value EXPRESSION null [(src)src.FieldSchema(name:value, type:string, comment: default ), \ ]
        Hide
        Ashish Thusoo added a comment -

        Added a more centralized function to decide what is the dependency type. Also reduced the number of dependency types to SIMPLE, EXPRESSION and SELECT. SIMPLE = a copy of the column, EXPRESSION = UDF, UDAF, UDTF or union all, SCRIPT = if a user script is used.

        Also fixed the HashMap to LinkedHashMap..

        Show
        Ashish Thusoo added a comment - Added a more centralized function to decide what is the dependency type. Also reduced the number of dependency types to SIMPLE, EXPRESSION and SELECT. SIMPLE = a copy of the column, EXPRESSION = UDF, UDAF, UDTF or union all, SCRIPT = if a user script is used. Also fixed the HashMap to LinkedHashMap..
        Hide
        Ashish Thusoo added a comment -

        I looked at the ExecutionCtx stuff. There are atleast 3 different unrelated fields in SessionState that we should also move to the ExecutionCtx. I will file a follow up JIRA for it but I think we should get this one in. I did see some test failures due to using HashMaps and the consequent change in ordering after I refreshed. Will fix that and upload a new patch.

        Show
        Ashish Thusoo added a comment - I looked at the ExecutionCtx stuff. There are atleast 3 different unrelated fields in SessionState that we should also move to the ExecutionCtx. I will file a follow up JIRA for it but I think we should get this one in. I did see some test failures due to using HashMaps and the consequent change in ordering after I refreshed. Will fix that and upload a new patch.
        Hide
        Zheng Shao added a comment -

        > Look at the DataContainer class. That has a partition in it. And the Dependency has a mapping from Partition to the dependencies. Can you explain more your concerns on inefficiency?

        I see. So the DataContainer captures the output partition information, but we don't have input partition information (BaseColumnInfo/TableAliasInfo). This is reasonable since the input can be lots of partitions.

        > For S6 actually the queryplan is the wrong place to store the lineageinfo. Because of the dynamic partitioning work that Ning is doing, I have to generate the partition to dependency mapping at run time. So I would rather store it in a run time structure as opposed to a compile time structure. SessionState fits that bill, though I think we should have another structure called ExecutionCtx for this. But otherwise I think we want to store this in a runtime structure.

        +1 on the ExecutionCtx idea. SessionState is at the session level, and LineageInfo is at the query level. It will be great to put LineageInfo into ExecutionCtx.

        Show
        Zheng Shao added a comment - > Look at the DataContainer class. That has a partition in it. And the Dependency has a mapping from Partition to the dependencies. Can you explain more your concerns on inefficiency? I see. So the DataContainer captures the output partition information, but we don't have input partition information (BaseColumnInfo/TableAliasInfo). This is reasonable since the input can be lots of partitions. > For S6 actually the queryplan is the wrong place to store the lineageinfo. Because of the dynamic partitioning work that Ning is doing, I have to generate the partition to dependency mapping at run time. So I would rather store it in a run time structure as opposed to a compile time structure. SessionState fits that bill, though I think we should have another structure called ExecutionCtx for this. But otherwise I think we want to store this in a runtime structure. +1 on the ExecutionCtx idea. SessionState is at the session level, and LineageInfo is at the query level. It will be great to put LineageInfo into ExecutionCtx.
        Hide
        Ashish Thusoo added a comment -

        Look at the DataContainer class. That has a partition in it. And the Dependency has a mapping from Partition to the dependencies. Can you explain more your concerns on inefficiency?

        For S6 actually the queryplan is the wrong place to store the lineageinfo. Because of the dynamic partitioning work that Ning is doing, I have to generate the partition to dependency mapping at run time. So I would rather store it in a run time structure as opposed to a compile time structure. SessionState fits that bill, though I think we should have another structure called ExecutionCtx for this. But otherwise I think we want to store this in a runtime structure.

        S2 will add some more comments.

        Show
        Ashish Thusoo added a comment - Look at the DataContainer class. That has a partition in it. And the Dependency has a mapping from Partition to the dependencies. Can you explain more your concerns on inefficiency? For S6 actually the queryplan is the wrong place to store the lineageinfo. Because of the dynamic partitioning work that Ning is doing, I have to generate the partition to dependency mapping at run time. So I would rather store it in a run time structure as opposed to a compile time structure. SessionState fits that bill, though I think we should have another structure called ExecutionCtx for this. But otherwise I think we want to store this in a runtime structure. S2 will add some more comments.
        Hide
        Zheng Shao added a comment -

        > S1. Can we make lineage partition-level instead of table-level?
        I don't see this implemented in the new patch. After looking at the code more, I'd agree that this is too hard (and inefficient) to do, when the query has a range over a lot of partitions.

        > S3. Use "{}" even for single statement in "if", "for" etc.
        I cannot find any instances of these now.

        Still have some questions:
        > S2. We might want to define formally the concepts of these levels, especially how they are composited (What will be UDAF of UDF, or UDF of UDAF, like round(sum(col)), or sum(round(col)))
        LineageInfo.java: Can you add some comments on what DependencyType the nested dependencies like "round(sum(col))" or "sum(round(col)))" have?

        S6. The best place to store LineageInfo is probably in the QueryPlan instead of SessionState. Otherwise the LineageInfo will be lost when we run a query that is compiled earlier. Thoughts?

        Show
        Zheng Shao added a comment - > S1. Can we make lineage partition-level instead of table-level? I don't see this implemented in the new patch. After looking at the code more, I'd agree that this is too hard (and inefficient) to do, when the query has a range over a lot of partitions. > S3. Use "{}" even for single statement in "if", "for" etc. I cannot find any instances of these now. Still have some questions: > S2. We might want to define formally the concepts of these levels, especially how they are composited (What will be UDAF of UDF, or UDF of UDAF, like round(sum(col)), or sum(round(col))) LineageInfo.java: Can you add some comments on what DependencyType the nested dependencies like "round(sum(col))" or "sum(round(col)))" have? S6. The best place to store LineageInfo is probably in the QueryPlan instead of SessionState. Otherwise the LineageInfo will be lost when we run a query that is compiled earlier. Thoughts?
        Hide
        Ashish Thusoo added a comment -

        This patch has all the tests updated as well.

        Show
        Ashish Thusoo added a comment - This patch has all the tests updated as well.
        Hide
        Ashish Thusoo added a comment -

        Also I did not find any instance of S3 in the code. Perhaps you just mentioned it for completeness but in case you do find an instance please let me know the offending file.

        Show
        Ashish Thusoo added a comment - Also I did not find any instance of S3 in the code. Perhaps you just mentioned it for completeness but in case you do find an instance please let me know the offending file.
        Hide
        Ashish Thusoo added a comment -

        This fixes all the review comments. Will post the patch with tests separately.

        Show
        Ashish Thusoo added a comment - This fixes all the review comments. Will post the patch with tests separately.
        Hide
        Ashish Thusoo added a comment -

        Comment 3 from Raghu and comment S2-S4 from Zheng are not yet incorporated.

        The new patch overhauls things a bit to support Partition level lineage and does this in a post execute hook. It gets rid of the visits and the iterator classes. Will fix the other comments in the patch with the test cases.

        Show
        Ashish Thusoo added a comment - Comment 3 from Raghu and comment S2-S4 from Zheng are not yet incorporated. The new patch overhauls things a bit to support Partition level lineage and does this in a post execute hook. It gets rid of the visits and the iterator classes. Will fix the other comments in the patch with the test cases.
        Hide
        Ashish Thusoo added a comment -

        Patch with all the review comments incorporated. This is just the source patch. Will be uploading the fixed tests shortly.

        Show
        Ashish Thusoo added a comment - Patch with all the review comments incorporated. This is just the source patch. Will be uploading the fixed tests shortly.
        Hide
        Zheng Shao added a comment -

        S1. Can we make lineage partition-level instead of table-level?
        S2. We might want to define formally the concepts of these levels, especially how they are composited (What will be UDAF of UDF, or UDF of UDAF, like round(sum(col)), or sum(round(col)))

        +  /**
        +   * Enum to track dependency. This enum has two values:
        +   * 1. SCALAR - Indicates that the column is derived from a scalar expression.
        +   * 2. AGGREGATION - Indicates that the column is derived from an aggregation.
        +   */
        +  public static enum DependencyType {
        +    SIMPLE, UDF, UDAF, UDTF, SCRIPT, SET
        +  }
        +  
        

        S3. Use "{}" even for single statement in "if", "for" etc.
        S4. Use "ArrayList" instead of "Vector" when it's accessed by a single thread.
        S5. Remove "private HashMap<FileSinkOperator, Table> fopToTable;" since it's not used.

        Show
        Zheng Shao added a comment - S1. Can we make lineage partition-level instead of table-level? S2. We might want to define formally the concepts of these levels, especially how they are composited (What will be UDAF of UDF, or UDF of UDAF, like round(sum(col)), or sum(round(col))) + /** + * Enum to track dependency. This enum has two values: + * 1. SCALAR - Indicates that the column is derived from a scalar expression. + * 2. AGGREGATION - Indicates that the column is derived from an aggregation. + */ + public static enum DependencyType { + SIMPLE, UDF, UDAF, UDTF, SCRIPT, SET + } + S3. Use "{}" even for single statement in "if", "for" etc. S4. Use "ArrayList" instead of "Vector" when it's accessed by a single thread. S5. Remove "private HashMap<FileSinkOperator, Table> fopToTable;" since it's not used.
        Hide
        Raghotham Murthy added a comment -

        Went over code with Ashish. A few things:

        1. The hash<key1, hash<key2, value>> paradigm can be changed to hash<pair<key1,key2>, value>. That will reduce the amount of code needed. For example, there is no need for special iterator and item classes.
        2. Code which records visits to nodes can be removed
        3. PreOrderWalker.java does not have any change

        Show
        Raghotham Murthy added a comment - Went over code with Ashish. A few things: 1. The hash<key1, hash<key2, value>> paradigm can be changed to hash<pair<key1,key2>, value>. That will reduce the amount of code needed. For example, there is no need for special iterator and item classes. 2. Code which records visits to nodes can be removed 3. PreOrderWalker.java does not have any change
        Hide
        Zheng Shao added a comment -

        1. LineageInfo and related classes (that are used in PreExecutionHook/PostExecutionHook) need to implement Serializable so that we can serialize out the whole execution plan (including the hooks).

        Show
        Zheng Shao added a comment - 1. LineageInfo and related classes (that are used in PreExecutionHook/PostExecutionHook) need to implement Serializable so that we can serialize out the whole execution plan (including the hooks).
        Hide
        Ashish Thusoo added a comment -

        This is just the source patch. Will publish the test patch soon.

        Show
        Ashish Thusoo added a comment - This is just the source patch. Will publish the test patch soon.

          People

          • Assignee:
            Ashish Thusoo
            Reporter:
            Ashish Thusoo
          • Votes:
            0 Vote for this issue
            Watchers:
            10 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development