Hive
  1. Hive
  2. HIVE-896

Add LEAD/LAG/FIRST/LAST analytical windowing functions to Hive.

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.11.0
    • Component/s: OLAP, UDF
    • Labels:
      None
    1. DataStructs.pdf
      339 kB
      Harish Butani
    2. HIVE-896.1.patch.txt
      614 kB
      Harish Butani
    3. Hive-896.2.patch.txt
      1.14 MB
      Harish Butani
    4. hive-896.3.patch.txt
      1.18 MB
      Harish Butani
    5. HIVE-896.4.patch
      1.19 MB
      Gunther Hagleitner
    6. HIVE-896.5.patch.txt
      2.30 MB
      Harish Butani

      Issue Links

        Activity

        Hide
        Zhuoluo (Clark) Yang added a comment -

        I think it is necessary to have a kind of functions called UDWF (User-Defined Windowing Function).

        Show
        Zhuoluo (Clark) Yang added a comment - I think it is necessary to have a kind of functions called UDWF (User-Defined Windowing Function).
        Hide
        Jeff Hammerbacher added a comment -

        Guy who appears to be from SAP has posted some code for windowing functions using Hive on Github: https://github.com/hbutani/SQLWindowing

        Show
        Jeff Hammerbacher added a comment - Guy who appears to be from SAP has posted some code for windowing functions using Hive on Github: https://github.com/hbutani/SQLWindowing
        Hide
        Harish Butani added a comment -

        Hi,

        Yes, I have been developing SQL Windowing on top of Hive(https://github.com/hbutani/SQLWindowing/wiki). Users can intersperse Hive Queries with Windowing Clauses and Table functions. Users can use this from a CLI (hive --service windowingCLI) or via a Query API.
        The hope is to provide Users:

        • Windowing Clauses and Table functions as a simple extension to HQL
        • overtime to fold the implementation into the Hive code base with hopefully minimal impact to the User experience.
          Here are some thoughts on how to fold this functionality into Hive: https://github.com/hbutani/SQLWindowing/wiki/MoveToHive

        Any thoughts/comments/suggestions greatly welcomed.

        regards,
        Harish Butani.

        Show
        Harish Butani added a comment - Hi, Yes, I have been developing SQL Windowing on top of Hive( https://github.com/hbutani/SQLWindowing/wiki ). Users can intersperse Hive Queries with Windowing Clauses and Table functions. Users can use this from a CLI (hive --service windowingCLI) or via a Query API. The hope is to provide Users: Windowing Clauses and Table functions as a simple extension to HQL overtime to fold the implementation into the Hive code base with hopefully minimal impact to the User experience. Here are some thoughts on how to fold this functionality into Hive: https://github.com/hbutani/SQLWindowing/wiki/MoveToHive Any thoughts/comments/suggestions greatly welcomed. regards, Harish Butani.
        Hide
        Harish Butani added a comment -

        Hi,

        We are posting a preliminary patch for a Partitioned Table Function mechanism and
        Windowing clause support based on this. The solution let's you invoke a
        Partitioned Table Function anywhere a Table/SubQuery can appear in HQL.
        The Windowing clause support matches standard SQL as much as possible:
        ability to define windows with the Query or individual Function; ability to
        specify a range or value based window with any UDAF. But since Windowing is
        handled as a PTF invocation, all Window specification must have the same Partition
        and Order specification.

        You can read about the details in a (work in progress) document
        here http://tinyurl.com/ck4nopn. We have added a lot of tests to show case the
        functionality. A good starting point is ptf_general_queries.q, which has 49 queries.

        But let us emphasize that this is a preliminary patch. We wanted to get this out early
        to get your feedback sooner rather than later. We need to do a lot of cleanup,
        refactoring and documentation. The starting point was our SQLWindowing on top of Hive
        project; which used Hive's metadata and runtime components but had its own Query form.
        So some components still reflect the assumptions from that project. We started by
        taking all the code from that project and placing it in the ql.ptf package.
        Gradually we have dissipated the stuff under this package; but we still have some
        ways to go. For background it may help to look at our Hadoop Summit
        presentation(http://tinyurl.com/bm4qb7z).

        Finally and most importantly we are not completely finished. We are missing support for
        Queries with multiple Inserts. We have to address the case of Queries with aggregations
        with no group by and with constants as columns in the Select List. On the entire ql
        testsuite there are still around 15 failures, because of these 2 issues.

        Harish Butani, Prajakta Kalmegh

        Show
        Harish Butani added a comment - Hi, We are posting a preliminary patch for a Partitioned Table Function mechanism and Windowing clause support based on this. The solution let's you invoke a Partitioned Table Function anywhere a Table/SubQuery can appear in HQL. The Windowing clause support matches standard SQL as much as possible: ability to define windows with the Query or individual Function; ability to specify a range or value based window with any UDAF. But since Windowing is handled as a PTF invocation, all Window specification must have the same Partition and Order specification. You can read about the details in a (work in progress) document here http://tinyurl.com/ck4nopn . We have added a lot of tests to show case the functionality. A good starting point is ptf_general_queries.q, which has 49 queries. But let us emphasize that this is a preliminary patch. We wanted to get this out early to get your feedback sooner rather than later. We need to do a lot of cleanup, refactoring and documentation. The starting point was our SQLWindowing on top of Hive project; which used Hive's metadata and runtime components but had its own Query form. So some components still reflect the assumptions from that project. We started by taking all the code from that project and placing it in the ql.ptf package. Gradually we have dissipated the stuff under this package; but we still have some ways to go. For background it may help to look at our Hadoop Summit presentation( http://tinyurl.com/bm4qb7z ). Finally and most importantly we are not completely finished. We are missing support for Queries with multiple Inserts. We have to address the case of Queries with aggregations with no group by and with constants as columns in the Select List. On the entire ql testsuite there are still around 15 failures, because of these 2 issues. Harish Butani, Prajakta Kalmegh
        Hide
        Alan Gates added a comment -

        Harish,

        Could you point out the interfaces (in the API sense, not the Java sense) that are most important in this patch? In particular I'm intersted in interfaces between UDFs and Hive. Based on my review so far the classes that stand out as important in this regard are TableFunctionEvaluator, TableFunctionResolver, and PTFPartition. Are there others I should be looking at?

        Questions I have so far:

        • If I read this right you are using CLUSTER BY and SORT BY instead of PARTITION BY and ORDER BY for syntax in OVER. Why?
        • Does it ever make sense for a windowing function to return a partition? Should there be a interface/abstract class specific for windowing functions that only returns a single entry?
        • Can I put one of the existing aggregate functions in an OVER clause using this?
        • Could you explain how the partition is handled in memory? It looks to me as if the entire partition is read into memory. Is that correct? If so, does it read it aggresively or as the iterator moves through the records? It also appears there is no effort to drop earlier parts of the partition that are now out of range of the window. Is that also correct?
        Show
        Alan Gates added a comment - Harish, Could you point out the interfaces (in the API sense, not the Java sense) that are most important in this patch? In particular I'm intersted in interfaces between UDFs and Hive. Based on my review so far the classes that stand out as important in this regard are TableFunctionEvaluator, TableFunctionResolver, and PTFPartition. Are there others I should be looking at? Questions I have so far: If I read this right you are using CLUSTER BY and SORT BY instead of PARTITION BY and ORDER BY for syntax in OVER. Why? Does it ever make sense for a windowing function to return a partition? Should there be a interface/abstract class specific for windowing functions that only returns a single entry? Can I put one of the existing aggregate functions in an OVER clause using this? Could you explain how the partition is handled in memory? It looks to me as if the entire partition is read into memory. Is that correct? If so, does it read it aggresively or as the iterator moves through the records? It also appears there is no effort to drop earlier parts of the partition that are now out of range of the window. Is that also correct?
        Hide
        Harish Butani added a comment -

        Hi Alan,
        Thanks for taking the time. Here are my responses:

        1. Could you point out the interfaces...
        Yes you are right, from a function writer perspective TableFunctionEvaluator, TableFunctionResolver are the important ifcs; PTFPartition(and PTFPartitionIterator) is the data container ifc.

        2. If I read this right you are using CLUSTER BY and SORT BY instead of PARTITION BY and ORDER BY for syntax in OVER. Why?
        To highlight the similarity. The Partition/Order specs in a Window clause have the same meaning as Cluster/Distribute in HQL. Note you can use a Cluster/Distribute at the query level and not specify any Partition spec in a Window clause. So the following are different ways for saying the same thing:

        a.
        select p_mfgr, p_name,
        sum(p_retailprice) over (distribute by p_mfgr sort by p_name rows between unbounded preceding and current row)
        from part;
        b.
        select p_mfgr, p_name, p_size,
        sum(p_retailprice) over (rows between unbounded preceding and current row)
        from part
        distribute by p_mfgr
        sort by p_name;
        c.
        select p_mfgr, p_name, p_size,
        sum(p_retailprice) over (w1)
        from part
        window w1 as distribute by p_mfgr sort by p_name rows between 2 preceding and 2 following;

        (I just realized that there are no egs of using Cluster/Distribute in Wdw clauses in the tests; we are adding them now)

        3. Can I put one of the existing aggregate functions in an OVER clause using this?
        I am not exactly clear what your question is. I may have answered it above. To be clear there is no special Window Function. Any existing Hive UDAF invocation can have a Windowing specification.
        tests 31,40,41 cover most of the UDAFs.

        4. Could you explain how the partition is handled in memory...
        Partitions are backed by a Persistent List ( see ptf.ds.PartitionedByteBasedList) . We need do to some work to refactor this package. Yes you are right, things can be done in delaying bringing rows into a partition and getting rid of rows once outside the window. This is true for Windowing Table Function; especially for Range based Windows.

        But for a general PTF the contract is Partition in Partition out. For e.g. CandidateFrequency function will read the rows in a partition multiple times.

        The PartitionedByteBasedList is backed by a set of PersistentByteBasedLists which uses weak refs and stores its data on disk. Done some testing with partitions with a million rows. But I agree with what you are getting at: there is stuff that can be done to reduce the memory footprint. Haven't gotten around to it....

        Show
        Harish Butani added a comment - Hi Alan, Thanks for taking the time. Here are my responses: 1. Could you point out the interfaces... Yes you are right, from a function writer perspective TableFunctionEvaluator, TableFunctionResolver are the important ifcs; PTFPartition(and PTFPartitionIterator) is the data container ifc. 2. If I read this right you are using CLUSTER BY and SORT BY instead of PARTITION BY and ORDER BY for syntax in OVER. Why? To highlight the similarity. The Partition/Order specs in a Window clause have the same meaning as Cluster/Distribute in HQL. Note you can use a Cluster/Distribute at the query level and not specify any Partition spec in a Window clause. So the following are different ways for saying the same thing: a. select p_mfgr, p_name, sum(p_retailprice) over (distribute by p_mfgr sort by p_name rows between unbounded preceding and current row) from part; b. select p_mfgr, p_name, p_size, sum(p_retailprice) over (rows between unbounded preceding and current row) from part distribute by p_mfgr sort by p_name; c. select p_mfgr, p_name, p_size, sum(p_retailprice) over (w1) from part window w1 as distribute by p_mfgr sort by p_name rows between 2 preceding and 2 following; (I just realized that there are no egs of using Cluster/Distribute in Wdw clauses in the tests; we are adding them now) 3. Can I put one of the existing aggregate functions in an OVER clause using this? I am not exactly clear what your question is. I may have answered it above. To be clear there is no special Window Function. Any existing Hive UDAF invocation can have a Windowing specification. tests 31,40,41 cover most of the UDAFs. 4. Could you explain how the partition is handled in memory... Partitions are backed by a Persistent List ( see ptf.ds.PartitionedByteBasedList) . We need do to some work to refactor this package. Yes you are right, things can be done in delaying bringing rows into a partition and getting rid of rows once outside the window. This is true for Windowing Table Function; especially for Range based Windows. But for a general PTF the contract is Partition in Partition out. For e.g. CandidateFrequency function will read the rows in a partition multiple times. The PartitionedByteBasedList is backed by a set of PersistentByteBasedLists which uses weak refs and stores its data on disk. Done some testing with partitions with a million rows. But I agree with what you are getting at: there is stuff that can be done to reduce the memory footprint. Haven't gotten around to it....
        Hide
        Alan Gates added a comment -

        If I read this right you are using CLUSTER BY and SORT BY instead of PARTITION BY and ORDER BY for syntax in OVER. Why? To highlight the similarity. The Partition/Order specs in a Window clause have the same meaning as Cluster/Distribute in HQL.

        This is only true as long as you have only one OVER clause, right? As soon as you add the ability to have separate OVER clauses partitioning by different keys (which users will want very soon) you lose this identity.

        Even if you decide to retain this I would argue that the standard PARTITION BY/ORDER BY syntax should be accepted as well. HQL already has enough one off syntax that makes life hard for people coming from more standard SQL. It should not be exacerbated.

        Could you explain how the partition is handled in memory...

        Partitions are backed by a Persistent List ( see ptf.ds.PartitionedByteBasedList) . We need do to some work to refactor this package. Yes you are right, things can be done in delaying bringing rows into a partition and getting rid of rows once outside the window. This is true for Windowing Table Function; especially for Range based Windows.
        But for a general PTF the contract is Partition in Partition out. For e.g. CandidateFrequency function will read the rows in a partition multiple times.

        This is part of where I was going with my earlier question on why a windowing function would ever return a partition. I am becoming less convinced that it makes sense to combine windowing and partition functions. While they both take partitions as inputs they return different things. Partition functions return partitions and windowing functions return a single value. As you point out here the partition functions will also not be interested in the range limiting features of windowing functions. But taking advantage of this in windowing functions will be very important for performance optimizations, I suspect. At the very least it seems like partitioning functions and windowing functions should be presented as separate entities to users and UDF writers, even if for now Hive shares some of the framework for handling them underneath. This way in the future optimizations and new features can be added in a way that is advantageous for each.

        Show
        Alan Gates added a comment - If I read this right you are using CLUSTER BY and SORT BY instead of PARTITION BY and ORDER BY for syntax in OVER. Why? To highlight the similarity. The Partition/Order specs in a Window clause have the same meaning as Cluster/Distribute in HQL. This is only true as long as you have only one OVER clause, right? As soon as you add the ability to have separate OVER clauses partitioning by different keys (which users will want very soon) you lose this identity. Even if you decide to retain this I would argue that the standard PARTITION BY/ORDER BY syntax should be accepted as well. HQL already has enough one off syntax that makes life hard for people coming from more standard SQL. It should not be exacerbated. Could you explain how the partition is handled in memory... Partitions are backed by a Persistent List ( see ptf.ds.PartitionedByteBasedList) . We need do to some work to refactor this package. Yes you are right, things can be done in delaying bringing rows into a partition and getting rid of rows once outside the window. This is true for Windowing Table Function; especially for Range based Windows. But for a general PTF the contract is Partition in Partition out. For e.g. CandidateFrequency function will read the rows in a partition multiple times. This is part of where I was going with my earlier question on why a windowing function would ever return a partition. I am becoming less convinced that it makes sense to combine windowing and partition functions. While they both take partitions as inputs they return different things. Partition functions return partitions and windowing functions return a single value. As you point out here the partition functions will also not be interested in the range limiting features of windowing functions. But taking advantage of this in windowing functions will be very important for performance optimizations, I suspect. At the very least it seems like partitioning functions and windowing functions should be presented as separate entities to users and UDF writers, even if for now Hive shares some of the framework for handling them underneath. This way in the future optimizations and new features can be added in a way that is advantageous for each.
        Hide
        Harish Butani added a comment -

        1.
        This is only true as long as you have only one OVER clause, right? As soon as you add the ability to have separate OVER clauses partitioning by different keys (which users will want very soon) you lose this identity. Even if you decide to retain this I would argue that the standard PARTITION BY/ORDER BY syntax should be accepted as well. HQL already has enough one off syntax that makes life hard for people coming from more standard SQL. It should not be exacerbated.

        I am agnostic about the second point. We can support Partition/Order or Distribute/Sort or both...

        Regarding the first point, no it applies beyond having the same partitioning. If you say something like:

        select p_mfgr, p_name,
        sum(p_size) over (distribute by p_mfgr sort by p_name rows between unbounded preceding and current row)
        from part
        distribute by p_mfgr
        sort by p_name;

        This is allowed in the language; if we don't relate windowing to the distribute/sort, this means do the Windowing and then do the distribute/sort. I doubt this would ever be what the user intended. So we propose to associate the distribute/sort with windowing; and use it at as the 'default' partitioning specification. So we allow a Query to be specified this way:

        select p_mfgr, p_name,
        sum(p_size) over (rows between unbounded preceding and current row)
        from part
        distribute by p_mfgr
        sort by p_name;

        The concept of inheriting the default partitioning still works even when we allow different partitioning specifications. So in the future this will be how you specify multiple ordering:

        select p_mfgr, p_name,
        sum(p_size) over ( rows between unbounded preceding and current row),
        sum(p_retailprice) (sort by p_size rows between unbounded preceding and current row)
        from part
        distribute by p_mfgr
        sort by p_name;

        Similarly you can have multiple distribute specs.

        2.
        This is part of where I was going with my earlier question on why a windowing function would ever return a partition. I am becoming less convinced that it makes sense to combine windowing and partition functions. While they both take partitions as inputs they return different things. Partition functions return partitions and windowing functions return a single value. As you point out here the partition functions will also not be interested in the range limiting features of windowing functions. But taking advantage of this in windowing functions will be very important for performance optimizations, I suspect. At the very least it seems like partitioning functions and windowing functions should be presented as separate entities to users and UDF writers, even if for now Hive shares some of the framework for handling them underneath. This way in the future optimizations and new features can be added in a way that is advantageous for each.

        There are several points here.

        a. Windowing doesn't return a single value. The output of applying a WindowFunction on a Partition is a Column with the same number of rows as the partition.
        b. The combined output of all the WIndow Functions in a Statement is a Partition that combines output from the individual Wdw Functions.

        Now let me say something about the seperation of Windowing and PTF functionality. There are 4 areas:

        a. The Language level: there is no intersection. The user will not see the connection. One is used as a UDAF; the other whereever tables can appear.

        b. The Ifc/Function writer level: there is no intersection. There is no 'Window Function'; a UDAF writer can continue to write UDAFs. They automatically become available in Window expressions. Table Functions are written using the TableFunctionResolver and Evaluator ifcs. This is very different from writing a UDAF. (We have setup a functions branch; you will see some egs of TblFuncs, past the pedantic Noop and NoopWithMap; also we were hoping to add NPath into the first patch).

        c. The Query Specification Level: Here things get a little messy. First let me describe the situation, and then how it is relatively eay to fix. We have extended the QueryBlock(QB) to have the following information(per destination):

        • a map of PTF invocations. These are any PTF invocations that appear in the from clause. Roughly equivalent to how SubQuery information is captured in the QB. This information is held in an instance of PTFSpec; which captures all the details of the PTF invocation.
        • a destination may also have a PTFSpec attached which represents the Windowing processing associated with this QB destination.

        Here is where things need correction. Our implementation details are leaking into the Specification classes. Since today we use the PTFOp to execute windowing we are capturing the windowing clauses information in a PTFSpec. But it is relatively easy to correct this. We can have a different set of classes to capture the Window processing.

        c.2. The other place where the implemenation is leaking is how we try to optimize Windowing processing when it is combined with a PTF invocation. (this may be too much detail; the gist of this point is that this too is easily fixable; if not interested skip to point d.)

        When we see that the from clause is only a PTF invocation then we associate the windowing clauses with its PTFSpec; so as to treat both things as one PTF Chain. Once in a chain we use our PTF Chain breaking logic to decide whether Windowing can be done in the same PTFOp for we need to break them. Again this is relatively easy to fix; for now we remove the logic that trys to associate a Windowing processing with an existing PTFSpec on the QB. This keeps the translation clean; the decision to combine can be pushed off to a later stage.

        d. The execution of Windowing: think of PTFOperator as an implementaion for Windowing. With the changes above, it will be easy to choose other available implementaions in the future.

        Finally the execution of Windowing by the PTFOp has some uses:
        d.1

        • the handling of value based ranges: it is more work to predict the boundary of the window; and sometimes it may just make sense to keep the whole partition. For e.g.:

        select p_mfgr,p_name, p_size,
        sum(p_size) as s2 over (range between p_size 5 less and current row),
        from part
        distribute by p_mfgr
        sort by p_mfgr, p_name;

        The tradeoff is doing the window calculation for each row to decide how much to keep around vs just keeping the whole thing(spilled to disk if needed)

        d.2
        The support for multiple ordering, with the same partition is possible by doing just one shuffle and then doing a sort of the PersistentPartitionedList backing the Partition. This is much easier to support then to invoke multiple MR jobs and assemble the output into a final Partition.

        Show
        Harish Butani added a comment - 1. This is only true as long as you have only one OVER clause, right? As soon as you add the ability to have separate OVER clauses partitioning by different keys (which users will want very soon) you lose this identity. Even if you decide to retain this I would argue that the standard PARTITION BY/ORDER BY syntax should be accepted as well. HQL already has enough one off syntax that makes life hard for people coming from more standard SQL. It should not be exacerbated. I am agnostic about the second point. We can support Partition/Order or Distribute/Sort or both... Regarding the first point, no it applies beyond having the same partitioning. If you say something like: select p_mfgr, p_name, sum(p_size) over (distribute by p_mfgr sort by p_name rows between unbounded preceding and current row) from part distribute by p_mfgr sort by p_name; This is allowed in the language; if we don't relate windowing to the distribute/sort, this means do the Windowing and then do the distribute/sort. I doubt this would ever be what the user intended. So we propose to associate the distribute/sort with windowing; and use it at as the 'default' partitioning specification. So we allow a Query to be specified this way: select p_mfgr, p_name, sum(p_size) over (rows between unbounded preceding and current row) from part distribute by p_mfgr sort by p_name; The concept of inheriting the default partitioning still works even when we allow different partitioning specifications. So in the future this will be how you specify multiple ordering: select p_mfgr, p_name, sum(p_size) over ( rows between unbounded preceding and current row), sum(p_retailprice) (sort by p_size rows between unbounded preceding and current row) from part distribute by p_mfgr sort by p_name; Similarly you can have multiple distribute specs. 2. This is part of where I was going with my earlier question on why a windowing function would ever return a partition. I am becoming less convinced that it makes sense to combine windowing and partition functions. While they both take partitions as inputs they return different things. Partition functions return partitions and windowing functions return a single value. As you point out here the partition functions will also not be interested in the range limiting features of windowing functions. But taking advantage of this in windowing functions will be very important for performance optimizations, I suspect. At the very least it seems like partitioning functions and windowing functions should be presented as separate entities to users and UDF writers, even if for now Hive shares some of the framework for handling them underneath. This way in the future optimizations and new features can be added in a way that is advantageous for each. There are several points here. a. Windowing doesn't return a single value. The output of applying a WindowFunction on a Partition is a Column with the same number of rows as the partition. b. The combined output of all the WIndow Functions in a Statement is a Partition that combines output from the individual Wdw Functions. Now let me say something about the seperation of Windowing and PTF functionality. There are 4 areas: a. The Language level: there is no intersection. The user will not see the connection. One is used as a UDAF; the other whereever tables can appear. b. The Ifc/Function writer level: there is no intersection. There is no 'Window Function'; a UDAF writer can continue to write UDAFs. They automatically become available in Window expressions. Table Functions are written using the TableFunctionResolver and Evaluator ifcs. This is very different from writing a UDAF. (We have setup a functions branch; you will see some egs of TblFuncs, past the pedantic Noop and NoopWithMap; also we were hoping to add NPath into the first patch). c. The Query Specification Level: Here things get a little messy. First let me describe the situation, and then how it is relatively eay to fix. We have extended the QueryBlock(QB) to have the following information(per destination): a map of PTF invocations. These are any PTF invocations that appear in the from clause. Roughly equivalent to how SubQuery information is captured in the QB. This information is held in an instance of PTFSpec; which captures all the details of the PTF invocation. a destination may also have a PTFSpec attached which represents the Windowing processing associated with this QB destination. Here is where things need correction. Our implementation details are leaking into the Specification classes. Since today we use the PTFOp to execute windowing we are capturing the windowing clauses information in a PTFSpec. But it is relatively easy to correct this. We can have a different set of classes to capture the Window processing. c.2. The other place where the implemenation is leaking is how we try to optimize Windowing processing when it is combined with a PTF invocation. (this may be too much detail; the gist of this point is that this too is easily fixable; if not interested skip to point d.) When we see that the from clause is only a PTF invocation then we associate the windowing clauses with its PTFSpec; so as to treat both things as one PTF Chain. Once in a chain we use our PTF Chain breaking logic to decide whether Windowing can be done in the same PTFOp for we need to break them. Again this is relatively easy to fix; for now we remove the logic that trys to associate a Windowing processing with an existing PTFSpec on the QB. This keeps the translation clean; the decision to combine can be pushed off to a later stage. d. The execution of Windowing: think of PTFOperator as an implementaion for Windowing. With the changes above, it will be easy to choose other available implementaions in the future. Finally the execution of Windowing by the PTFOp has some uses: d.1 the handling of value based ranges: it is more work to predict the boundary of the window; and sometimes it may just make sense to keep the whole partition. For e.g.: select p_mfgr,p_name, p_size, sum(p_size) as s2 over (range between p_size 5 less and current row), from part distribute by p_mfgr sort by p_mfgr, p_name; The tradeoff is doing the window calculation for each row to decide how much to keep around vs just keeping the whole thing(spilled to disk if needed) d.2 The support for multiple ordering, with the same partition is possible by doing just one shuffle and then doing a sort of the PersistentPartitionedList backing the Partition. This is much easier to support then to invoke multiple MR jobs and assemble the output into a final Partition.
        Hide
        Alan Gates added a comment -

        Harish,

        Thanks for you replies. I want to think on your explanation in 2 above some more, but at least I think I understand your rationale now.

        One other question. I tried playing around with this but kept getting an error. I'm not sure what I'm doing wrong. I have a table that I created with the following statement:

        create table studenttab10k (name string, age int, gpa float);
        

        When I run

        select avg(gpa) over (cluster by age) from studenttab10k;
        

        I get

        FAILED: SemanticException 1:43 No partition specification associated with start of PTF chain . Error encountered near token 'age'
        

        I looked through the syntax file and I think I'm doing the right thing, but obviously I'm not.

        Show
        Alan Gates added a comment - Harish, Thanks for you replies. I want to think on your explanation in 2 above some more, but at least I think I understand your rationale now. One other question. I tried playing around with this but kept getting an error. I'm not sure what I'm doing wrong. I have a table that I created with the following statement: create table studenttab10k (name string, age int , gpa float ); When I run select avg(gpa) over (cluster by age) from studenttab10k; I get FAILED: SemanticException 1:43 No partition specification associated with start of PTF chain . Error encountered near token 'age' I looked through the syntax file and I think I'm doing the right thing, but obviously I'm not.
        Hide
        Harish Butani added a comment -

        Alan,

        Thanks for spending the time.

        Yes your e.g. is going to fail. There was a bug in the patch we posted.
        This was fixed in commit 0eff864d765c91e0bece497e0f007c6cd2cec72f in our repo on Jan 9th.
        I can send you a patch privately or post the updated patch here.
        Sorry about this.

        Show
        Harish Butani added a comment - Alan, Thanks for spending the time. Yes your e.g. is going to fail. There was a bug in the patch we posted. This was fixed in commit 0eff864d765c91e0bece497e0f007c6cd2cec72f in our repo on Jan 9th. I can send you a patch privately or post the updated patch here. Sorry about this.
        Hide
        Alan Gates added a comment -

        I'd definitely like to get a new version of the patch. I'm happy to pull from github. I looked at the repo referenced above ( https://github.com/hbutani/SQLWindowing ) but it didn't have any recent updates.

        Show
        Alan Gates added a comment - I'd definitely like to get a new version of the patch. I'm happy to pull from github. I looked at the repo referenced above ( https://github.com/hbutani/SQLWindowing ) but it didn't have any recent updates.
        Hide
        Harish Butani added a comment -

        Its https://github.com/hbutani/hive (ptf branch)
        The SQLWindowing repo has the work we did on top of hive.

        Show
        Harish Butani added a comment - Its https://github.com/hbutani/hive (ptf branch) The SQLWindowing repo has the work we did on top of hive.
        Hide
        Harish Butani added a comment -

        Alan,

        I have attached a document describing the PTF & Windowing Specification classes. This is a formal description of the changes I was describing above. The output of Phase 1 will be Windowing & PTFInvocation objects attached to the QB.

        The thought process is:

        • Phase 1 generates Specification that doesn't tie PTFs and Windowing
        • We alter the PTFTranslator to handle both Windowing & PTFInvocations; for now we keep the PTFDef classes mostly intact.
        • in the future we:
        • build a separate Operator for Windowing, based on the Spec classes.
        • tbd: we refactor the PTFTranslator and Definition classes to share more translation code with the Windowing Operator.
        Show
        Harish Butani added a comment - Alan, I have attached a document describing the PTF & Windowing Specification classes. This is a formal description of the changes I was describing above. The output of Phase 1 will be Windowing & PTFInvocation objects attached to the QB. The thought process is: Phase 1 generates Specification that doesn't tie PTFs and Windowing We alter the PTFTranslator to handle both Windowing & PTFInvocations; for now we keep the PTFDef classes mostly intact. in the future we: build a separate Operator for Windowing, based on the Spec classes. tbd: we refactor the PTFTranslator and Definition classes to share more translation code with the Windowing Operator.
        Hide
        Ashutosh Chauhan added a comment -

        Harish Butani Image in your html doc didn't make it. May be attach a pdf version containing image? Or, just attaching standalone image is fine as well.

        Show
        Ashutosh Chauhan added a comment - Harish Butani Image in your html doc didn't make it. May be attach a pdf version containing image? Or, just attaching standalone image is fine as well.
        Hide
        Harish Butani added a comment -

        Done. Sorry about that.

        Show
        Harish Butani added a comment - Done. Sorry about that.
        Hide
        Harish Butani added a comment -

        Hi,

        Just posted a second preliminary patch. This has:

        • support for multiple inserts
        • Finished all the cleanup
        • Added the NPath Table Function
        • Added more tests, now there are around 70 tests.
        • Many bug fixes and minor enhancements, to list a few : don't interfere with GBys with no Group By clause; support for multiple invocations of same UDAF on diff wdws; allow aliases to be optional in wdw clauses; fix issues with default wdw specification and inheriting partitioning; enhance the PTF ifc to specify output col names; fix bugs in value based wdws; fix issues when mixing distinct and wdwing clauses.
        • made sure all the ql tests pass ( though this is a moving target; we apologize if failures have crept in since we checked yesterday)

        We now:

        • plan to make the changes listed in the comments to Alan; so there is a clean separation between Windowing & PTFs at the specification level.
        • In parallel we have started a functions branch, where we are testing more PTFs. See out github repo for details.
        Show
        Harish Butani added a comment - Hi, Just posted a second preliminary patch. This has: support for multiple inserts Finished all the cleanup Added the NPath Table Function Added more tests, now there are around 70 tests. Many bug fixes and minor enhancements, to list a few : don't interfere with GBys with no Group By clause; support for multiple invocations of same UDAF on diff wdws; allow aliases to be optional in wdw clauses; fix issues with default wdw specification and inheriting partitioning; enhance the PTF ifc to specify output col names; fix bugs in value based wdws; fix issues when mixing distinct and wdwing clauses. made sure all the ql tests pass ( though this is a moving target; we apologize if failures have crept in since we checked yesterday) We now: plan to make the changes listed in the comments to Alan; so there is a clean separation between Windowing & PTFs at the specification level. In parallel we have started a functions branch, where we are testing more PTFs. See out github repo for details.
        Hide
        Ashutosh Chauhan added a comment -

        Hey Harish,
        Thanks for updating the patch. While I was playing with previous patch, though unit tests ran fine. I found ptf queries failed when I ran them on cluster. I found out that antlr-runtime jar is required at tasktracker, so I need to 'add jar' to send it across. This increases our runtime dependency. Though, thats not my biggest concern. On investigating a bit, I found this is due to we are serializing antlr structures from client to backend. I think we should get rid of this. We should populate all the information we need in hive's data structure in front end ( probably in *desc objects) and than use these structures to pull out the information. Having pieces of antlr structures at backend is not ideal. Does the latest patch improves this situation?

        Show
        Ashutosh Chauhan added a comment - Hey Harish, Thanks for updating the patch. While I was playing with previous patch, though unit tests ran fine. I found ptf queries failed when I ran them on cluster. I found out that antlr-runtime jar is required at tasktracker, so I need to 'add jar' to send it across. This increases our runtime dependency. Though, thats not my biggest concern. On investigating a bit, I found this is due to we are serializing antlr structures from client to backend. I think we should get rid of this. We should populate all the information we need in hive's data structure in front end ( probably in *desc objects) and than use these structures to pull out the information. Having pieces of antlr structures at backend is not ideal. Does the latest patch improves this situation?
        Hide
        Harish Butani added a comment -

        No this needs addressing. Currently the Definition classes ( same as Desc classes) are holding onto the Spec classes which have the ASTNodes. Most cases it is easy to make the Spec reference transient (or better yet move the association outside of the Def classes). There are couple of places that the ASTNode is being referenced; need to take a closer look. I am in process of refactoring the Spec classes; will have better sense in a couple of days.

        Show
        Harish Butani added a comment - No this needs addressing. Currently the Definition classes ( same as Desc classes) are holding onto the Spec classes which have the ASTNodes. Most cases it is easy to make the Spec reference transient (or better yet move the association outside of the Def classes). There are couple of places that the ASTNode is being referenced; need to take a closer look. I am in process of refactoring the Spec classes; will have better sense in a couple of days.
        Hide
        Ashutosh Chauhan added a comment -

        PTFDesc only contains a serialized string for PTFDef. I think we should just merge these two classes. Rename the existing PTFDef to PTFDesc and removing the existing PTFDef. And than make sure that PTFDesc is serializable. Does that sound right?

        Show
        Ashutosh Chauhan added a comment - PTFDesc only contains a serialized string for PTFDef. I think we should just merge these two classes. Rename the existing PTFDef to PTFDesc and removing the existing PTFDef. And than make sure that PTFDesc is serializable. Does that sound right?
        Hide
        Ashutosh Chauhan added a comment -

        Also need to make sure that ASTNode and other antlr datastructures referenced (directly or via contained fields) in this new PTFDesc are not required in PTFOperator and are thus not serialized, thereby eliminating antlr runtime dependency.

        Show
        Ashutosh Chauhan added a comment - Also need to make sure that ASTNode and other antlr datastructures referenced (directly or via contained fields) in this new PTFDesc are not required in PTFOperator and are thus not serialized, thereby eliminating antlr runtime dependency.
        Hide
        Harish Butani added a comment -

        Yes, exactly. Will start to introduce the new Spec classes as noted in the DataStruct attachment, and refactor the Def classes to remove the antlr dependency.

        But before doing this had to handle the following issue. So the plan we generate has the form
        ... -> ReduceSink -> Extract -> PTF Op -> ...
        The Reduce Sink RowResolver contains the Virtual Columns from its input Operators. During translation we set the RowResolver of the Extract Op to be the same as the Reduce Sink RR; and this same RR was used to setup the ExprNodeDescs in PTF translation. But at runtime the Extract Op doesn't contain the Virtual Columns and so the internal column names can be different. For e.g. in our testJoinWithLeadLag testCase, which is a self join on part and also has a Windowing expression. The RR of the RS op at translation time looks something like this:
        (_co1,_col2,..,_col7, _col8(vc=true),_col9(vc=true),_col10,_col11,.._col15(vc=true),_col16(vc=true),..)
        At runtime the Virtual columns are removed and all the columns after _col7 are shifted 1 or 2 positions. So in child Operators ColumnExprNodeDescs are no longer referring to the right columns.
        We were handling this issue by recreating the ExprNodeDescs from the ASTNodes at runtime.
        So to avoid carrying forward the ASTNodes we now build a new RR for the Extract Op, with the Virtual Columns removed. We hand this to the PTFTranslator as the starting RR to use to translate a PTF Chain.

        With the above change, now it should be possible to use the ExprNodeDescs created during translation in the execution of the PTF Op. So will now start a sequence of steps to move to the new data structures and avoid recreation of ExprNodeDescs at runtime.

        I apologize if I am not being clear. This is a little hard to explain w/o walking through an example. Happy to go over this in detail offline.

        Show
        Harish Butani added a comment - Yes, exactly. Will start to introduce the new Spec classes as noted in the DataStruct attachment, and refactor the Def classes to remove the antlr dependency. But before doing this had to handle the following issue. So the plan we generate has the form ... -> ReduceSink -> Extract -> PTF Op -> ... The Reduce Sink RowResolver contains the Virtual Columns from its input Operators. During translation we set the RowResolver of the Extract Op to be the same as the Reduce Sink RR; and this same RR was used to setup the ExprNodeDescs in PTF translation. But at runtime the Extract Op doesn't contain the Virtual Columns and so the internal column names can be different. For e.g. in our testJoinWithLeadLag testCase, which is a self join on part and also has a Windowing expression. The RR of the RS op at translation time looks something like this: (_co1,_col2,..,_col7, _col8(vc=true),_col9(vc=true),_col10,_col11,.._col15(vc=true),_col16(vc=true),..) At runtime the Virtual columns are removed and all the columns after _col7 are shifted 1 or 2 positions. So in child Operators ColumnExprNodeDescs are no longer referring to the right columns. We were handling this issue by recreating the ExprNodeDescs from the ASTNodes at runtime. So to avoid carrying forward the ASTNodes we now build a new RR for the Extract Op, with the Virtual Columns removed. We hand this to the PTFTranslator as the starting RR to use to translate a PTF Chain. With the above change, now it should be possible to use the ExprNodeDescs created during translation in the execution of the PTF Op. So will now start a sequence of steps to move to the new data structures and avoid recreation of ExprNodeDescs at runtime. I apologize if I am not being clear. This is a little hard to explain w/o walking through an example. Happy to go over this in detail offline.
        Hide
        Ashutosh Chauhan added a comment -

        That sounds like a bug in existing trunk. If I got this right than query involving sort/distribute by and virtual columns would yield incorrect result on current trunk. If so, perhaps we should address this bug in a separate ticket for trunk?

        Show
        Ashutosh Chauhan added a comment - That sounds like a bug in existing trunk. If I got this right than query involving sort/distribute by and virtual columns would yield incorrect result on current trunk. If so, perhaps we should address this bug in a separate ticket for trunk?
        Hide
        Prajakta Kalmegh added a comment -

        This is not exactly a bug. In the existing trunk, the ExtractOperator is followed by a FileSinkOperator and hence does not have this problem. For queries like below:

        select p1.p_mfgr, p1.p_name,
        p1.p_size
        from part p1 join part p2 on p1.p_partkey = p2.p_partkey
        distribute by p1.p_mfgr
        sort by p1.p_name;

        a SelectOperator after JoinOperator solves this problem by filtering the virtual columns (VCs) and setting up a correct RR for ReduceSinkOperator. We cannot insert a SelectOperator in our case as the PTF chain is a black-box for us.

        In queries with the PTFOperator, we use the RowResolver of the ExtractOperator to construct ExprNodeDescs during translation. The problem here is: if we do not filter out the VCs from the ExtractOperator and use them during translation, the ColumnPrunerTableScanProc adds these VCs in the newVirtualCols List. This causes a non-empty virtualCols on TableScanDesc. During runtime, in the MapOperator the 'hasVC' boolean is set to true eventually resulting in a ClassCastException in ReduceSinkOperator during row evaluation. This problem occurs particularly for queries involving join with PTF (We can walk through some examples offline to explain why this is not a problem for queries with a PTF and no join). So currently, we are filtering the VCs and setting up a new RowResolver for ExtractOperator during translation so that the columns at runtime match with those during translation.

        Show
        Prajakta Kalmegh added a comment - This is not exactly a bug. In the existing trunk, the ExtractOperator is followed by a FileSinkOperator and hence does not have this problem. For queries like below: select p1.p_mfgr, p1.p_name, p1.p_size from part p1 join part p2 on p1.p_partkey = p2.p_partkey distribute by p1.p_mfgr sort by p1.p_name; a SelectOperator after JoinOperator solves this problem by filtering the virtual columns (VCs) and setting up a correct RR for ReduceSinkOperator. We cannot insert a SelectOperator in our case as the PTF chain is a black-box for us. In queries with the PTFOperator, we use the RowResolver of the ExtractOperator to construct ExprNodeDescs during translation. The problem here is: if we do not filter out the VCs from the ExtractOperator and use them during translation, the ColumnPrunerTableScanProc adds these VCs in the newVirtualCols List. This causes a non-empty virtualCols on TableScanDesc. During runtime, in the MapOperator the 'hasVC' boolean is set to true eventually resulting in a ClassCastException in ReduceSinkOperator during row evaluation. This problem occurs particularly for queries involving join with PTF (We can walk through some examples offline to explain why this is not a problem for queries with a PTF and no join). So currently, we are filtering the VCs and setting up a new RowResolver for ExtractOperator during translation so that the columns at runtime match with those during translation.
        Hide
        Ashutosh Chauhan added a comment -

        Thanks Prajakta for explanation. Make sense, But to your last point which Harish also noted in previous comment:

        So currently, we are filtering the VCs and setting up a new RowResolver for ExtractOperator during translation so that the columns at runtime match with those during translation.

        I think we cannot always filter out VC's for RR of ExtractOp. That will depend on whether user has selected VCs or not in the query. No?

        Show
        Ashutosh Chauhan added a comment - Thanks Prajakta for explanation. Make sense, But to your last point which Harish also noted in previous comment: So currently, we are filtering the VCs and setting up a new RowResolver for ExtractOperator during translation so that the columns at runtime match with those during translation. I think we cannot always filter out VC's for RR of ExtractOp. That will depend on whether user has selected VCs or not in the query. No?
        Hide
        Ashutosh Chauhan added a comment -

        Did some further testing on my 1-node pseudo cluster. After getting past dependency problems, I hit this OOM exception. This is with

        -- 10. testOnlyPTF 
        select p_mfgr, p_name, p_size from noop(part distribute by p_mfgr sort by p_name);
        
        
        java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row (tag=0) {"key":{"reducesinkkey0":"Manufacturer#2","reducesinkkey1":"almond antique violet chocolate turquoise"},"value":{"_col0":105685,"_col1":"almond antique violet chocolate turquoise","_col2":"Manufacturer#2","_col3":"Brand#22","_col4":"MEDIUM ANODIZED COPPER","_col5":14,"_col6":"MED CAN","_col7":1690.68,"_col8":"ly pending requ"},"alias":0}
        	at org.apache.hadoop.hive.ql.exec.ExecReducer.reduce(ExecReducer.java:274)
        	at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:518)
        	at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:419)
        	at org.apache.hadoop.mapred.Child$4.run(Child.java:259)
        	at java.security.AccessController.doPrivileged(Native Method)
        	at javax.security.auth.Subject.doAs(Subject.java:396)
        	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059)
        	at org.apache.hadoop.mapred.Child.main(Child.java:253)
        Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row (tag=0) {"key":{"reducesinkkey0":"Manufacturer#2","reducesinkkey1":"almond antique violet chocolate turquoise"},"value":{"_col0":105685,"_col1":"almond antique violet chocolate turquoise","_col2":"Manufacturer#2","_col3":"Brand#22","_col4":"MEDIUM ANODIZED COPPER","_col5":14,"_col6":"MED CAN","_col7":1690.68,"_col8":"ly pending requ"},"alias":0}
        	at org.apache.hadoop.hive.ql.exec.ExecReducer.reduce(ExecReducer.java:262)
        	... 7 more
        Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.reflect.InvocationTargetException
        	at org.apache.hadoop.hive.ql.exec.PTFPersistence.createList(PTFPersistence.java:47)
        	at org.apache.hadoop.hive.ql.exec.PTFPartition.init(PTFPartition.java:42)
        	at org.apache.hadoop.hive.ql.exec.PTFPartition.<init>(PTFPartition.java:35)
        	at org.apache.hadoop.hive.ql.exec.PTFOperator.createFirstPartitionForChain(PTFOperator.java:371)
        	at org.apache.hadoop.hive.ql.exec.PTFOperator.processOp(PTFOperator.java:133)
        	at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:521)
        	at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:849)
        	at org.apache.hadoop.hive.ql.exec.ExtractOperator.processOp(ExtractOperator.java:45)
        	at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:521)
        	at org.apache.hadoop.hive.ql.exec.ExecReducer.reduce(ExecReducer.java:253)
        	... 7 more
        Caused by: java.lang.reflect.InvocationTargetException
        	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
        	at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
        	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
        	at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
        	at org.apache.hadoop.hive.ql.exec.PTFPersistence.createList(PTFPersistence.java:43)
        	... 16 more
        Caused by: java.lang.OutOfMemoryError: Java heap space
        	at org.apache.hadoop.hive.ql.exec.PTFPersistence$ByteBasedList.<init>(PTFPersistence.java:77)
        	at org.apache.hadoop.hive.ql.exec.PTFPersistence$PartitionedByteBasedList.addPartition(PTFPersistence.java:407)
        	at org.apache.hadoop.hive.ql.exec.PTFPersistence$PartitionedByteBasedList.<init>(PTFPersistence.java:386)
        	... 21 more
        

        I tried with other queries as well and all of them failed there. Note that these queries are from ptf_general_queries.q which when run via ant for unit tests, passes fine.

        Show
        Ashutosh Chauhan added a comment - Did some further testing on my 1-node pseudo cluster. After getting past dependency problems, I hit this OOM exception. This is with -- 10. testOnlyPTF select p_mfgr, p_name, p_size from noop(part distribute by p_mfgr sort by p_name); java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row (tag=0) { "key" :{ "reducesinkkey0" : "Manufacturer#2" , "reducesinkkey1" : "almond antique violet chocolate turquoise" }, "value" :{ "_col0" :105685, "_col1" : "almond antique violet chocolate turquoise" , "_col2" : "Manufacturer#2" , "_col3" : "Brand#22" , "_col4" : "MEDIUM ANODIZED COPPER" , "_col5" :14, "_col6" : "MED CAN" , "_col7" :1690.68, "_col8" : "ly pending requ" }, "alias" :0} at org.apache.hadoop.hive.ql.exec.ExecReducer.reduce(ExecReducer.java:274) at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:518) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:419) at org.apache.hadoop.mapred.Child$4.run(Child.java:259) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059) at org.apache.hadoop.mapred.Child.main(Child.java:253) Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row (tag=0) { "key" :{ "reducesinkkey0" : "Manufacturer#2" , "reducesinkkey1" : "almond antique violet chocolate turquoise" }, "value" :{ "_col0" :105685, "_col1" : "almond antique violet chocolate turquoise" , "_col2" : "Manufacturer#2" , "_col3" : "Brand#22" , "_col4" : "MEDIUM ANODIZED COPPER" , "_col5" :14, "_col6" : "MED CAN" , "_col7" :1690.68, "_col8" : "ly pending requ" }, "alias" :0} at org.apache.hadoop.hive.ql.exec.ExecReducer.reduce(ExecReducer.java:262) ... 7 more Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.reflect.InvocationTargetException at org.apache.hadoop.hive.ql.exec.PTFPersistence.createList(PTFPersistence.java:47) at org.apache.hadoop.hive.ql.exec.PTFPartition.init(PTFPartition.java:42) at org.apache.hadoop.hive.ql.exec.PTFPartition.<init>(PTFPartition.java:35) at org.apache.hadoop.hive.ql.exec.PTFOperator.createFirstPartitionForChain(PTFOperator.java:371) at org.apache.hadoop.hive.ql.exec.PTFOperator.processOp(PTFOperator.java:133) at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:521) at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:849) at org.apache.hadoop.hive.ql.exec.ExtractOperator.processOp(ExtractOperator.java:45) at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:521) at org.apache.hadoop.hive.ql.exec.ExecReducer.reduce(ExecReducer.java:253) ... 7 more Caused by: java.lang.reflect.InvocationTargetException at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27) at java.lang.reflect.Constructor.newInstance(Constructor.java:513) at org.apache.hadoop.hive.ql.exec.PTFPersistence.createList(PTFPersistence.java:43) ... 16 more Caused by: java.lang.OutOfMemoryError: Java heap space at org.apache.hadoop.hive.ql.exec.PTFPersistence$ByteBasedList.<init>(PTFPersistence.java:77) at org.apache.hadoop.hive.ql.exec.PTFPersistence$PartitionedByteBasedList.addPartition(PTFPersistence.java:407) at org.apache.hadoop.hive.ql.exec.PTFPersistence$PartitionedByteBasedList.<init>(PTFPersistence.java:386) ... 21 more I tried with other queries as well and all of them failed there. Note that these queries are from ptf_general_queries.q which when run via ant for unit tests, passes fine.
        Hide
        Harish Butani added a comment -

        can you try
        set hive.ptf.partition.persistence.memsize=something lower
        default is 64MB.

        Show
        Harish Butani added a comment - can you try set hive.ptf.partition.persistence.memsize=something lower default is 64MB.
        Hide
        Ashutosh Chauhan added a comment -

        Yeah that worked fine. My 1-node pseudo cluster was running with default config which resulted in Xmx value of 256M which wasn't well equipped to handle arrays of 64MB : ) worked fine with 10MB.

        Show
        Ashutosh Chauhan added a comment - Yeah that worked fine. My 1-node pseudo cluster was running with default config which resulted in Xmx value of 256M which wasn't well equipped to handle arrays of 64MB : ) worked fine with 10MB.
        Hide
        Ashutosh Chauhan added a comment -

        Created phabricator review entry to ease review : https://reviews.facebook.net/D8331

        Show
        Ashutosh Chauhan added a comment - Created phabricator review entry to ease review : https://reviews.facebook.net/D8331
        Hide
        Harish Butani added a comment -

        Ashutosh,

        So the plan looks like this:

        ... -> ReduceSink -> Extract -> PTFOp

        We don't know what columns a PTF will access; the contract is it has access to all columns in its input. So we don't want any Column Pruning to happen. So we don't put a Select Op before the Reduce Sink. At translation time we see all the Columns, including the VCs. It appears as though during optimization VCs are carried forward only if required; so at runtime the ColumnExprNodeDescs are referring to the wrong internalNames. Does this make sense? Is there a way to carry forward the VCs when a PTF is present. The other option is (which we have taken is) to say VCs are not available to PTFs.

        Having said this, when the PTF is Windowing, we do know the columns being referred; so we should put a Select Op in front of the ReduceSink.

        Show
        Harish Butani added a comment - Ashutosh, So the plan looks like this: ... -> ReduceSink -> Extract -> PTFOp We don't know what columns a PTF will access; the contract is it has access to all columns in its input. So we don't want any Column Pruning to happen. So we don't put a Select Op before the Reduce Sink. At translation time we see all the Columns, including the VCs. It appears as though during optimization VCs are carried forward only if required; so at runtime the ColumnExprNodeDescs are referring to the wrong internalNames. Does this make sense? Is there a way to carry forward the VCs when a PTF is present. The other option is (which we have taken is) to say VCs are not available to PTFs. Having said this, when the PTF is Windowing, we do know the columns being referred; so we should put a Select Op in front of the ReduceSink.
        Hide
        Alan Gates added a comment -

        Harish Butani I've been looking at the syntax for invoking a PTF in your patch. If I read it correctly it looks something like:

        functionname ( expr_list ON source ... )

        where expr_list is a comma separated list of expressions, source is a table, subquery or another PTF, and ... is the DISTRIBUTE BY etc.

        Is that right? If so, what is that list of expressions? Are those arguments to be passed to the PTF? Also, if I read it correctly the ON keyword is optional. Does ON not being there have some meaning or do you just see it as a noise word so you're allowing it to be optional?

        Show
        Alan Gates added a comment - Harish Butani I've been looking at the syntax for invoking a PTF in your patch. If I read it correctly it looks something like: functionname ( expr_list ON source ... ) where expr_list is a comma separated list of expressions, source is a table, subquery or another PTF, and ... is the DISTRIBUTE BY etc. Is that right? If so, what is that list of expressions? Are those arguments to be passed to the PTF? Also, if I read it correctly the ON keyword is optional. Does ON not being there have some meaning or do you just see it as a noise word so you're allowing it to be optional?
        Hide
        Harish Butani added a comment -

        Yes you got most of it correct. So if there are no arguments the form is:

        functioname(source....) for e.g. ptf1(part distribute by mfgr sort by name)

        If there are arguments the form is:

        functionname(args on source ...) for e.g. ptf2(arg1, arg2 on part distribute by mfgr sort by name)

        If you have args the ON is required. Probably can make the ON optional here. I think there was some ambiguity if ON is optional; didn't want to deal with it in the first pass... If you want I can go back and check.

        Show
        Harish Butani added a comment - Yes you got most of it correct. So if there are no arguments the form is: functioname(source....) for e.g. ptf1(part distribute by mfgr sort by name) If there are arguments the form is: functionname(args on source ...) for e.g. ptf2(arg1, arg2 on part distribute by mfgr sort by name) If you have args the ON is required. Probably can make the ON optional here. I think there was some ambiguity if ON is optional; didn't want to deal with it in the first pass... If you want I can go back and check.
        Hide
        Alan Gates added a comment -

        I was wondering if there's a "standard" way to do selections from partition table functions in SQL. (I know it isn't actually part of the standard yet.) If there's a way others have done it, it makes sense to align Hive's syntax with those, as long as theirs is rational.

        The only other one I'm aware of is Aster. There syntax is:

        functionname ( ON source [PARTITION BY expression [ORDER BY expression]] [key_value_list] )

        where a key_value_list is: key(value)[, key(value)...] where key is the name of the argument in the function and value is the value to be passed.

        We've already agreed to add PARTITION BY and ORDER BY as synonyms for DISTRIBUTE/CLUSTER BY and SORT BY. It seems to me it would make sense to do the function argument passing in the same way to avoid confusion for users.

        Show
        Alan Gates added a comment - I was wondering if there's a "standard" way to do selections from partition table functions in SQL. (I know it isn't actually part of the standard yet.) If there's a way others have done it, it makes sense to align Hive's syntax with those, as long as theirs is rational. The only other one I'm aware of is Aster. There syntax is: functionname ( ON source [PARTITION BY expression [ORDER BY expression] ] [key_value_list] ) where a key_value_list is: key(value) [, key(value)...] where key is the name of the argument in the function and value is the value to be passed. We've already agreed to add PARTITION BY and ORDER BY as synonyms for DISTRIBUTE/CLUSTER BY and SORT BY. It seems to me it would make sense to do the function argument passing in the same way to avoid confusion for users.
        Hide
        Harish Butani added a comment -

        Yes Aster is the one I had looked at too. The other one I looked at was Pipelined Partitioned Table Functions in Oracle; they are bit different, the partitioning information is part of the Function definition.

        I am assuming key is an identifier; so this is really an invocation using named parameters.
        Since Hive doesn't have named parameters for other functions, could we do positional parameters for now?
        Later, do named parameters for all functions not just PTFs.

        Show
        Harish Butani added a comment - Yes Aster is the one I had looked at too. The other one I looked at was Pipelined Partitioned Table Functions in Oracle; they are bit different, the partitioning information is part of the Function definition. I am assuming key is an identifier; so this is really an invocation using named parameters. Since Hive doesn't have named parameters for other functions, could we do positional parameters for now? Later, do named parameters for all functions not just PTFs.
        Hide
        Harish Butani added a comment -

        Attached patch to be used as starting point for hive branch.
        Has minor changes since the last patch.

        Show
        Harish Butani added a comment - Attached patch to be used as starting point for hive branch. Has minor changes since the last patch.
        Hide
        Ashutosh Chauhan added a comment -

        This is an umbrella jira for this feature. All the sub-tasks are linked to it.

        Show
        Ashutosh Chauhan added a comment - This is an umbrella jira for this feature. All the sub-tasks are linked to it.
        Hide
        Gunther Hagleitner added a comment -

        .4 has all the changes from the branch in a single patch.

        Show
        Gunther Hagleitner added a comment - .4 has all the changes from the branch in a single patch.
        Hide
        Harish Butani added a comment -

        uploaded .5

        Show
        Harish Butani added a comment - uploaded .5
        Hide
        Ashutosh Chauhan added a comment -

        I am +1 on this. I have overseen this entire work, have reviewed all ~45 patches which went in the branch. With ~200 tests (including -ve ones), feature is well tested. Code is well commented and relatively easier to follow.

        Show
        Ashutosh Chauhan added a comment - I am +1 on this. I have overseen this entire work, have reviewed all ~45 patches which went in the branch. With ~200 tests (including -ve ones), feature is well tested. Code is well commented and relatively easier to follow.
        Hide
        Ashutosh Chauhan added a comment -

        Running tests. Will commit if tests pass.

        Show
        Ashutosh Chauhan added a comment - Running tests. Will commit if tests pass.
        Hide
        Ashutosh Chauhan added a comment -

        Committed to trunk. Thanks, Harish for this awesome feature!

        Show
        Ashutosh Chauhan added a comment - Committed to trunk. Thanks, Harish for this awesome feature!
        Hide
        Brock Noland added a comment -

        +1

        Nice work guys! Sorry I have been OOO so I haven't been able to help lately.

        Show
        Brock Noland added a comment - +1 Nice work guys! Sorry I have been OOO so I haven't been able to help lately.
        Hide
        Ashutosh Chauhan added a comment -

        I must acknowledge other folks have also contributed in this effort including Brock Noland , Prajakta Kalmegh and myself. Thanks everyone for their time and enthusiasm to see this through.

        Show
        Ashutosh Chauhan added a comment - I must acknowledge other folks have also contributed in this effort including Brock Noland , Prajakta Kalmegh and myself. Thanks everyone for their time and enthusiasm to see this through.
        Hide
        Gunther Hagleitner added a comment -

        Amazing! Thanks Harish!

        Show
        Gunther Hagleitner added a comment - Amazing! Thanks Harish!
        Hide
        eric baldeschwieler added a comment -

        very cool

        Show
        eric baldeschwieler added a comment - very cool
        Hide
        Harish Butani added a comment -

        I second Ashutosh's comment. Thanks Prajakta, Ashutosh, Brock and Alan for your help and support.

        Show
        Harish Butani added a comment - I second Ashutosh's comment. Thanks Prajakta, Ashutosh, Brock and Alan for your help and support.
        Hide
        Hudson added a comment -

        Integrated in Hive-trunk-hadoop2 #138 (See https://builds.apache.org/job/Hive-trunk-hadoop2/138/)
        HIVE-896 : Add LEAD/LAG/FIRST/LAST analytical windowing functions to Hive. (Harish Butani via Ashutosh Chauhan) (Revision 1463556)

        Result = FAILURE
        hashutosh : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1463556
        Files :

        • /hive/trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java
        • /hive/trunk/data/files/flights_tiny.txt
        • /hive/trunk/data/files/over10k
        • /hive/trunk/data/files/part.rc
        • /hive/trunk/data/files/part.seq
        • /hive/trunk/data/files/part_tiny.txt
        • /hive/trunk/ql/if/queryplan.thrift
        • /hive/trunk/ql/src/gen/thrift/gen-cpp/queryplan_types.cpp
        • /hive/trunk/ql/src/gen/thrift/gen-cpp/queryplan_types.h
        • /hive/trunk/ql/src/gen/thrift/gen-javabean/org/apache/hadoop/hive/ql/plan/api/Operator.java
        • /hive/trunk/ql/src/gen/thrift/gen-javabean/org/apache/hadoop/hive/ql/plan/api/OperatorType.java
        • /hive/trunk/ql/src/gen/thrift/gen-javabean/org/apache/hadoop/hive/ql/plan/api/Query.java
        • /hive/trunk/ql/src/gen/thrift/gen-javabean/org/apache/hadoop/hive/ql/plan/api/Stage.java
        • /hive/trunk/ql/src/gen/thrift/gen-javabean/org/apache/hadoop/hive/ql/plan/api/Task.java
        • /hive/trunk/ql/src/gen/thrift/gen-php/Types.php
        • /hive/trunk/ql/src/gen/thrift/gen-py/queryplan/ttypes.py
        • /hive/trunk/ql/src/gen/thrift/gen-rb/queryplan_types.rb
        • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/QueryProperties.java
        • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/FunctionRegistry.java
        • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/OperatorFactory.java
        • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/PTFFunctionInfo.java
        • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/PTFOperator.java
        • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/PTFPartition.java
        • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/PTFPersistence.java
        • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/PTFUtils.java
        • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/PartitionTableFunctionDescription.java
        • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/WindowFunctionDescription.java
        • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/WindowFunctionInfo.java
        • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/hooks/LineageInfo.java
        • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/ColumnPruner.java
        • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/ColumnPrunerProcFactory.java
        • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/lineage/Generator.java
        • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/ASTNode.java
        • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/FromClauseParser.g
        • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/HiveLexer.g
        • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/HiveParser.g
        • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/IdentifiersParser.g
        • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/PTFInvocationSpec.java
        • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/PTFTranslator.java
        • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/ParseDriver.java
        • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/QB.java
        • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/QBParseInfo.java
        • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SelectClauseParser.g
        • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java
        • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/TypeCheckProcFactory.java
        • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/WindowingComponentizer.java
        • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/WindowingExprNodeEvaluatorFactory.java
        • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/WindowingSpec.java
        • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/PTFDesc.java
        • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/PTFDeserializer.java
        • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/ppd/OpProcFactory.java
        • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/ppd/PredicatePushDown.java
        • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFCumeDist.java
        • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFDenseRank.java
        • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFFirstValue.java
        • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFLag.java
        • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFLastValue.java
        • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFLead.java
        • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFLeadLag.java
        • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFNTile.java
        • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFPercentRank.java
        • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFRank.java
        • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFRowNumber.java
        • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFLeadLag.java
        • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/ptf
        • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/ptf/NPath.java
        • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/ptf/Noop.java
        • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/ptf/NoopWithMap.java
        • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/ptf/TableFunctionEvaluator.java
        • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/ptf/TableFunctionResolver.java
        • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/ptf/WindowingTableFunction.java
        • /hive/trunk/ql/src/test/queries/clientnegative/ptf_negative_AggrFuncsWithNoGBYNoPartDef.q
        • /hive/trunk/ql/src/test/queries/clientnegative/ptf_negative_AmbiguousWindowDefn.q
        • /hive/trunk/ql/src/test/queries/clientnegative/ptf_negative_DistributeByOrderBy.q
        • /hive/trunk/ql/src/test/queries/clientnegative/ptf_negative_DuplicateWindowAlias.q
        • /hive/trunk/ql/src/test/queries/clientnegative/ptf_negative_HavingLeadWithNoGBYNoWindowing.q
        • /hive/trunk/ql/src/test/queries/clientnegative/ptf_negative_HavingLeadWithPTF.q
        • /hive/trunk/ql/src/test/queries/clientnegative/ptf_negative_InvalidValueBoundary.q
        • /hive/trunk/ql/src/test/queries/clientnegative/ptf_negative_JoinWithAmbigousAlias.q
        • /hive/trunk/ql/src/test/queries/clientnegative/ptf_negative_PartitionBySortBy.q
        • /hive/trunk/ql/src/test/queries/clientnegative/ptf_negative_WhereWithRankCond.q
        • /hive/trunk/ql/src/test/queries/clientnegative/ptf_window_boundaries.q
        • /hive/trunk/ql/src/test/queries/clientnegative/ptf_window_boundaries2.q
        • /hive/trunk/ql/src/test/queries/clientpositive/leadlag.q
        • /hive/trunk/ql/src/test/queries/clientpositive/leadlag_queries.q
        • /hive/trunk/ql/src/test/queries/clientpositive/ptf.q
        • /hive/trunk/ql/src/test/queries/clientpositive/ptf_general_queries.q
        • /hive/trunk/ql/src/test/queries/clientpositive/ptf_npath.q
        • /hive/trunk/ql/src/test/queries/clientpositive/ptf_rcfile.q
        • /hive/trunk/ql/src/test/queries/clientpositive/ptf_seqfile.q
        • /hive/trunk/ql/src/test/queries/clientpositive/windowing.q
        • /hive/trunk/ql/src/test/queries/clientpositive/windowing_columnPruning.q
        • /hive/trunk/ql/src/test/queries/clientpositive/windowing_expressions.q
        • /hive/trunk/ql/src/test/queries/clientpositive/windowing_multipartitioning.q
        • /hive/trunk/ql/src/test/queries/clientpositive/windowing_navfn.q
        • /hive/trunk/ql/src/test/queries/clientpositive/windowing_ntile.q
        • /hive/trunk/ql/src/test/queries/clientpositive/windowing_rank.q
        • /hive/trunk/ql/src/test/queries/clientpositive/windowing_udaf.q
        • /hive/trunk/ql/src/test/queries/clientpositive/windowing_windowspec.q
        • /hive/trunk/ql/src/test/results/clientnegative/ptf_negative_AggrFuncsWithNoGBYNoPartDef.q.out
        • /hive/trunk/ql/src/test/results/clientnegative/ptf_negative_AmbiguousWindowDefn.q.out
        • /hive/trunk/ql/src/test/results/clientnegative/ptf_negative_DistributeByOrderBy.q.out
        • /hive/trunk/ql/src/test/results/clientnegative/ptf_negative_DuplicateWindowAlias.q.out
        • /hive/trunk/ql/src/test/results/clientnegative/ptf_negative_HavingLeadWithNoGBYNoWindowing.q.out
        • /hive/trunk/ql/src/test/results/clientnegative/ptf_negative_HavingLeadWithPTF.q.out
        • /hive/trunk/ql/src/test/results/clientnegative/ptf_negative_InvalidValueBoundary.q.out
        • /hive/trunk/ql/src/test/results/clientnegative/ptf_negative_JoinWithAmbigousAlias.q.out
        • /hive/trunk/ql/src/test/results/clientnegative/ptf_negative_PartitionBySortBy.q.out
        • /hive/trunk/ql/src/test/results/clientnegative/ptf_negative_WhereWithRankCond.q.out
        • /hive/trunk/ql/src/test/results/clientnegative/ptf_window_boundaries.q.out
        • /hive/trunk/ql/src/test/results/clientnegative/ptf_window_boundaries2.q.out
        • /hive/trunk/ql/src/test/results/clientpositive/leadlag.q.out
        • /hive/trunk/ql/src/test/results/clientpositive/leadlag_queries.q.out
        • /hive/trunk/ql/src/test/results/clientpositive/ptf.q.out
        • /hive/trunk/ql/src/test/results/clientpositive/ptf_general_queries.q.out
        • /hive/trunk/ql/src/test/results/clientpositive/ptf_npath.q.out
        • /hive/trunk/ql/src/test/results/clientpositive/ptf_rcfile.q.out
        • /hive/trunk/ql/src/test/results/clientpositive/ptf_seqfile.q.out
        • /hive/trunk/ql/src/test/results/clientpositive/show_functions.q.out
        • /hive/trunk/ql/src/test/results/clientpositive/windowing.q.out
        • /hive/trunk/ql/src/test/results/clientpositive/windowing_columnPruning.q.out
        • /hive/trunk/ql/src/test/results/clientpositive/windowing_expressions.q.out
        • /hive/trunk/ql/src/test/results/clientpositive/windowing_multipartitioning.q.out
        • /hive/trunk/ql/src/test/results/clientpositive/windowing_navfn.q.out
        • /hive/trunk/ql/src/test/results/clientpositive/windowing_ntile.q.out
        • /hive/trunk/ql/src/test/results/clientpositive/windowing_rank.q.out
        • /hive/trunk/ql/src/test/results/clientpositive/windowing_udaf.q.out
        • /hive/trunk/ql/src/test/results/clientpositive/windowing_windowspec.q.out
        Show
        Hudson added a comment - Integrated in Hive-trunk-hadoop2 #138 (See https://builds.apache.org/job/Hive-trunk-hadoop2/138/ ) HIVE-896 : Add LEAD/LAG/FIRST/LAST analytical windowing functions to Hive. (Harish Butani via Ashutosh Chauhan) (Revision 1463556) Result = FAILURE hashutosh : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1463556 Files : /hive/trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java /hive/trunk/data/files/flights_tiny.txt /hive/trunk/data/files/over10k /hive/trunk/data/files/part.rc /hive/trunk/data/files/part.seq /hive/trunk/data/files/part_tiny.txt /hive/trunk/ql/if/queryplan.thrift /hive/trunk/ql/src/gen/thrift/gen-cpp/queryplan_types.cpp /hive/trunk/ql/src/gen/thrift/gen-cpp/queryplan_types.h /hive/trunk/ql/src/gen/thrift/gen-javabean/org/apache/hadoop/hive/ql/plan/api/Operator.java /hive/trunk/ql/src/gen/thrift/gen-javabean/org/apache/hadoop/hive/ql/plan/api/OperatorType.java /hive/trunk/ql/src/gen/thrift/gen-javabean/org/apache/hadoop/hive/ql/plan/api/Query.java /hive/trunk/ql/src/gen/thrift/gen-javabean/org/apache/hadoop/hive/ql/plan/api/Stage.java /hive/trunk/ql/src/gen/thrift/gen-javabean/org/apache/hadoop/hive/ql/plan/api/Task.java /hive/trunk/ql/src/gen/thrift/gen-php/Types.php /hive/trunk/ql/src/gen/thrift/gen-py/queryplan/ttypes.py /hive/trunk/ql/src/gen/thrift/gen-rb/queryplan_types.rb /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/QueryProperties.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/FunctionRegistry.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/OperatorFactory.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/PTFFunctionInfo.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/PTFOperator.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/PTFPartition.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/PTFPersistence.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/PTFUtils.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/PartitionTableFunctionDescription.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/WindowFunctionDescription.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/WindowFunctionInfo.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/hooks/LineageInfo.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/ColumnPruner.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/ColumnPrunerProcFactory.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/lineage/Generator.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/ASTNode.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/FromClauseParser.g /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/HiveLexer.g /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/HiveParser.g /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/IdentifiersParser.g /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/PTFInvocationSpec.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/PTFTranslator.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/ParseDriver.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/QB.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/QBParseInfo.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SelectClauseParser.g /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/TypeCheckProcFactory.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/WindowingComponentizer.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/WindowingExprNodeEvaluatorFactory.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/WindowingSpec.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/PTFDesc.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/PTFDeserializer.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/ppd/OpProcFactory.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/ppd/PredicatePushDown.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFCumeDist.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFDenseRank.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFFirstValue.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFLag.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFLastValue.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFLead.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFLeadLag.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFNTile.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFPercentRank.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFRank.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFRowNumber.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFLeadLag.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/ptf /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/ptf/NPath.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/ptf/Noop.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/ptf/NoopWithMap.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/ptf/TableFunctionEvaluator.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/ptf/TableFunctionResolver.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/ptf/WindowingTableFunction.java /hive/trunk/ql/src/test/queries/clientnegative/ptf_negative_AggrFuncsWithNoGBYNoPartDef.q /hive/trunk/ql/src/test/queries/clientnegative/ptf_negative_AmbiguousWindowDefn.q /hive/trunk/ql/src/test/queries/clientnegative/ptf_negative_DistributeByOrderBy.q /hive/trunk/ql/src/test/queries/clientnegative/ptf_negative_DuplicateWindowAlias.q /hive/trunk/ql/src/test/queries/clientnegative/ptf_negative_HavingLeadWithNoGBYNoWindowing.q /hive/trunk/ql/src/test/queries/clientnegative/ptf_negative_HavingLeadWithPTF.q /hive/trunk/ql/src/test/queries/clientnegative/ptf_negative_InvalidValueBoundary.q /hive/trunk/ql/src/test/queries/clientnegative/ptf_negative_JoinWithAmbigousAlias.q /hive/trunk/ql/src/test/queries/clientnegative/ptf_negative_PartitionBySortBy.q /hive/trunk/ql/src/test/queries/clientnegative/ptf_negative_WhereWithRankCond.q /hive/trunk/ql/src/test/queries/clientnegative/ptf_window_boundaries.q /hive/trunk/ql/src/test/queries/clientnegative/ptf_window_boundaries2.q /hive/trunk/ql/src/test/queries/clientpositive/leadlag.q /hive/trunk/ql/src/test/queries/clientpositive/leadlag_queries.q /hive/trunk/ql/src/test/queries/clientpositive/ptf.q /hive/trunk/ql/src/test/queries/clientpositive/ptf_general_queries.q /hive/trunk/ql/src/test/queries/clientpositive/ptf_npath.q /hive/trunk/ql/src/test/queries/clientpositive/ptf_rcfile.q /hive/trunk/ql/src/test/queries/clientpositive/ptf_seqfile.q /hive/trunk/ql/src/test/queries/clientpositive/windowing.q /hive/trunk/ql/src/test/queries/clientpositive/windowing_columnPruning.q /hive/trunk/ql/src/test/queries/clientpositive/windowing_expressions.q /hive/trunk/ql/src/test/queries/clientpositive/windowing_multipartitioning.q /hive/trunk/ql/src/test/queries/clientpositive/windowing_navfn.q /hive/trunk/ql/src/test/queries/clientpositive/windowing_ntile.q /hive/trunk/ql/src/test/queries/clientpositive/windowing_rank.q /hive/trunk/ql/src/test/queries/clientpositive/windowing_udaf.q /hive/trunk/ql/src/test/queries/clientpositive/windowing_windowspec.q /hive/trunk/ql/src/test/results/clientnegative/ptf_negative_AggrFuncsWithNoGBYNoPartDef.q.out /hive/trunk/ql/src/test/results/clientnegative/ptf_negative_AmbiguousWindowDefn.q.out /hive/trunk/ql/src/test/results/clientnegative/ptf_negative_DistributeByOrderBy.q.out /hive/trunk/ql/src/test/results/clientnegative/ptf_negative_DuplicateWindowAlias.q.out /hive/trunk/ql/src/test/results/clientnegative/ptf_negative_HavingLeadWithNoGBYNoWindowing.q.out /hive/trunk/ql/src/test/results/clientnegative/ptf_negative_HavingLeadWithPTF.q.out /hive/trunk/ql/src/test/results/clientnegative/ptf_negative_InvalidValueBoundary.q.out /hive/trunk/ql/src/test/results/clientnegative/ptf_negative_JoinWithAmbigousAlias.q.out /hive/trunk/ql/src/test/results/clientnegative/ptf_negative_PartitionBySortBy.q.out /hive/trunk/ql/src/test/results/clientnegative/ptf_negative_WhereWithRankCond.q.out /hive/trunk/ql/src/test/results/clientnegative/ptf_window_boundaries.q.out /hive/trunk/ql/src/test/results/clientnegative/ptf_window_boundaries2.q.out /hive/trunk/ql/src/test/results/clientpositive/leadlag.q.out /hive/trunk/ql/src/test/results/clientpositive/leadlag_queries.q.out /hive/trunk/ql/src/test/results/clientpositive/ptf.q.out /hive/trunk/ql/src/test/results/clientpositive/ptf_general_queries.q.out /hive/trunk/ql/src/test/results/clientpositive/ptf_npath.q.out /hive/trunk/ql/src/test/results/clientpositive/ptf_rcfile.q.out /hive/trunk/ql/src/test/results/clientpositive/ptf_seqfile.q.out /hive/trunk/ql/src/test/results/clientpositive/show_functions.q.out /hive/trunk/ql/src/test/results/clientpositive/windowing.q.out /hive/trunk/ql/src/test/results/clientpositive/windowing_columnPruning.q.out /hive/trunk/ql/src/test/results/clientpositive/windowing_expressions.q.out /hive/trunk/ql/src/test/results/clientpositive/windowing_multipartitioning.q.out /hive/trunk/ql/src/test/results/clientpositive/windowing_navfn.q.out /hive/trunk/ql/src/test/results/clientpositive/windowing_ntile.q.out /hive/trunk/ql/src/test/results/clientpositive/windowing_rank.q.out /hive/trunk/ql/src/test/results/clientpositive/windowing_udaf.q.out /hive/trunk/ql/src/test/results/clientpositive/windowing_windowspec.q.out
        Hide
        Guilherme Braccialli added a comment -

        Harish,

        I noticed that NPath class is on Hive 0.11 source and it's also a known function on hive. Is it working? Could you please give us a sample query? I tried the query below, but its not working.

        Thanks.

        create external table flights_tiny (ORIGIN_CITY_NAME string, DEST_CITY_NAME string, YEAR int, MONTH int, DAY_OF_MONTH int, ARR_DELAY float, FL_NUM string)
        location '/user/xxxxx';

        select npath(
        'ONTIME.LATE+',
        'LATE', arr_delay > 15,
        'EARLY', arr_delay < 0,
        'ONTIME', arr_delay >=0 and arr_delay <= 15,
        'origin_city_name, fl_num, year, month, day_of_month, size(tpath) as sz, tpath as tpath'
        )
        from flights_tiny;

        FAILED: NullPointerException null

        Show
        Guilherme Braccialli added a comment - Harish, I noticed that NPath class is on Hive 0.11 source and it's also a known function on hive. Is it working? Could you please give us a sample query? I tried the query below, but its not working. Thanks. create external table flights_tiny (ORIGIN_CITY_NAME string, DEST_CITY_NAME string, YEAR int, MONTH int, DAY_OF_MONTH int, ARR_DELAY float, FL_NUM string) location '/user/xxxxx'; select npath( 'ONTIME.LATE+', 'LATE', arr_delay > 15, 'EARLY', arr_delay < 0, 'ONTIME', arr_delay >=0 and arr_delay <= 15, 'origin_city_name, fl_num, year, month, day_of_month, size(tpath) as sz, tpath as tpath' ) from flights_tiny; FAILED: NullPointerException null
        Hide
        Harish Butani added a comment -

        The syntax now is different from the original patch. Here is an e.g. from ptf_npath.q test file

        select origin_city_name, fl_num, year, month, day_of_month, sz, tpath 
        from npath(on 
                flights_tiny 
                distribute by fl_num 
                sort by year, month, day_of_month  
              arg1('LATE.LATE+'), 
              arg2('LATE'), arg3(arr_delay > 15), 
            arg4('origin_city_name, fl_num, year, month, day_of_month, size(tpath) as sz, tpath[0].day_of_month as tpath') 
           );  
        

        One thing I should point out is that it is not clear if NPath will remain in its current form.

        Show
        Harish Butani added a comment - The syntax now is different from the original patch. Here is an e.g. from ptf_npath.q test file select origin_city_name, fl_num, year, month, day_of_month, sz, tpath from npath(on flights_tiny distribute by fl_num sort by year, month, day_of_month arg1('LATE.LATE+'), arg2('LATE'), arg3(arr_delay > 15), arg4('origin_city_name, fl_num, year, month, day_of_month, size(tpath) as sz, tpath[0].day_of_month as tpath') ); One thing I should point out is that it is not clear if NPath will remain in its current form.

          People

          • Assignee:
            Harish Butani
            Reporter:
            Amr Awadallah
          • Votes:
            13 Vote for this issue
            Watchers:
            46 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development