Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.11
    • Component/s: None
    • Labels:
    • Release Note:
      Hide
      Pig includes a new RANK operator:
      RANK <relation> ( BY <column> (ASC|DES)? (DENSE)? )?
      This operator prepends a consecutive integer to each tuple in the relation starting from 1.
      If the BY clause is present, RANK sorts the relation before ranking it, otherwise it uses the order in which it receives the relation (e.g. the order in which the relation is stored if RANK is performed right after a LOAD).
      The DENSE modifier produces a dense rank, which has no gaps in it regardless of ties.

      RANK is now a reserved keyword and is *not* backward compatible.
      Please review your scripts to avoid usage of RANK as a relation name.
      Show
      Pig includes a new RANK operator: RANK <relation> ( BY <column> (ASC|DES)? (DENSE)? )? This operator prepends a consecutive integer to each tuple in the relation starting from 1. If the BY clause is present, RANK sorts the relation before ranking it, otherwise it uses the order in which it receives the relation (e.g. the order in which the relation is stored if RANK is performed right after a LOAD). The DENSE modifier produces a dense rank, which has no gaps in it regardless of ties. RANK is now a reserved keyword and is *not* backward compatible. Please review your scripts to avoid usage of RANK as a relation name.

      Description

      Implement a function that given a (sorted) bag adds to each tuple a unique, increasing identifier without gaps, like what RANK does for SQL.

      This is a candidate project for Google summer of code 2012. More information about the program can be found at https://cwiki.apache.org/confluence/display/PIG/GSoc2012

      Functionality implemented so far, is available at https://reviews.apache.org/r/5523/diff/#index_header

      1. PIG2353.patch
        9 kB
        Jonathan Coveney
      2. PIG-2353-2
        73 kB
        Allan Avendaño
      3. PIG-2353-3.txt
        251 kB
        Allan Avendaño
      4. PIG-2353-4.txt
        286 kB
        Allan Avendaño
      5. PIG-2353-5.txt
        294 kB
        Allan Avendaño

        Issue Links

          Activity

          Hide
          David Ciemiewicz added a comment -

          Did anyone look at the solution I proposed in JIRA PIG 821 for partitioned rank computation on billions of rows? There may be better solutions but I do know that that one works without need for serialization of all rows, only on the histogram.

          Show
          David Ciemiewicz added a comment - Did anyone look at the solution I proposed in JIRA PIG 821 for partitioned rank computation on billions of rows? There may be better solutions but I do know that that one works without need for serialization of all rows, only on the histogram.
          Hide
          Gianmarco De Francisci Morales added a comment -

          Hi, sorry I guess I misunderstood.
          I thought that PIG-2947 was sufficient as documentation and that we just wanted to clarify the release notes.

          Should I open a separate Jira to include the release notes of the Jira inside RELEASE_NOTES.txt ?

          Show
          Gianmarco De Francisci Morales added a comment - Hi, sorry I guess I misunderstood. I thought that PIG-2947 was sufficient as documentation and that we just wanted to clarify the release notes. Should I open a separate Jira to include the release notes of the Jira inside RELEASE_NOTES.txt ?
          Hide
          Olga Natkovich added a comment -

          I believe we agreed that the document changes are included and reviewed as part of the patch. Since this was not done this way, we need to get a separate patch for docs,

          Show
          Olga Natkovich added a comment - I believe we agreed that the document changes are included and reviewed as part of the patch. Since this was not done this way, we need to get a separate patch for docs,
          Hide
          Rohini Palaniswamy added a comment -

          Shouldn't we make it some part of documentation or RELEASE_NOTES.txt, instead of just Release note in JIRA?

          Show
          Rohini Palaniswamy added a comment - Shouldn't we make it some part of documentation or RELEASE_NOTES.txt, instead of just Release note in JIRA?
          Hide
          Jonathan Coveney added a comment -

          You cannot, so this is not without precedent. We should document it, and ideally introduce better error messages around it (separate JIRA, and for other keywords it is equally as bad).

          Show
          Jonathan Coveney added a comment - You cannot, so this is not without precedent. We should document it, and ideally introduce better error messages around it (separate JIRA, and for other keywords it is equally as bad).
          Hide
          Gianmarco De Francisci Morales added a comment -

          Hi Jonathan,
          Yes, RANK is now an operator and thus a reserved keyword.
          We can add it to the release notes.

          The parser is definitely a bit rough and could use some reworking, especially in the error messages, so I am all in for it. Not sure if it is a known issue. Can you use LOAD or FOREACH as column names?

          Show
          Gianmarco De Francisci Morales added a comment - Hi Jonathan, Yes, RANK is now an operator and thus a reserved keyword. We can add it to the release notes. The parser is definitely a bit rough and could use some reworking, especially in the error messages, so I am all in for it. Not sure if it is a known issue. Can you use LOAD or FOREACH as column names?
          Hide
          Jonathan Coveney added a comment -

          Did this make rank a reserved keyword? We may need to document this as a non-backwards compatible change if it is, as many scripts use "rank" as a column name. Example:

          A = load 'thing';
          B = FOREACH (GROUP A all) GENERATE MIN(A.rank);
          

          Of all the errors you'd expect, I wasn't expecting this one:

          2012-12-18 23:18:36,142 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1200: <line 2, column 38> mismatched input '(' expecting SEMI_COLON
          Details at logfile: /var/log/pig/pig_1355872714665.log

          I culled this example from a larger script, and it looks like removing rank as a column name fixed it. Is this a known issue? I think we can refine the parser to work with rank in that position, but I thought it would be worth asking.

          Show
          Jonathan Coveney added a comment - Did this make rank a reserved keyword? We may need to document this as a non-backwards compatible change if it is, as many scripts use "rank" as a column name. Example: A = load 'thing'; B = FOREACH (GROUP A all) GENERATE MIN(A.rank); Of all the errors you'd expect, I wasn't expecting this one: 2012-12-18 23:18:36,142 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1200: <line 2, column 38> mismatched input '(' expecting SEMI_COLON Details at logfile: /var/log/pig/pig_1355872714665.log I culled this example from a larger script, and it looks like removing rank as a column name fixed it. Is this a known issue? I think we can refine the parser to work with rank in that position, but I thought it would be worth asking.
          Hide
          Olga Natkovich added a comment -

          Yes, I think that's fine - I did not realize it was covered in a separate JIRA, thanks!

          Show
          Olga Natkovich added a comment - Yes, I think that's fine - I did not realize it was covered in a separate JIRA, thanks!
          Hide
          Allan Avendaño added a comment -

          Hi Olga!

          Does PIG-2947 apply as release notes?

          Show
          Allan Avendaño added a comment - Hi Olga! Does PIG-2947 apply as release notes?
          Hide
          Olga Natkovich added a comment -

          Can you please add usage example to release notes section, thanks!

          Show
          Olga Natkovich added a comment - Can you please add usage example to release notes section, thanks!
          Hide
          Gianmarco De Francisci Morales added a comment -

          +1 for me.
          Passes local tests and manual testing.
          Committed to trunk.

          Thank Allan!

          Show
          Gianmarco De Francisci Morales added a comment - +1 for me. Passes local tests and manual testing. Committed to trunk. Thank Allan!
          Hide
          Allan Avendaño added a comment -

          New approach to set counter into the MapReduce Job with PORank operator, only if there are PORank operator(s) on roots.

          Two tests added (ten in total) with complex scripts which prove particular scenarios.

          Show
          Allan Avendaño added a comment - New approach to set counter into the MapReduce Job with PORank operator, only if there are PORank operator(s) on roots. Two tests added (ten in total) with complex scripts which prove particular scenarios.
          Hide
          Gianmarco De Francisci Morales added a comment -

          There is a regression in the latest patch.
          It does not work properly in a multi-machine environment.
          It seems that the values of the counters are not properly serialized in the JobConf.
          We need to add a test and fix the bug before committing the patch.

          Show
          Gianmarco De Francisci Morales added a comment - There is a regression in the latest patch. It does not work properly in a multi-machine environment. It seems that the values of the counters are not properly serialized in the JobConf. We need to add a test and fix the bug before committing the patch.
          Hide
          Allan Avendaño added a comment -

          All unit and e2e tests passed.

          Show
          Allan Avendaño added a comment - All unit and e2e tests passed.
          Hide
          Allan Avendaño added a comment -

          Code generated so far.
          It passed all junit and e2e test on a cluster.
          New code has been documented.

          Show
          Allan Avendaño added a comment - Code generated so far. It passed all junit and e2e test on a cluster. New code has been documented.
          Hide
          Allan Avendaño added a comment -

          Current implementation is now available for your review at https://reviews.apache.org/r/5523/diff/#index_header

          Show
          Allan Avendaño added a comment - Current implementation is now available for your review at https://reviews.apache.org/r/5523/diff/#index_header
          Hide
          Allan Avendaño added a comment -

          Here, rank operator is fully implemented (rank, dense rank and row number), now I'm working on refactoring, tests and documentation.

          Looking forward to your comments.

          Show
          Allan Avendaño added a comment - Here, rank operator is fully implemented (rank, dense rank and row number), now I'm working on refactoring, tests and documentation. Looking forward to your comments.
          Hide
          Gianmarco De Francisci Morales added a comment -

          Yes, partitioned rank can be simply group by + UDF.
          Global rank should follow the implementation blueprint that I outlined in this Jira, or something similar to make it fully scalable.

          Show
          Gianmarco De Francisci Morales added a comment - Yes, partitioned rank can be simply group by + UDF. Global rank should follow the implementation blueprint that I outlined in this Jira, or something similar to make it fully scalable.
          Hide
          Daniel Dai added a comment -

          So partitioned and non-partitioned RANK are using different implementation, right?

          Show
          Daniel Dai added a comment - So partitioned and non-partitioned RANK are using different implementation, right?
          Hide
          Gianmarco De Francisci Morales added a comment -

          No, sorry, there is a typo in my previous comment.
          What I meant is that partitioned rank is only group by + UDF.
          The main aim of this project is a distributed implementation of the global RANK, which needs to be implemented from scratch.

          Show
          Gianmarco De Francisci Morales added a comment - No, sorry, there is a typo in my previous comment. What I meant is that partitioned rank is only group by + UDF. The main aim of this project is a distributed implementation of the global RANK, which needs to be implemented from scratch.
          Hide
          Daniel Dai added a comment -

          You mean the global rank is implemented by group all + UDF? Do we have a plan for a distributed implementation in this project?

          Show
          Daniel Dai added a comment - You mean the global rank is implemented by group all + UDF? Do we have a plan for a distributed implementation in this project?
          Hide
          Gianmarco De Francisci Morales added a comment -

          My current view is that non-partitioned rank is simply a group by + a rank UDF.
          So there is no need for a separate implementation of it unless we have some performance gain. Maybe we need something specific if we plan to support nested rank.

          Show
          Gianmarco De Francisci Morales added a comment - My current view is that non-partitioned rank is simply a group by + a rank UDF. So there is no need for a separate implementation of it unless we have some performance gain. Maybe we need something specific if we plan to support nested rank.
          Hide
          Daniel Dai added a comment -

          We can use secondary sort to implement partitioned rank. However, I think partitioned rank and non-partitioned rank may have to adopt a totally different implementation. We can focus on non-partitioned rank first.

          Show
          Daniel Dai added a comment - We can use secondary sort to implement partitioned rank. However, I think partitioned rank and non-partitioned rank may have to adopt a totally different implementation. We can focus on non-partitioned rank first.
          Hide
          Allan Avendaño added a comment -

          Hi to everybody,

          I am working on this functionality for GSOC 2012, with Gianmarco as my mentor.
          I had been working on syntax, and now is recognized this syntax, recommended by Gianmarco:

          RANK <relation> ( BY <column> (ASC|DES)? )?

          I was also looking for some other functionality that can be incorporated, and on SQL Server, Oracle and Postgresql [1][2][3], it is also possible to specify a "partition" (ranking over a specific group) at the same rank operation. Gianmarco already pointed me out that it could imply some performance flaws.

          Looking forward for yours feedback/suggestion.

          References:

          [1] http://msdn.microsoft.com/en-us/library/ms176102.aspx
          [2] http://www.techonthenet.com/oracle/functions/rank.php
          [3] http://www.postgresql.org/docs/9.1/static/tutorial-window.html

          Show
          Allan Avendaño added a comment - Hi to everybody, I am working on this functionality for GSOC 2012, with Gianmarco as my mentor. I had been working on syntax, and now is recognized this syntax, recommended by Gianmarco: RANK <relation> ( BY <column> (ASC|DES)? )? I was also looking for some other functionality that can be incorporated, and on SQL Server, Oracle and Postgresql [1] [2] [3] , it is also possible to specify a "partition" (ranking over a specific group) at the same rank operation. Gianmarco already pointed me out that it could imply some performance flaws. Looking forward for yours feedback/suggestion. References: [1] http://msdn.microsoft.com/en-us/library/ms176102.aspx [2] http://www.techonthenet.com/oracle/functions/rank.php [3] http://www.postgresql.org/docs/9.1/static/tutorial-window.html
          Hide
          Gianmarco De Francisci Morales added a comment -

          Assigning to myself as per Apache guidelines as I'd like to mentor this.

          Show
          Gianmarco De Francisci Morales added a comment - Assigning to myself as per Apache guidelines as I'd like to mentor this.
          Hide
          Allan Avendaño added a comment -

          Dear all,

          Let me introduce myself, I am Allan Avendaño, student of Master Computing Engineering at Rome. I am interested into collaborate with Rank function like SQL PIG-2353 for Gsoc.

          I have been working with MR paradigm since three years, mainly with two research projects (which were part of undergraduate projects).

          One was aimed to analyze the incidence of navigability factors on websites university network, by creating a inverse correlation among them through links and citations.
          My undergraduate project was driven to solve the previous navigability problem, by establishing relations among them according to terms used and topics. Was really interesting to interleave some MR phases (some modifications to Mahout code) and Pig.

          I was checking the activities of this feature, and also an initial approach at PIG-821, I think also could be useful dense rank an nth-tile, and other statistical inference operations.

          Really thankful for your guidance and comments.

          Best Regards

          Show
          Allan Avendaño added a comment - Dear all, Let me introduce myself, I am Allan Avendaño, student of Master Computing Engineering at Rome. I am interested into collaborate with Rank function like SQL PIG-2353 for Gsoc. I have been working with MR paradigm since three years, mainly with two research projects (which were part of undergraduate projects). One was aimed to analyze the incidence of navigability factors on websites university network, by creating a inverse correlation among them through links and citations. My undergraduate project was driven to solve the previous navigability problem, by establishing relations among them according to terms used and topics. Was really interesting to interleave some MR phases (some modifications to Mahout code) and Pig. I was checking the activities of this feature, and also an initial approach at PIG-821 , I think also could be useful dense rank an nth-tile, and other statistical inference operations. Really thankful for your guidance and comments. Best Regards
          Hide
          Apurv Verma added a comment -

          Hello,
          I am an undergraduate student from India and I would be interested in working on this as a GSoC project. I have a beginner level knowledge of writing map-reduce tasks so would need help with it. I have understood the algorithm which Gianmarco has outlined in the comments.

          Show
          Apurv Verma added a comment - Hello, I am an undergraduate student from India and I would be interested in working on this as a GSoC project. I have a beginner level knowledge of writing map-reduce tasks so would need help with it. I have understood the algorithm which Gianmarco has outlined in the comments.
          Hide
          Gianmarco De Francisci Morales added a comment -

          Thanks Daniel, I am excited this Jira is going to be a candidate for GSoC, I was going to propose it myself!

          Show
          Gianmarco De Francisci Morales added a comment - Thanks Daniel, I am excited this Jira is going to be a candidate for GSoC, I was going to propose it myself!
          Hide
          David Ciemiewicz added a comment -

          There is a much more efficient way to compute RANK, DENSE_RANK, CUMULATIVE_SUM and more if you have billions of rows of data, especially if the data follows a power law/zipf distribution (like queries do). It involves using Map-Reduce to compute a histogram of the frequencies/counts and then serializing and sorting the histogram which is something like 20,000 rows for 1B queries.

          https://issues.apache.org/jira/browse/PIG-821

          Show
          David Ciemiewicz added a comment - There is a much more efficient way to compute RANK, DENSE_RANK, CUMULATIVE_SUM and more if you have billions of rows of data, especially if the data follows a power law/zipf distribution (like queries do). It involves using Map-Reduce to compute a histogram of the frequencies/counts and then serializing and sorting the histogram which is something like 20,000 rows for 1B queries. https://issues.apache.org/jira/browse/PIG-821
          Hide
          Jonathan Coveney added a comment -

          Weird, the above got garbled and I can't edit it, but I think the idea is clear.

          Show
          Jonathan Coveney added a comment - Weird, the above got garbled and I can't edit it, but I think the idea is clear.
          Hide
          Jonathan Coveney added a comment -

          I see no reason why it couldn't do both. The grammar syntax could be

          
          

          RANK relation ( BY column )? such that if you specify the column to rank by, it'll sort it, but if you don't, it just sorts it as it got it.

          Show
          Jonathan Coveney added a comment - I see no reason why it couldn't do both. The grammar syntax could be RANK relation ( BY column )? such that if you specify the column to rank by, it'll sort it, but if you don't, it just sorts it as it got it.
          Hide
          Gianmarco De Francisci Morales added a comment -

          Actually I was thinking that RANK would only do the counting and appending.
          This way you could get a sort + rank with

          B = RANK ( ORDER A BY <column> ASC);
          

          But you could also get your dataset from file and rank it directly, without any specific order

          A = LOAD 'path/to/file';
          B = RANK A;
          C = ORDER B BY <column>
          

          This, for example, gives you the permutation that was used to sort the dataset, which might be useful.
          Also, RANK would allow to create a data column that reflects the ordering that you have in your data.

          Show
          Gianmarco De Francisci Morales added a comment - Actually I was thinking that RANK would only do the counting and appending. This way you could get a sort + rank with B = RANK ( ORDER A BY <column> ASC); But you could also get your dataset from file and rank it directly, without any specific order A = LOAD 'path/to/file'; B = RANK A; C = ORDER B BY <column> This, for example, gives you the permutation that was used to sort the dataset, which might be useful. Also, RANK would allow to create a data column that reflects the ordering that you have in your data.
          Hide
          Jonathan Coveney added a comment -

          So Gianmarco, are you thinking this sort of syntax:

          A = <some relation>
          B = RANK A BY <column name> <ASC | DESC>;
          

          IE it'd just follow the order syntax, but add the rank to the end?

          And I assume your n map job would run after already sorting the, right? So first rank would run the order by, and then it would run the two jobs that would actually append the rank?

          Show
          Jonathan Coveney added a comment - So Gianmarco, are you thinking this sort of syntax: A = <some relation> B = RANK A BY <column name> <ASC | DESC>; IE it'd just follow the order syntax, but add the rank to the end? And I assume your n map job would run after already sorting the, right? So first rank would run the order by, and then it would run the two jobs that would actually append the rank?
          Hide
          Ashutosh Chauhan added a comment -

          I was also thinking of this problem of implementing statistical measures (like top-K, median, quantiles) etc. efficiently in a distributed manner which is amenable to MR framework. Rank is a basis of it. I came up with similiar outline as yours, your have laid it out well. I think this is pretty useful to be in Pig and these are kind of features which higher level language like Pig should make available to its users. Sophisticated users will expect this and this will derive adoption.
          +1 for distributed implementation of RANK in Pig.

          Show
          Ashutosh Chauhan added a comment - I was also thinking of this problem of implementing statistical measures (like top-K, median, quantiles) etc. efficiently in a distributed manner which is amenable to MR framework. Rank is a basis of it. I came up with similiar outline as yours, your have laid it out well. I think this is pretty useful to be in Pig and these are kind of features which higher level language like Pig should make available to its users. Sophisticated users will expect this and this will derive adoption. +1 for distributed implementation of RANK in Pig.
          Hide
          Gianmarco De Francisci Morales added a comment -

          My idea would be to have a distributed implementation of RANK in the following manner:

          Run a Map-only job with n mapper, each mapper just computes the number of records in each input split and accumulates it in an internal variable (or alternatively it uses dynamic counters).
          At the end, we have a map(partition_id => number_of_records).
          This map is small enough to be put in the distributed cache.
          Compute the cumulative sum of each number of records.
          Then launch a second Map-only job with exactly n mappers, each will read it's input split and the cumulative number of records preceding it, initialize the counter with this value and finally RANK the records as they come in.

          This would be a distributed implementation of RANK that could scale very well.
          I haven't figured out how to integrate it into Pig yet.

          Show
          Gianmarco De Francisci Morales added a comment - My idea would be to have a distributed implementation of RANK in the following manner: Run a Map-only job with n mapper, each mapper just computes the number of records in each input split and accumulates it in an internal variable (or alternatively it uses dynamic counters). At the end, we have a map(partition_id => number_of_records). This map is small enough to be put in the distributed cache. Compute the cumulative sum of each number of records. Then launch a second Map-only job with exactly n mappers, each will read it's input split and the cumulative number of records preceding it, initialize the counter with this value and finally RANK the records as they come in. This would be a distributed implementation of RANK that could scale very well. I haven't figured out how to integrate it into Pig yet.
          Hide
          Jonathan Coveney added a comment -

          Is there anything in the piggybank using the e2e? I'm not sure what the word is about piggybank and when to use e2e. Someone else will have to weigh in on that.

          As far as that other JIRA...you should make it and link it, though I'm curious what benefit/optimization you forsee RANK having if it has access to Pig's internals.

          Show
          Jonathan Coveney added a comment - Is there anything in the piggybank using the e2e? I'm not sure what the word is about piggybank and when to use e2e. Someone else will have to weigh in on that. As far as that other JIRA...you should make it and link it, though I'm curious what benefit/optimization you forsee RANK having if it has access to Pig's internals.
          Hide
          Gianmarco De Francisci Morales added a comment -

          Hi Jonathan,
          thanks for giving it a try!

          I think the approach is fine for an initial implementation.
          To scale it out, we need a deeper integration with Pig (i.e. it need to be an operator and not a UDF), but this is the subject for another Jira.

          Just one more comment.
          I am not sure about testing in piggybank.
          Should we use e2e testing instead of JUnit?

          Show
          Gianmarco De Francisci Morales added a comment - Hi Jonathan, thanks for giving it a try! I think the approach is fine for an initial implementation. To scale it out, we need a deeper integration with Pig (i.e. it need to be an operator and not a UDF), but this is the subject for another Jira. Just one more comment. I am not sure about testing in piggybank. Should we use e2e testing instead of JUnit?
          Hide
          Jonathan Coveney added a comment -

          This provides a rank function, which needs a sorted input. Equal tuples will get increasing rank numbers.

          Show
          Jonathan Coveney added a comment - This provides a rank function, which needs a sorted input. Equal tuples will get increasing rank numbers.

            People

            • Assignee:
              Allan Avendaño
              Reporter:
              Gianmarco De Francisci Morales
            • Votes:
              0 Vote for this issue
              Watchers:
              11 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development