Uploaded image for project: 'Pig'
  1. Pig
  2. PIG-1926

Sample/Limit should take scalar

VotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Major
    • Resolution: Fixed
    • None
    • 0.10.0
    • None
    • Hide
      Limit and Sample now accept a variable (scalar) as argument.

      For example, the new Limit command allows the following syntax to get the top 1% of a sorted file:
      [ a = LOAD 'a.txt'; b = GROUP a all; c = FOREACH b GENERATE COUNT(a) AS sum; d = ORDER a BY $0; e = LIMIT d c.sum/100; ]

      Only scalar variables may be used in the expression in limit or sample, columns in the input relation for the operation cannot be used in the expression. A statement like [ e = LIMIT d $0; ] is invalid.
      The new Sample command allows for the same syntax.

      Using a variable instead of a constant in Limit automatically disables most of the optimizations (only push-before-foreach is performed). More work is needed to enable optimizations for limit-after-sort, limit duplication before cross/union and limit merging.
      Show
      Limit and Sample now accept a variable (scalar) as argument. For example, the new Limit command allows the following syntax to get the top 1% of a sorted file: [ a = LOAD 'a.txt'; b = GROUP a all; c = FOREACH b GENERATE COUNT(a) AS sum; d = ORDER a BY $0; e = LIMIT d c.sum/100; ] Only scalar variables may be used in the expression in limit or sample, columns in the input relation for the operation cannot be used in the expression. A statement like [ e = LIMIT d $0; ] is invalid. The new Sample command allows for the same syntax. Using a variable instead of a constant in Limit automatically disables most of the optimizations (only push-before-foreach is performed). More work is needed to enable optimizations for limit-after-sort, limit duplication before cross/union and limit merging.

    Description

      Currently, Limit, Sample only takes a constant. It would be better we can use a scalar in the place of constant. Eg:

      a = load 'a.txt';
      b = group a all;
      c = foreach b generate COUNT(a) as sum;
      d = order a by $0;
      e = limit d c.sum/100;
      

      This is a candidate project for Google summer of code 2011. More information about the program can be found at http://wiki.apache.org/pig/GSoc2011

      Attachments

        1. PIG-1926.10.patch
          45 kB
          Gianmarco De Francisci Morales
        2. PIG-1926.11.patch
          51 kB
          Gianmarco De Francisci Morales
        3. PIG-1926.12.1.patch
          56 kB
          Thejas Nair
        4. PIG-1926.12.patch
          55 kB
          Gianmarco De Francisci Morales
        5. PIG-1926.7.patch
          34 kB
          Gianmarco De Francisci Morales
        6. PIG-1926.8.patch
          41 kB
          Gianmarco De Francisci Morales
        7. PIG-1926.9.patch
          47 kB
          Gianmarco De Francisci Morales
        8. PIG-1926.patch
          26 kB
          Gianmarco De Francisci Morales
        9. PIG-1926.patch
          26 kB
          Gianmarco De Francisci Morales
        10. PIG-1926.patch
          86 kB
          Gianmarco De Francisci Morales
        11. PIG-1926.patch
          76 kB
          Gianmarco De Francisci Morales
        12. PIG-1926.patch
          74 kB
          Gianmarco De Francisci Morales
        13. PIG-1926.patch
          28 kB
          Gianmarco De Francisci Morales

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            azaroth Gianmarco De Francisci Morales
            daijy Daniel Dai
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment