Uploaded image for project: 'Pig'
  1. Pig
  2. PIG-1926

Sample/Limit should take scalar

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Major
    • Resolution: Fixed
    • None
    • 0.10.0
    • None
    • Hide
      Limit and Sample now accept a variable (scalar) as argument.

      For example, the new Limit command allows the following syntax to get the top 1% of a sorted file:
      [ a = LOAD 'a.txt'; b = GROUP a all; c = FOREACH b GENERATE COUNT(a) AS sum; d = ORDER a BY $0; e = LIMIT d c.sum/100; ]

      Only scalar variables may be used in the expression in limit or sample, columns in the input relation for the operation cannot be used in the expression. A statement like [ e = LIMIT d $0; ] is invalid.
      The new Sample command allows for the same syntax.

      Using a variable instead of a constant in Limit automatically disables most of the optimizations (only push-before-foreach is performed). More work is needed to enable optimizations for limit-after-sort, limit duplication before cross/union and limit merging.
      Show
      Limit and Sample now accept a variable (scalar) as argument. For example, the new Limit command allows the following syntax to get the top 1% of a sorted file: [ a = LOAD 'a.txt'; b = GROUP a all; c = FOREACH b GENERATE COUNT(a) AS sum; d = ORDER a BY $0; e = LIMIT d c.sum/100; ] Only scalar variables may be used in the expression in limit or sample, columns in the input relation for the operation cannot be used in the expression. A statement like [ e = LIMIT d $0; ] is invalid. The new Sample command allows for the same syntax. Using a variable instead of a constant in Limit automatically disables most of the optimizations (only push-before-foreach is performed). More work is needed to enable optimizations for limit-after-sort, limit duplication before cross/union and limit merging.

    Description

      Currently, Limit, Sample only takes a constant. It would be better we can use a scalar in the place of constant. Eg:

      a = load 'a.txt';
      b = group a all;
      c = foreach b generate COUNT(a) as sum;
      d = order a by $0;
      e = limit d c.sum/100;
      

      This is a candidate project for Google summer of code 2011. More information about the program can be found at http://wiki.apache.org/pig/GSoc2011

      Attachments

        1. PIG-1926.patch
          28 kB
          Gianmarco De Francisci Morales
        2. PIG-1926.patch
          74 kB
          Gianmarco De Francisci Morales
        3. PIG-1926.patch
          76 kB
          Gianmarco De Francisci Morales
        4. PIG-1926.patch
          86 kB
          Gianmarco De Francisci Morales
        5. PIG-1926.patch
          26 kB
          Gianmarco De Francisci Morales
        6. PIG-1926.patch
          26 kB
          Gianmarco De Francisci Morales
        7. PIG-1926.9.patch
          47 kB
          Gianmarco De Francisci Morales
        8. PIG-1926.8.patch
          41 kB
          Gianmarco De Francisci Morales
        9. PIG-1926.7.patch
          34 kB
          Gianmarco De Francisci Morales
        10. PIG-1926.12.patch
          55 kB
          Gianmarco De Francisci Morales
        11. PIG-1926.12.1.patch
          56 kB
          Thejas Nair
        12. PIG-1926.11.patch
          51 kB
          Gianmarco De Francisci Morales
        13. PIG-1926.10.patch
          45 kB
          Gianmarco De Francisci Morales

        Issue Links

          Activity

            People

              azaroth Gianmarco De Francisci Morales
              daijy Daniel Dai
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: