Uploaded image for project: 'Pig'
  1. Pig
  2. PIG-795

Command that selects a random sample of the rows, similar to LIMIT

    XMLWordPrintableJSON

Details

    • New Feature
    • Status: Closed
    • Trivial
    • Resolution: Fixed
    • None
    • 0.3.0
    • impl
    • None
    • Patch Available

    Description

      When working with very large data sets (imagine that!), running a pig script can take time. It may be useful to run on a small subset of the data in some situations (eg: debugging / testing, or to get fast results even if less accurate.)

      The command "LIMIT N" selects the first N rows of the data, but these are not necessarily randomzed. A command "SAMPLE X" would retain the row only with the probability x%.

      Note: it is possible to implement this feature with FILTER BY and an UDF, but so is LIMIT, and limit is built-in.

      Attachments

        1. sample3.diff
          8 kB
          Eric Gaudet
        2. sample2.diff
          6 kB
          Eric Gaudet

        Activity

          People

            ericg Eric Gaudet
            ericg Eric Gaudet
            Votes:
            1 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: