Description
When working with very large data sets (imagine that!), running a pig script can take time. It may be useful to run on a small subset of the data in some situations (eg: debugging / testing, or to get fast results even if less accurate.)
The command "LIMIT N" selects the first N rows of the data, but these are not necessarily randomzed. A command "SAMPLE X" would retain the row only with the probability x%.
Note: it is possible to implement this feature with FILTER BY and an UDF, but so is LIMIT, and limit is built-in.