[PIG-2014] SAMPLE shouldn't be pushed up - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 0.9.0, 0.10.0
Fix Version/s: 0.9.0, 0.10.0
Component/s: None
Labels:
None

Release Note:

Hide
A new annotation, @Nondeterministic, is introduced to allow UDF authors to mark their UDFs as such.

A non-deterministic UDF is one that can produce different results when invoked on the same input. Examples of non-deterministic behavior might be, for example, getCurrentTime() or RANDOM.

Certain Pig optimizations depend on UDFs being deterministic. It is therefore very important for correctness that non-deterministic UDFs be annotated as such.

Show
A new annotation, @Nondeterministic, is introduced to allow UDF authors to mark their UDFs as such. A non-deterministic UDF is one that can produce different results when invoked on the same input. Examples of non-deterministic behavior might be, for example, getCurrentTime() or RANDOM. Certain Pig optimizations depend on UDFs being deterministic. It is therefore very important for correctness that non-deterministic UDFs be annotated as such.

Description

Consider the following code:

tfidf_all = LOAD '$TFIDF' AS (doc_id:chararray, token:chararray, weight:double);
grouped   = GROUP tfidf_all BY doc_id;
vectors   = FOREACH grouped GENERATE group AS doc_id, tfidf_all.(token, weight) AS vector;
DUMP vectors;

This, of course, runs just fine. In a real example, tfidf_all contains 1,428,280 records. The reduce output records should be exactly the number of documents, which turn out to be 18,863 in this case. All well and good.

The strangeness comes when you add a SAMPLE command:

sampled = SAMPLE vectors 0.0012;
DUMP sampled;

Running this results in 1,513 reduce output records. The reduce output records be much much closer to 22 or 23 records (eg. 0.0012*18863).

Evidently, Pig rewrites SAMPLE into filter, and then pushes that filter in front of the group. It shouldn't push that filter
since the UDF is non-deterministic.

Quick fix: If you add "-t PushUpFilter" to your command line when invoking pig this won't happen.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

PIG-2014.5.patch
12/May/11 09:51
32 kB
Dmitriy V. Ryaboy
PIG-2014.4.patch
12/May/11 09:47
38 kB
Dmitriy V. Ryaboy
PIG-2014.3.patch
12/May/11 07:58
2 kB
Daniel Dai
PIG-2014.2.patch
11/May/11 15:31
11 kB
Dmitriy V. Ryaboy
PIG-2014.patch
10/May/11 06:24
7 kB
Dmitriy V. Ryaboy

Issue Links

relates to

PIG-2137 SAMPLE should not be pushed above DISTINCT

Closed

Activity

People

Assignee:: Dmitriy V. Ryaboy

Reporter:: Jacob Perkins

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 26/Apr/11 14:35

Updated:: 04/Aug/11 00:34

Resolved:: 12/May/11 10:01