[PIG-3642] Direct HDFS access for small jobs (fetch) - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 0.13.0
Component/s: None
Labels:
None

Release Note:

Hide
When the DUMP operator is used to execute Pig Latin statements, Pig can take the advantage to minimize latency by directly reading data from HDFS rather than launching MapReduce jobs.

Direct fetch is turned on by default. To turn it off set the property opt.fetch to false or start Pig with the "-N" or "-no_fetch" option.

Show
When the DUMP operator is used to execute Pig Latin statements, Pig can take the advantage to minimize latency by directly reading data from HDFS rather than launching MapReduce jobs. Direct fetch is turned on by default. To turn it off set the property opt.fetch to false or start Pig with the "-N" or "-no_fetch" option.

Description

With this patch I'd like to add the possibility to directly read data from HDFS instead of launching MR jobs in case of simple (map-only) tasks. Hive already has this feature (fetch). This patch shares some similarities with the local mode of Pig 0.6. Here, fetching kicks off when the following holds for a script:

it contains only LIMIT, FILTER, UNION (if no split is generated), STREAM, (nested) FOREACH with expression operators, custom UDFs..etc
no scalar aliases
no SampleLoader
single leaf job
DUMP (no STORE)

The feature is enabled by default and can be toggled with:

-N or -no_fetch
set opt.fetch true/false;

There's no STORE support because I wanted to make it explicit that this "optimization" is for launching small/simple scripts during development, rather than querying and filtering large number of rows on the client machine. However, a threshold could be given on the input size (an estimation) to determine whether to prefer fetch over MR jobs, similar to what Hive's 'hive.fetch.task.conversion.threshold' does. (through Pig's LoadMetadata#getStatistic ?)

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

PIG-3642.patch
26/Dec/13 23:15
64 kB
Lorand Bendig
PIG-3642-3.patch
05/Jan/14 01:12
72 kB
Cheolsoo Park
PIG-3642-4.patch
12/Jan/14 08:59
73 kB
Lorand Bendig
PIG-3642-5.patch
30/Jan/14 00:45
80 kB
Lorand Bendig
PIG-3642-6.patch
02/Feb/14 05:34
82 kB
Cheolsoo Park

Issue Links

is related to

PIG-2864 LOAD with FILTER should read data without MapReduce

Resolved

relates to

PIG-4135 Fetch optimization should be disabled if plan contains no limit

Closed

PIG-3740 Document direct fetch optimization

Closed

Activity

People

Assignee:: Lorand Bendig

Reporter:: Lorand Bendig

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 26/Dec/13 23:13

Updated:: 20/Aug/14 23:34

Resolved:: 03/Feb/14 04:02