[PIG-3059] Global configurable minimum 'bad record' thresholds - ASF JIRA

XML

Word

Printable

JSON

Details

Type: New Feature
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: 0.11
Fix Version/s: None
Component/s: impl
Labels:
None

Description

Pig dies when one record in a LOAD of a billion records fails to parse. This is almost certainly not the desired behavior. elephant-bird and some other storage UDFs have minimum thresholds in terms of percent and count that must be exceeded before a job will fail outright.

We need these limits to be configurable for Pig, globally. I've come to realize what a major problem Pig's crashing on bad records is for new Pig users. I believe this feature can greatly improve Pig.

An example of a config would look like:

pig.storage.bad.record.threshold=0.01
pig.storage.bad.record.min=100

A thorough discussion of this issue is available here: http://www.quora.com/Big-Data/In-Big-Data-ETL-how-many-records-are-an-acceptable-loss

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

avro_test_files-2.tar.gz
31/Dec/12 01:46
20 kB
Cheolsoo Park
PIG-3059.patch
26/Dec/12 06:47
24 kB
Cheolsoo Park
PIG-3059-2.patch
31/Dec/12 01:46
33 kB
Cheolsoo Park

Issue Links

is duplicated by

PIG-3404 Improve Pig to ignore bad files or inaccessible files or folders

Open

is related to

AVRO-806 add a column-major file format

Closed

PIG-3015 Rewrite of AvroStorage

Closed

supercedes

PIG-2614 AvroStorage crashes on LOADING a single bad error

Resolved

links to

Review Board

Activity

People

Assignee:: Unassigned

Reporter:: Russell Jurney

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 20/Nov/12 07:14

Updated:: 07/Sep/14 03:36