Description
There are use cases in Pig:
- A directory is used as the input of a load operation. It is possible that one or more files in that directory are bad files (for example, corrupted or bad data caused by compression).
- A directory is used as the input of a load operation. The current user may not have permission to access any subdirectories or files of that directory.
The current Pig implementation will abort the whole Pig job for such cases. It would be useful to have option to allow the job to continue and ignore the bad files or inaccessible files/folders without abort the job, ideally, log or print a warning for such error or violations. This requirement is not trivial because for big data set for large analytics applications, this is not always possible to sort out the good data for processing; Ignore a few of bad files may be a better choice for such situations.
We propose to use “Ignore bad files” flag to address this problem. AvroStorage and related file format in Pig already has this flag but it is not complete to cover all the cases mentioned above. We would improve the PigStorage and related text format to support this new flag as well as improve AvroStorage and related facilities to completely support the concept.
The flag is “Storage” (For example, PigStorage or AvroStorage) based and can be set for each load operation respectively. The value of this flag will be false if it is not explicitly set. Ideally, we can provide a global pig parameter which forces the default value to true for all load functions even if it is not explicitly set in the LOAD statement.
Attachments
Attachments
Issue Links
- duplicates
-
PIG-3059 Global configurable minimum 'bad record' thresholds
- Open