Details
-
Improvement
-
Status: Closed
-
Major
-
Resolution: Fixed
-
None
-
None
-
None
Description
HCatalog tasks currently fail when deserializing corrupt records. In some cases, large data sets have a small number of corrupt records and its okay to skip them. In fact Hadoop has support for skipping bad records for exactly this reason.
However, using the Hadoop-native record skipping feature (like Hive does) is very coarse and leads to a large number of failed tasks, task scheduling overhead, and limited control over the skipping behavior.
HCatalog should have native support for skipping a user-defined amount of bad records.