Uploaded image for project: 'HCatalog'
  1. HCatalog
  2. HCATALOG-487

HCatalog should tolerate a user-defined amount of bad records

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Major
    • Resolution: Fixed
    • None
    • 0.5
    • None
    • None

    Description

      HCatalog tasks currently fail when deserializing corrupt records. In some cases, large data sets have a small number of corrupt records and its okay to skip them. In fact Hadoop has support for skipping bad records for exactly this reason.

      However, using the Hadoop-native record skipping feature (like Hive does) is very coarse and leads to a large number of failed tasks, task scheduling overhead, and limited control over the skipping behavior.

      HCatalog should have native support for skipping a user-defined amount of bad records.

      Attachments

        1. HCATALOG-487_skip_bad_records.1.patch
          14 kB
          Travis Crawford
        2. HCATALOG-487_skip_bad_records.2.patch
          15 kB
          Travis Crawford

        Activity

          People

            traviscrawford Travis Crawford
            traviscrawford Travis Crawford
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: