Uploaded image for project: 'Hadoop Common'
  1. Hadoop Common
  2. HADOOP-3144

better fault tolerance for corrupted text files

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 0.15.3
    • 0.18.0
    • None
    • None
    • Reviewed

    Description

      every once in a while - we encounter corrupted text files (corrupted at source prior to copying into hadoop). inevitably - some of the data looks like a really really long line and hadoop trips over trying to stuff it into an in memory object and gets outofmem error. Code looks same way in trunk as well ..

      so looking for an option to the textinputformat (and like) to ignore long lines. ideally - we would just skip errant lines above a certain size limit.

      Attachments

        1. 3144-7.patch
          8 kB
          Zheng Shao
        2. 3144-6.patch
          7 kB
          Zheng Shao
        3. 3144-5.patch
          7 kB
          Zheng Shao
        4. 3144-4.patch
          8 kB
          Zheng Shao
        5. 3144-ignore-spaces-3.patch
          8 kB
          Zheng Shao
        6. 3144-ignore-spaces-2.patch
          8 kB
          Zheng Shao

        Activity

          People

            zshao Zheng Shao
            jsensarma Joydeep Sen Sarma
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: