Uploaded image for project: 'ORC'
  1. ORC
  2. ORC-633

Skip broken ORC files when reading

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 1.6.3
    • None
    • Reader
    • None

    Description

      I am reading a path with ORC files using flink. However, some of them are broken.

      I get exceptions like this:

      org.apache.orc.FileFormatException: Not a valid ORC file /user/orc/0.orc (maxFileLength= 9223372036854775807) 
      at org.apache.orc.impl.ReaderImpl.extractFileTail(ReaderImpl.java:546) 
      at org.apache.orc.impl.ReaderImpl.<init>(ReaderImpl.java:370) 
      at org.apache.orc.OrcFile.createReader(OrcFile.java:342) 
      at org.apache.flink.orc.OrcRowInputFormat.open(OrcRowInputFormat.java:225) 
      at org.apache.flink.orc.OrcRowInputFormat.open(OrcRowInputFormat.java:63) 
      at org.apache.flink.runtime.operators.DataSourceTask.invoke(DataSourceTask.java:173) 
      at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:705) 
      at org.apache.flink.runtime.taskmanager.Task.run(Task.java:530)

       

      I have also defined in my configuration the "skip corrupt file":

      conf.setBoolean(OrcConf.SKIP_CORRUPT_DATA.getAttribute(), true);

       

      but it only handles a specific case and it doesn't skip broken files. 

      Is it possible to not throw exception on any kind of broken ORC files and only return the valid ones?

      Attachments

        Activity

          People

            Unassigned Unassigned
            nikobearrr Nikola
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated: