Details
-
Improvement
-
Status: Open
-
Major
-
Resolution: Unresolved
-
1.6.3
-
None
-
None
Description
I am reading a path with ORC files using flink. However, some of them are broken.
I get exceptions like this:
org.apache.orc.FileFormatException: Not a valid ORC file /user/orc/0.orc (maxFileLength= 9223372036854775807) at org.apache.orc.impl.ReaderImpl.extractFileTail(ReaderImpl.java:546) at org.apache.orc.impl.ReaderImpl.<init>(ReaderImpl.java:370) at org.apache.orc.OrcFile.createReader(OrcFile.java:342) at org.apache.flink.orc.OrcRowInputFormat.open(OrcRowInputFormat.java:225) at org.apache.flink.orc.OrcRowInputFormat.open(OrcRowInputFormat.java:63) at org.apache.flink.runtime.operators.DataSourceTask.invoke(DataSourceTask.java:173) at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:705) at org.apache.flink.runtime.taskmanager.Task.run(Task.java:530)
I have also defined in my configuration the "skip corrupt file":
conf.setBoolean(OrcConf.SKIP_CORRUPT_DATA.getAttribute(), true);
but it only handles a specific case and it doesn't skip broken files.
Is it possible to not throw exception on any kind of broken ORC files and only return the valid ones?