Uploaded image for project: 'Hadoop Common'
  1. Hadoop Common
  2. HADOOP-17755

EOF reached error reading ORC file on S3A

VotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Duplicate
    • 3.2.0
    • None
    • fs/s3
    • None
    • Hadoop 3.2.0

    Description

      Hi I am trying to do some transformation using Spark 3.1.1-Hadoop 3.2 on K8s and using s3a

      I have around 700 GB of data to read and around 200 executors (5 vCore and 30G each).

      Its able to read most of the files in problematic stage (Scan orc => Filter => Project) but is failing with few files at the end with below error.  The size of the file mentioned in error is around 140 MB and all other files are of similar size.

      I am able to read and rewrite the specific file mentioned which suggest the file is not corrupted.

      Let me know if further information is required

       

      java.io.IOException: Error reading file: s3a://<bucket-with-prefix>/part-00001-5e22a873-82a5-4781-9eb9-473b483396bd.c000.zlib.orcjava.io.IOException: Error reading file: s3a://<bucket-with-prefix>/part-00001-5e22a873-82a5-4781-9eb9-473b483396bd.c000.zlib.orc at org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1331) at org.apache.orc.mapreduce.OrcMapreduceRecordReader.ensureBatch(OrcMapreduceRecordReader.java:78) at org.apache.orc.mapreduce.OrcMapreduceRecordReader.nextKeyValue(OrcMapreduceRecordReader.java:96) at org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:37) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:511) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) at org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:177) at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52) at org.apache.spark.scheduler.Task.run(Task.scala:131) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.base/java.lang.Thread.run(Unknown Source)Caused by: java.io.EOFException: End of file reached before reading fully. at org.apache.hadoop.fs.s3a.S3AInputStream.readFully(S3AInputStream.java:702) at org.apache.hadoop.fs.FSDataInputStream.readFully(FSDataInputStream.java:111) at org.apache.orc.impl.RecordReaderUtils.readDiskRanges(RecordReaderUtils.java:566) at org.apache.orc.impl.RecordReaderUtils$DefaultDataReader.readFileData(RecordReaderUtils.java:285) at org.apache.orc.impl.RecordReaderImpl.readPartialDataStreams(RecordReaderImpl.java:1237) at org.apache.orc.impl.RecordReaderImpl.readStripe(RecordReaderImpl.java:1105) at org.apache.orc.impl.RecordReaderImpl.advanceStripe(RecordReaderImpl.java:1256) at org.apache.orc.impl.RecordReaderImpl.advanceToNextRow(RecordReaderImpl.java:1291) at org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1327) ... 20 more
      

       

       

      Attachments

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            Unassigned Unassigned
            arghya18 Arghya Saha
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment