Uploaded image for project: 'Beam'
  1. Beam
  2. BEAM-2110

HDFSFileSource throws IndexOutOfBoundsException when trying to read big file (gaming_data1.csv)

Details

    • Bug
    • Status: Resolved
    • P2
    • Resolution: Not A Problem
    • None
    • Not applicable
    • None
    • CentOS 7, Oracle JDK 8

    Description

      I modified the wordcount example to read from HDFS with this code:

      pipeline.apply(Read.from(HDFSFileSource.fromText(options.getInput())))
      

      This worked for a number of small files I tried with. But with the included example: gs://apache-beam-samples/game/gaming_data*.csv (moved to HDFS) fails with the following trace:

      Caused by: java.lang.IndexOutOfBoundsException
      	at java.nio.Buffer.checkBounds(Buffer.java:567)
      	at java.nio.ByteBuffer.get(ByteBuffer.java:686)
      	at java.nio.DirectByteBuffer.get(DirectByteBuffer.java:285)
      	at org.apache.hadoop.hdfs.RemoteBlockReader2.read(RemoteBlockReader2.java:168)
      	at org.apache.hadoop.hdfs.DFSInputStream$ByteArrayStrategy.doRead(DFSInputStream.java:775)
      	at org.apache.hadoop.hdfs.DFSInputStream.readBuffer(DFSInputStream.java:831)
      	at org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:891)
      	at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:934)
      	at java.io.DataInputStream.read(DataInputStream.java:149)
      	at org.apache.hadoop.mapreduce.lib.input.UncompressedSplitLineReader.fillBuffer(UncompressedSplitLineReader.java:59)
      	at org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:216)
      	at org.apache.hadoop.util.LineReader.readLine(LineReader.java:174)
      	at org.apache.hadoop.mapreduce.lib.input.UncompressedSplitLineReader.readLine(UncompressedSplitLineReader.java:91)
      	at org.apache.hadoop.mapreduce.lib.input.LineRecordReader.skipUtfByteOrderMark(LineRecordReader.java:144)
      	at org.apache.hadoop.mapreduce.lib.input.LineRecordReader.nextKeyValue(LineRecordReader.java:184)
      	at org.apache.beam.sdk.io.hdfs.HDFSFileSource$HDFSFileReader.advance(HDFSFileSource.java:492)
      	at org.apache.beam.sdk.io.hdfs.HDFSFileSource$HDFSFileReader.start(HDFSFileSource.java:465)
      	at org.apache.beam.runners.flink.metrics.ReaderInvocationUtil.invokeStart(ReaderInvocationUtil.java:50)
      	at org.apache.beam.runners.flink.translation.wrappers.SourceInputFormat.open(SourceInputFormat.java:79)
      	at org.apache.beam.runners.flink.translation.wrappers.SourceInputFormat.open(SourceInputFormat.java:45)
      	at org.apache.flink.runtime.operators.DataSourceTask.invoke(DataSourceTask.java:144)
      	at org.apache.flink.runtime.taskmanager.Task.run(Task.java:655)
      	at java.lang.Thread.run(Thread.java:745)
      

      Attachments

        Activity

          People

            jbonofre Jean-Baptiste Onofré
            GergelyNovak Gergely Novák
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: