Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-13292

NullPointerException when reading a string field in a nested struct from an Orc file.

    XMLWordPrintableJSON

Details

    Description

      When I try to read an Orc file using flink-orc an NullPointerException exception is thrown.
      I think this issue could be related with this closed issue https://issues.apache.org/jira/browse/FLINK-8230

      This happens when trying to read the string fields in a nested struct. This is my schema:

            "struct<" +
              "operation:int," +
              "originalTransaction:bigInt," +
              "bucket:int," +
              "rowId:bigInt," +
              "currentTransaction:bigInt," +
              "row:struct<" +
              "id:int," +
              "headline:string," +
              "user_id:int," +
              "company_id:int," +
              "created_at:timestamp," +
              "updated_at:timestamp," +
              "link:string," +
              "is_html:tinyint," +
              "source:string," +
              "company_feed_id:int," +
              "editable:tinyint," +
              "body_clean:string," +
              "activitystream_activity_id:bigint," +
              "uniqueness_checksum:string," +
              "rating:string," +
              "review_id:int," +
              "soft_deleted:tinyint," +
              "type:string," +
              "metadata:string," +
              "url:string," +
              "imagecache_uuid:string," +
              "video_id:int" +
              ">>",
      [error] Caused by: java.lang.NullPointerException
      [error] 	at java.lang.String.checkBounds(String.java:384)
      [error] 	at java.lang.String.<init>(String.java:462)
      [error] 	at org.apache.flink.orc.OrcBatchReader.readString(OrcBatchReader.java:1216)
      [error] 	at org.apache.flink.orc.OrcBatchReader.readNonNullBytesColumnAsString(OrcBatchReader.java:328)
      [error] 	at org.apache.flink.orc.OrcBatchReader.readField(OrcBatchReader.java:215)
      [error] 	at org.apache.flink.orc.OrcBatchReader.readNonNullStructColumn(OrcBatchReader.java:453)
      [error] 	at org.apache.flink.orc.OrcBatchReader.readField(OrcBatchReader.java:250)
      [error] 	at org.apache.flink.orc.OrcBatchReader.fillRows(OrcBatchReader.java:143)
      [error] 	at org.apache.flink.orc.OrcRowInputFormat.ensureBatch(OrcRowInputFormat.java:333)
      [error] 	at org.apache.flink.orc.OrcRowInputFormat.reachedEnd(OrcRowInputFormat.java:313)
      [error] 	at org.apache.flink.runtime.operators.DataSourceTask.invoke(DataSourceTask.java:190)
      [error] 	at org.apache.flink.runtime.taskmanager.Task.run(Task.java:711)
      [error] 	at java.lang.Thread.run(Thread.java:748)

      Instead to use the TableApi I am trying to read the orc files in the Batch mode as following:

            env
              .readFile(
                new OrcRowInputFormat(
                  "",
                  "SCHEMA_GIVEN_BEFORE",
                  new HadoopConfiguration()
                ),
                "PATH_TO_FOLDER"
              )
              .writeAsText("file:///tmp/test/fromOrc")
      

      Thanks for your support

      Attachments

        1. LinkField.png
          166 kB
          Nithish
        2. output.orc
          3 kB
          Alejandro Sellero
        3. one_row.json
          1 kB
          Alejandro Sellero

        Activity

          People

            Unassigned Unassigned
            alexsell Alejandro Sellero
            Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

              Created:
              Updated: