Uploaded image for project: 'Apache Avro'
  1. Apache Avro
  2. AVRO-1953

ArrayIndexOutOfBoundsException in org.apache.avro.io.parsing.Symbol$Alternative.getSymbol

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 1.7.4
    • None
    • java
    • None

    Description

      We are facing an issue when Avro MapReducer cannot process the avro file in the reducer.

      Here is the schema of our data:

      {
      "namespace" : "our package name",
      "type" : "record",
      "name" : "Lists",
      "fields" : [

      {"name" : "account_id", "type" : "long"}

      ,

      {"name" : "list_id", "type" : "string"}

      ,

      {"name" : "sequence_id", "type" : ["int", "null"]}

      ,

      {"name" : "name", "type" : ["string", "null"]}

      ,

      {"name" : "state", "type" : ["string", "null"]}

      ,

      {"name" : "description", "type" : ["string", "null"]}

      ,

      {"name" : "dynamic_filtered_list", "type" : ["int", "null"]}

      ,

      {"name" : "filter_criteria", "type" : ["string", "null"]}

      ,

      {"name" : "created_at", "type" : ["long", "null"]}

      ,

      {"name" : "updated_at", "type" : ["long", "null"]}

      ,

      {"name" : "deleted_at", "type" : ["long", "null"]}

      ,

      {"name" : "favorite", "type" : ["int", "null"]}

      ,

      {"name" : "delta", "type" : ["boolean", "null"]}

      ,
      {
      "name" : "list_memberships", "type" : {
      "type" : "array", "items" : {
      "name" : "ListMembership", "type" : "record",
      "fields" : [

      {"name" : "channel_id", "type" : "string"}

      ,

      {"name" : "created_at", "type" : ["long", "null"]}

      ,

      {"name" : "created_source", "type" : ["string", "null"]}

      ,

      {"name" : "deleted_at", "type" : ["long", "null"]}

      ,

      {"name" : "sequence_id", "type" : ["int", "null"]}

      ]
      }
      }
      }
      ]
      }

      Our MapReduce job is to get the delta of the above dataset, and use our merge logic to merge the latest change into the dataset.

      The whole MR job runs daily, and work fine for 18 months. During this time, we saw 2 times the merge MapReduce job failed with following error (In the reducer stage, which means the Avro data being read successfully, and send to the reducers, which we sort the data based on the key and timestamp, so the delta can be merged in the reducer side):

      java.lang.ArrayIndexOutOfBoundsException at org.apache.avro.io.parsing.Symbol$Alternative.getSymbol(Symbol.java:364) at org.apache.avro.io.ResolvingDecoder.doAction(ResolvingDecoder.java:229) at org.apache.avro.io.parsing.Parser.advance(Parser.java:88) at org.apache.avro.io.ResolvingDecoder.readIndex(ResolvingDecoder.java:206) at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:152) at org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:177) at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:148) at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:139) at org.apache.avro.hadoop.io.AvroDeserializer.deserialize(AvroDeserializer.java:108) at org.apache.avro.hadoop.io.AvroDeserializer.deserialize(AvroDeserializer.java:48) at org.apache.hadoop.mapreduce.task.ReduceContextImpl.nextKeyValue(ReduceContextImpl.java:142) at org.apache.hadoop.mapreduce.task.ReduceContextImpl.nextKey(ReduceContextImpl.java:117) at org.apache.hadoop.mapreduce.lib.reduce.WrappedReducer$Context.nextKey(WrappedReducer.java:297) at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:165) at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:652) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:420) at org.apache.hadoop.mapred.Child$4.run(Child.java:255) at java.security.AccessController.doPrivileged(AccessController.java:366) at javax.security.auth.Subject.doAs(Subject.java:572) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1502) at org.apache.hadoop.mapred.Child.main(Child.java:249)

      The MapReducer job will fail eventually in the reducer stage. I don't think our data is corrupted, as they are read fine in the map stage. Every time we got this error, we have to get the whole huge dataset from the source, then rebuilt the AVRO, and start building merge again daily, until after several months, then face this issue due to whatever reason we don't know yet.

      Attachments

        Activity

          People

            Unassigned Unassigned
            java8964 Yong Zhang
            Votes:
            1 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated: