Avro
  1. Avro
  2. AVRO-1266

Fix mapred AvroMultipleOutputs class to write the schema to Jobconf rather than private Hashmap

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.7.5
    • Component/s: java
    • Labels:
      None

      Description

      The current version of mapred AvroMultipleOutputs stores schemas in provate hashmap which has issues when run in a mapreduce code.

      1. AVRO-1266-v1.patch
        13 kB
        Ashish Nagavaram
      2. AVRO-1266.patch
        2 kB
        Ashish Nagavaram
      3. AVRO-1266.patch
        12 kB
        Ashish Nagavaram

        Issue Links

          Activity

          Hide
          Hudson added a comment -

          Integrated in AvroJava #361 (See https://builds.apache.org/job/AvroJava/361/)
          AVRO-1266. Java: Fix mapred.AvroMultipleOutputs to support multiple different schemas. Contributed by Ashish Nagavaram. (Revision 1467543)

          Result = SUCCESS
          martinkl :
          Files :

          • /avro/trunk/CHANGES.txt
          • /avro/trunk/lang/java/mapred/src/main/java/org/apache/avro/mapred/AvroMultipleOutputs.java
          • /avro/trunk/lang/java/mapred/src/test/java/org/apache/avro/mapred/TestAvroMultipleOutputs.java
          Show
          Hudson added a comment - Integrated in AvroJava #361 (See https://builds.apache.org/job/AvroJava/361/ ) AVRO-1266 . Java: Fix mapred.AvroMultipleOutputs to support multiple different schemas. Contributed by Ashish Nagavaram. (Revision 1467543) Result = SUCCESS martinkl : Files : /avro/trunk/CHANGES.txt /avro/trunk/lang/java/mapred/src/main/java/org/apache/avro/mapred/AvroMultipleOutputs.java /avro/trunk/lang/java/mapred/src/test/java/org/apache/avro/mapred/TestAvroMultipleOutputs.java
          Hide
          Martin Kleppmann added a comment -

          I committed this (r1467543). Thanks Ashish!

          Show
          Martin Kleppmann added a comment - I committed this (r1467543). Thanks Ashish!
          Hide
          Martin Kleppmann added a comment -

          Unit tests pass and look good, and I have successfully used this patch in a real Hadoop job (running on a Hadoop-1.0.4 cluster). Can't tell for sure, but it looks like the error that Pierre was seeing is unrelated to this patch. I'll commit this soon unless there are any objections.

          Show
          Martin Kleppmann added a comment - Unit tests pass and look good, and I have successfully used this patch in a real Hadoop job (running on a Hadoop-1.0.4 cluster). Can't tell for sure, but it looks like the error that Pierre was seeing is unrelated to this patch. I'll commit this soon unless there are any objections.
          Hide
          Pierre Mariani added a comment -

          I am really sorry, but I won't be trying to make this work anymore. I have reached my limit in terms of time spent and frustration.

          FYI, this is what I am getting when trying to run a job with a configuration similar to the one in the test file:
          13/03/18 14:30:55 INFO mapred.JobClient: Task Id : attempt_201302121658_15642_m_000003_2, Status : FAILED
          java.lang.Throwable: Child Error
          at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:475)
          Caused by: java.io.IOException: Task process exit with nonzero status of 1.

          Show
          Pierre Mariani added a comment - I am really sorry, but I won't be trying to make this work anymore. I have reached my limit in terms of time spent and frustration. FYI, this is what I am getting when trying to run a job with a configuration similar to the one in the test file: 13/03/18 14:30:55 INFO mapred.JobClient: Task Id : attempt_201302121658_15642_m_000003_2, Status : FAILED java.lang.Throwable: Child Error at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:475) Caused by: java.io.IOException: Task process exit with nonzero status of 1.
          Hide
          Pierre Mariani added a comment -

          I'll try again

          Show
          Pierre Mariani added a comment - I'll try again
          Hide
          Ashish Nagavaram added a comment -

          Hi Pierre,

          did you get a chance to test the patch again?

          I am attaching the test code just in case you need any reference.

          http://svn.apache.org/viewvc/avro/trunk/lang/java/mapred/src/test/java/org/apache/avro/mapred/TestAvroMultipleOutputs.java?view=log

          Show
          Ashish Nagavaram added a comment - Hi Pierre, did you get a chance to test the patch again? I am attaching the test code just in case you need any reference. http://svn.apache.org/viewvc/avro/trunk/lang/java/mapred/src/test/java/org/apache/avro/mapred/TestAvroMultipleOutputs.java?view=log
          Hide
          Pierre Mariani added a comment -

          I am not getting anywhere, and opted to port my job to the mapreduce api. I am running into troubles there as well which I'll mention in a mailing list email.

          Show
          Pierre Mariani added a comment - I am not getting anywhere, and opted to port my job to the mapreduce api. I am running into troubles there as well which I'll mention in a mailing list email.
          Hide
          Pierre Mariani added a comment -

          schemaA is not null.

          I confirmed that I am not getting the NullPointerException with the new patch, but bugs and difficulties in configuring the job in my own code are preventing me from confirming that it works. I am still trying and will update if I get anywhere.

          Show
          Pierre Mariani added a comment - schemaA is not null. I confirmed that I am not getting the NullPointerException with the new patch, but bugs and difficulties in configuring the job in my own code are preventing me from confirming that it works. I am still trying and will update if I get anywhere.
          Hide
          Ashish Nagavaram added a comment -

          Is the schemaA passed in the AvroMultipleOutputs.addNamedOutput(conf, "outA", AvroOutputFormat.class, schemaA) null?

          eitherways it shouldn't throw a NullPointerException, I am attaching a new patch.

          Can you try the new patch ?

          Show
          Ashish Nagavaram added a comment - Is the schemaA passed in the AvroMultipleOutputs.addNamedOutput(conf, "outA", AvroOutputFormat.class, schemaA) null? eitherways it shouldn't throw a NullPointerException, I am attaching a new patch. Can you try the new patch ?
          Hide
          Pierre Mariani added a comment -

          I am getting a null pointer exception when using avro 1.7.4 and the patch from March 3rd.

          Details of the exception:

          java.lang.NullPointerException
                  at java.io.StringReader.<init>(StringReader.java:33)
                  at org.apache.avro.Schema$Parser.parse(Schema.java:917)
                  at org.apache.avro.Schema.parse(Schema.java:966)
                  at org.apache.avro.mapred.AvroMultipleOutputs$InternalFileOutputFormat.getRecordWriter(AvroMultipleOutputs.java:611)
                  at org.apache.avro.mapred.AvroMultipleOutputs.getRecordWriter(AvroMultipleOutputs.java:411)
                  at org.apache.avro.mapred.AvroMultipleOutputs.getCollector(AvroMultipleOutputs.java:570)
                  at org.apache.avro.mapred.AvroMultipleOutputs.getCollector(AvroMultipleOutputs.java:506)
                  <call to amos.getCollector("outA", reporter).collect(object);>
          

          Configuration of my job:

          JobConf conf = new JobConf(getConf(), getClass());
          AvroJob.setInputSchema(conf, schemaA);
          AvroJob.setMapOutputSchema(conf, Pair.getPairSchema(Schema.create(Schema.Type.STRING), schemaA));
          AvroJob.setOutputSchema(conf, schemaA);
          AvroJob.setMapperClass(conf, MyMapper.class);
          AvroJob.setReducerClass(conf, MyReducer.class);
          AvroMultipleOutputs.addNamedOutput(conf, "outA", AvroOutputFormat.class, schemaA);
          AvroMultipleOutputs.addNamedOutput(conf, "outB", AvroOutputFormat.class, schemaB);
          
          Show
          Pierre Mariani added a comment - I am getting a null pointer exception when using avro 1.7.4 and the patch from March 3rd. Details of the exception: java.lang.NullPointerException at java.io.StringReader.<init>(StringReader.java:33) at org.apache.avro.Schema$Parser.parse(Schema.java:917) at org.apache.avro.Schema.parse(Schema.java:966) at org.apache.avro.mapred.AvroMultipleOutputs$InternalFileOutputFormat.getRecordWriter(AvroMultipleOutputs.java:611) at org.apache.avro.mapred.AvroMultipleOutputs.getRecordWriter(AvroMultipleOutputs.java:411) at org.apache.avro.mapred.AvroMultipleOutputs.getCollector(AvroMultipleOutputs.java:570) at org.apache.avro.mapred.AvroMultipleOutputs.getCollector(AvroMultipleOutputs.java:506) <call to amos.getCollector("outA", reporter).collect(object);> Configuration of my job: JobConf conf = new JobConf(getConf(), getClass()); AvroJob.setInputSchema(conf, schemaA); AvroJob.setMapOutputSchema(conf, Pair.getPairSchema(Schema.create(Schema.Type.STRING), schemaA)); AvroJob.setOutputSchema(conf, schemaA); AvroJob.setMapperClass(conf, MyMapper.class); AvroJob.setReducerClass(conf, MyReducer.class); AvroMultipleOutputs.addNamedOutput(conf, "outA", AvroOutputFormat.class, schemaA); AvroMultipleOutputs.addNamedOutput(conf, "outB", AvroOutputFormat.class, schemaB);
          Hide
          Doug Cutting added a comment -

          Looks good to me. I'll commit this soon unless someone objects.

          Show
          Doug Cutting added a comment - Looks good to me. I'll commit this soon unless someone objects.
          Hide
          Ashish Nagavaram added a comment -

          Attached a new patch with the required changes, added functions with takes in basefilename and a schema and writes output according to that schema.

          Show
          Ashish Nagavaram added a comment - Attached a new patch with the required changes, added functions with takes in basefilename and a schema and writes output according to that schema.
          Hide
          Ashish Nagavaram added a comment -

          Yes, it is the mapred version of AVRO-1215

          Currently the only way it supports multiple schemas is by created a namedOutput and schema for each output file. But it makes sense to have a collect(namedoutput,K,V,schema) and collect(K,V,schema,baseOutputFile) methods too.

          I will work on these and should mostly have a patch out by monday.

          Show
          Ashish Nagavaram added a comment - Yes, it is the mapred version of AVRO-1215 Currently the only way it supports multiple schemas is by created a namedOutput and schema for each output file. But it makes sense to have a collect(namedoutput,K,V,schema) and collect(K,V,schema,baseOutputFile) methods too. I will work on these and should mostly have a patch out by monday.
          Hide
          Doug Cutting added a comment -

          This looks like the mapred version of AVRO-1215. Is that right? If so, let's link it to that issue.

          Also, does this address the following question raised on the user list?

          http://s.apache.org/mapredmultiple

          Show
          Doug Cutting added a comment - This looks like the mapred version of AVRO-1215 . Is that right? If so, let's link it to that issue. Also, does this address the following question raised on the user list? http://s.apache.org/mapredmultiple

            People

            • Assignee:
              Ashish Nagavaram
              Reporter:
              Ashish Nagavaram
            • Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development