Avro
  1. Avro
  2. AVRO-512

define and implement mapreduce connector protocol

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.4.0
    • Component/s: java
    • Labels:
      None

      Description

      Avro should provide Hadoop Mapper and Reducer implementations that connect to a subprocess in another programming language, transmitting raw binary values to and from that process. This should be modeled after Hadoop Pipes. It would allow one to easily write efficient mapreduce programs in non-Java languages that process Avro-format data.

      1. AVRO-512.patch
        84 kB
        Doug Cutting
      2. AVRO-512.patch
        84 kB
        Doug Cutting
      3. AVRO-512.patch
        80 kB
        Doug Cutting
      4. AVRO-512.patch
        58 kB
        Doug Cutting

        Issue Links

          Activity

          Doug Cutting created issue -
          Doug Cutting made changes -
          Field Original Value New Value
          Link This issue is blocked by AVRO-285 [ AVRO-285 ]
          Hide
          Doug Cutting added a comment -

          To use an Avro protocol for this, we'll want request-only messages (AVRO-285).

          Show
          Doug Cutting added a comment - To use an Avro protocol for this, we'll want request-only messages ( AVRO-285 ).
          Hide
          Owen O'Malley added a comment -

          I think it is a very bad idea to make Avro depend on MapReduce. At the very least, please create a separate jar to put this stuff into rather than the main Avro jar so that MapReduce doesn't depend on a jar that depends on it.

          Why not build this as a library on MapReduce? It shouldn't be bundled with Avro...

          Show
          Owen O'Malley added a comment - I think it is a very bad idea to make Avro depend on MapReduce. At the very least, please create a separate jar to put this stuff into rather than the main Avro jar so that MapReduce doesn't depend on a jar that depends on it. Why not build this as a library on MapReduce? It shouldn't be bundled with Avro...
          Hide
          Scott Carey added a comment - - edited

          I agree that Avro should not require MapReduce – specifically the maven POM should not cause consumers to pull MapReduce by default.

          But, I think we already prevent that. The POM generated by the build specifies hadoop-core as "optional" meaning downstream projects that consume Avro won't automatically pull the Hadoop jar. Another option for similar effect is to specify the dependency scope as "provided" instead of "compile" which makes the jar available for build and test but does not bundle it. This is probably preferred for MapReduce. If a user wants to use those APIs, they have to get a copy of their own hadoop-core jar or specify the dependency themselves.

          Putting the code in Hadoop is probably a problem, unless we want to release new versions of 0.18, 0.19, 0.20, etc. Placing it in Hadoop means that changes to the Avro lower level APIs will break compatibility with the version in Hadoop. Honestly, some of those APIs are going to keep evolving and dot-releases of AVRO can break these APIs (but not encoded formats). Until these APIs are more locked down it is better to keep packages like this in the Avro project.

          -----------
          Going slightly off topic now:

          A few other libraries Avro bundles have similar issues – optional side features should specify either "provided" or "optional" flags in the maven pom. Or, the project needs to be split up into a few jars.

          avro-core
          -> avro-genavro
          -> avro-protocol
          -> avro-mapred
          -> avro-reflect

          probably covers the main dependency chunks. Avro-core can get away with only jackson, slf4j, and commons-lang, I think – meaning generic, and specific APIs, file formats, etc work.

          Show
          Scott Carey added a comment - - edited I agree that Avro should not require MapReduce – specifically the maven POM should not cause consumers to pull MapReduce by default. But, I think we already prevent that. The POM generated by the build specifies hadoop-core as "optional" meaning downstream projects that consume Avro won't automatically pull the Hadoop jar. Another option for similar effect is to specify the dependency scope as "provided" instead of "compile" which makes the jar available for build and test but does not bundle it. This is probably preferred for MapReduce. If a user wants to use those APIs, they have to get a copy of their own hadoop-core jar or specify the dependency themselves. Putting the code in Hadoop is probably a problem, unless we want to release new versions of 0.18, 0.19, 0.20, etc. Placing it in Hadoop means that changes to the Avro lower level APIs will break compatibility with the version in Hadoop. Honestly, some of those APIs are going to keep evolving and dot-releases of AVRO can break these APIs (but not encoded formats). Until these APIs are more locked down it is better to keep packages like this in the Avro project. ----------- Going slightly off topic now: A few other libraries Avro bundles have similar issues – optional side features should specify either "provided" or "optional" flags in the maven pom. Or, the project needs to be split up into a few jars. avro-core -> avro-genavro -> avro-protocol -> avro-mapred -> avro-reflect probably covers the main dependency chunks. Avro-core can get away with only jackson, slf4j, and commons-lang, I think – meaning generic, and specific APIs, file formats, etc work.
          Doug Cutting made changes -
          Assignee Doug Cutting [ cutting ]
          Fix Version/s 1.4.0 [ 12314789 ]
          Hide
          Doug Cutting added a comment -

          Here's an early version of this patch. I think it's feature complete. It compiles but as yet has no tests or documentation.

          Show
          Doug Cutting added a comment - Here's an early version of this patch. I think it's feature complete. It compiles but as yet has no tests or documentation.
          Doug Cutting made changes -
          Attachment AVRO-512.patch [ 12445601 ]
          Hide
          Doug Cutting added a comment -

          New version of patch that runs and passes tests.

          Show
          Doug Cutting added a comment - New version of patch that runs and passes tests.
          Doug Cutting made changes -
          Attachment AVRO-512.patch [ 12446522 ]
          Doug Cutting made changes -
          Attachment AVRO-512.patch [ 12446723 ]
          Doug Cutting made changes -
          Status Open [ 1 ] Patch Available [ 10002 ]
          Doug Cutting made changes -
          Link This issue blocks AVRO-570 [ AVRO-570 ]
          Hide
          Doug Cutting added a comment -

          I'm going to commit this later today, so that we can start trying to implement it in other languages. I've updated the documentation to note that this is an experimental feature, subject to change.

          Show
          Doug Cutting added a comment - I'm going to commit this later today, so that we can start trying to implement it in other languages. I've updated the documentation to note that this is an experimental feature, subject to change.
          Doug Cutting made changes -
          Attachment AVRO-512.patch [ 12447052 ]
          Hide
          Doug Cutting added a comment -

          I just committed this.

          Show
          Doug Cutting added a comment - I just committed this.
          Doug Cutting made changes -
          Status Patch Available [ 10002 ] Resolved [ 5 ]
          Resolution Fixed [ 1 ]
          Doug Cutting made changes -
          Status Resolved [ 5 ] Closed [ 6 ]
          Hide
          Jeremy Lewi added a comment -

          I think a deadlock can occur if the subprocess fails to start (e.g if the executable is specified incorrectly). This happens because the constructor for TetheredProcess starts the subprocess and then calls outputService.inputPort(). But inputPort() will block until the child process sends a configure message to the parent; but if the child process wasn't started then I think the parent deadlocks.

          At a minimum we could check that the subprocess hasn't exited yet. This probably won't prevent all possible deadlocks but it might help.

          Below is some code for checking if the process has exited.
          //is there a better way to check if the process has exited then the roundabout way below?
          boolean hasexited=false;
          try

          { //exitValue throws an exception if process hasn't exited this.subprocess.exitValue(); hasexited=true; }

          catch (IllegalThreadStateException e)

          { //it hasn't exited yet hasexited=true; }

          if (hasexited)

          { //What's the best way to log this System.out.println("Error: Could not start subprocess"); throw new RuntimeException("Error: Could not start subprocess"); }
          Show
          Jeremy Lewi added a comment - I think a deadlock can occur if the subprocess fails to start (e.g if the executable is specified incorrectly). This happens because the constructor for TetheredProcess starts the subprocess and then calls outputService.inputPort(). But inputPort() will block until the child process sends a configure message to the parent; but if the child process wasn't started then I think the parent deadlocks. At a minimum we could check that the subprocess hasn't exited yet. This probably won't prevent all possible deadlocks but it might help. Below is some code for checking if the process has exited. //is there a better way to check if the process has exited then the roundabout way below? boolean hasexited=false; try { //exitValue throws an exception if process hasn't exited this.subprocess.exitValue(); hasexited=true; } catch (IllegalThreadStateException e) { //it hasn't exited yet hasexited=true; } if (hasexited) { //What's the best way to log this System.out.println("Error: Could not start subprocess"); throw new RuntimeException("Error: Could not start subprocess"); }
          Hide
          Jeremy Lewi added a comment -

          Small bug in the above code. Should be
          if process hasn't exited this.subprocess.exitValue(); hasexited=false.

          Aside:
          Is there a way to edit comments?

          Show
          Jeremy Lewi added a comment - Small bug in the above code. Should be if process hasn't exited this.subprocess.exitValue(); hasexited=false. Aside: Is there a way to edit comments?
          Hide
          Doug Cutting added a comment -

          Yes, this would be a good thing to fix. Can you please open a new issue for it? Thanks!

          Show
          Doug Cutting added a comment - Yes, this would be a good thing to fix. Can you please open a new issue for it? Thanks!

            People

            • Assignee:
              Doug Cutting
              Reporter:
              Doug Cutting
            • Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development