Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      The work in HADOOP-3412 provided an API to support pluggable schedulers. However implementations have to be in the org.apache.hadoop.mapred package, which is undesirable. The goal of this issue is to create a public API for scheduler writers to code against.

      1. hadoop-3822.sh
        0.4 kB
        Tom White
      2. hadoop-3822.patch
        25 kB
        Tom White

        Issue Links

          Activity

          Allen Wittenauer made changes -
          Status Open [ 1 ] Resolved [ 5 ]
          Resolution Fixed [ 1 ]
          Hide
          Allen Wittenauer added a comment -

          Fixed forever ago.

          Show
          Allen Wittenauer added a comment - Fixed forever ago.
          Owen O'Malley made changes -
          Project Hadoop Common [ 12310240 ] Hadoop Map/Reduce [ 12310941 ]
          Key HADOOP-3822 MAPREDUCE-342
          Component/s mapred [ 12310690 ]
          Tom White made changes -
          Fix Version/s 0.19.0 [ 12313211 ]
          Hide
          Tom White added a comment -

          We don't need this change for 0.19 since the new schedulers have gone into contrib.

          Show
          Tom White added a comment - We don't need this change for 0.19 since the new schedulers have gone into contrib.
          Hide
          Steve Loughran added a comment -

          I don't know about the impact of Java annotations and javadocs. You can put the attribute into the java .class files itself, then run code over it, but javadoc itself doesn't seem to. There are custom javadoc tags, but they are separate things. If all that is needed is a javadoc tag, then a custom tag like @hadoop.internal could be added.

          Show
          Steve Loughran added a comment - I don't know about the impact of Java annotations and javadocs. You can put the attribute into the java .class files itself, then run code over it, but javadoc itself doesn't seem to. There are custom javadoc tags, but they are separate things. If all that is needed is a javadoc tag, then a custom tag like @hadoop.internal could be added.
          Hide
          Doug Cutting added a comment -

          The line we've drawn to date has been that, if it's in the released javadoc, back-compatibility requirements apply. (We promise to deprecate features for one major release before removing them. This is in fact the defining characteristic of a major release. See http://wiki.apache.org/hadoop/Roadmap for details.)

          Currently, the plan is not to release javadoc for the mapreduce and hdfs daemon code packages. The scheduler is part of the daemon code, so it's already out of scope for back-compatibility. So, for now, this is not an issue. It will only become an issue if we intend to start including javadoc for the scheduler in releases.

          Longer-term we might wish to have a finer-grained line. I still believe it ought to be easy to determine from the javadoc whether something is a supported API. So, if we were to, e.g., start releasing Scheduler javadoc, we'd need to make sure that unstable methods were clearly marked. In Lucene we adopted the convention of adding an "Expert:" to the beginning of javadoc comments for APIs that most users should ignore. Similarly, we might add something like "Unstable:" to APIs that we expect will change.

          Would an @Internal tag be able to affect the javadoc?

          Show
          Doug Cutting added a comment - The line we've drawn to date has been that, if it's in the released javadoc, back-compatibility requirements apply. (We promise to deprecate features for one major release before removing them. This is in fact the defining characteristic of a major release. See http://wiki.apache.org/hadoop/Roadmap for details.) Currently, the plan is not to release javadoc for the mapreduce and hdfs daemon code packages. The scheduler is part of the daemon code, so it's already out of scope for back-compatibility. So, for now, this is not an issue. It will only become an issue if we intend to start including javadoc for the scheduler in releases. Longer-term we might wish to have a finer-grained line. I still believe it ought to be easy to determine from the javadoc whether something is a supported API. So, if we were to, e.g., start releasing Scheduler javadoc, we'd need to make sure that unstable methods were clearly marked. In Lucene we adopted the convention of adding an "Expert:" to the beginning of javadoc comments for APIs that most users should ignore. Similarly, we might add something like "Unstable:" to APIs that we expect will change. Would an @Internal tag be able to affect the javadoc?
          Hide
          Steve Loughran added a comment -

          This is what an @Internal attribute would look like

          @Retention(RetentionPolicy.SOURCE)
          public @interface Internal

          { String value() default ""; }

          It could then be used to annotate classes, methods, interfaces, etc

          @Internal
          public interface HealthChecks

          { boolean healthy() throws IOException; }

          Even member variables you want better access to

          public class NameNode

          { @Internal("namesystem should be accessed through getNamesystem()"); FSNamesystem namesystem; .... }

          The attribute would not do anything, except to say "be careful when you use this, we may change it"

          Show
          Steve Loughran added a comment - This is what an @Internal attribute would look like @Retention(RetentionPolicy.SOURCE) public @interface Internal { String value() default ""; } It could then be used to annotate classes, methods, interfaces, etc @Internal public interface HealthChecks { boolean healthy() throws IOException; } Even member variables you want better access to public class NameNode { @Internal("namesystem should be accessed through getNamesystem()"); FSNamesystem namesystem; .... } The attribute would not do anything, except to say "be careful when you use this, we may change it"
          Hide
          Steve Loughran added a comment -

          In Open Source there is no way to stop anyone getting at your internals. What you can do is flag which interfaces or classes have no stability guarantees. People are free to work with them subclass the classes, implement the interfaces but it is not the hadoop team's concerns if a new release stops their thing from recompiling.

          Interfaces can be easier to mock, and easier to patch in to other object hierarchies -and there is no requirement to never change an interface. Its something that sun and joshua do for the sun libraries, but even they have internal stuff they don't like you touching and which moves about from time to time.

          This is why the idea of having an @Internal tag appeals to me -something you can put on an interface to say "this is internal, track SVN Head if you don't want to be surprised".

          Show
          Steve Loughran added a comment - In Open Source there is no way to stop anyone getting at your internals. What you can do is flag which interfaces or classes have no stability guarantees. People are free to work with them subclass the classes, implement the interfaces but it is not the hadoop team's concerns if a new release stops their thing from recompiling. Interfaces can be easier to mock, and easier to patch in to other object hierarchies -and there is no requirement to never change an interface. Its something that sun and joshua do for the sun libraries, but even they have internal stuff they don't like you touching and which moves about from time to time. This is why the idea of having an @Internal tag appeals to me -something you can put on an interface to say "this is internal, track SVN Head if you don't want to be surprised".
          Hide
          Tom White added a comment -

          Can you summarize?

          According to Joshua Bloch, the virtues are all to do with the fact that interfaces don't enforce a hierarchy on your design in the way that abstract classes do. This chimes with another principle in the book: "Favor composition over inheritance". So for example, you can retrofit existing classes to implement interfaces (you often can't do this with abstract classes - it "causes great collateral damage to the type hierarchy"); also you can define mixins and non-hierarchical type frameworks with interfaces. However, he does say that "It is far easier to evolve an abstract class than an interface" (which is where this whole discussion started).

          Perhaps we should document these (and more) in the wiki or someplace, and try to validate our abstract APIs against them. There are added potential pitfalls when using abstract classes instead of interfaces, and we should work hard to avoid them.

          +1

          So do we intend to permit implementations of the scheduler that are not in Hadoop's source tree?

          I think it is OK to say that we should only support those in the tree for the time being. (That doesn't stop folks implementing their own schedulers of course, it just means they may have have to change them from release to release if the interfaces change.) However, I do think we should make it possible for scheduler implementations to live in their own packages. (E.g. I think we should do this before committing HADOOP-3746.) To do that, I suggest we do the following:

          • Repackage the current patch so the scheduler code goes in org.apache.hadoop.mapreduce.server.scheduler (with abstract classes changed to interfaces, as appropriate)
          • Remove javadoc for the server packages (or generate it separately, labelled as developer javadoc - see HADOOP-3742)
          • Restructure the MapReduce code into org.apache.hadoop.mapreduce for the user API (HADOOP-1230), and the server API (HADOOP-2884 - is anyone working on this for reorganising MapReduce?)
          Show
          Tom White added a comment - Can you summarize? According to Joshua Bloch, the virtues are all to do with the fact that interfaces don't enforce a hierarchy on your design in the way that abstract classes do. This chimes with another principle in the book: "Favor composition over inheritance". So for example, you can retrofit existing classes to implement interfaces (you often can't do this with abstract classes - it "causes great collateral damage to the type hierarchy"); also you can define mixins and non-hierarchical type frameworks with interfaces. However, he does say that "It is far easier to evolve an abstract class than an interface" (which is where this whole discussion started). Perhaps we should document these (and more) in the wiki or someplace, and try to validate our abstract APIs against them. There are added potential pitfalls when using abstract classes instead of interfaces, and we should work hard to avoid them. +1 So do we intend to permit implementations of the scheduler that are not in Hadoop's source tree? I think it is OK to say that we should only support those in the tree for the time being. (That doesn't stop folks implementing their own schedulers of course, it just means they may have have to change them from release to release if the interfaces change.) However, I do think we should make it possible for scheduler implementations to live in their own packages. (E.g. I think we should do this before committing HADOOP-3746 .) To do that, I suggest we do the following: Repackage the current patch so the scheduler code goes in org.apache.hadoop.mapreduce.server.scheduler (with abstract classes changed to interfaces, as appropriate) Remove javadoc for the server packages (or generate it separately, labelled as developer javadoc - see HADOOP-3742 ) Restructure the MapReduce code into org.apache.hadoop.mapreduce for the user API ( HADOOP-1230 ), and the server API ( HADOOP-2884 - is anyone working on this for reorganising MapReduce?)
          Hide
          Doug Cutting added a comment -

          > You don't need to freeze interfaces. You only need to freeze interfaces that you promise to keep stable

          Right. For example, we use interfaces for RPC protocols, but those are not for public consumption and are only used internally. We do not intend to support clients or servers that are not in Hadoop's source tree.

          So do we intend to permit implementations of the scheduler that are not in Hadoop's source tree? If so, then we should not use interfaces. If not, then interfaces are fine, since we can update all implementations if/when we alter the interface.

          Show
          Doug Cutting added a comment - > You don't need to freeze interfaces. You only need to freeze interfaces that you promise to keep stable Right. For example, we use interfaces for RPC protocols, but those are not for public consumption and are only used internally. We do not intend to support clients or servers that are not in Hadoop's source tree. So do we intend to permit implementations of the scheduler that are not in Hadoop's source tree? If so, then we should not use interfaces. If not, then interfaces are fine, since we can update all implementations if/when we alter the interface.
          Hide
          Steve Loughran added a comment -

          You don't need to freeze interfaces. You only need to freeze interfaces that you promise to keep stable

          There's one other risk with abstract classes -that the base class adds a new public or protected method that accidentally clashes with one a subclass has implemented, in which case the subclass one gets used. There are some checks in C# to catch this, but not java, not even with the @Override attribute, because that attribute is optional.

          The best way to avoid problems is to have people building and testing their code with your SVN_HEAD version, as it finds problems the minute they are checked in, rather than when a release occurs. This is something we can do in open source development, across teams and organisations.

          Show
          Steve Loughran added a comment - You don't need to freeze interfaces. You only need to freeze interfaces that you promise to keep stable There's one other risk with abstract classes -that the base class adds a new public or protected method that accidentally clashes with one a subclass has implemented, in which case the subclass one gets used. There are some checks in C# to catch this, but not java, not even with the @Override attribute, because that attribute is optional. The best way to avoid problems is to have people building and testing their code with your SVN_HEAD version, as it finds problems the minute they are checked in, rather than when a release occurs. This is something we can do in open source development, across teams and organisations.
          Hide
          Doug Cutting added a comment -

          > he extols the virtues of interfaces over abstract classes

          What are they? Can you summarize? Are they significant?

          > So perhaps the developer API (which is "less public") is the place where it is OK to introduce interfaces [...]

          Perhaps, if there are significant advantages to interfaces. The ability to implement multiple interfaces in a single class can save a few lines of code, but doesn't really seem significant to me when compared to the huge cost of freezing the API.

          I like Steve's guidelines (above) for abstract classes. Perhaps we should document these (and more) in the wiki or someplace, and try to validate our abstract APIs against them. There are added potential pitfalls when using abstract classes instead of interfaces, and we should work hard to avoid them.

          Show
          Doug Cutting added a comment - > he extols the virtues of interfaces over abstract classes What are they? Can you summarize? Are they significant? > So perhaps the developer API (which is "less public") is the place where it is OK to introduce interfaces [...] Perhaps, if there are significant advantages to interfaces. The ability to implement multiple interfaces in a single class can save a few lines of code, but doesn't really seem significant to me when compared to the huge cost of freezing the API. I like Steve's guidelines (above) for abstract classes. Perhaps we should document these (and more) in the wiki or someplace, and try to validate our abstract APIs against them. There are added potential pitfalls when using abstract classes instead of interfaces, and we should work hard to avoid them.
          Hide
          Steve Loughran added a comment -

          with inner classes, a lot of this is moot. What is important for an abstract class is that

          • it doesn't do anything in its constructor that calls non-final methods
          • it doesnt offer protected fields (better for private and protected methods) as you can't change much from then on.
          • you dont rely on package scoped fields/methods/classes

          The biggest problem I have with either api choice is exception handling -you are always stuck with whatever exceptions the api designer thought was likely.

          Currently hadoop has a lot of package-private code, and that is bad because it forces people subclassing to use that package name. This is bad for the hadoop team, as their packages appear in the stack traces -they get to field the support calls.

          One thing we could consider is a java5 tag @Unstable or @Internal, which would be a bit like @deprecated, and warn people that the things could change without warning -and hence that there were no stability guarantees at all.

          Show
          Steve Loughran added a comment - with inner classes, a lot of this is moot. What is important for an abstract class is that it doesn't do anything in its constructor that calls non-final methods it doesnt offer protected fields (better for private and protected methods) as you can't change much from then on. you dont rely on package scoped fields/methods/classes The biggest problem I have with either api choice is exception handling -you are always stuck with whatever exceptions the api designer thought was likely. Currently hadoop has a lot of package-private code, and that is bad because it forces people subclassing to use that package name. This is bad for the hadoop team, as their packages appear in the stack traces -they get to field the support calls. One thing we could consider is a java5 tag @Unstable or @Internal, which would be a bit like @deprecated, and warn people that the things could change without warning -and hence that there were no stability guarantees at all.
          Hide
          Tom White added a comment -

          +1 on separating developer and user APIs.

          There's some discussion on this very issue in the context of interfaces/abstract classes (yes, that again) here http://kasparov.skife.org/blog-live/src/java/versioning-apr-almost.writeback and here http://www.dehora.net/journal/2008/06/24/icompatible/.

          Also in "Item 18: Prefer interfaces to abstract classes" of Joshua Bloch's Effective Java (http://java.sun.com/docs/books/effective/), he extols the virtues of interfaces over abstract classes. However, at the end he says:

          Once an interface is released and widely implemented, it is almost impossible to change. You really must get it right the first time. If an interface contains a minor flaw, it will irritate you and its users forever. If an interface is severely deficient, it can doom an API. The best thing to do when releasing a new interface is to have as many programmers as possible implement the interface in as many ways as possible before the interface is frozen. This will allow you to discover flaws while you can still correct them.

          So perhaps the developer API (which is "less public") is the place where it is OK to introduce interfaces - if we feel there is a good case for them - so we can get feedback on them by having different groups produce different implementations before they are frozen by appearing in user javadocs. (According to Bloch, three implementations is enough: http://lcsd05.cs.tamu.edu/slides/keynote.pdf)

          Show
          Tom White added a comment - +1 on separating developer and user APIs. There's some discussion on this very issue in the context of interfaces/abstract classes (yes, that again) here http://kasparov.skife.org/blog-live/src/java/versioning-apr-almost.writeback and here http://www.dehora.net/journal/2008/06/24/icompatible/ . Also in "Item 18: Prefer interfaces to abstract classes" of Joshua Bloch's Effective Java ( http://java.sun.com/docs/books/effective/ ), he extols the virtues of interfaces over abstract classes. However, at the end he says: Once an interface is released and widely implemented, it is almost impossible to change. You really must get it right the first time. If an interface contains a minor flaw, it will irritate you and its users forever. If an interface is severely deficient, it can doom an API. The best thing to do when releasing a new interface is to have as many programmers as possible implement the interface in as many ways as possible before the interface is frozen. This will allow you to discover flaws while you can still correct them. So perhaps the developer API (which is "less public") is the place where it is OK to introduce interfaces - if we feel there is a good case for them - so we can get feedback on them by having different groups produce different implementations before they are frozen by appearing in user javadocs. (According to Bloch, three implementations is enough: http://lcsd05.cs.tamu.edu/slides/keynote.pdf )
          Hide
          Doug Cutting added a comment -

          There's public and there's public. I'm not sure we're yet ready for this to appear in the user javadocs, but making it easier for developers to experiment is fine. The published javadoc is the API that we're promising to keep back-compatible, and we should expand it only cautiously. As we restructure the code, we should create a new classification we didn't have before: public but not in javadoc. This includes most of the hdfs and mapred daemon implementations. And the scheduler API belongs there for now.

          Show
          Doug Cutting added a comment - There's public and there's public. I'm not sure we're yet ready for this to appear in the user javadocs, but making it easier for developers to experiment is fine. The published javadoc is the API that we're promising to keep back-compatible, and we should expand it only cautiously. As we restructure the code, we should create a new classification we didn't have before: public but not in javadoc. This includes most of the hdfs and mapred daemon implementations. And the scheduler API belongs there for now.
          Tom White made changes -
          Attachment hadoop-3822.patch [ 12386803 ]
          Hide
          Tom White added a comment -

          A patch that moves the TaskScheduler class to a new org.apache.hadoop.mapred.scheduler package. I've also made JobInProgressListener public, as well as a number of related classes: JobInProgress, Task, TaskTracker status. For these last three I've changed the visibility of any public methods that are only used within the package to package-private so we start out with as little exposed as possible.

          I've also made TaskTrackerManager public and turned it into an abstract class.

          The attached script should be run before applying the patch.

          Show
          Tom White added a comment - A patch that moves the TaskScheduler class to a new org.apache.hadoop.mapred.scheduler package. I've also made JobInProgressListener public, as well as a number of related classes: JobInProgress, Task, TaskTracker status. For these last three I've changed the visibility of any public methods that are only used within the package to package-private so we start out with as little exposed as possible. I've also made TaskTrackerManager public and turned it into an abstract class. The attached script should be run before applying the patch.
          Tom White made changes -
          Attachment hadoop-3822.sh [ 12386802 ]
          Tom White made changes -
          Link This issue relates to HADOOP-1230 [ HADOOP-1230 ]
          Hide
          Tom White added a comment - - edited

          See discussions on interfaces vs abstract classes in HADOOP-3801 and HADOOP-1230.

          Show
          Tom White added a comment - - edited See discussions on interfaces vs abstract classes in HADOOP-3801 and HADOOP-1230 .
          Tom White made changes -
          Field Original Value New Value
          Link This issue relates to HADOOP-3801 [ HADOOP-3801 ]
          Tom White created issue -

            People

            • Assignee:
              Unassigned
              Reporter:
              Tom White
            • Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development