Details

    • Type: New Feature New Feature
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: 3.0.0
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None
    • Target Version/s:

      Description

      Coordination Engine (CE) is a system, which allows to agree on a sequence of events in a distributed system. In order to be reliable CE should be distributed by itself.
      Coordination Engine can be based on different algorithms (paxos, raft, 2PC, zab) and have different implementations, depending on use cases, reliability, availability, and performance requirements.
      CE should have a common API, so that it could serve as a pluggable component in different projects. The immediate beneficiaries are HDFS (HDFS-6469) and HBase (HBASE-10909).
      First implementation is proposed to be based on ZooKeeper.

      1. zkCEBenchmark.pdf
        46 kB
        Andrey Stepachev
      2. zkbench.pdf
        49 kB
        Henry Wang
      3. NNThroughputBenchmark Results.pdf
        61 kB
        Henry Wang
      4. hadoop-coordination.patch
        63 kB
        Mikhail Antonov
      5. HADOOP-10641.patch
        60 kB
        Konstantin Boudnik
      6. HADOOP-10641.patch
        75 kB
        Plamen Jeliazkov
      7. HADOOP-10641.patch
        60 kB
        Plamen Jeliazkov
      8. HADOOP-10641.patch
        72 kB
        Plamen Jeliazkov
      9. HADOOP-10641.patch
        65 kB
        Plamen Jeliazkov
      10. ce-tla.zip
        38 kB
        Michael Parkin

        Issue Links

          Activity

          Hide
          Konstantin Shvachko added a comment -

          Steve, thanks for thorough code review.
          The question about "what's next" has been discussed in this jira (see Aaron's and others comments), on the meetup on July 15 and in person. The decision everybody agreed on is to continue the CE and CNode development on a branch.

          • I would be glad to commit it to hadoop-common as you propose. That way we can see faster adoption.
          • I don't think the interface should be tied to ZooKeeper itself or higher level projects on top of it like Curator. Because CE is intended to be have implementations for different consensus algorithms.
          Show
          Konstantin Shvachko added a comment - Steve, thanks for thorough code review. The question about "what's next" has been discussed in this jira (see Aaron's and others comments), on the meetup on July 15 and in person. The decision everybody agreed on is to continue the CE and CNode development on a branch. I would be glad to commit it to hadoop-common as you propose. That way we can see faster adoption. I don't think the interface should be tied to ZooKeeper itself or higher level projects on top of it like Curator. Because CE is intended to be have implementations for different consensus algorithms.
          Hide
          Konstantin Shvachko added a comment -

          Thanks Henry and Andrey for the benchmark results. To summarize this, we have three benchmarks

          1. NNThroughputBenchmark, which gives us the upper bound of NN throughput.
          2. ZK benchmark, which measures the performance of ZK itself.
          3. ZK-CE benchmark measuring performance of CE based on ZK.

          Ideally we would like to see NNThroughput <= ZK-CE throughput <= ZK throughput.
          Just looking at create operation

          • NNThroughput yields 13K ops/sec with 400 threads, which seems to be optimal for that hardware configuration.
          • ZK throughput is substantially higher on SSD: 34K ops/sec
          • ZK-CE runs at 8.4K ops/sec, which is slower than NNThroughput.

          So there is work to do here. I think CE implementation can be optimized to get on par with or close to ZK performance.

          Show
          Konstantin Shvachko added a comment - Thanks Henry and Andrey for the benchmark results. To summarize this, we have three benchmarks NNThroughputBenchmark, which gives us the upper bound of NN throughput. ZK benchmark, which measures the performance of ZK itself. ZK-CE benchmark measuring performance of CE based on ZK. Ideally we would like to see NNThroughput <= ZK-CE throughput <= ZK throughput. Just looking at create operation NNThroughput yields 13K ops/sec with 400 threads, which seems to be optimal for that hardware configuration. ZK throughput is substantially higher on SSD: 34K ops/sec ZK-CE runs at 8.4K ops/sec, which is slower than NNThroughput. So there is work to do here. I think CE implementation can be optimized to get on par with or close to ZK performance.
          Hide
          Steve Loughran added a comment -

          oh, one more thing.. could you add the .tla file to the patch too. maybe we could start having src/tla as the home for these files

          Show
          Steve Loughran added a comment - oh, one more thing.. could you add the .tla file to the patch too. maybe we could start having src/tla as the home for these files
          Hide
          Steve Loughran added a comment -

          Thank your for the really good TLA document —I do think it makes what's going on a lot clearer as now others can see strictly what implementations do. (It now places a requirement to me to do the same for my proposals, but I'm happy with that).

          It also places a requirement for me to look at the code alongside the spec, so here goes:

          Algorithm

          1. presumably usedIds aren't collected forever in common implementations, instead they use enough of a time marker in their IDs to be self-windowing.

          core code

          1. CoordinationEngine should extend Hadoop common's AbstractService; this makes it trivial to integrate with the lifecycle of YARN apps and when someone migrates the NN/DN to the lifecycle will hook to HDFS the same way. The init/start/stop operations all match.
          2. ZKConfigKeys needs a name more tied to the coordination engine.
          3. ZKCoordinationEngine could use guava Precondition as a check on localNodeId and move it up to the init method.
          4. I don't like downgrading all ZK exceptions to "IOE", as it hides things like security exceptions, missing parents &c. Again this is something we could do that is more generic than just for the co-ord engine, as I've ended up doing some of this in my code.
          1. ZKCoordinationEngine.loopLearningUntilCaughtUp()
          2. createOrGetGlobalSequenceNumber's while(true) loop appears to spin forever if the exception raised in its ZK actions is KeeperException.NoAuthException... that is, if starting on a secure cluster where it can't access the path. More filtering of exception types is needed with the unrecoverables thrown up (SessionExpiredException +maybe some others).
          1. ZKCoordinationEngine.submitProposal(): needs better exception text, ideally including path and text of nested exception.
          2. I'm not sure I like the way ZKCoordinationEngine.processImpl() shuts itself down. I'd prefer some exception for the caller to process, so that owner of the engine is in sync.
          3. if an AgreementsThread fails its exception doesn't get picked up ... these need to be propagated back to the junit thread. Ideall also logged in that AgreementsThread after Thread.currentThread().setName() has given it a name for the log statements.

          tests

          1. MiniZKCluster ... I have the one from Twill converted to a YARN svc; this is is one i'd like to see this switch to once the code is checked in. It's lighter weight than the HBase one.
          2. there's always a delay for ZK startup. Could you make the test cases start one as a @BeforeClass and all share the same one?
          3. needs tests for failure conditions: no ZK cluster
          4. if there's a way to do this, tests for against a secure cluster, both succeeding and failing.
          5. add a test timeout via @Rule public final Timeout testTimeout = new Timeout(30000);

          minor

          1. minor: needs formatting to style, use of {} in all conditional clauses.
          2. Can you use SLF4K everywhere -it's more efficient as can do string expansion only when needed & will stop people complaining that every log action needs to be wrapped by log-level checks. Then switch calls like LOG.info("Got watched event: " + watchedEvent) to LOG.info("Got watched event: {}", watchedEvent)

          What now?

          I'm pretty happy with it —though more reviewers are needed. What now?

          1. We commit hadoop-coordination with the implementation into Hadoop common, and work towards getting it into a version of the NN which can use it for committing operations; maybe later in YARN and downstream.
          2. We work with the curator team to get it into curator and pull that into server-side Hadoop. YARN-913 is going to need that in the RM anyway, though so far HDFS hasn't. Strengths: lives with the rest of their ZK work. Weaknesses: looser coupling to Hadoop & a different release cycle.
          3. it goes into a hadoop-zookeeper module that depends on ZK & Curator, and on which things which need these can depend on, including client-side code that wants it (ZK trickles out somehow already, pulling it apart would be cleaner). For example, my mini-ZK cluster as YARN service could be anothe service to add.

          Strategy 3 appeals to me : it's not that different from what is there today (indeed, we just fix the module name for now & add more features as/when needed)

          Show
          Steve Loughran added a comment - Thank your for the really good TLA document —I do think it makes what's going on a lot clearer as now others can see strictly what implementations do. (It now places a requirement to me to do the same for my proposals, but I'm happy with that). It also places a requirement for me to look at the code alongside the spec, so here goes: Algorithm presumably usedIds aren't collected forever in common implementations, instead they use enough of a time marker in their IDs to be self-windowing. core code CoordinationEngine should extend Hadoop common's AbstractService ; this makes it trivial to integrate with the lifecycle of YARN apps and when someone migrates the NN/DN to the lifecycle will hook to HDFS the same way. The init/start/stop operations all match. ZKConfigKeys needs a name more tied to the coordination engine. ZKCoordinationEngine could use guava Precondition as a check on localNodeId and move it up to the init method. I don't like downgrading all ZK exceptions to "IOE", as it hides things like security exceptions, missing parents &c. Again this is something we could do that is more generic than just for the co-ord engine, as I've ended up doing some of this in my code. ZKCoordinationEngine.loopLearningUntilCaughtUp() createOrGetGlobalSequenceNumber 's while(true) loop appears to spin forever if the exception raised in its ZK actions is KeeperException.NoAuthException ... that is, if starting on a secure cluster where it can't access the path. More filtering of exception types is needed with the unrecoverables thrown up ( SessionExpiredException +maybe some others). ZKCoordinationEngine.submitProposal() : needs better exception text, ideally including path and text of nested exception. I'm not sure I like the way ZKCoordinationEngine.processImpl() shuts itself down. I'd prefer some exception for the caller to process, so that owner of the engine is in sync. if an AgreementsThread fails its exception doesn't get picked up ... these need to be propagated back to the junit thread. Ideall also logged in that AgreementsThread after Thread.currentThread().setName() has given it a name for the log statements. tests MiniZKCluster ... I have the one from Twill converted to a YARN svc; this is is one i'd like to see this switch to once the code is checked in . It's lighter weight than the HBase one. there's always a delay for ZK startup. Could you make the test cases start one as a @BeforeClass and all share the same one? needs tests for failure conditions: no ZK cluster if there's a way to do this, tests for against a secure cluster, both succeeding and failing. add a test timeout via @Rule public final Timeout testTimeout = new Timeout(30000); minor minor: needs formatting to style, use of {} in all conditional clauses. Can you use SLF4K everywhere -it's more efficient as can do string expansion only when needed & will stop people complaining that every log action needs to be wrapped by log-level checks. Then switch calls like LOG.info("Got watched event: " + watchedEvent) to LOG.info("Got watched event: {}", watchedEvent) What now? I'm pretty happy with it —though more reviewers are needed. What now? We commit hadoop-coordination with the implementation into Hadoop common, and work towards getting it into a version of the NN which can use it for committing operations; maybe later in YARN and downstream. We work with the curator team to get it into curator and pull that into server-side Hadoop. YARN-913 is going to need that in the RM anyway, though so far HDFS hasn't. Strengths: lives with the rest of their ZK work. Weaknesses: looser coupling to Hadoop & a different release cycle. it goes into a hadoop-zookeeper module that depends on ZK & Curator, and on which things which need these can depend on, including client-side code that wants it (ZK trickles out somehow already, pulling it apart would be cleaner). For example, my mini-ZK cluster as YARN service could be anothe service to add. Strategy 3 appeals to me : it's not that different from what is there today (indeed, we just fix the module name for now & add more features as/when needed)
          Hide
          Andrey Stepachev added a comment -

          I did improvements of this patch. Improved version was benchmarked. Results attached.
          Modified version of CE I'll file as separate jira later.

          Show
          Andrey Stepachev added a comment - I did improvements of this patch. Improved version was benchmarked. Results attached. Modified version of CE I'll file as separate jira later.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12663803/HADOOP-10641.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 6 new or modified test files.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          -1 javadoc. The javadoc tool appears to have generated 2 warning messages.
          See https://builds.apache.org/job/PreCommit-HADOOP-Build/4537//artifact/trunk/patchprocess/diffJavadocWarnings.txt for details.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          -1 findbugs. The patch appears to introduce 1 new Findbugs (version 2.0.3) warnings.

          -1 release audit. The applied patch generated 1 release audit warnings.

          +1 core tests. The patch passed unit tests in hadoop-common-project/hadoop-coordination.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-HADOOP-Build/4537//testReport/
          Release audit warnings: https://builds.apache.org/job/PreCommit-HADOOP-Build/4537//artifact/trunk/patchprocess/patchReleaseAuditProblems.txt
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HADOOP-Build/4537//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-coordination.html
          Console output: https://builds.apache.org/job/PreCommit-HADOOP-Build/4537//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12663803/HADOOP-10641.patch against trunk revision . +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 6 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. -1 javadoc . The javadoc tool appears to have generated 2 warning messages. See https://builds.apache.org/job/PreCommit-HADOOP-Build/4537//artifact/trunk/patchprocess/diffJavadocWarnings.txt for details. +1 eclipse:eclipse . The patch built with eclipse:eclipse. -1 findbugs . The patch appears to introduce 1 new Findbugs (version 2.0.3) warnings. -1 release audit . The applied patch generated 1 release audit warnings. +1 core tests . The patch passed unit tests in hadoop-common-project/hadoop-coordination. +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HADOOP-Build/4537//testReport/ Release audit warnings: https://builds.apache.org/job/PreCommit-HADOOP-Build/4537//artifact/trunk/patchprocess/patchReleaseAuditProblems.txt Findbugs warnings: https://builds.apache.org/job/PreCommit-HADOOP-Build/4537//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-coordination.html Console output: https://builds.apache.org/job/PreCommit-HADOOP-Build/4537//console This message is automatically generated.
          Hide
          Plamen Jeliazkov added a comment -

          The test, testSimpleProposals, was also updated to validate that the CoordinateEngine's GlobalSequenceNumber is incrementing monotonically per Agreement reached.

          Show
          Plamen Jeliazkov added a comment - The test, testSimpleProposals, was also updated to validate that the CoordinateEngine's GlobalSequenceNumber is incrementing monotonically per Agreement reached.
          Hide
          Plamen Jeliazkov added a comment -

          Attaching new patch.

          Here's the major updates.

          1. Added ZK implementation interfaces to make Agreement handling clearer.
          2. ZKCoordinationEngine takes a collection of ZKAgreementHandlers. It is the job of the Handlers to type cast the Agreements. The pre-requisite to the type cast is that ZKAgreementHandler.handles(Agreement) must return true for that very Agreement.
          3. ZKCoordinationEngine executes each Agreement amongst all the ZKAgreementHandlers. Look at ZKCoordinationEngine.executeAllHandlers().
          4. SampleHandler sets the GlobalSequenceNumber of the SampleProposal before executing it.
          Show
          Plamen Jeliazkov added a comment - Attaching new patch. Here's the major updates. Added ZK implementation interfaces to make Agreement handling clearer. ZKCoordinationEngine takes a collection of ZKAgreementHandlers. It is the job of the Handlers to type cast the Agreements. The pre-requisite to the type cast is that ZKAgreementHandler.handles(Agreement) must return true for that very Agreement. ZKCoordinationEngine executes each Agreement amongst all the ZKAgreementHandlers. Look at ZKCoordinationEngine.executeAllHandlers(). SampleHandler sets the GlobalSequenceNumber of the SampleProposal before executing it.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12663755/zkbench.pdf
          against trunk revision .

          -1 patch. The patch command could not apply the patch.

          Console output: https://builds.apache.org/job/PreCommit-HADOOP-Build/4535//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12663755/zkbench.pdf against trunk revision . -1 patch . The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-HADOOP-Build/4535//console This message is automatically generated.
          Hide
          Henry Wang added a comment -

          I ran zookeeper-benchmark and NNThroughputBenchMark.
          Zookeeper write throughput on HDD is comparable to NameNode write operations throughput while zookeeper throughput on SSD is higher than NameNode throughput. Based on the results, zookeeper is not expected to be the bottleneck.

          Show
          Henry Wang added a comment - I ran zookeeper-benchmark and NNThroughputBenchMark. Zookeeper write throughput on HDD is comparable to NameNode write operations throughput while zookeeper throughput on SSD is higher than NameNode throughput. Based on the results, zookeeper is not expected to be the bottleneck.
          Hide
          Alex Newman added a comment -

          Michael Parkin this looks really neat.

          Steve Loughran I am curious if that helps at all?

          Show
          Alex Newman added a comment - Michael Parkin this looks really neat. Steve Loughran I am curious if that helps at all?
          Hide
          Michael Parkin added a comment -

          Please find attached an first attempt at a CE specification in TLA+. With TLC and the attached configuration file, model checking is successful.

          The specification is for a CE that accepts proposals (containing values submitted by proposers) and produces a sequence of agreements. The mechanism through which proposals are agreed will depend on the coordination algorithm used by the CE implementation - at the moment all the specification states is that a submitted proposal ends up in the agreement sequence.

          Show
          Michael Parkin added a comment - Please find attached an first attempt at a CE specification in TLA+. With TLC and the attached configuration file, model checking is successful. The specification is for a CE that accepts proposals (containing values submitted by proposers) and produces a sequence of agreements. The mechanism through which proposals are agreed will depend on the coordination algorithm used by the CE implementation - at the moment all the specification states is that a submitted proposal ends up in the agreement sequence.
          Hide
          Konstantin Boudnik added a comment -

          Looks like new pom.xml file is missing the ASL boiler-plate.

          Show
          Konstantin Boudnik added a comment - Looks like new pom.xml file is missing the ASL boiler-plate.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12659686/HADOOP-10641.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 6 new or modified test files.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. There were no new javadoc warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          -1 findbugs. The patch appears to introduce 1 new Findbugs (version 2.0.3) warnings.

          -1 release audit. The applied patch generated 1 release audit warnings.

          +1 core tests. The patch passed unit tests in hadoop-common-project/hadoop-coordination.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-HADOOP-Build/4420//testReport/
          Release audit warnings: https://builds.apache.org/job/PreCommit-HADOOP-Build/4420//artifact/trunk/patchprocess/patchReleaseAuditProblems.txt
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HADOOP-Build/4420//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-coordination.html
          Console output: https://builds.apache.org/job/PreCommit-HADOOP-Build/4420//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12659686/HADOOP-10641.patch against trunk revision . +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 6 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . There were no new javadoc warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. -1 findbugs . The patch appears to introduce 1 new Findbugs (version 2.0.3) warnings. -1 release audit . The applied patch generated 1 release audit warnings. +1 core tests . The patch passed unit tests in hadoop-common-project/hadoop-coordination. +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HADOOP-Build/4420//testReport/ Release audit warnings: https://builds.apache.org/job/PreCommit-HADOOP-Build/4420//artifact/trunk/patchprocess/patchReleaseAuditProblems.txt Findbugs warnings: https://builds.apache.org/job/PreCommit-HADOOP-Build/4420//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-coordination.html Console output: https://builds.apache.org/job/PreCommit-HADOOP-Build/4420//console This message is automatically generated.
          Hide
          Plamen Jeliazkov added a comment -

          Attaching new patch, here are the changes I've made:

          1. Moved updateCurrentGSN() to after executeAgreement(). We will replay the agreement if we crashed before updating the GSN in ZK.
          2. Removed the ProposalReturnCode(s). We will just return if submission was successful. Exception if unsuccessful.
          3. Renamed ProposalNotAcceptedException to ProposalSubmissionException.
          4. NoQuorumException now extends ProposalSubmissionException.
          Show
          Plamen Jeliazkov added a comment - Attaching new patch, here are the changes I've made: Moved updateCurrentGSN() to after executeAgreement(). We will replay the agreement if we crashed before updating the GSN in ZK. Removed the ProposalReturnCode(s). We will just return if submission was successful. Exception if unsuccessful. Renamed ProposalNotAcceptedException to ProposalSubmissionException. NoQuorumException now extends ProposalSubmissionException.
          Hide
          Steve Loughran added a comment -

          It seems that you cite HADOOP-9361 as an example of a specification of an API. Same as Andrew I could not find specifications or any documents linked there. May be you can clarify. Is it a TLA+ spec of HDFS?

          Look in the site docs.

          I actually used the Z model with a python syntax in the hope it would be more broadly understood, though now that I'm trying to integrate it with other work I'm trying to think "shall I go back and TLA+ it?"...because then its possible to start thinking about code using the public module definitions.

          If FileSystem specs are done and the tests developed according to specs, should that be sufficient to safeguard FS behaviour of any internal changes including introduction of CE? Which I assume is your main concern with this jira.

          My concern is slightly different: we're producing a plugin point for co-ordination across the Hadoop stack, and we need to know what it is meant to do.

          If I introduce an implementation of CE using ZK and it does not break the tests and therefore does not alter FileSystem semantics, isn't that a verification of the implementation.

          The real HDFS tests are the full stack, with failure injection...those FS API ones are limited to those API calls and how things like seek() work. It'll be the full stack tests with failure injection that will highlight where a test fails after the changes.

          As for the tests in this JIRA so far, they're pretty minimal and verify that the ZK implementation does "something" provided nothing appears to fail. They don't seem designed to be run against any other implementation of the interface, nor is there any failure injection. Even without the fault injection, any tests for this should be designed to be targetable at any implementation, to show consistent behaviour in the core actions.

          Such tests won't verify robustness though...it looks to me that I could implement this API using in-memory data structures, something would be utterly lacking in the durability things need. Or I could try to use Gossip, which may have the durability, but a different ordering guarantee which may show up in production. Its things like the latter I'm hoping to catch, by spelling out to implementors what they have to do.

          It looks like you propose to introduce a new requirement for Hadoop contributions, that new features should be formally specified and a mathematically proven. I think this should be discussed in a separate thread before it can be enforced. Would be good to hear what people think.

          Proven? Not a chance. What I would like to see is those critical co-ordination points to be defined formally enough that there's no ambiguity about what they can do. This proposal is for a plugin to define what HDFS, HBase &c will expect from a consensus service, so we have to ensure there's no ambiguity. Then we can use those documents to derive the tests to break things.

          Generally great minds were thinking about disciplined software development techniques way back, probably starting from Dijkstra and Donald Knuth. I found them very useful dealing with complex algorithms, not sure about APIs.

          +Parnas, who invented "interfaces" as the combination of (signature, semantics) in 1972. What I'm trying to do here is get those semantics nailed down.

          Show
          Steve Loughran added a comment - It seems that you cite HADOOP-9361 as an example of a specification of an API. Same as Andrew I could not find specifications or any documents linked there. May be you can clarify. Is it a TLA+ spec of HDFS? Look in the site docs . I actually used the Z model with a python syntax in the hope it would be more broadly understood, though now that I'm trying to integrate it with other work I'm trying to think "shall I go back and TLA+ it?"...because then its possible to start thinking about code using the public module definitions. If FileSystem specs are done and the tests developed according to specs, should that be sufficient to safeguard FS behaviour of any internal changes including introduction of CE? Which I assume is your main concern with this jira. My concern is slightly different: we're producing a plugin point for co-ordination across the Hadoop stack, and we need to know what it is meant to do. If I introduce an implementation of CE using ZK and it does not break the tests and therefore does not alter FileSystem semantics, isn't that a verification of the implementation. The real HDFS tests are the full stack, with failure injection...those FS API ones are limited to those API calls and how things like seek() work. It'll be the full stack tests with failure injection that will highlight where a test fails after the changes. As for the tests in this JIRA so far, they're pretty minimal and verify that the ZK implementation does "something" provided nothing appears to fail. They don't seem designed to be run against any other implementation of the interface, nor is there any failure injection. Even without the fault injection, any tests for this should be designed to be targetable at any implementation, to show consistent behaviour in the core actions. Such tests won't verify robustness though...it looks to me that I could implement this API using in-memory data structures, something would be utterly lacking in the durability things need. Or I could try to use Gossip, which may have the durability, but a different ordering guarantee which may show up in production. Its things like the latter I'm hoping to catch, by spelling out to implementors what they have to do. It looks like you propose to introduce a new requirement for Hadoop contributions, that new features should be formally specified and a mathematically proven. I think this should be discussed in a separate thread before it can be enforced. Would be good to hear what people think. Proven? Not a chance. What I would like to see is those critical co-ordination points to be defined formally enough that there's no ambiguity about what they can do. This proposal is for a plugin to define what HDFS, HBase &c will expect from a consensus service, so we have to ensure there's no ambiguity. Then we can use those documents to derive the tests to break things. Generally great minds were thinking about disciplined software development techniques way back, probably starting from Dijkstra and Donald Knuth. I found them very useful dealing with complex algorithms, not sure about APIs. +Parnas, who invented "interfaces" as the combination of (signature, semantics) in 1972. What I'm trying to do here is get those semantics nailed down.
          Hide
          Konstantin Shvachko added a comment -

          I asked this question in the linked jira and want to post it here.
          Do the points raised so far are objections to creating a branch and starting the implementation of CoordinationEngine along with ConsensusNode on it? As it was agreed on the July 15 meetup.

          Show
          Konstantin Shvachko added a comment - I asked this question in the linked jira and want to post it here. Do the points raised so far are objections to creating a branch and starting the implementation of CoordinationEngine along with ConsensusNode on it? As it was agreed on the July 15 meetup.
          Hide
          Konstantin Shvachko added a comment -

          Steve, I am glad we are talking about the same thing now, which is the CoordinationEngine interface.

          1. It seems that you cite HADOOP-9361 as an example of a specification of an API. Same as Andrew I could not find specifications or any documents linked there. May be you can clarify. Is it a TLA+ spec of HDFS?
          2. If FileSystem specs are done and the tests developed according to specs, should that be sufficient to safeguard FS behaviour of any internal changes including introduction of CE? Which I assume is your main concern with this jira.
            If I introduce an implementation of CE using ZK and it does not break the tests and therefore does not alter FileSystem semantics, isn't that a verification of the implementation.
          3. It looks like you propose to introduce a new requirement for Hadoop contributions, that new features should be formally specified and mathematically proven. I think this should be discussed in a separate thread before it can be enforced. Would be good to hear what people think.
          4. Generally great minds were thinking about disciplined software development techniques way back, probably starting from Dijkstra and Donald Knuth. I found them very useful dealing with complex algorithms, not sure about APIs.
          Show
          Konstantin Shvachko added a comment - Steve, I am glad we are talking about the same thing now, which is the CoordinationEngine interface. It seems that you cite HADOOP-9361 as an example of a specification of an API. Same as Andrew I could not find specifications or any documents linked there. May be you can clarify. Is it a TLA+ spec of HDFS? If FileSystem specs are done and the tests developed according to specs, should that be sufficient to safeguard FS behaviour of any internal changes including introduction of CE? Which I assume is your main concern with this jira. If I introduce an implementation of CE using ZK and it does not break the tests and therefore does not alter FileSystem semantics, isn't that a verification of the implementation. It looks like you propose to introduce a new requirement for Hadoop contributions, that new features should be formally specified and mathematically proven . I think this should be discussed in a separate thread before it can be enforced. Would be good to hear what people think. Generally great minds were thinking about disciplined software development techniques way back, probably starting from Dijkstra and Donald Knuth. I found them very useful dealing with complex algorithms, not sure about APIs.
          Hide
          Steve Loughran added a comment -

          this jira is not proposing new Consensus protocols, as stated in this comment. CoordinationEngine here is an interface to be used with existing consensus algorithms,

          Exactly. This JIRA is proposing a plugin interface to co-ordination systems using consensus algorithms, a plugin point intended for use by HDFS and others. It is absolutely critical that all implementations of this plug in do exactly what is expected of them -and we cannot do that without a clear definition of what they are meant to do, what guarantees must be met and what failure modes are expected.

          The consensus node design document is not such a document. It's an outline of what can be done, but it doesn't specify the API. The current patch for this JIRA contains some interfaces, a ZK class and a single test case. Can we trust this ZK class to do what is required? Not without a clear definition of what is required. Can we trust the test case to verify that the ZK implementations does what is required? Not now, no. What do we do if there is a difference between what the ZK implementation does and the interface defines -is it the interface at fault, or the ZK implementation? What if a third-party implementation does something differently? Whose implementation is considered the correct one?

          For the filesystems, HDFS defines the behavior; my '9361 JIRA was deriving a specification from that implementation, generating more corner case tests, and making the details of how (every) other filesystem behaves differently a declarative bit of XML for each FS -now we can see how they differ. We've even used it to bring the other filesystems (especially S3N) more in line with what is expected.

          This new plugin point is intended become a critical failure point for HDFS and YARN, where the incorrect behaviour of an implementations potentially places data at risk. Yet to date, all we have is a PDF file which, as Amazon describes it "conventional design documents consist of prose, static diagrams, and perhaps pseudo-code in an ad hoc untestable language."

          This is not a full consensus protocol; it will be straightforward to specify strictly enough to derive tests, to tell implementors of consensus protocol-based systems how to hook up their work to Hadoop. And, as those implementors are expected to be experts in distributed systems and such topics, we should be able to expect them to pick up basic specification languages just as we expect submitters of all patches to be able to write JUnit tests.

          Show
          Steve Loughran added a comment - this jira is not proposing new Consensus protocols, as stated in this comment. CoordinationEngine here is an interface to be used with existing consensus algorithms, Exactly. This JIRA is proposing a plugin interface to co-ordination systems using consensus algorithms, a plugin point intended for use by HDFS and others. It is absolutely critical that all implementations of this plug in do exactly what is expected of them -and we cannot do that without a clear definition of what they are meant to do, what guarantees must be met and what failure modes are expected. The consensus node design document is not such a document. It's an outline of what can be done, but it doesn't specify the API. The current patch for this JIRA contains some interfaces, a ZK class and a single test case. Can we trust this ZK class to do what is required? Not without a clear definition of what is required. Can we trust the test case to verify that the ZK implementations does what is required? Not now, no. What do we do if there is a difference between what the ZK implementation does and the interface defines -is it the interface at fault, or the ZK implementation? What if a third-party implementation does something differently? Whose implementation is considered the correct one? For the filesystems, HDFS defines the behavior; my '9361 JIRA was deriving a specification from that implementation, generating more corner case tests, and making the details of how (every) other filesystem behaves differently a declarative bit of XML for each FS -now we can see how they differ. We've even used it to bring the other filesystems (especially S3N) more in line with what is expected. This new plugin point is intended become a critical failure point for HDFS and YARN, where the incorrect behaviour of an implementations potentially places data at risk. Yet to date, all we have is a PDF file which, as Amazon describes it "conventional design documents consist of prose, static diagrams, and perhaps pseudo-code in an ad hoc untestable language." This is not a full consensus protocol; it will be straightforward to specify strictly enough to derive tests, to tell implementors of consensus protocol-based systems how to hook up their work to Hadoop. And, as those implementors are expected to be experts in distributed systems and such topics, we should be able to expect them to pick up basic specification languages just as we expect submitters of all patches to be able to write JUnit tests.
          Hide
          Konstantin Shvachko added a comment -

          Consensus protocols are expected to provide proofs of the algorithms correctness

          Steve this jira is not proposing new Consensus protocols, as stated in this comment.
          CoordinationEngine here is an interface to be used with existing consensus algorithms, which indeed go through a rigorous math scrutiny before they become trustworthy.

          Show
          Konstantin Shvachko added a comment - Consensus protocols are expected to provide proofs of the algorithms correctness Steve this jira is not proposing new Consensus protocols, as stated in this comment . CoordinationEngine here is an interface to be used with existing consensus algorithms, which indeed go through a rigorous math scrutiny before they become trustworthy.
          Hide
          Steve Loughran added a comment -

          This is a good idea in the abstract, but the notion of applying Amazon's process to a volunteer open source project is problematic.

          Consensus protocols are expected to provide proofs of the algorithms correctness; anything derived from Paxos, Raft et al rely on those algorithms being considered valid, and the implementors being able to understand the algorithms. Open source consensus protocol implementations are expected to publish their inner workings, else they can't be trusted. I will site Apache Zookeeper's ZAB protocol, and Anubis's consistent T-space model, as examples of two OSS products that I have used and implementations that I trust.

          In terms of the Hadoop contribution process, this is a novel requirement.

          Implementations of distributed consensus protocols already a one place where the team needs people who understands the maths. If a team implementing a protocol aren't able to specify it formally in some form or other: run. And if someone tries to submit changes to the core protocols of an OSS implementation who can't prove that it works, I would hope that the patch will be rejected.

          Which is why I believe this specific JIRA "provide an API and reference implementation of distributed updates" is suitable for the criteria "provide a strict specification". I'm confident that someone in the WanDisco dev team will be able to do this, and would make "understand this specification" a pre req for anyone else doing their own implementation.

          Even so, we can't expect complete proofs of correctness. Which is why I said "any maths that can be provided, and test cases".

          For HADOOP-9361, the test cases were the main outcome: by enumerating invariants and pre/post conditions, some places where we didn't have enough tests became apparent. These were mostly failure modes of some operations (e.g. what happens when preconditions aren't met).

          Derived tests are great as:

          1. Jenkins can run them; you can't get mathematicians to prove things during automated regression tests.
          2. It makes it easier to decide if a test failure is due to an error in the test, or a failure of the code. If a specification-derived test fails, then it is now due to either an error in the specification or the code.

          I think we need to do the same here: from a specification of the API, build the test cases which can verify the behavior as well as local tests can. Those implementors of the back end now get those tests alongside a specification which defines what they have to implement.

          The next issue becomes "can people implementing things understand the specification?". It's why I used a notation that uses Python expressions and data structures; one that should be easy to understand. It's also why users of the TLA+ stuff in the Java & C/C++ world tend to use the curly-braced form of the language.

          I'm sorry if this appears harsh or that I've suddenly added a new criteria to what Hadoop patches have to do, but given this Coordination Manager is proposed as a central part in a future HDFS and YARN RM, then yes, we do have to define it properly.

          Show
          Steve Loughran added a comment - This is a good idea in the abstract, but the notion of applying Amazon's process to a volunteer open source project is problematic. Consensus protocols are expected to provide proofs of the algorithms correctness; anything derived from Paxos, Raft et al rely on those algorithms being considered valid, and the implementors being able to understand the algorithms. Open source consensus protocol implementations are expected to publish their inner workings, else they can't be trusted. I will site Apache Zookeeper's ZAB protocol , and Anubis's consistent T-space model , as examples of two OSS products that I have used and implementations that I trust. In terms of the Hadoop contribution process, this is a novel requirement. Implementations of distributed consensus protocols already a one place where the team needs people who understands the maths. If a team implementing a protocol aren't able to specify it formally in some form or other: run. And if someone tries to submit changes to the core protocols of an OSS implementation who can't prove that it works, I would hope that the patch will be rejected. Which is why I believe this specific JIRA "provide an API and reference implementation of distributed updates" is suitable for the criteria "provide a strict specification". I'm confident that someone in the WanDisco dev team will be able to do this, and would make "understand this specification" a pre req for anyone else doing their own implementation. Even so, we can't expect complete proofs of correctness. Which is why I said "any maths that can be provided, and test cases". For HADOOP-9361 , the test cases were the main outcome: by enumerating invariants and pre/post conditions, some places where we didn't have enough tests became apparent. These were mostly failure modes of some operations (e.g. what happens when preconditions aren't met). Derived tests are great as: Jenkins can run them; you can't get mathematicians to prove things during automated regression tests. It makes it easier to decide if a test failure is due to an error in the test, or a failure of the code. If a specification-derived test fails, then it is now due to either an error in the specification or the code. I think we need to do the same here: from a specification of the API, build the test cases which can verify the behavior as well as local tests can. Those implementors of the back end now get those tests alongside a specification which defines what they have to implement. The next issue becomes "can people implementing things understand the specification?". It's why I used a notation that uses Python expressions and data structures; one that should be easy to understand. It's also why users of the TLA+ stuff in the Java & C/C++ world tend to use the curly-braced form of the language. I'm sorry if this appears harsh or that I've suddenly added a new criteria to what Hadoop patches have to do, but given this Coordination Manager is proposed as a central part in a future HDFS and YARN RM, then yes, we do have to define it properly.
          Hide
          Andrew Purtell added a comment -

          I want a formal specification of the API, and what we have in the current PDF design document is not it. I will also need evidence that the reference ZK implementation is consistent with that specification, both by any maths that can be provided, and the test cases derived from the specification.

          This is a good idea in the abstract, but the notion of applying Amazon's process to a volunteer open source project is problematic. In terms of the Hadoop contribution process, this is a novel requirement. It is up to the Hadoop committership to determine commit criteria of course, but I humbly suggest that the intersection of contributors able to mathematically prove the correctness of a large code change while simultaneously being able to implement production quality systems code is vanishingly small. In this case, the contributors might be able to meet the challenge but going forward if significant changes to Hadoop will require a team of engineers and mathematicians, probably this marks the end of external contributions to the project. Also, I looked at HADOOP-9361. The documentation updates there are fantastic but I did not find any mathematical proofs of correctness.

          Show
          Andrew Purtell added a comment - I want a formal specification of the API, and what we have in the current PDF design document is not it. I will also need evidence that the reference ZK implementation is consistent with that specification, both by any maths that can be provided, and the test cases derived from the specification. This is a good idea in the abstract, but the notion of applying Amazon's process to a volunteer open source project is problematic. In terms of the Hadoop contribution process, this is a novel requirement. It is up to the Hadoop committership to determine commit criteria of course, but I humbly suggest that the intersection of contributors able to mathematically prove the correctness of a large code change while simultaneously being able to implement production quality systems code is vanishingly small. In this case, the contributors might be able to meet the challenge but going forward if significant changes to Hadoop will require a team of engineers and mathematicians, probably this marks the end of external contributions to the project. Also, I looked at HADOOP-9361 . The documentation updates there are fantastic but I did not find any mathematical proofs of correctness.
          Hide
          Steve Loughran added a comment -

          I'm very much an =0 to the changes to HDFS, as that level is not an area of my understanding. If something does go into HDFS, then as noted, hadoop-common does seem an appropriate location - if it can't go into hadoop-hdfs itself.

          Before that happens, consider this:

          Consensus protocols are where CS-hard mathematics comes out of the textbooks and into the codebase; it is a key place where you are expected to prove the correctness of your algorithm before your peers will trust it. And, hopefully, before you make the correctness of that algorithm a critical part of your own application.

          If Hadoop is going to provide a plug-in point for distributed co-ordination systems – which is what this proposal is – then we need to specify what is expected of an implementation strictly enough that it is possible to prove that implementations meet the specification, and that downstream projects can demonstrate that if an implementation meets this specification then their own algorithms with be correct.

          More succinctly: *I want a formal specification of the API, and what we have in the current PDF design document is not it. I will also need evidence that the reference ZK implementation is consistent with that specification, both by any maths that can be provided, and the test cases derived from the specification.

          This may seem a harsh requirement, but HADOOP-9361 shows that it is nothing I would not impose on myself. It is What Amazon is doing in their stack, and it has also been done for Distributed File Systems.

          I would recommend using TLA+ here -and for any downstream uses. Once the foundations are done, then we can move onto YARN, and then finally to the applications which run on it.

          I'm not going to comment on the code at all at this point, except to observe that you should be making this a YARN service to integrate with the rest of the services and workflow being built around them. The core classes are in hadoop-common.

          Show
          Steve Loughran added a comment - I'm very much an =0 to the changes to HDFS, as that level is not an area of my understanding. If something does go into HDFS, then as noted, hadoop-common does seem an appropriate location - if it can't go into hadoop-hdfs itself. Before that happens, consider this: Consensus protocols are where CS-hard mathematics comes out of the textbooks and into the codebase; it is a key place where you are expected to prove the correctness of your algorithm before your peers will trust it. And, hopefully, before you make the correctness of that algorithm a critical part of your own application. If Hadoop is going to provide a plug-in point for distributed co-ordination systems – which is what this proposal is – then we need to specify what is expected of an implementation strictly enough that it is possible to prove that implementations meet the specification, and that downstream projects can demonstrate that if an implementation meets this specification then their own algorithms with be correct. More succinctly: *I want a formal specification of the API, and what we have in the current PDF design document is not it. I will also need evidence that the reference ZK implementation is consistent with that specification, both by any maths that can be provided, and the test cases derived from the specification. This may seem a harsh requirement, but HADOOP-9361 shows that it is nothing I would not impose on myself. It is What Amazon is doing in their stack , and it has also been done for Distributed File Systems . I would recommend using TLA+ here -and for any downstream uses. Once the foundations are done, then we can move onto YARN, and then finally to the applications which run on it. I'm not going to comment on the code at all at this point, except to observe that you should be making this a YARN service to integrate with the rest of the services and workflow being built around them. The core classes are in hadoop-common.
          Hide
          Konstantin Boudnik added a comment -

          I think in the code we refer to it as CNode or ConsensusNode, hence the name for the branch.

          Show
          Konstantin Boudnik added a comment - I think in the code we refer to it as CNode or ConsensusNode, hence the name for the branch.
          Hide
          Allen Wittenauer added a comment -

          Did you mean ConsensusNameNode?

          Show
          Allen Wittenauer added a comment - Did you mean ConsensusNameNode?
          Hide
          Alex Newman added a comment -

          Hey dude. Should we delay this a bit?

          On Wed, Jul 16, 2014 at 11:11 PM, Konstantin Boudnik (JIRA)

          Show
          Alex Newman added a comment - Hey dude. Should we delay this a bit? On Wed, Jul 16, 2014 at 11:11 PM, Konstantin Boudnik (JIRA)
          Hide
          Konstantin Boudnik added a comment -

          As has been proposed above and agreed during the meet-up yesterday, I will go ahead and clear new branch ConsensusNode off the trunk, so we'll start adding the implementation there.

          Show
          Konstantin Boudnik added a comment - As has been proposed above and agreed during the meet-up yesterday, I will go ahead and clear new branch ConsensusNode off the trunk, so we'll start adding the implementation there.
          Hide
          Plamen Jeliazkov added a comment -

          We hosted a meet-up at the WANdisco office in San Ramon today. Thank you to everyone who came. I'd especially like to thank Aaron T. Myers and Sanjay Radia for taking their time to connect with us.

          I took the liberty to record some of the comments / concerns people raised during our meet-up. I will list all of them here and provide a few responses.

          • Is NoQuorumException and ProposalNotAcceptedException enough? Are there other exceptions CoordinationEngine might throw?
            • My own feeling is that these two in particular were the most general and universal. We could always add IOException, if desired.
          • In submitProposal() there is ProposalReturnCode return value and possible Exception to be thrown. It is unclear which one we should use.
            • I agree. Konstantin looked at me for an answer during this but I remained silent. The reason for this is for ProposalReturnCode to return a deterministic result (NoQuorum has a deterministic event; the Proposal was not sent), and to treat the Exception case as something wrong with the Proposal itself (i.e., doesn't implement equal() or hashcode() correctly, or cannot be serialized properly). I understand the confusion and we could do better with just the Exception case.
          • ConsensusNode is non-specific. Consider renaming the project to ConsensusNameNode.
            • This applies to HDFS-6469. I think ConsensusNameNode is a good name. I'll probably always continue to call them CNodes though.
          • Concern for PAXOS to effectively load balance clients. Two round trips makes writes slow.
          • CNodeProxyProvider should allow for deterministic host selection. Consider a round-robin approach.
          • We are weakening read semantics to provide the fast read path. This makes stale reads possible.
            • Konstantin discussed the 'coordinated read' mechanism and how we ensure clients talk to up-to-date NameNodes via Proposals.
          • Sub-namespace WAN replication is highly desirable but double-journaling in the CoordinationEngine and the EditsLog is concerning.
          • An address of the impact on write performance is desirable by the community.
          • HBase coming up with WAL plugin for possible coordination. Wary of membership coordination (multiple Distributed State Machines) for HBase WALs.
          • Small separate project might make it more likely for people to import CE into their own projects and build their own CoordinationEngines. Separate branch also possible.

          Some of these clearly correspond to the HDFS and HBase projects and not just the CoordinationEngine itself. Apologies if I missed anyone's concern / point; pretty sure I captured everybody though.

          Show
          Plamen Jeliazkov added a comment - We hosted a meet-up at the WANdisco office in San Ramon today. Thank you to everyone who came. I'd especially like to thank Aaron T. Myers and Sanjay Radia for taking their time to connect with us. I took the liberty to record some of the comments / concerns people raised during our meet-up. I will list all of them here and provide a few responses. Is NoQuorumException and ProposalNotAcceptedException enough? Are there other exceptions CoordinationEngine might throw? My own feeling is that these two in particular were the most general and universal. We could always add IOException, if desired. In submitProposal() there is ProposalReturnCode return value and possible Exception to be thrown. It is unclear which one we should use. I agree. Konstantin looked at me for an answer during this but I remained silent. The reason for this is for ProposalReturnCode to return a deterministic result (NoQuorum has a deterministic event; the Proposal was not sent), and to treat the Exception case as something wrong with the Proposal itself (i.e., doesn't implement equal() or hashcode() correctly, or cannot be serialized properly). I understand the confusion and we could do better with just the Exception case. ConsensusNode is non-specific. Consider renaming the project to ConsensusNameNode. This applies to HDFS-6469 . I think ConsensusNameNode is a good name. I'll probably always continue to call them CNodes though. Concern for PAXOS to effectively load balance clients. Two round trips makes writes slow. CNodeProxyProvider should allow for deterministic host selection. Consider a round-robin approach. We are weakening read semantics to provide the fast read path. This makes stale reads possible. Konstantin discussed the 'coordinated read' mechanism and how we ensure clients talk to up-to-date NameNodes via Proposals. Sub-namespace WAN replication is highly desirable but double-journaling in the CoordinationEngine and the EditsLog is concerning. An address of the impact on write performance is desirable by the community. HBase coming up with WAL plugin for possible coordination. Wary of membership coordination (multiple Distributed State Machines) for HBase WALs. Small separate project might make it more likely for people to import CE into their own projects and build their own CoordinationEngines. Separate branch also possible. Some of these clearly correspond to the HDFS and HBase projects and not just the CoordinationEngine itself. Apologies if I missed anyone's concern / point; pretty sure I captured everybody though.
          Hide
          Aaron T. Myers added a comment -

          I'm saying you should commit the Coordination Engine interface to the ConsensusNode feature branch and use it on that branch, and then at some point we may merge the whole branch to trunk, CE and CN simultaneously. This is exactly what I said previously:

          I'm fine with you proceeding with this on a development branch. That will give you an opportunity to commit the coordination engine interface and start making progress on HDFS-6469. If and when that materializes as a stable system that the community wants to adopt into Hadoop, then we'll merge it back to trunk just like we've done with many large features that are better accomplished via multiple JIRAs and doing the work piecemeal.

          Show
          Aaron T. Myers added a comment - I'm saying you should commit the Coordination Engine interface to the ConsensusNode feature branch and use it on that branch, and then at some point we may merge the whole branch to trunk, CE and CN simultaneously. This is exactly what I said previously: I'm fine with you proceeding with this on a development branch. That will give you an opportunity to commit the coordination engine interface and start making progress on HDFS-6469 . If and when that materializes as a stable system that the community wants to adopt into Hadoop, then we'll merge it back to trunk just like we've done with many large features that are better accomplished via multiple JIRAs and doing the work piecemeal.
          Hide
          Konstantin Boudnik added a comment -

          I'm not comfortable with committing this to Hadoop trunk before it's actually something that Hadop trunk will use.

          This is a chicken-n-egg problem, don't you think? You don't want to get this piece into common before something in the trunk will use it. However, it isn't possible to have anything in the trunk to use the APO until it is committed. Am I missing anything?

          Show
          Konstantin Boudnik added a comment - I'm not comfortable with committing this to Hadoop trunk before it's actually something that Hadop trunk will use. This is a chicken-n-egg problem, don't you think? You don't want to get this piece into common before something in the trunk will use it. However, it isn't possible to have anything in the trunk to use the APO until it is committed. Am I missing anything?
          Hide
          Aaron T. Myers added a comment -

          There is no resistance. The plan has always been to build CNode on a branch. I am just trying to optimize development of CNode and HBase region replication, which is going on in parallel. My thinking was to commit the CE interface to trunk and then branch off HDFS of it. That way both both HDFS and HBase can use the interface.

          I'm not comfortable with committing this to Hadoop trunk before it's actually something that Hadop trunk will use. How about committing this to both HBase and the HDFS-6469 development branch? Or, you could of course go the route I originally suggested of making the CE interface and ZK reference implementation an entirely separate project that both HBase and the HDFS-6469 branch could depend on.

          Show
          Aaron T. Myers added a comment - There is no resistance. The plan has always been to build CNode on a branch. I am just trying to optimize development of CNode and HBase region replication, which is going on in parallel. My thinking was to commit the CE interface to trunk and then branch off HDFS of it. That way both both HDFS and HBase can use the interface. I'm not comfortable with committing this to Hadoop trunk before it's actually something that Hadop trunk will use. How about committing this to both HBase and the HDFS-6469 development branch? Or, you could of course go the route I originally suggested of making the CE interface and ZK reference implementation an entirely separate project that both HBase and the HDFS-6469 branch could depend on.
          Hide
          Eric Yang added a comment -

          Should executeAgreement updateGSN first or doExecute first? It seems more reliable to do 2 phase commit, otherwise you can run into situations where GSN is updated but execution failed.

          Show
          Eric Yang added a comment - Should executeAgreement updateGSN first or doExecute first? It seems more reliable to do 2 phase commit, otherwise you can run into situations where GSN is updated but execution failed.
          Hide
          Konstantin Shvachko added a comment -

          Sorry if it sounded as an overreaction, non intended.

          doing this work on a branch
          There is no resistance. The plan has always been to build CNode on a branch. I am just trying to optimize development of CNode and HBase region replication, which is going on in parallel. My thinking was to commit the CE interface to trunk and then branch off HDFS of it. That way both both HDFS and HBase can use the interface.

          the primary goal of this work is to introduce a plugin point for WANdisco's coordination engine implementation
          I don't see anything bad with plugging in WANdisco CE into Hadoop, as I argued in the other jira comment. But saying its a primary goal is not fair, you know me better than that.

          Let me comment on the design. We actually looked at multiple consensus algorithms and their implementations and came up with an abstractions that suite the area in the most general way. Particularly, the call back from agreement to update the application state is separated from the proposing action is because it is more generic. With some implementations of Raft a proposer can just wait when the agreement is made and then proceed with its execution - synchronously. But with ZK you have to set a watcher and wait for a callback acknowledging the event - asynchronously. So Asynchronous approach wins as more generic.

          If you take Bart we can organize pickup for participants from the near station. Also we should have a dial up.

          Show
          Konstantin Shvachko added a comment - Sorry if it sounded as an overreaction, non intended. doing this work on a branch There is no resistance. The plan has always been to build CNode on a branch. I am just trying to optimize development of CNode and HBase region replication, which is going on in parallel. My thinking was to commit the CE interface to trunk and then branch off HDFS of it. That way both both HDFS and HBase can use the interface. the primary goal of this work is to introduce a plugin point for WANdisco's coordination engine implementation I don't see anything bad with plugging in WANdisco CE into Hadoop, as I argued in the other jira comment . But saying its a primary goal is not fair, you know me better than that. Let me comment on the design. We actually looked at multiple consensus algorithms and their implementations and came up with an abstractions that suite the area in the most general way. Particularly, the call back from agreement to update the application state is separated from the proposing action is because it is more generic. With some implementations of Raft a proposer can just wait when the agreement is made and then proceed with its execution - synchronously. But with ZK you have to set a watcher and wait for a callback acknowledging the event - asynchronously. So Asynchronous approach wins as more generic. If you take Bart we can organize pickup for participants from the near station. Also we should have a dial up.
          Hide
          Konstantin Boudnik added a comment -

          Please give me a little more credit than this. Even though it may not be mentioned in the design doc, it's fairly transparent that the primary goal of this work is to introduce a plugin point for WANdisco's coordination engine implementation into Hadoop.

          And as Konstantin said elsewhere: Hadoop has a number of features that are targeted to a proprietary technologies, which doesn't seem to be bothering anyone this far. So, I can't consider this as a real objection.

          Show
          Konstantin Boudnik added a comment - Please give me a little more credit than this. Even though it may not be mentioned in the design doc, it's fairly transparent that the primary goal of this work is to introduce a plugin point for WANdisco's coordination engine implementation into Hadoop. And as Konstantin said elsewhere: Hadoop has a number of features that are targeted to a proprietary technologies, which doesn't seem to be bothering anyone this far. So, I can't consider this as a real objection.
          Hide
          Aaron T. Myers added a comment -

          Konst, you are overreacting here. As I said previously, I'm fine with you proceeding with this on a development branch. That will give you an opportunity to commit the coordination engine interface and start making progress on HDFS-6469. If and when that materializes as a stable system that the community wants to adopt into Hadoop, then we'll merge it back to trunk just like we've done with many large features that are better accomplished via multiple JIRAs and doing the work piecemeal. In particular, this comment doesn't make any sense:

          Committing this to a development branch wouldn't make sense without you being convinced or comfortable to have it merged to trunk once the work is done.

          The work on HDFS-6469 is not done, so how can I possibly know whether or not I'll be convinced and comfortable to have it merged to trunk, when it is time to merge the feature branch? My skepticism of this feature at this point is not a good reason to commit it to trunk first without a development branch. I think the exact opposite is the case: doing this on a feature branch is a way for you to be able to demonstrate the benefits and prove out how low-risk the NN changes are that are required for this work. I do not understand at all your resistance to doing this work on a branch.

          Not sure which 3rd party system dependencies you see here. There are non mentioned in the CNode design. And ZK is already a dependency for Hadoop HA.

          Please give me a little more credit than this. Even though it may not be mentioned in the design doc, it's fairly transparent that the primary goal of this work is to introduce a plugin point for WANdisco's coordination engine implementation into Hadoop.

          I will try to make the meeting next week but at the moment my schedule does not allow it. I will try to move things around, though. San Ramon is also a very inconvenient location for me. Will there be a dial-in provided for those who cannot attend in-person?

          Show
          Aaron T. Myers added a comment - Konst, you are overreacting here. As I said previously, I'm fine with you proceeding with this on a development branch. That will give you an opportunity to commit the coordination engine interface and start making progress on HDFS-6469 . If and when that materializes as a stable system that the community wants to adopt into Hadoop, then we'll merge it back to trunk just like we've done with many large features that are better accomplished via multiple JIRAs and doing the work piecemeal. In particular, this comment doesn't make any sense: Committing this to a development branch wouldn't make sense without you being convinced or comfortable to have it merged to trunk once the work is done. The work on HDFS-6469 is not done, so how can I possibly know whether or not I'll be convinced and comfortable to have it merged to trunk, when it is time to merge the feature branch? My skepticism of this feature at this point is not a good reason to commit it to trunk first without a development branch. I think the exact opposite is the case: doing this on a feature branch is a way for you to be able to demonstrate the benefits and prove out how low-risk the NN changes are that are required for this work. I do not understand at all your resistance to doing this work on a branch. Not sure which 3rd party system dependencies you see here. There are non mentioned in the CNode design. And ZK is already a dependency for Hadoop HA. Please give me a little more credit than this. Even though it may not be mentioned in the design doc, it's fairly transparent that the primary goal of this work is to introduce a plugin point for WANdisco's coordination engine implementation into Hadoop. I will try to make the meeting next week but at the moment my schedule does not allow it. I will try to move things around, though. San Ramon is also a very inconvenient location for me. Will there be a dial-in provided for those who cannot attend in-person?
          Hide
          Konstantin Shvachko added a comment -

          several unanswered or un-retracted objections

          • I did address the "complexity" issue in the first paragraph of my reply to Suresh
            I cannot and probably should not address comfort levels of community members in general. But I can and will gladly address technical issues should you raise any.
            These two jiras do introduce some concepts, which may be new to somebody (as they were to me when I started the project). But distributed coordination is the direction in which distributed systems are moving towards their maturity. I'll just mention Google's Spanner and Facebook's HydraBase here as examples. In my experience such concepts in fact simplify system architectures rather than complicate them.
          • I will address Todd's comment in HDFS-6469 in more details.

          the design does not add much (or perhaps any) benefit over a simpler solution that builds on the current HA system in Hadoop

          • I discussed the alternative solution in my reply to Todd, see section on ActiveActive vs ActiveStanby HA.
            This approach faces essentially the same problems as ConsensusNode, or "opens the same can of worms as the ConsensusNode" in Todd's words. But CNode in the end gives us all active NNs, rather than single active other RD-only standbys.
          • Coordination opens an opportunity for geographically distributed HDFS, which allows to scale file system across data centers.
          • Coordination opens an opportunity for active-active Yarn.
          • Coordination opens an opportunity for replicated regions in HBase.

          I'm concerned about baking in dependence on a proprietary 3rd party system for HA capabilities
          Not sure which 3rd party system dependencies you see here. There are non mentioned in the CNode design. And ZK is already a dependency for Hadoop HA.

          general agreement
          I really don't know how to answer to the rest of your comments, Aaron.

          • You seem to have issues with the design of HDFS-6469, but did not present any technical reasons there.
          • You make HDFS-6469 a pre-condition for HADOOP-10641, but CNode implementation cannot start without the CE interface.
          • Committing this to a development branch wouldn't make sense without you being convinced or comfortable to have it merged to trunk once the work is done.
          • You do not give a clue on what would indicate a "general agreement" or what would convince you that there is one.

          We are hosting a community meeting next week, which was announce on the dev lists. The topics in the agenda include technical discussion as well as the logistics of moving forward. Are you available to talk about this issues at the meeting and potentially work out a general agreement or a compromise?

          Show
          Konstantin Shvachko added a comment - several unanswered or un-retracted objections I did address the "complexity" issue in the first paragraph of my reply to Suresh I cannot and probably should not address comfort levels of community members in general. But I can and will gladly address technical issues should you raise any. These two jiras do introduce some concepts, which may be new to somebody (as they were to me when I started the project). But distributed coordination is the direction in which distributed systems are moving towards their maturity. I'll just mention Google's Spanner and Facebook's HydraBase here as examples. In my experience such concepts in fact simplify system architectures rather than complicate them. I will address Todd's comment in HDFS-6469 in more details. the design does not add much (or perhaps any) benefit over a simpler solution that builds on the current HA system in Hadoop I discussed the alternative solution in my reply to Todd , see section on ActiveActive vs ActiveStanby HA. This approach faces essentially the same problems as ConsensusNode, or "opens the same can of worms as the ConsensusNode" in Todd's words. But CNode in the end gives us all active NNs, rather than single active other RD-only standbys. Coordination opens an opportunity for geographically distributed HDFS, which allows to scale file system across data centers. Coordination opens an opportunity for active-active Yarn. Coordination opens an opportunity for replicated regions in HBase. I'm concerned about baking in dependence on a proprietary 3rd party system for HA capabilities Not sure which 3rd party system dependencies you see here. There are non mentioned in the CNode design. And ZK is already a dependency for Hadoop HA. general agreement I really don't know how to answer to the rest of your comments, Aaron. You seem to have issues with the design of HDFS-6469 , but did not present any technical reasons there. You make HDFS-6469 a pre-condition for HADOOP-10641 , but CNode implementation cannot start without the CE interface. Committing this to a development branch wouldn't make sense without you being convinced or comfortable to have it merged to trunk once the work is done. You do not give a clue on what would indicate a "general agreement" or what would convince you that there is one. We are hosting a community meeting next week , which was announce on the dev lists. The topics in the agenda include technical discussion as well as the logistics of moving forward. Are you available to talk about this issues at the meeting and potentially work out a general agreement or a compromise?
          Hide
          Aaron T. Myers added a comment -

          HDFS-6469 still has several unanswered or un-retracted objections, perhaps most notably this one from Suresh:

          I am very uncomfortable about adding all this complexity into HDFS.

          and this one from Todd:

          Lastly, a fully usable solution would be available to the community at large, whereas the design you're proposing seems like it will only be usably implemented by a proprietary extension (I don't consider the ZK "reference implementation" likely to actually work in a usable fashion).

          I personally share both of these concerns with Suresh and Todd, specifically that this design seems overly-complex and does not add much (or perhaps any) benefit over a simpler solution that builds on the current HA system in Hadoop, and I'm concerned about baking in dependence on a proprietary 3rd party system for HA capabilities.

          My main point here is that this work (HADOOP-10641) will not be useful to Hadoop except in the context of HDFS-6469. So, I'm not OK with committing this to trunk, at least until there's general agreement that HDFS-6469 is a reasonable design that we should move forward with in Hadoop. As it stands, I don't think there is such agreement; I certainly have not been convinced of this yet. I'm OK with you committing this to a development branch if you want to try to make progress that way, though.

          Show
          Aaron T. Myers added a comment - HDFS-6469 still has several unanswered or un-retracted objections, perhaps most notably this one from Suresh: I am very uncomfortable about adding all this complexity into HDFS. and this one from Todd: Lastly, a fully usable solution would be available to the community at large, whereas the design you're proposing seems like it will only be usably implemented by a proprietary extension (I don't consider the ZK "reference implementation" likely to actually work in a usable fashion). I personally share both of these concerns with Suresh and Todd, specifically that this design seems overly-complex and does not add much (or perhaps any) benefit over a simpler solution that builds on the current HA system in Hadoop, and I'm concerned about baking in dependence on a proprietary 3rd party system for HA capabilities. My main point here is that this work ( HADOOP-10641 ) will not be useful to Hadoop except in the context of HDFS-6469 . So, I'm not OK with committing this to trunk, at least until there's general agreement that HDFS-6469 is a reasonable design that we should move forward with in Hadoop. As it stands, I don't think there is such agreement; I certainly have not been convinced of this yet. I'm OK with you committing this to a development branch if you want to try to make progress that way, though.
          Hide
          Konstantin Boudnik added a comment -

          Aaron T. Myers HDFS-6469 doesn't have a lot of objections really. It has some questions, but they are already pretty much covered, IMO, by Konstantin Shvachko. So it doesn't looks like HDFS ticket is blocking the passage of this one.

          Show
          Konstantin Boudnik added a comment - Aaron T. Myers HDFS-6469 doesn't have a lot of objections really. It has some questions, but they are already pretty much covered, IMO, by Konstantin Shvachko . So it doesn't looks like HDFS ticket is blocking the passage of this one.
          Hide
          Aaron T. Myers added a comment -

          I don't think we should be committing this to the trunk of Hadoop Common until it's actually 100% clear that HDFS will have a dependency on it. As it stands, there are a lot of objections in HDFS-6469 about adding this to the NN at all.

          If you want to add it to a development branch in Hadoop then I won't stand in the way of that, but I honestly believe you'll be able to make a lot more headway more quickly if you go the route of putting this in a separate project.

          Show
          Aaron T. Myers added a comment - I don't think we should be committing this to the trunk of Hadoop Common until it's actually 100% clear that HDFS will have a dependency on it. As it stands, there are a lot of objections in HDFS-6469 about adding this to the NN at all. If you want to add it to a development branch in Hadoop then I won't stand in the way of that, but I honestly believe you'll be able to make a lot more headway more quickly if you go the route of putting this in a separate project.
          Hide
          Konstantin Shvachko added a comment -

          I was thinking about ATM's suggestion of taking this as a separate project some place other than Hadoop. The main problem with that is cross dependencies. 'Coordination Engine' project depends on types from Hadoop common, and HDFS will depend on the CE project. Since Common and HDFS are packaged together this will be hard to break unless CE project is a part of Common. Currently the interface uses Configuration, logging, and looking forward it will be rather tightly integrated with Hadoop RPC layer, since many RPC calls should trigger coordination.

          I think the patch looks good now. +1
          I see how the interfaces can be the base for HDFS and HBase coordination. I also see that ZKCoordinationEngine needs more work. I think once we start using it we will see where it should evolve.
          I propose to commit this upon Jenkins approval and if there are no objections. This should free up the implementation of coordination for HBase and HDFS.

          Show
          Konstantin Shvachko added a comment - I was thinking about ATM's suggestion of taking this as a separate project some place other than Hadoop. The main problem with that is cross dependencies. 'Coordination Engine' project depends on types from Hadoop common, and HDFS will depend on the CE project. Since Common and HDFS are packaged together this will be hard to break unless CE project is a part of Common. Currently the interface uses Configuration, logging, and looking forward it will be rather tightly integrated with Hadoop RPC layer, since many RPC calls should trigger coordination. I think the patch looks good now. +1 I see how the interfaces can be the base for HDFS and HBase coordination. I also see that ZKCoordinationEngine needs more work. I think once we start using it we will see where it should evolve. I propose to commit this upon Jenkins approval and if there are no objections. This should free up the implementation of coordination for HBase and HDFS.
          Hide
          Rakesh R added a comment -

          Hi Alex Newman,

          zab or zk?

          ZAB is an fresh project idea and this is in the initial phase. I'm not having much details now. Please follow ZOOKEEPER-1931 to know more on this.
          Motivation - there could be many use cases where you need a quorum based replication. So fresh thought came up to define consensus algorithm (ZAB) more cleanly so that the users can define their own data models and use ZAB to replicate their own data.

          Actually after seeing 'Coordination Engine' feature in HDFS, I thought of introducing this new idea.

          Show
          Rakesh R added a comment - Hi Alex Newman , zab or zk? ZAB is an fresh project idea and this is in the initial phase. I'm not having much details now. Please follow ZOOKEEPER-1931 to know more on this. Motivation - there could be many use cases where you need a quorum based replication. So fresh thought came up to define consensus algorithm (ZAB) more cleanly so that the users can define their own data models and use ZAB to replicate their own data. Actually after seeing 'Coordination Engine' feature in HDFS, I thought of introducing this new idea.
          Hide
          Mikhail Antonov added a comment -

          Guys,

          attached is improved version of the patch, with following changes:

          • ZkCoordinationEngine class is now not abstract, rather it relays agreement execution to registered AgreementHandler class (see below)
          • Added interface AgreementHandler<L> to control execution of the agreements learned by CoordinationEngine. This interface is parametrized with the type L of custom Learner class. Also added method to register handler into CoordinationEngine interface. SampleLearner class is added in tests as an example.
          • Improved handling of ZooKeeper 'Disconnected' event inside ZK-based reference implementation of Coordination engine, now in case zookeeper client gets disconnected from the ensemble, engine will pause and resume automatically when it re-connects to zookeeper.
          • Cleaned up ZkConfigKeys
          • Improved javadoc, refined pom.xml-s.
          • Moved exceptions and proposal classes to o.a.hadoop.coordination package, rather than subpackages of this package.
          Show
          Mikhail Antonov added a comment - Guys, attached is improved version of the patch, with following changes: ZkCoordinationEngine class is now not abstract, rather it relays agreement execution to registered AgreementHandler class (see below) Added interface AgreementHandler<L> to control execution of the agreements learned by CoordinationEngine. This interface is parametrized with the type L of custom Learner class. Also added method to register handler into CoordinationEngine interface. SampleLearner class is added in tests as an example. Improved handling of ZooKeeper 'Disconnected' event inside ZK-based reference implementation of Coordination engine, now in case zookeeper client gets disconnected from the ensemble, engine will pause and resume automatically when it re-connects to zookeeper. Cleaned up ZkConfigKeys Improved javadoc, refined pom.xml-s. Moved exceptions and proposal classes to o.a.hadoop.coordination package, rather than subpackages of this package.
          Hide
          Alex Newman added a comment -

          Rakesh R zab or zk?

          Show
          Alex Newman added a comment - Rakesh R zab or zk?
          Hide
          Rakesh R added a comment -

          Thanks Konstantin Shvachko for the interest on ZAB internship and if everything goes well will try use ZAB as a base for CoordinationEngine. I'll keep an eye on this JIRA to watch the progress.

          Show
          Rakesh R added a comment - Thanks Konstantin Shvachko for the interest on ZAB internship and if everything goes well will try use ZAB as a base for CoordinationEngine. I'll keep an eye on this JIRA to watch the progress.
          Hide
          Konstantin Shvachko added a comment -

          Rakesh, thanks for the link. This is indeed in line with this effort.
          The interface defined here is not dependent on the storage provided by Zookeeper. It only needs the coordination piece in the form defined. So if you implement "stand-alone" ZAB it could potentially be used as a base for CoordinationEngine.
          I see the work is just starting on github - hope the internship will be fun and productive.

          Show
          Konstantin Shvachko added a comment - Rakesh, thanks for the link. This is indeed in line with this effort. The interface defined here is not dependent on the storage provided by Zookeeper. It only needs the coordination piece in the form defined. So if you implement "stand-alone" ZAB it could potentially be used as a base for CoordinationEngine. I see the work is just starting on github - hope the internship will be fun and productive.
          Hide
          Andrey Stepachev added a comment -

          Plamen Jeliazkov, good work.
          just a couple of picky notes

          org/apache/hadoop/coordination/zk/ZkCoordinationEngine.java:117
          Not configured localNodeId can lead to uncontrolled creation of new zk sessions.
          Should code check that before creating ZooKeeper object?
          'catch' should close zookeeper, if it was opened.

          org/apache/hadoop/coordination/zk/ZkCoordinationEngine.java:132
          split doesn’t take into account chrooted zk configurations.
          it is much better to use org.apache.zookeeper.client.ConnectStringParser

          org/apache/hadoop/coordination/zk/ZkCoordinationEngine.java:301
          currentGSN is long, but treated as integer. easy to create overflow.

          More serious problem, that zk sequential nodes are integer.
          Code should guard sequence overflows over zero and handle that using
          more then one parent znodes or zero crossing detection or other techniques
          preventing integer seqence id overflow.

          Show
          Andrey Stepachev added a comment - Plamen Jeliazkov , good work. just a couple of picky notes org/apache/hadoop/coordination/zk/ZkCoordinationEngine.java:117 Not configured localNodeId can lead to uncontrolled creation of new zk sessions. Should code check that before creating ZooKeeper object? 'catch' should close zookeeper, if it was opened. org/apache/hadoop/coordination/zk/ZkCoordinationEngine.java:132 split doesn’t take into account chrooted zk configurations. it is much better to use org.apache.zookeeper.client.ConnectStringParser org/apache/hadoop/coordination/zk/ZkCoordinationEngine.java:301 currentGSN is long, but treated as integer. easy to create overflow. More serious problem, that zk sequential nodes are integer . Code should guard sequence overflows over zero and handle that using more then one parent znodes or zero crossing detection or other techniques preventing integer seqence id overflow.
          Hide
          Rakesh R added a comment -

          The intent of this jira is not to solve the general problem of distributed consensus. That is, I do not propose to build an implementation of paxos or other coordination algorithms here. This is only to introduce a common interface, so that real implementations such as ZooKeeper could be plugged into hadoop projects.

          This sounds interesting. Thanks for the effort!. If I understood the discussion correctly here, the idea is to build a quorum based replication. For example, the events(I think this represents data) are submitted as proposals to a quorum of nodes. In ZooKeeper terms, Leader proposes values to the Followers. Now Leader wait for acknowledgements from a quorum of Followers before considering a proposal committed. Also, Leader queues COMMIT(zxid) events to all Followers so that all other nodes learn the events. This ensures that the events will be reached to all nodes in the system. Adding one more point, in general ZK provides strong ordering guarantees.

          Sometime back ZooKeeper folks initiated discussions to decouple ZAB from ZooKeeper, so that users can make use of this and can define their own models and reliably replicate the data. There is a related JIRA ZOOKEEPER-1931 talks similar feature, now this is in initial dev stage. Please have a look at this. I hope this would help to define a common interface, also an opportunity for us to know more about the use cases.

          Regards,
          Rakesh

          Show
          Rakesh R added a comment - The intent of this jira is not to solve the general problem of distributed consensus. That is, I do not propose to build an implementation of paxos or other coordination algorithms here. This is only to introduce a common interface, so that real implementations such as ZooKeeper could be plugged into hadoop projects. This sounds interesting. Thanks for the effort!. If I understood the discussion correctly here, the idea is to build a quorum based replication. For example, the events(I think this represents data) are submitted as proposals to a quorum of nodes. In ZooKeeper terms, Leader proposes values to the Followers. Now Leader wait for acknowledgements from a quorum of Followers before considering a proposal committed. Also, Leader queues COMMIT(zxid) events to all Followers so that all other nodes learn the events. This ensures that the events will be reached to all nodes in the system. Adding one more point, in general ZK provides strong ordering guarantees. Sometime back ZooKeeper folks initiated discussions to decouple ZAB from ZooKeeper, so that users can make use of this and can define their own models and reliably replicate the data. There is a related JIRA ZOOKEEPER-1931 talks similar feature, now this is in initial dev stage. Please have a look at this. I hope this would help to define a common interface, also an opportunity for us to know more about the use cases. Regards, Rakesh
          Hide
          Aaron T. Myers added a comment -

          I just want to make sure we are on the same page here. The intent of this jira is not to solve the general problem of distributed consensus. That is, I do not propose to build an implementation of paxos or other coordination algorithms here. This is only to introduce a common interface, so that real implementations such as ZooKeeper could be plugged into hadoop projects.

          Totally get that, but I think the point still remains that there's little expertise for defining a common interface for coordination engines in general in this project, and no real reason that the Hadoop project should necessarily be the place where that interface is defined. The ZooKeeper project, a ZK sub-project, or an entirely new TLP makes more sense to me.

          Show
          Aaron T. Myers added a comment - I just want to make sure we are on the same page here. The intent of this jira is not to solve the general problem of distributed consensus. That is, I do not propose to build an implementation of paxos or other coordination algorithms here. This is only to introduce a common interface, so that real implementations such as ZooKeeper could be plugged into hadoop projects. Totally get that, but I think the point still remains that there's little expertise for defining a common interface for coordination engines in general in this project, and no real reason that the Hadoop project should necessarily be the place where that interface is defined. The ZooKeeper project, a ZK sub-project, or an entirely new TLP makes more sense to me.
          Hide
          Konstantin Shvachko added a comment -

          > there's not much expertise in Hadoop for the general problem of distributed consensus

          I just want to make sure we are on the same page here. The intent of this jira is not to solve the general problem of distributed consensus. That is, I do not propose to build an implementation of paxos or other coordination algorithms here. This is only to introduce a common interface, so that real implementations such as ZooKeeper could be plugged into hadoop projects.

          Show
          Konstantin Shvachko added a comment - > there's not much expertise in Hadoop for the general problem of distributed consensus I just want to make sure we are on the same page here. The intent of this jira is not to solve the general problem of distributed consensus. That is, I do not propose to build an implementation of paxos or other coordination algorithms here. This is only to introduce a common interface, so that real implementations such as ZooKeeper could be plugged into hadoop projects.
          Hide
          Aaron T. Myers added a comment -

          Not sure where this leaves QJM. I thought it satisfies all these requirements including the Hadoop's aim.

          The NameNode QuorumJournalManager and JNs are expressly for storing HDFS NN edit logs. Not for general purpose consensus, not for use by other projects like HBase, etc.

          Hadoop common is chosen for the coordination engine interface as the lowest common ancestor.

          But a separate project could just as well be a common ancestor, just like both Hadoop and HBase separately depend on ZooKeeper. There's no actual need for it to be in Hadoop Common if HBase is to use it.

          It could be used for anything Hadoop from here: HDFS, HBase, Yarn.

          But seems like it could also be used for arbitrary, non-Hadoop things, correct? If so, why put it in Hadoop?

          I don't have other use cases for it in my mind at the moment. Did not have enough experience with TLPs and Incubators, but thought it needs like something bigger. Say, in this case, implementations for more than one CE. Sorry if I misunderstood you, but if you want to take it to Apache Incubator I'll be on your side.

          I personally don't think there's any good reason for this to start out as part of a larger project, and honestly think there are several downsides. For example, Hadoop's release cadence is too slow for a new project like this, there's not much expertise in Hadoop for the general problem of distributed consensus, possible desire for other non-Hadoop projects to want to use it, etc.

          Show
          Aaron T. Myers added a comment - Not sure where this leaves QJM. I thought it satisfies all these requirements including the Hadoop's aim. The NameNode QuorumJournalManager and JNs are expressly for storing HDFS NN edit logs. Not for general purpose consensus, not for use by other projects like HBase, etc. Hadoop common is chosen for the coordination engine interface as the lowest common ancestor. But a separate project could just as well be a common ancestor, just like both Hadoop and HBase separately depend on ZooKeeper. There's no actual need for it to be in Hadoop Common if HBase is to use it. It could be used for anything Hadoop from here: HDFS, HBase, Yarn. But seems like it could also be used for arbitrary, non-Hadoop things, correct? If so, why put it in Hadoop? I don't have other use cases for it in my mind at the moment. Did not have enough experience with TLPs and Incubators, but thought it needs like something bigger. Say, in this case, implementations for more than one CE. Sorry if I misunderstood you, but if you want to take it to Apache Incubator I'll be on your side. I personally don't think there's any good reason for this to start out as part of a larger project, and honestly think there are several downsides. For example, Hadoop's release cadence is too slow for a new project like this, there's not much expertise in Hadoop for the general problem of distributed consensus, possible desire for other non-Hadoop projects to want to use it, etc.
          Hide
          Konstantin Shvachko added a comment -

          > reliable distributed coordination, which is not what Hadoop aims to do at all.

          Not sure where this leaves QJM. I thought it satisfies all these requirements including the Hadoop's aim.
          Anyways, we have it, and we also do distributed coordination between active and standby NNs, RMs.

          Hadoop common is chosen for the coordination engine interface as the lowest common ancestor. It could be used for anything Hadoop from here: HDFS, HBase, Yarn. I don't have other use cases for it in my mind at the moment. Did not have enough experience with TLPs and Incubators, but thought it needs like something bigger. Say, in this case, implementations for more than one CE. Sorry if I misunderstood you, but if you want to take it to Apache Incubator I'll be on your side.

          Show
          Konstantin Shvachko added a comment - > reliable distributed coordination, which is not what Hadoop aims to do at all. Not sure where this leaves QJM. I thought it satisfies all these requirements including the Hadoop's aim. Anyways, we have it, and we also do distributed coordination between active and standby NNs, RMs. Hadoop common is chosen for the coordination engine interface as the lowest common ancestor. It could be used for anything Hadoop from here: HDFS, HBase, Yarn. I don't have other use cases for it in my mind at the moment. Did not have enough experience with TLPs and Incubators, but thought it needs like something bigger. Say, in this case, implementations for more than one CE. Sorry if I misunderstood you, but if you want to take it to Apache Incubator I'll be on your side.
          Hide
          Aaron T. Myers added a comment -

          Aaron, a separate project is a great idea.

          Cool, glad we agree. Shall we resolve this JIRA then and take this proposal to the Apache Incubator? A standalone separate project seems like a much more reasonable place to put this work to me.

          ZK is probably not the best home for the interface as ZKCoordinationEngine is only one its implementations. People talked to me about possible implementations using Raft, Bookeeper, Paxos, and two other engines. I think the variety makes sense as there could be different performance, network, reliability etc. requirements. Having said that, I am flexible if ZK community wants to adopt it.

          I was only suggesting ZK since that project's focus is about reliable distributed coordination, which is not what Hadoop aims to do at all. If for some reason you didn't want to try to make this a TLP (which, again, seems more reasonable to me) then trying to contribute it to that project makes a lot more sense to me.

          Show
          Aaron T. Myers added a comment - Aaron, a separate project is a great idea. Cool, glad we agree. Shall we resolve this JIRA then and take this proposal to the Apache Incubator? A standalone separate project seems like a much more reasonable place to put this work to me. ZK is probably not the best home for the interface as ZKCoordinationEngine is only one its implementations. People talked to me about possible implementations using Raft, Bookeeper, Paxos, and two other engines. I think the variety makes sense as there could be different performance, network, reliability etc. requirements. Having said that, I am flexible if ZK community wants to adopt it. I was only suggesting ZK since that project's focus is about reliable distributed coordination, which is not what Hadoop aims to do at all. If for some reason you didn't want to try to make this a TLP (which, again, seems more reasonable to me) then trying to contribute it to that project makes a lot more sense to me.
          Hide
          Plamen Jeliazkov added a comment -

          Hi Lohit, thanks for your comments!

          1. checkQuorum is an optimization some coordination engines may choose to implement in order to fail-fast to client requests. In the NameNode case, if quorum loss was suspected, that NameNode could start issuing StandbyExceptions.
          2. You are correct that the ZKCoordinationEngine does not implement ZNode clean-up currently. That is because it was made as a proof of concept for the CoordinationEngine API. Nonetheless, proper clean-up can be implemented. All one has to do is delete the ZNodes that everyone else has already learned about.
            1. Suppose you have Node A, B, and C, and Agreements 1, 2, 3, 4, and 5.
            2. Node A and B learn Agreement 1 first. Node C is a lagging node. A & B contain 1. C contains nothing.
            3. Node A and B continue onwards, learning up to Agreement 4. A & B contain 1, 2, 3, and 4 now. C contains nothing.
            4. Node C finally learns Agreement 1. A & B contain 1, 2, 3, and 4 now. C contains 1.
            5. We can now discard Agreement 1 from persistence because we know that all the Nodes, A, B, and C, have safely learned about and applied Agreement 1.
            6. We can apply this process for all other Agreements.
          Show
          Plamen Jeliazkov added a comment - Hi Lohit, thanks for your comments! checkQuorum is an optimization some coordination engines may choose to implement in order to fail-fast to client requests. In the NameNode case, if quorum loss was suspected, that NameNode could start issuing StandbyExceptions. You are correct that the ZKCoordinationEngine does not implement ZNode clean-up currently. That is because it was made as a proof of concept for the CoordinationEngine API. Nonetheless, proper clean-up can be implemented. All one has to do is delete the ZNodes that everyone else has already learned about. Suppose you have Node A, B, and C, and Agreements 1, 2, 3, 4, and 5. Node A and B learn Agreement 1 first. Node C is a lagging node. A & B contain 1. C contains nothing. Node A and B continue onwards, learning up to Agreement 4. A & B contain 1, 2, 3, and 4 now. C contains nothing. Node C finally learns Agreement 1. A & B contain 1, 2, 3, and 4 now. C contains 1. We can now discard Agreement 1 from persistence because we know that all the Nodes, A, B, and C, have safely learned about and applied Agreement 1. We can apply this process for all other Agreements.
          Hide
          Lohit Vijayarenu added a comment -

          Minor comments.

          • It looks like checkQuorum is kind of noop for submitProposal in ZK based implementation, since zooKeeper.create would fail if there is no quorum anyways?
          • In ZK based Coordination Engine implementation, how are ZNodes cleaned up? Looking at patch each proposal creates PERSISTENT_SEQUENTIAL, but no mention of cleanup.
          Show
          Lohit Vijayarenu added a comment - Minor comments. It looks like checkQuorum is kind of noop for submitProposal in ZK based implementation, since zooKeeper.create would fail if there is no quorum anyways? In ZK based Coordination Engine implementation, how are ZNodes cleaned up? Looking at patch each proposal creates PERSISTENT_SEQUENTIAL, but no mention of cleanup.
          Hide
          Konstantin Shvachko added a comment -

          Aaron, a separate project is a great idea. Although I am not sure right now how it will evolve. It may spin off from hadoop, like hadoop did from nutch. ZK is probably not the best home for the interface as ZKCoordinationEngine is only one its implementations. People talked to me about possible implementations using Raft, Bookeeper, Paxos, and two other engines. I think the variety makes sense as there could be different performance, network, reliability etc. requirements.
          Having said that, I am flexible if ZK community wants to adopt it.

          Show
          Konstantin Shvachko added a comment - Aaron, a separate project is a great idea. Although I am not sure right now how it will evolve. It may spin off from hadoop, like hadoop did from nutch. ZK is probably not the best home for the interface as ZKCoordinationEngine is only one its implementations. People talked to me about possible implementations using Raft, Bookeeper, Paxos, and two other engines. I think the variety makes sense as there could be different performance, network, reliability etc. requirements. Having said that, I am flexible if ZK community wants to adopt it.
          Hide
          Konstantin Shvachko added a comment -

          I think that the latest patch is a better example on how the coordination engine should work: it accepts a proposal, then agrees on its order, then invokes Agreement.execute() to trigger the learner state update. This is an initial patch and is a working progress. We should also introduce a generic Proposer, which submits a proposal and then waits for agreement to be executed in order to reply to the client.
          Nicholas, good points:

          • We chose Serializable interface to keep serialization as generic as possible, as Plamen mentioned. We routinely define readObject(), writeObject() for our proposals using protobuf. Actually in many cases we avoid extra serialization by passing already serialized requests directly from RPC to CoordinationEngine.
          • Not sure if it is exactly what you mean by the version of coordination engine. I see two versions here:
            1. CE software version. This should be tracked in the implementation of CE. We assume there could be different implementations of CE. If one implementation changes it should not effect other.
            2. The version of a set of Proposals / Agreements. This should be tracked on the application level. The set of proposals / agreements reflects a particular application logic and is orthogonal to the engine implementation.
          • Rolling upgrade is divided into two parts. Rolling upgrade of the application (e.g. HDFS) and rolling upgrade of the engine itself. If both support it then everything is good. If CE does not support RU, but the application does, then we should still be able to upgrade the application in the rolling manner but without upgrading CE. CE in this case will continue assigning GSNs and produce agreements.
          • Proposal and Agreement are made as generic as possible. Proposal is formally empty, but it has everything it needs. One need to define equals(), hashCode(), and serialization. We also routinely implement toString() so that CE could print proposals in the logs.
            Agreement has one method execute(), so that CE could make a call back to update the application state. It also has two generic parameters: first defines the application type, the second - the return type of the execute() method, which is intended to be returned to the client.
            ConsensusProposal from the patch combines Proposal and Agreement. This is probably the most typical use case. But there could be Proposals that don't assume Agreements, like control messages to the CE, and Agreements, which are different from corresponding Proposals, or even agreements without proposals, like control commands from CE to the application.
          • ZKCoordinationEngine should not be abstract. Made the same comment to Plamen.
          • Documentation for ZKCoordinationEngine makes sense.
          Show
          Konstantin Shvachko added a comment - I think that the latest patch is a better example on how the coordination engine should work: it accepts a proposal, then agrees on its order, then invokes Agreement.execute() to trigger the learner state update. This is an initial patch and is a working progress. We should also introduce a generic Proposer, which submits a proposal and then waits for agreement to be executed in order to reply to the client. Nicholas, good points: We chose Serializable interface to keep serialization as generic as possible, as Plamen mentioned. We routinely define readObject(), writeObject() for our proposals using protobuf. Actually in many cases we avoid extra serialization by passing already serialized requests directly from RPC to CoordinationEngine. Not sure if it is exactly what you mean by the version of coordination engine. I see two versions here: CE software version. This should be tracked in the implementation of CE. We assume there could be different implementations of CE. If one implementation changes it should not effect other. The version of a set of Proposals / Agreements. This should be tracked on the application level. The set of proposals / agreements reflects a particular application logic and is orthogonal to the engine implementation. Rolling upgrade is divided into two parts. Rolling upgrade of the application (e.g. HDFS) and rolling upgrade of the engine itself. If both support it then everything is good. If CE does not support RU, but the application does, then we should still be able to upgrade the application in the rolling manner but without upgrading CE. CE in this case will continue assigning GSNs and produce agreements. Proposal and Agreement are made as generic as possible. Proposal is formally empty, but it has everything it needs. One need to define equals(), hashCode(), and serialization. We also routinely implement toString() so that CE could print proposals in the logs. Agreement has one method execute(), so that CE could make a call back to update the application state. It also has two generic parameters: first defines the application type, the second - the return type of the execute() method, which is intended to be returned to the client. ConsensusProposal from the patch combines Proposal and Agreement. This is probably the most typical use case. But there could be Proposals that don't assume Agreements, like control messages to the CE, and Agreements, which are different from corresponding Proposals, or even agreements without proposals, like control commands from CE to the application. ZKCoordinationEngine should not be abstract. Made the same comment to Plamen. Documentation for ZKCoordinationEngine makes sense.
          Hide
          Plamen Jeliazkov added a comment -

          Tsz Wo Nicholas Sze, thanks for the review! It was nice to finally see a face at the Summit as well.
          Aaron T. Myers, thanks for the comments! I think I am outside of that discussion, most likely Konstantin Boudnik or Konstantin Shvachko can comment better on where to take the project.

          I posted a new patch around the same time your review came in; there were mistakes in the way agreement executions work.

          • ProtoBuf is certainly a nice choice for serialization. However, we shouldn't need to bind ourselves to any one serialization format. This is why we use Serializable. It is certainly possible to have the writeObject call write out a ProtoBuf of the proposal itself, for example, and read the values back using ProtoBuf as well. This is feasible with the current interfaces.
          • Good point on version compatibility. AFAIK, version compatibility would take place once the quorum is established as prior to that there is no communication between the engines. So the coordination engine, as part of bootstrap, should perform a version check against its quorum peers. Perhaps this means extending the API, or making it part of a larger interface? (VersionedCoordinationEngine)? Konstantin Shvachko might be able to comment better.
          • Please see my new patch. The idea is indeed to make the agreement execute on some callBack object, SampleLearner in this case. The new patch should show the test making use of it.
          • Yes we can probably do some refactoring here. I'll work on a new patch.
          • Yes we can add details for ZkCoordinationEngine. Unsure of any clear advantages and disadvantages. The only thing that comes to my mind right away is that it may be possible to build Paxos directly into the CoordinationEngine implementation, thus co-locating the coordination service with the server / application itself, rather than having to make RPC calls and wait for responses, like with ZooKeeper(s). I don't think the intent of this work is really to compare any one coordination mechanism with another but so much as provide a common interface for which one can implement whichever they prefer.
          Show
          Plamen Jeliazkov added a comment - Tsz Wo Nicholas Sze , thanks for the review! It was nice to finally see a face at the Summit as well. Aaron T. Myers , thanks for the comments! I think I am outside of that discussion, most likely Konstantin Boudnik or Konstantin Shvachko can comment better on where to take the project. I posted a new patch around the same time your review came in; there were mistakes in the way agreement executions work. ProtoBuf is certainly a nice choice for serialization. However, we shouldn't need to bind ourselves to any one serialization format. This is why we use Serializable. It is certainly possible to have the writeObject call write out a ProtoBuf of the proposal itself, for example, and read the values back using ProtoBuf as well. This is feasible with the current interfaces. Good point on version compatibility. AFAIK, version compatibility would take place once the quorum is established as prior to that there is no communication between the engines. So the coordination engine, as part of bootstrap, should perform a version check against its quorum peers. Perhaps this means extending the API, or making it part of a larger interface? (VersionedCoordinationEngine)? Konstantin Shvachko might be able to comment better. Please see my new patch. The idea is indeed to make the agreement execute on some callBack object, SampleLearner in this case. The new patch should show the test making use of it. Yes we can probably do some refactoring here. I'll work on a new patch. Yes we can add details for ZkCoordinationEngine. Unsure of any clear advantages and disadvantages. The only thing that comes to my mind right away is that it may be possible to build Paxos directly into the CoordinationEngine implementation, thus co-locating the coordination service with the server / application itself, rather than having to make RPC calls and wait for responses, like with ZooKeeper(s). I don't think the intent of this work is really to compare any one coordination mechanism with another but so much as provide a common interface for which one can implement whichever they prefer.
          Hide
          Konstantin Boudnik added a comment -

          Good point Aaron T. Myers! While other projects might take an advantage of coordinated intent approach, the HDFS and HBase are the only two right now that are focusing on real HA with low MTTR. I am just a bit hesitant of over-generalizing things

          Show
          Konstantin Boudnik added a comment - Good point Aaron T. Myers ! While other projects might take an advantage of coordinated intent approach, the HDFS and HBase are the only two right now that are focusing on real HA with low MTTR. I am just a bit hesitant of over-generalizing things
          Hide
          Aaron T. Myers added a comment -

          Hey Konstantin and Plamen, have y'all given any thought to contributing the Coordination Engine somewhere other than the Hadoop project? Sounds like it's a pretty general-purpose system that other projects that have nothing to do with Hadoop might want to use. Seems to me like it might reasonably be a separate top-level Apache project, which Hadoop and HBase (and perhaps others) could then depend on. It might also make sense for it to be a sub-project of the ZooKeeper project, much like BookKeeper is.

          With something this new that you presumably want to iterate on quickly, seems like a shame to have to wait around for a Hadoop release to be able to pick up an updated Coordination Engine.

          Show
          Aaron T. Myers added a comment - Hey Konstantin and Plamen, have y'all given any thought to contributing the Coordination Engine somewhere other than the Hadoop project? Sounds like it's a pretty general-purpose system that other projects that have nothing to do with Hadoop might want to use. Seems to me like it might reasonably be a separate top-level Apache project, which Hadoop and HBase (and perhaps others) could then depend on. It might also make sense for it to be a sub-project of the ZooKeeper project, much like BookKeeper is. With something this new that you presumably want to iterate on quickly, seems like a shame to have to wait around for a Hadoop release to be able to pick up an updated Coordination Engine.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          Looked at the patch. Some questions:

          • For serialization, I think we should use ProtoBuf but not java.io.Serializable. Then, it is easier to support compatibility.
          • It seems that there is no version check between CoordinationEngine(s). How to detect if the engines are compatible with each other? Also, how to support rolling upgrade?
          • Do we really need Proposal and Agreement as two interfaces? Currently, Proposal extending Serializable is an empty interface and Agreement also extending Serializable only has one method.
          • I also don't understand why Agreement.execute(..) needs a callBackObject (learner?) parameter. The test does not make use of it.
          • ZkCoordinationEngine also has a doExecute(..) method. Could it be combined with Agreement.execute(..)? I suspect we can get rid of ZkCoordinationEngine.doExecute(..) in order to make ZkCoordinationEngine non-abstract. Then, applications like namenode only have to use the engine but not extend it.
          • In the CNode design doc, the Coordination Engine section describes some general concepts and some details for the Paxos algorithm. Could you also add some details for the ZkCoordinationEngine? Also, what are the advantages and disadvantages of the ZkCoordinationEngine compared with the Paxos algorithm?
          Show
          Tsz Wo Nicholas Sze added a comment - Looked at the patch. Some questions: For serialization, I think we should use ProtoBuf but not java.io.Serializable. Then, it is easier to support compatibility. It seems that there is no version check between CoordinationEngine(s). How to detect if the engines are compatible with each other? Also, how to support rolling upgrade? Do we really need Proposal and Agreement as two interfaces? Currently, Proposal extending Serializable is an empty interface and Agreement also extending Serializable only has one method. I also don't understand why Agreement.execute(..) needs a callBackObject (learner?) parameter. The test does not make use of it. ZkCoordinationEngine also has a doExecute(..) method. Could it be combined with Agreement.execute(..)? I suspect we can get rid of ZkCoordinationEngine.doExecute(..) in order to make ZkCoordinationEngine non-abstract. Then, applications like namenode only have to use the engine but not extend it. In the CNode design doc, the Coordination Engine section describes some general concepts and some details for the Paxos algorithm. Could you also add some details for the ZkCoordinationEngine? Also, what are the advantages and disadvantages of the ZkCoordinationEngine compared with the Paxos algorithm?
          Hide
          Plamen Jeliazkov added a comment -

          New patch. Not clear what I was doing earlier. Indeed the correct usage here is to have agreement executions call the learner, not the other way around.

          1. Factored out SampleLearner.
          2. SampleProposal now does <SampleLearner,Void>.
          3. Added setCurrentUser to SampleProposal.
          4. Made SampleLearner use UserGroupInformation.doAs when executing agreement.
          5. Removed LICENSE and NOTICE files. Sorry about that.
          6. Unit tests now correctly wait for all agreements to arrive (they were not prior).
          7. Made use of ClientBase in MiniZooKeeperCluster.
          Show
          Plamen Jeliazkov added a comment - New patch. Not clear what I was doing earlier. Indeed the correct usage here is to have agreement executions call the learner, not the other way around. Factored out SampleLearner. SampleProposal now does <SampleLearner,Void>. Added setCurrentUser to SampleProposal. Made SampleLearner use UserGroupInformation.doAs when executing agreement. Removed LICENSE and NOTICE files. Sorry about that. Unit tests now correctly wait for all agreements to arrive (they were not prior). Made use of ClientBase in MiniZooKeeperCluster.
          Hide
          Konstantin Shvachko added a comment -
          1. NOTICE and LICENSE files are not needed as they are already contained in the hadoop-common.
          2. MiniZooKeeperCluster should use ClientBase methods (like setupTestEnv() and waitForServer*()) instead of copying them.
          3. Your SampleLearner is a good example of how it should not be used. SampleLearner represents the state of the server, like namespace or a region. Learner should be the first generic parameter of the Agreement, so that Agreement.execute() could call the update methods of the Learner. Not the other way around as in the example provided.
          Show
          Konstantin Shvachko added a comment - NOTICE and LICENSE files are not needed as they are already contained in the hadoop-common. MiniZooKeeperCluster should use ClientBase methods (like setupTestEnv() and waitForServer*()) instead of copying them. Your SampleLearner is a good example of how it should not be used. SampleLearner represents the state of the server, like namespace or a region. Learner should be the first generic parameter of the Agreement, so that Agreement.execute() could call the update methods of the Learner. Not the other way around as in the example provided.
          Hide
          Plamen Jeliazkov added a comment -

          Attaching new patch based on Konstantin's suggestions.

          1. Added private static class SampleLearner in unit test.
          2. SampleLearner executes SampleProposals and LOG.info's the return value.
          3. Added NOTICE and LICENSE files.
          4. Reduced the code of MiniZooKeeperCluster.
          Show
          Plamen Jeliazkov added a comment - Attaching new patch based on Konstantin's suggestions. Added private static class SampleLearner in unit test. SampleLearner executes SampleProposals and LOG.info's the return value. Added NOTICE and LICENSE files. Reduced the code of MiniZooKeeperCluster.
          Hide
          Konstantin Shvachko added a comment -

          Couple suggestions after reviewing the patch:

          1. In Agreement it would be better to rename type T to L, and explain in the JavaDoc that this is the type of a learner.
          2. In the test, it would be good to introduce some SampleLearner, which would actually process the SampleAgreement. This would exemplify how agreement are used.
          3. MiniZooKeeperCluster should belong to ZK project in the long run. For now could you do some code reduction. Few methods are not used, some are copies of ClientBase class and can be called directly, configuration member is not used.
          Show
          Konstantin Shvachko added a comment - Couple suggestions after reviewing the patch: In Agreement it would be better to rename type T to L, and explain in the JavaDoc that this is the type of a learner. In the test, it would be good to introduce some SampleLearner, which would actually process the SampleAgreement. This would exemplify how agreement are used. MiniZooKeeperCluster should belong to ZK project in the long run. For now could you do some code reduction. Few methods are not used, some are copies of ClientBase class and can be called directly, configuration member is not used.
          Hide
          Plamen Jeliazkov added a comment -

          Attaching initial patch.

          Initial implementation shows using ZooKeeper as a Coordination Engine. The mechanism for sequencing transactions is done by using a single persistent-sequential Znode.

          The ZooKeeper connection thread is utilized for learning of agreements by constantly checking against the single Znode mentioned above for different sequence values, and reading them one by one.

          Proposing values and learning about them happen in parallel.

          Show
          Plamen Jeliazkov added a comment - Attaching initial patch. Initial implementation shows using ZooKeeper as a Coordination Engine. The mechanism for sequencing transactions is done by using a single persistent-sequential Znode. The ZooKeeper connection thread is utilized for learning of agreements by constantly checking against the single Znode mentioned above for different sequence values, and reading them one by one. Proposing values and learning about them happen in parallel.

            People

            • Assignee:
              Plamen Jeliazkov
              Reporter:
              Konstantin Shvachko
            • Votes:
              2 Vote for this issue
              Watchers:
              52 Start watching this issue

              Dates

              • Created:
                Updated:

                Development