Hadoop HDFS
  1. Hadoop HDFS
  2. HDFS-3092

Enable journal protocol based editlog streaming for standby namenode

    Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Won't Fix
    • Affects Version/s: 0.24.0, 0.23.3
    • Fix Version/s: None
    • Component/s: ha, namenode
    • Labels:
      None

      Description

      Currently standby namenode relies on reading shared editlogs to stay current with the active namenode, for namespace changes. BackupNode used streaming edits from active namenode for doing the same. This jira is to explore using journal protocol based editlog streams for the standby namenode. A daemon in standby will get the editlogs from the active and write it to local edits. To begin with, the existing standby mechanism of reading from a file, will continue to be used, instead of from shared edits, from the local edits.

      1. Removingfilerdependency.pdf
        59 kB
        Eli Collins
      2. MultipleSharedJournals.pdf
        150 kB
        Suresh Srinivas
      3. MultipleSharedJournals.pdf
        161 kB
        Suresh Srinivas
      4. MultipleSharedJournals.pdf
        191 kB
        Tsz Wo Nicholas Sze
      5. JNStates.png
        48 kB
        Tsz Wo Nicholas Sze
      6. ComparisonofApproachesforHAJournals.pdf
        72 kB
        Bikas Saha

        Issue Links

        1.
        Add a service that enables JournalDaemon Sub-task Resolved Suresh Srinivas
         
        2.
        Journal stream from the namenode to backup needs to have a timeout Sub-task Resolved Hari Mankude
         
        3.
        Add states for journal synchronization in journal daemon Sub-task Resolved Tsz Wo Nicholas Sze
         
        4. Add JournalManager implementation to JournalDaemons for storing edits Sub-task Open Unassigned
         
        5.
        Setup configuration for Journal Manager and Journal Services Sub-task Resolved Hari Mankude
         
        6.
        Sync lagging journal service from the active journal service Sub-task Resolved Brandon Li
         
        7.
        Implement JournalListener for writing journal to local disk Sub-task Resolved Tsz Wo Nicholas Sze
         
        8.
        Implement Journal reader for reading transactions from local edits log or remote Journal node Sub-task Resolved Brandon Li
         
        9.
        JournalProtocol changes required for introducing epoch and fencing Sub-task Resolved Suresh Srinivas
         
        10. Persist the epoch received by the JournalService Sub-task Open Unassigned
         
        11.
        JournalDaemon (server) should persist the cluster id and nsid in the storage directory Sub-task Resolved Hari Mankude
         
        12.
        add JournalProtocol RPCs to list finalized edit segments, and read edit segment file from JournalNode. Sub-task Resolved Brandon Li
         
        13.
        Fix synchronization issues with journal service Sub-task Resolved Hari Mankude
         
        14.
        Refactor BackupImage and FSEditLog Sub-task Resolved Tsz Wo Nicholas Sze
         
        15.
        Create a new journal_edits_dir key to support journal nodes Sub-task Resolved Hari Mankude
         
        16.
        create http server for Journal Service Sub-task Resolved Brandon Li
         
        17. Handle block pool ID in journal service Sub-task Open Unassigned
         
        18. handle creation time also in journal service Sub-task Open Unassigned
         
        19. stream the edit segments to NameNode when NameNode starts up Sub-task Open Brandon Li
         
        20.
        create a protocol for journal service synchronziation Sub-task Resolved Brandon Li
         
        21. remove JournalServiceProtocols Sub-task Open Brandon Li
         
        22.
        GetJournalEditServlet should close output stream only if the stream is used. Sub-task Resolved Brandon Li
         
        23. Enable standby namenode to tail edits from journal service Sub-task Open Brandon Li
         
        24. Replaced Kerberized SSL for journal segment transfer with SPNEGO-based solution Sub-task Open Brandon Li
         
        25.
        fix the build failure in branch HDFS-3092 Sub-task Resolved Brandon Li
         

          Activity

          Hide
          Suresh Srinivas added a comment -

          Design document.

          Show
          Suresh Srinivas added a comment - Design document.
          Hide
          Todd Lipcon added a comment -

          Hi Suresh. I took a look at the design document, and I think it actually shares a lot with what I'm doing in HDFS-3077. Hopefully we can share some portions of the code and design.

          Here are some points I think need elaboration in the design doc:

          • How does the fencing command ensure that prior NNs can no longer access the JD after it completes? I think you need to have the JDs record the sequence number of each NN so they can reject past NNs from coming back to life.
          • I don't think the following can be done correctly:

            a. Choose the transaction >= Q JDs from JournalList.
            b. Else choose the highest transaction ID from a JD in the JournalList.
            3. All the JDs perform recovery to transaction ID Ft .

            because it's possible that there are distinct transactions with the same txid at the beginning of a log segment. For example, consider the following situation with two NN (NN1 and NN2) and three JDs (JD1, JD2, JD3):

          1. NN1 writes txid 1 to JD1, crashes before writing to JD2 and JD3
          2. NN2 initiates fencing, but only succeeds in contacting JD2 and JD3. So, it does not see the edit made in step 1
          3. NN2 writes txid1 to JD2 and JD3, then crashes
          4. One of the two NNs recovers. It sees that all JDs have txid 1. The fencing/synchronization process you've described cannot distinguish between the correct txn which was written to a quorum and the incorrect txn which was only written to JD1.

          Show
          Todd Lipcon added a comment - Hi Suresh. I took a look at the design document, and I think it actually shares a lot with what I'm doing in HDFS-3077 . Hopefully we can share some portions of the code and design. Here are some points I think need elaboration in the design doc: How does the fencing command ensure that prior NNs can no longer access the JD after it completes? I think you need to have the JDs record the sequence number of each NN so they can reject past NNs from coming back to life. I don't think the following can be done correctly: a. Choose the transaction >= Q JDs from JournalList. b. Else choose the highest transaction ID from a JD in the JournalList. 3. All the JDs perform recovery to transaction ID Ft . because it's possible that there are distinct transactions with the same txid at the beginning of a log segment. For example, consider the following situation with two NN (NN1 and NN2) and three JDs (JD1, JD2, JD3): 1. NN1 writes txid 1 to JD1, crashes before writing to JD2 and JD3 2. NN2 initiates fencing, but only succeeds in contacting JD2 and JD3. So, it does not see the edit made in step 1 3. NN2 writes txid1 to JD2 and JD3, then crashes 4. One of the two NNs recovers. It sees that all JDs have txid 1. The fencing/synchronization process you've described cannot distinguish between the correct txn which was written to a quorum and the incorrect txn which was only written to JD1.
          Hide
          Suresh Srinivas added a comment -

          Hi Suresh. I took a look at the design document, and I think it actually shares a lot with what I'm doing in HDFS-3077. Hopefully we can share some portions of the code and design.

          Sounds good. Will add more details on how the code will be organized, so we can better reuse the code.

          fencing command ensure that prior NNs can no longer access the JD after it completes

          Fence command will include a version number that we got from the JournalList ZK node. The number that is higher wins at the JD. The fence command with lower version # is rejected.

          For the scenario you described, NN2 after it rolls JD2 and JD3, updates the JournalList with JD2 and JD3. JD1 will no longer be used.

          Here are some points I think need elaboration in the design doc:

          We decided to keep the document light to ensure the details do not distract from core mechanism. Will add more details in the next version, including some use cases.

          Show
          Suresh Srinivas added a comment - Hi Suresh. I took a look at the design document, and I think it actually shares a lot with what I'm doing in HDFS-3077 . Hopefully we can share some portions of the code and design. Sounds good. Will add more details on how the code will be organized, so we can better reuse the code. fencing command ensure that prior NNs can no longer access the JD after it completes Fence command will include a version number that we got from the JournalList ZK node. The number that is higher wins at the JD. The fence command with lower version # is rejected. For the scenario you described, NN2 after it rolls JD2 and JD3, updates the JournalList with JD2 and JD3. JD1 will no longer be used. Here are some points I think need elaboration in the design doc: We decided to keep the document light to ensure the details do not distract from core mechanism. Will add more details in the next version, including some use cases.
          Hide
          Todd Lipcon added a comment -

          For the scenario you described, NN2 after it rolls JD2 and JD3, updates the JournalList with JD2 and JD3. JD1 will no longer be used.

          The situation here is at the beginning of a segment - for example the very first transaction. So, when NN2 rolls, the starting txid of the next segment is 1. I think you need to add an epoch number which is separate from the txid, to distinguish different "startings" of the same segment.

          Fence command will include a version number that we got from the JournalList ZK node. The number that is higher wins at the JD. The fence command with lower version # is rejected.

          You'll also need to atomically write this to local storage.

          I'll upload my in-progress code to HDFS-3077 so that we might be able to start sharing code earlier rather than later.

          Show
          Todd Lipcon added a comment - For the scenario you described, NN2 after it rolls JD2 and JD3, updates the JournalList with JD2 and JD3. JD1 will no longer be used. The situation here is at the beginning of a segment - for example the very first transaction. So, when NN2 rolls, the starting txid of the next segment is 1. I think you need to add an epoch number which is separate from the txid, to distinguish different "startings" of the same segment. Fence command will include a version number that we got from the JournalList ZK node. The number that is higher wins at the JD. The fence command with lower version # is rejected. You'll also need to atomically write this to local storage. I'll upload my in-progress code to HDFS-3077 so that we might be able to start sharing code earlier rather than later.
          Hide
          Suresh Srinivas added a comment -

          The situation here is at the beginning of a segment - for example the very first transaction. So, when NN2 rolls, the starting txid of the next segment is 1. I think you need to add an epoch number which is separate from the txid, to distinguish different "startings" of the same segment.

          This is what I was thinking - when roll is called from NN, it includes the version # from journalList ZK node. It gets recorded in the start segment transaction in the editlog. Does that work? Did I understand your comment correctly?

          Show
          Suresh Srinivas added a comment - The situation here is at the beginning of a segment - for example the very first transaction. So, when NN2 rolls, the starting txid of the next segment is 1. I think you need to add an epoch number which is separate from the txid, to distinguish different "startings" of the same segment. This is what I was thinking - when roll is called from NN, it includes the version # from journalList ZK node. It gets recorded in the start segment transaction in the editlog. Does that work? Did I understand your comment correctly?
          Hide
          Todd Lipcon added a comment -

          This is what I was thinking - when roll is called from NN, it includes the version # from journalList ZK node. It gets recorded in the start segment transaction in the editlog. Does that work? Did I understand your comment correctly?

          Yes, that solves the problem, but then requires that the recovery protocol open up and read the edits out of the logs. It also leaks the edits storage mechanism into the edits contents, so I don't think it's quite the right design.

          The solution I'm using in HDFS-3077 is to keep an extra file in the edits directory which includes the version number. During recovery, edits-in-progress on a logger whose version number is out of date with respect to the others can be removed.

          The thing is, once you've implemented all of this, I don't think your solution is any less complex than the one in HDFS-3077

          Show
          Todd Lipcon added a comment - This is what I was thinking - when roll is called from NN, it includes the version # from journalList ZK node. It gets recorded in the start segment transaction in the editlog. Does that work? Did I understand your comment correctly? Yes, that solves the problem, but then requires that the recovery protocol open up and read the edits out of the logs. It also leaks the edits storage mechanism into the edits contents, so I don't think it's quite the right design. The solution I'm using in HDFS-3077 is to keep an extra file in the edits directory which includes the version number. During recovery, edits-in-progress on a logger whose version number is out of date with respect to the others can be removed. The thing is, once you've implemented all of this, I don't think your solution is any less complex than the one in HDFS-3077
          Hide
          Suresh Srinivas added a comment -

          Yes, that solves the problem, but then requires that the recovery protocol open up and read the edits out of the logs. It also leaks the edits storage mechanism into the edits contents, so I don't think it's quite the right design.

          Given the reliance on rolling, three pieces of information is needed per JD. Last txid, the segment start txid and associated information.

          The thing is, once you've implemented all of this, I don't think your solution is any less complex than the one in HDFS-3077

          May be. But I am not sure I know what you are doing in HDFS-3077. At least you know the details of what is being done here

          Show
          Suresh Srinivas added a comment - Yes, that solves the problem, but then requires that the recovery protocol open up and read the edits out of the logs. It also leaks the edits storage mechanism into the edits contents, so I don't think it's quite the right design. Given the reliance on rolling, three pieces of information is needed per JD. Last txid, the segment start txid and associated information. The thing is, once you've implemented all of this, I don't think your solution is any less complex than the one in HDFS-3077 May be. But I am not sure I know what you are doing in HDFS-3077 . At least you know the details of what is being done here
          Hide
          Todd Lipcon added a comment -

          Could I request that this work be done on a feature branch, as there are multiple competing proposals to fill the same need? Then, when the features are complete, we can either choose to merge all, some, or none of them, based on their merits.

          Show
          Todd Lipcon added a comment - Could I request that this work be done on a feature branch, as there are multiple competing proposals to fill the same need? Then, when the features are complete, we can either choose to merge all, some, or none of them, based on their merits.
          Hide
          Suresh Srinivas added a comment -

          Could I request that this work be done on a feature branch, as there are multiple competing proposals to fill the same need? Then, when the features are complete, we can either choose to merge all, some, or none of them, based on their merits.

          Sounds good.

          If we plan to share the code, separate branch for HDFS-3077 does not make sense either.

          Show
          Suresh Srinivas added a comment - Could I request that this work be done on a feature branch, as there are multiple competing proposals to fill the same need? Then, when the features are complete, we can either choose to merge all, some, or none of them, based on their merits. Sounds good. If we plan to share the code, separate branch for HDFS-3077 does not make sense either.
          Hide
          Suresh Srinivas added a comment -

          Could I request that this work be done on a feature branch, as there are multiple competing proposals to fill the same need? Then, when the features are complete, we can either choose to merge all, some, or none of them, based on their merits.

          The problem with this is, we could avoid unnecessary duplication effort. Like we did in HDFS-1623, we could divide the implementation into multiple jiras. Different people can pickup jiras and contribute. Right now I see a 3K+ line patch in HDFS-3077. Posting such a huge patch makes review, code reuse etc. difficult.

          For now, perhaps development can happen and we could see if the code reuse can happen later.

          Show
          Suresh Srinivas added a comment - Could I request that this work be done on a feature branch, as there are multiple competing proposals to fill the same need? Then, when the features are complete, we can either choose to merge all, some, or none of them, based on their merits. The problem with this is, we could avoid unnecessary duplication effort. Like we did in HDFS-1623 , we could divide the implementation into multiple jiras. Different people can pickup jiras and contribute. Right now I see a 3K+ line patch in HDFS-3077 . Posting such a huge patch makes review, code reuse etc. difficult. For now, perhaps development can happen and we could see if the code reuse can happen later.
          Hide
          Todd Lipcon added a comment -

          Right now I see a 3K+ line patch in HDFS-3077. Posting such a huge patch makes review, code reuse etc. difficult.

          I agree. But when I post work piecemeal, I get complaints that it's hard to see the overall direction of the work. So not sure how to satisfy folks... Given this is all new code, and a lot of it is tests and "protobuf translators", it's actually not too bad.

          If you will agree to review the pieces as they come in, I'm happy to split HDFS-3077 into several patches. But if no one is signing up to review the pieces, splitting it up just slows down progress in my experience.

          Show
          Todd Lipcon added a comment - Right now I see a 3K+ line patch in HDFS-3077 . Posting such a huge patch makes review, code reuse etc. difficult. I agree. But when I post work piecemeal, I get complaints that it's hard to see the overall direction of the work. So not sure how to satisfy folks... Given this is all new code, and a lot of it is tests and "protobuf translators", it's actually not too bad. If you will agree to review the pieces as they come in, I'm happy to split HDFS-3077 into several patches. But if no one is signing up to review the pieces, splitting it up just slows down progress in my experience.
          Hide
          Flavio Junqueira added a comment -

          Hi Suresh, Thanks for sharing a design document. I have a few comments and questions if you don't mind:

          1. I find this design to be very close to bookkeeper, with a few important differences. One noticeable difference that has been mentioned elsewhere is that bookies implement mechanisms to enable high performance when there are multiple concurrent ledgers being written to. Your design does not seem to consider the possibility of multiple concurrent logs, which you may want to have for federation. Federation will be useful for large deployments, but not for small deployments. It sounds like a good idea to have a solution that covers both cases.
          2. There has been comments about comparing the different approaches discussed, and I was wondering what criteria you have been thinking of using to compare them. I guess it can't be performance because as the numbers Ivan has generated show, the current bottleneck is the namenode code, not the logging. Until the existing bottlenecks in the namenode code are removed, having a fast logging mechanism won't make much difference with respect to throughput.
          3. I was wondering about how reads to the log are executed if writes only have to reach a majority quorum. Once it is time to read, how does the reader gets a consistent view of the log? One JD alone may not have all entries, so I suppose the reader may need to read from multiple JDs to get a consistent view? Do the transaction identifiers establish the order of entries in the log? One quick note is that I don't see why a majority is required; bk does not require a majority.

          Here are some notes I took comparing the bk approach with the one in this jira, in the case you're interested:

          1. Rolling: The notion of rolling here is equivalent to closing a ledger and creating a new one. As ledgers are identified with numbers that are monotonically increasing, the ledger identifiers can be used to order the sequence of logs created over time.
          2. Single writer: Only one client can add new entries to a ledger. We have the notion of a recovery client, which is essentially a reader that makes sure that there is agreement on the end of the ledger. Such a recovery client is also able to write entries, but these writes are simply to make sure that there is enough replication.
          3. Fencing: We fence ledgers individually, so that we guarantee that all operations a ledger writer returns successfully are persisted on enough bookies. This is different from the approach proposed here, which essentially fences logging as a whole.
          4. Split brain: In a split-brain situation, bk can have two writers each writing to a different ledger. However, my understanding is that a namenode that is failing over cannot make progress without reading the previous log (ledger), consequently this situation cannot occur with bk and we don’t require writes to a majority for correctness.
          5. Adding JDs: The mechanism described here mentions explicitly adding a new JD. My understanding is that a new JD is brought up and it is told somehow to connect to the namenode and to another JD in the JournalList to sync up. bk currently only picks bookies from a pool of available bookies through zookeeper. It shouldn’t be a problem to allow a fixed list of bookies to be passed upon creating a ledger.
          6. Striping: bk implements striping, although that’s an optional feature. It is possible to use a configuration like 2-2 or 3-3 (Q-N, Q=quorum size and N=ensemble size).
          7. Failure detection: bk uses zookeeper ephemeral nodes to track bookies that are available. A client also changes its ensemble view if it loses a bookie by adding a new bookie. I’m not exactly sure how you monitor crashes here. Is it the namenode that keeps track of which JDs in the JournalList are available?
          Show
          Flavio Junqueira added a comment - Hi Suresh, Thanks for sharing a design document. I have a few comments and questions if you don't mind: I find this design to be very close to bookkeeper, with a few important differences. One noticeable difference that has been mentioned elsewhere is that bookies implement mechanisms to enable high performance when there are multiple concurrent ledgers being written to. Your design does not seem to consider the possibility of multiple concurrent logs, which you may want to have for federation. Federation will be useful for large deployments, but not for small deployments. It sounds like a good idea to have a solution that covers both cases. There has been comments about comparing the different approaches discussed, and I was wondering what criteria you have been thinking of using to compare them. I guess it can't be performance because as the numbers Ivan has generated show, the current bottleneck is the namenode code, not the logging. Until the existing bottlenecks in the namenode code are removed, having a fast logging mechanism won't make much difference with respect to throughput. I was wondering about how reads to the log are executed if writes only have to reach a majority quorum. Once it is time to read, how does the reader gets a consistent view of the log? One JD alone may not have all entries, so I suppose the reader may need to read from multiple JDs to get a consistent view? Do the transaction identifiers establish the order of entries in the log? One quick note is that I don't see why a majority is required; bk does not require a majority. Here are some notes I took comparing the bk approach with the one in this jira, in the case you're interested: Rolling : The notion of rolling here is equivalent to closing a ledger and creating a new one. As ledgers are identified with numbers that are monotonically increasing, the ledger identifiers can be used to order the sequence of logs created over time. Single writer : Only one client can add new entries to a ledger. We have the notion of a recovery client, which is essentially a reader that makes sure that there is agreement on the end of the ledger. Such a recovery client is also able to write entries, but these writes are simply to make sure that there is enough replication. Fencing : We fence ledgers individually, so that we guarantee that all operations a ledger writer returns successfully are persisted on enough bookies. This is different from the approach proposed here, which essentially fences logging as a whole. Split brain : In a split-brain situation, bk can have two writers each writing to a different ledger. However, my understanding is that a namenode that is failing over cannot make progress without reading the previous log (ledger), consequently this situation cannot occur with bk and we don’t require writes to a majority for correctness. Adding JDs : The mechanism described here mentions explicitly adding a new JD. My understanding is that a new JD is brought up and it is told somehow to connect to the namenode and to another JD in the JournalList to sync up. bk currently only picks bookies from a pool of available bookies through zookeeper. It shouldn’t be a problem to allow a fixed list of bookies to be passed upon creating a ledger. Striping : bk implements striping, although that’s an optional feature. It is possible to use a configuration like 2-2 or 3-3 (Q-N, Q=quorum size and N=ensemble size). Failure detection : bk uses zookeeper ephemeral nodes to track bookies that are available. A client also changes its ensemble view if it loses a bookie by adding a new bookie. I’m not exactly sure how you monitor crashes here. Is it the namenode that keeps track of which JDs in the JournalList are available?
          Hide
          Suresh Srinivas added a comment -

          Your design does not seem to consider the possibility of multiple concurrent logs, which you may want to have for federation.

          For HDFS editlogs, my feeling is that there will only be three JDs. One on the active namenode, second on the standby and a third JD on one of the machines. In federation, one has to configure a JD per Federated namespace. Alternative is to use BookKeeper, since it could make the deployment simpler for federated large cluster.

          There has been comments about comparing the different approaches discussed, and I was wondering what criteria you have been thinking of using to compare them.

          I think the comment was more about comparing the design and complexity of deployment and not benchmarks for two systems. Performance is not the motivation for this jira.

          I was wondering about how reads to the log are executed if writes only have to reach a majority quorum. Once it is time to read, how does the reader gets a consistent view of the log? One JD alone may not have all entries, so I suppose the reader may need to read from multiple JDs to get a consistent view? Do the transaction identifiers establish the order of entries in the log? One quick note is that I don't see why a majority is required; bk does not require a majority.

          We decided on majority quorum to keep the design simple, though it is strictly not necessary. A JD in JournalList is supposed to have all the entries and any JD from the list can be used to read the journals.

          Here are some notes I took comparing the bk approach with the one in this jira, in the case you're interested

          I noticed that as well. After we went thourgh many issues that this solution had to take care of, the solution looks very similar to BK. That is comforting

          Show
          Suresh Srinivas added a comment - Your design does not seem to consider the possibility of multiple concurrent logs, which you may want to have for federation. For HDFS editlogs, my feeling is that there will only be three JDs. One on the active namenode, second on the standby and a third JD on one of the machines. In federation, one has to configure a JD per Federated namespace. Alternative is to use BookKeeper, since it could make the deployment simpler for federated large cluster. There has been comments about comparing the different approaches discussed, and I was wondering what criteria you have been thinking of using to compare them. I think the comment was more about comparing the design and complexity of deployment and not benchmarks for two systems. Performance is not the motivation for this jira. I was wondering about how reads to the log are executed if writes only have to reach a majority quorum. Once it is time to read, how does the reader gets a consistent view of the log? One JD alone may not have all entries, so I suppose the reader may need to read from multiple JDs to get a consistent view? Do the transaction identifiers establish the order of entries in the log? One quick note is that I don't see why a majority is required; bk does not require a majority. We decided on majority quorum to keep the design simple, though it is strictly not necessary. A JD in JournalList is supposed to have all the entries and any JD from the list can be used to read the journals. Here are some notes I took comparing the bk approach with the one in this jira, in the case you're interested I noticed that as well. After we went thourgh many issues that this solution had to take care of, the solution looks very similar to BK. That is comforting
          Hide
          Flavio Junqueira added a comment -

          Thanks for the responses, Suresh.

          For HDFS editlogs, my feeling is that there will only be three JDs. One on the active namenode, second on the standby and a third JD on one of the machines. In federation, one has to configure a JD per Federated namespace. Alternative is to use BookKeeper, since it could make the deployment simpler for federated large cluster.

          When you say three JDs, that's the degree of replication, right? When I said multiple logs, I was referring to multiple namenodes writing to different logs, as with federation.

          I think the comment was more about comparing the design and complexity of deployment and not benchmarks for two systems. Performance is not the motivation for this jira.

          Got it. You're thinking about a qualitative design based on the requirements identified. Correctness sounds like an obvious candidate.

          We decided on majority quorum to keep the design simple, though it is strictly not necessary. A JD in JournalList is supposed to have all the entries and any JD from the list can be used to read the journals.

          I think my confusion here is that you require a quorum to be able to acknowledge the operation, but in reality you try to write to everyone. If you can't write to everyone, then you induce a view change (change to JournalList). Is this right?

          Show
          Flavio Junqueira added a comment - Thanks for the responses, Suresh. For HDFS editlogs, my feeling is that there will only be three JDs. One on the active namenode, second on the standby and a third JD on one of the machines. In federation, one has to configure a JD per Federated namespace. Alternative is to use BookKeeper, since it could make the deployment simpler for federated large cluster. When you say three JDs, that's the degree of replication, right? When I said multiple logs, I was referring to multiple namenodes writing to different logs, as with federation. I think the comment was more about comparing the design and complexity of deployment and not benchmarks for two systems. Performance is not the motivation for this jira. Got it. You're thinking about a qualitative design based on the requirements identified. Correctness sounds like an obvious candidate. We decided on majority quorum to keep the design simple, though it is strictly not necessary. A JD in JournalList is supposed to have all the entries and any JD from the list can be used to read the journals. I think my confusion here is that you require a quorum to be able to acknowledge the operation, but in reality you try to write to everyone. If you can't write to everyone, then you induce a view change (change to JournalList). Is this right?
          Hide
          Suresh Srinivas added a comment -

          When you say three JDs, that's the degree of replication, right? When I said multiple logs, I was referring to multiple namenodes writing to different logs, as with federation.

          Right, three JDs for degree of replication. However, I do understand multiple logs - that is a log per namespace. For every namespace, in federation, active and standby namenode + additional JD is needed.

          I think my confusion here is that you require a quorum to be able to acknowledge the operation, but in reality you try to write to everyone. If you can't write to everyone, then you induce a view change (change to JournalList). Is this right?

          Yes. In the first cut we write to all the JDs that are active. At least quorum should be written. This can be improved in the future by waiting for only Quorum JDs.

          Show
          Suresh Srinivas added a comment - When you say three JDs, that's the degree of replication, right? When I said multiple logs, I was referring to multiple namenodes writing to different logs, as with federation. Right, three JDs for degree of replication. However, I do understand multiple logs - that is a log per namespace. For every namespace, in federation, active and standby namenode + additional JD is needed. I think my confusion here is that you require a quorum to be able to acknowledge the operation, but in reality you try to write to everyone. If you can't write to everyone, then you induce a view change (change to JournalList). Is this right? Yes. In the first cut we write to all the JDs that are active. At least quorum should be written. This can be improved in the future by waiting for only Quorum JDs.
          Hide
          Sanjay Radia added a comment -

          Is there a way to turn off the striping even if Quorom size (Q) is less than Ensemble size (E)?
          We like the idea that each Journal file contains ALL entries.
          Our default config: Q is 2 and set of JDs is 3 (roughly equivalent to E).

          Show
          Sanjay Radia added a comment - Is there a way to turn off the striping even if Quorom size (Q) is less than Ensemble size (E)? We like the idea that each Journal file contains ALL entries. Our default config: Q is 2 and set of JDs is 3 (roughly equivalent to E).
          Hide
          Suresh Srinivas added a comment -

          Updated design with additional use cases.

          Show
          Suresh Srinivas added a comment - Updated design with additional use cases.
          Hide
          Flavio Junqueira added a comment -

          Is there a way to turn off the striping even if Quorom size (Q) is less than Ensemble size (E)?

          We like the idea that each Journal file contains ALL entries.
          Our default config: Q is 2 and set of JDs is 3 (roughly equivalent to E).

          Right now the write set is the same as the ack set, since we haven't had the use case you suggest so far. We have anticipated the need of different ways of scheduling reads and writes, and it is fairly simple to make it cover your use case. Ivan had actually suggested this separation between ack sets and write sets previously to deal with slow disks.

          I have tried to reflect this discussion in BOOKKEEPER-208 if you're interested.

          Show
          Flavio Junqueira added a comment - Is there a way to turn off the striping even if Quorom size (Q) is less than Ensemble size (E)? We like the idea that each Journal file contains ALL entries. Our default config: Q is 2 and set of JDs is 3 (roughly equivalent to E). Right now the write set is the same as the ack set, since we haven't had the use case you suggest so far. We have anticipated the need of different ways of scheduling reads and writes, and it is fairly simple to make it cover your use case. Ivan had actually suggested this separation between ack sets and write sets previously to deal with slow disks. I have tried to reflect this discussion in BOOKKEEPER-208 if you're interested.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          Added a section for journal service states to the design doc.

          Show
          Tsz Wo Nicholas Sze added a comment - Added a section for journal service states to the design doc.
          Hide
          Kihwal Lee added a comment -

          I imagine a newly brought up or recovering journal daemon will get synced when the edit log is rolled. But is it going to be tied to checkpointing?

          Show
          Kihwal Lee added a comment - I imagine a newly brought up or recovering journal daemon will get synced when the edit log is rolled. But is it going to be tied to checkpointing?
          Hide
          Tsz Wo Nicholas Sze added a comment -

          Hi Kihwal,

          If the NN already has enough JDs to write to, a new JD will be synced asynchronously - it will start syncing and accepting journal at the same time. If the NN does not have enough JDs, the NN will be blocked until it has enough synced JDs.

          Show
          Tsz Wo Nicholas Sze added a comment - Hi Kihwal, If the NN already has enough JDs to write to, a new JD will be synced asynchronously - it will start syncing and accepting journal at the same time. If the NN does not have enough JDs, the NN will be blocked until it has enough synced JDs.
          Hide
          Bikas Saha added a comment -

          Combination of this and HDFS-3077 sound very much like ZAB with one difference - in the use of 2-phase commit for the write broadcast.

          Lets say there is 1 active NN writing to a quorum set of journal daemons. This is the same as ZAB. Active NN writes edits and ZAB leader writes new states.

          ZAB uses 2 phase commits (without abort) for each write while our design is getting away without it. I am wondering why we can get away with it.

          My guess is that each follower in ZAB can also serve reads from clients. Hence, it cannot serve an update until it is guaranteed that a quorum of followers has agreed on that update. That is what 2 phase commit gives.
          In our case, the active NN is the only server for client reads. Hence, updates are not served to clients until a quorum acks back.

          However, the above would break for us if the standby NN is using any journal daemon to refresh its state. Because, ideally, a journal node should not inform the standby about an update until the it knows that the update has been accepted by a quorum of journal daemons. That would require a 2 phase commit.
          E.g. Standby NN3 reads the last edit written to JN1 by old active NN1, before NN1 realized that it has lost quorum to NN2 (by failing to write to JN2 and JN3).

          Perhaps we can get away with this by using some assumptions on timeouts, or by additional constraints on the standby. Eg. that it only syncs with finalized edit segments.

          If we say that the standby sync with only the finalized log segments in order to be safe from the above, then IMO, the tailing of the edits by the standby should not be done by the standby directly but via a journal daemon API for the standby. This JD API would ensure that only valid edits are being sent to the standby (edits from finalized segments or edits known to be safely committed to a quorum of journal daemons). This way the correctness of the journal protocol would remain inside it. Instead of leaking it into the standby by having the standby code remember rules for tailing edits.

          Show
          Bikas Saha added a comment - Combination of this and HDFS-3077 sound very much like ZAB with one difference - in the use of 2-phase commit for the write broadcast. Lets say there is 1 active NN writing to a quorum set of journal daemons. This is the same as ZAB. Active NN writes edits and ZAB leader writes new states. ZAB uses 2 phase commits (without abort) for each write while our design is getting away without it. I am wondering why we can get away with it. My guess is that each follower in ZAB can also serve reads from clients. Hence, it cannot serve an update until it is guaranteed that a quorum of followers has agreed on that update. That is what 2 phase commit gives. In our case, the active NN is the only server for client reads. Hence, updates are not served to clients until a quorum acks back. However, the above would break for us if the standby NN is using any journal daemon to refresh its state. Because, ideally, a journal node should not inform the standby about an update until the it knows that the update has been accepted by a quorum of journal daemons. That would require a 2 phase commit. E.g. Standby NN3 reads the last edit written to JN1 by old active NN1, before NN1 realized that it has lost quorum to NN2 (by failing to write to JN2 and JN3). Perhaps we can get away with this by using some assumptions on timeouts, or by additional constraints on the standby. Eg. that it only syncs with finalized edit segments. If we say that the standby sync with only the finalized log segments in order to be safe from the above, then IMO, the tailing of the edits by the standby should not be done by the standby directly but via a journal daemon API for the standby. This JD API would ensure that only valid edits are being sent to the standby (edits from finalized segments or edits known to be safely committed to a quorum of journal daemons). This way the correctness of the journal protocol would remain inside it. Instead of leaking it into the standby by having the standby code remember rules for tailing edits.
          Hide
          Todd Lipcon added a comment -

          Perhaps we can get away with this by using some assumptions on timeouts, or by additional constraints on the standby. Eg. that it only syncs with finalized edit segments.

          That's my plan in HDFS-3077, and in fact that's the current behavior of the SBN, even when operating on NFS.

          Show
          Todd Lipcon added a comment - Perhaps we can get away with this by using some assumptions on timeouts, or by additional constraints on the standby. Eg. that it only syncs with finalized edit segments. That's my plan in HDFS-3077 , and in fact that's the current behavior of the SBN, even when operating on NFS.
          Hide
          Bikas Saha added a comment -

          Todd, I was going to post this same comment on HDFS-3077 because I did not see 2 phase commits mentioned there either even though the initial comments mentioned implementing ZAB. The doc did mention standby node tailing as a requirement but it wasn't clear how this is being achieved in the absence of a coordinated distributed commit.

          That's my plan in HDFS-3077, and in fact that's the current behavior of the SBN, even when operating on NFS.

          Hari updated me offline that standby NN currently tails only the finalized edits. So this works fine. By 'my plan' are you referring to an API on the journal node to read latest edits that replaces the current standby NN tailing code?

          Show
          Bikas Saha added a comment - Todd, I was going to post this same comment on HDFS-3077 because I did not see 2 phase commits mentioned there either even though the initial comments mentioned implementing ZAB. The doc did mention standby node tailing as a requirement but it wasn't clear how this is being achieved in the absence of a coordinated distributed commit. That's my plan in HDFS-3077 , and in fact that's the current behavior of the SBN, even when operating on NFS. Hari updated me offline that standby NN currently tails only the finalized edits. So this works fine. By 'my plan' are you referring to an API on the journal node to read latest edits that replaces the current standby NN tailing code?
          Hide
          Todd Lipcon added a comment -

          By 'my plan' are you referring to an API on the journal node to read latest edits that replaces the current standby NN tailing code?

          Yep - well, not replaces, but rather just implements the correct APIs in JournalManager. We already have "read side" APIs there to get an input stream starting at a given txid. We just need implementations that do the "remote" reads.

          Show
          Todd Lipcon added a comment - By 'my plan' are you referring to an API on the journal node to read latest edits that replaces the current standby NN tailing code? Yep - well, not replaces, but rather just implements the correct APIs in JournalManager. We already have "read side" APIs there to get an input stream starting at a given txid. We just need implementations that do the "remote" reads.
          Hide
          dhruba borthakur added a comment -

          I still somehow like the BK approach, especially because it can store transaction logs from multiple namenodes, this is something that is very needed for HDFS-federation. If we can get BK to scale and be performant, it is a possibility that HBase can store transaction logs in there too, especially of the latencies are much smaller (and disk-failure recoveries faster) than a typical HDFS write pipeline.

          Show
          dhruba borthakur added a comment - I still somehow like the BK approach, especially because it can store transaction logs from multiple namenodes, this is something that is very needed for HDFS-federation. If we can get BK to scale and be performant, it is a possibility that HBase can store transaction logs in there too, especially of the latencies are much smaller (and disk-failure recoveries faster) than a typical HDFS write pipeline.
          Hide
          Bikas Saha added a comment -

          Sanjay and I were trying to objectively compare different approaches (HDFS-3077, HDFS-3092, BookKeeper). I have attached a document outlining the observations.
          Hopefully, this will help in structuring the discussions going forward.

          Show
          Bikas Saha added a comment - Sanjay and I were trying to objectively compare different approaches ( HDFS-3077 , HDFS-3092 , BookKeeper). I have attached a document outlining the observations. Hopefully, this will help in structuring the discussions going forward.
          Hide
          Todd Lipcon added a comment -

          Can you clarify a few things in this document?

          • In ParallelWritesWithBarrier, what happens to the journals which timeout/fail? It seems you need to mark them as failed in ZK or something in order to be correct. But if you do that, why do you need Q to be a "quorum"? Q=1 should suffice for correctness, and Q=2 should suffice in order to always be available to recover.

          It seems the protocol should be closer to:
          1) send out write request to all active JNs
          2) wait until all respond, or a configurable timeout
          3) any that do not respond are marked as failed in ZK
          4) If the remaining number of JNs is sufficient (I'd guess 2) then succeed the write. Otherwise fail the write and abort.

          The recovery protocol here is also a little tricky. I haven't seen a description of the specifics - there are a number of cases to handle - eg even if a write appears to fail from the perspective of the writer, it may have actually succeeded. Another situation: what happens if the writer crashes between step 2 and step 3 (so the JNs have differing number of txns, but ZK indicates they're all up to date?)

          Regarding quorum commits:

          b. The journal set is fixed in the config. Hard to add/replace hardware.

          There are protocols that could be used to change the quorum size/membership at runtime. They do add complexity, though, so I think they should be seen as a future improvement - but not be discounted as impossible.
          Another point is that hardware replacement can easily be treated the same as a full crash and loss of disk. If one node completely crashes, a new node could be brought in with the same hostname with no complicated protocols.
          Adding or removing nodes shouldn't be hard to support during a downtime window, which I think satisfies most use cases pretty well.

          Regarding bookkeeper:

          • other operational concerns aren't mentioned: eg it doesn't use Hadoop metrics, doesn't use the same style of configuration files, daemon scripts, etc.
          Show
          Todd Lipcon added a comment - Can you clarify a few things in this document? In ParallelWritesWithBarrier, what happens to the journals which timeout/fail? It seems you need to mark them as failed in ZK or something in order to be correct. But if you do that, why do you need Q to be a "quorum"? Q=1 should suffice for correctness, and Q=2 should suffice in order to always be available to recover. It seems the protocol should be closer to: 1) send out write request to all active JNs 2) wait until all respond, or a configurable timeout 3) any that do not respond are marked as failed in ZK 4) If the remaining number of JNs is sufficient (I'd guess 2) then succeed the write. Otherwise fail the write and abort. The recovery protocol here is also a little tricky. I haven't seen a description of the specifics - there are a number of cases to handle - eg even if a write appears to fail from the perspective of the writer, it may have actually succeeded. Another situation: what happens if the writer crashes between step 2 and step 3 (so the JNs have differing number of txns, but ZK indicates they're all up to date?) Regarding quorum commits: b. The journal set is fixed in the config. Hard to add/replace hardware. There are protocols that could be used to change the quorum size/membership at runtime. They do add complexity, though, so I think they should be seen as a future improvement - but not be discounted as impossible. Another point is that hardware replacement can easily be treated the same as a full crash and loss of disk. If one node completely crashes, a new node could be brought in with the same hostname with no complicated protocols. Adding or removing nodes shouldn't be hard to support during a downtime window, which I think satisfies most use cases pretty well. Regarding bookkeeper: other operational concerns aren't mentioned: eg it doesn't use Hadoop metrics, doesn't use the same style of configuration files, daemon scripts, etc.
          Hide
          Flavio Junqueira added a comment -

          Thanks for posting this comparison, Bikas. Let me try to address the last two points on bookkeeper:

          Tools for recovery - have a bookie recover tool. others??

          That's correct, we have a bookie recovery tool that reconstructs the ledger fragments of a dead bookie. This has been part of bookkeeper for a while. We have some other tools proposed in BOOKKEEPER-183 to read and check bookie files, but they are not checked in yet. We have yet some other tools we want to develop for some more extreme failure scenarios. We are targeting release 4.2.0 for them (a draft of our feature roadmap is here https://cwiki.apache.org/confluence/display/BOOKKEEPER/Roadmap).

          Release frequency, committers, projects that use it??

          We started planning for releases every 6 months, but we have been thinking about releasing more frequently, every 3 months.

          We are currently 6 committers, but only 3 have been really active. Four of us are from Yahoo!, one from Twitter, and one from Facebook. Given that it is still a young project, I don't see why other hdfs folks cannot become committers of bookkeeper if they contribute and there is interest. It would be actually quite natural in the case bookkeeper ends up being used with the namenode. For us, having committers from the hdfs community would be useful to make sure we don't miss important requirements of yours.

          As for projects using it, we have applications that incorporated bookkeeper (and hedwig) inside Yahoo! recently, and we have people from other companies on the mailing list discussing their setups and asking questions. If you're on the list, you have possibly seen those.

          Show
          Flavio Junqueira added a comment - Thanks for posting this comparison, Bikas. Let me try to address the last two points on bookkeeper: Tools for recovery - have a bookie recover tool. others?? That's correct, we have a bookie recovery tool that reconstructs the ledger fragments of a dead bookie. This has been part of bookkeeper for a while. We have some other tools proposed in BOOKKEEPER-183 to read and check bookie files, but they are not checked in yet. We have yet some other tools we want to develop for some more extreme failure scenarios. We are targeting release 4.2.0 for them (a draft of our feature roadmap is here https://cwiki.apache.org/confluence/display/BOOKKEEPER/Roadmap ). Release frequency, committers, projects that use it?? We started planning for releases every 6 months, but we have been thinking about releasing more frequently, every 3 months. We are currently 6 committers, but only 3 have been really active. Four of us are from Yahoo!, one from Twitter, and one from Facebook. Given that it is still a young project, I don't see why other hdfs folks cannot become committers of bookkeeper if they contribute and there is interest. It would be actually quite natural in the case bookkeeper ends up being used with the namenode. For us, having committers from the hdfs community would be useful to make sure we don't miss important requirements of yours. As for projects using it, we have applications that incorporated bookkeeper (and hedwig) inside Yahoo! recently, and we have people from other companies on the mailing list discussing their setups and asking questions. If you're on the list, you have possibly seen those.
          Hide
          Bikas Saha added a comment -

          @Todd
          The definition in the doc for ParallelWritesWithBarrier is deliberately shallow. The point was just to differentiate between waiting and not waiting. The doc does not go into specifics of algorithms. So your feedback for different issues should be directed to the proposal you are commenting on. On future improvements - again, the doc is meant to be a comparison of the proposals as we saw them in the design docs submitted to the jira's and bookkeeper online references.
          Basically, going by existing documentation of proposals, the doc tries to outline the high level salient points to consider.

          @Flavio
          Thanks for the roadmap pointer.

          Show
          Bikas Saha added a comment - @Todd The definition in the doc for ParallelWritesWithBarrier is deliberately shallow. The point was just to differentiate between waiting and not waiting. The doc does not go into specifics of algorithms. So your feedback for different issues should be directed to the proposal you are commenting on. On future improvements - again, the doc is meant to be a comparison of the proposals as we saw them in the design docs submitted to the jira's and bookkeeper online references. Basically, going by existing documentation of proposals, the doc tries to outline the high level salient points to consider. @Flavio Thanks for the roadmap pointer.
          Hide
          dhruba borthakur added a comment -

          I am trying to digest the meat of this approach, but one question that I do not have an answer for: is it possible for the journal daemon to write data to disks and nodes that do not share load from other non-journal writers? I feel this requirement will be critical to ensure low variance of write latencies for the journal. My experience is that a 5% increase in the latency of writes to the transaction log causes a 20% degradation of namenode throughput in a large cluster.

          Show
          dhruba borthakur added a comment - I am trying to digest the meat of this approach, but one question that I do not have an answer for: is it possible for the journal daemon to write data to disks and nodes that do not share load from other non-journal writers? I feel this requirement will be critical to ensure low variance of write latencies for the journal. My experience is that a 5% increase in the latency of writes to the transaction log causes a 20% degradation of namenode throughput in a large cluster.
          Hide
          Bikas Saha added a comment -

          From what I understand the approach is to dedicate a disk per journal daemon. That would be easy when running JD's on NN machines. For the 3rd JD one could use a disk on the JobTracker/ResourceManager machine.

          Show
          Bikas Saha added a comment - From what I understand the approach is to dedicate a disk per journal daemon. That would be easy when running JD's on NN machines. For the 3rd JD one could use a disk on the JobTracker/ResourceManager machine.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          JNStates.png: Attached the state diagram separately.

          Show
          Tsz Wo Nicholas Sze added a comment - JNStates.png: Attached the state diagram separately.
          Hide
          dhruba borthakur added a comment -

          we might continue to invest in making namenode transaction logs be stored in BK, just because BK is a general purpose logging service that non-hadoop software services could also use in the future.

          Show
          dhruba borthakur added a comment - we might continue to invest in making namenode transaction logs be stored in BK, just because BK is a general purpose logging service that non-hadoop software services could also use in the future.
          Hide
          Todd Lipcon added a comment -

          A few of us are meeting up today at 3:30pm PST to discuss this JIRA. If you'd like to join by phone, please ping me by email. Sorry for the late notice, we just planned this impromptu.

          Show
          Todd Lipcon added a comment - A few of us are meeting up today at 3:30pm PST to discuss this JIRA. If you'd like to join by phone, please ping me by email. Sorry for the late notice, we just planned this impromptu.
          Hide
          Eli Collins added a comment -

          Notes from the meeting:

          • Discussed the two approaches for the client side (HDFS-3077 and HDFS-3092), the server sides are similar modulo small differences in recovery. Discussed the tradeoffs in terms of going down to n-1 servers/requiring a quorum vs latency sensitivity, in the context of typical cluster configurations. Think it's possible to wed the client side of 3077 (quorum journal) to the server side of 3092 (journal daemon). Will pursue further on jira.
          • Discussed journal-based fencing. Current NN fencers are not needed when the journal handles fencing.
          • Discussed IP based failover and stonith, primary motivation for the current approach is that other master services that are not yet HA often run on the same machines, stonith doesn't work here.
          • Discussed making the state machine for auto failover more explicit
          • Discussed separate vs embedded FCs, separate works well for now, though currently means we'll need a FC per service that gets failover (vs embedding a FC in each service that will need failover)
          Show
          Eli Collins added a comment - Notes from the meeting: Discussed the two approaches for the client side ( HDFS-3077 and HDFS-3092 ), the server sides are similar modulo small differences in recovery. Discussed the tradeoffs in terms of going down to n-1 servers/requiring a quorum vs latency sensitivity, in the context of typical cluster configurations. Think it's possible to wed the client side of 3077 (quorum journal) to the server side of 3092 (journal daemon). Will pursue further on jira. Discussed journal-based fencing. Current NN fencers are not needed when the journal handles fencing. Discussed IP based failover and stonith, primary motivation for the current approach is that other master services that are not yet HA often run on the same machines, stonith doesn't work here. Discussed making the state machine for auto failover more explicit Discussed separate vs embedded FCs, separate works well for now, though currently means we'll need a FC per service that gets failover (vs embedding a FC in each service that will need failover)
          Hide
          Eli Collins added a comment -

          Attaching some thoughts/framework for comparing approaches I wrote last month.

          Show
          Eli Collins added a comment - Attaching some thoughts/framework for comparing approaches I wrote last month.
          Hide
          Nathan Roberts added a comment -

          Eli, thanks for the writeup.

          One question on this statement "(ie option 1 is fundamentally slower)." We already double-buffer the editstreams in the namenode so isn't it true that users will only see latency effects if the buffering isn't able to keep up? In other words, isn't it the case that as long as the slowest journal is keeping up with demand, there's no significant difference in performance?

          Show
          Nathan Roberts added a comment - Eli, thanks for the writeup. One question on this statement "(ie option 1 is fundamentally slower)." We already double-buffer the editstreams in the namenode so isn't it true that users will only see latency effects if the buffering isn't able to keep up? In other words, isn't it the case that as long as the slowest journal is keeping up with demand, there's no significant difference in performance?
          Hide
          Eli Collins added a comment -

          Specifically, in HDFS-3092 we "write in parallel to all active/syncing journals. Writes must complete on all active journals or fail/timeout." therefore the latency of a transaction is that of the slowest JD. Note that the write must complete - a JD can't ack a write that's not on disk. In 3077 the latency is the slowest of the JDs in a quorum (ie if you have 5 JDs the latency is that of the 3rd fastest JD).

          Show
          Eli Collins added a comment - Specifically, in HDFS-3092 we "write in parallel to all active/syncing journals. Writes must complete on all active journals or fail/timeout." therefore the latency of a transaction is that of the slowest JD. Note that the write must complete - a JD can't ack a write that's not on disk. In 3077 the latency is the slowest of the JDs in a quorum (ie if you have 5 JDs the latency is that of the 3rd fastest JD).
          Hide
          Todd Lipcon added a comment -

          Hey folks. Is anyone still working on this given that HDFS-3077 has been committed for a year or so and in use at lots of sites? Maybe we should close it as wontfix, given a lot of the original ideas got incorporated into 3077?

          Show
          Todd Lipcon added a comment - Hey folks. Is anyone still working on this given that HDFS-3077 has been committed for a year or so and in use at lots of sites? Maybe we should close it as wontfix, given a lot of the original ideas got incorporated into 3077?
          Hide
          Suresh Srinivas added a comment -

          Lets close this. If any work is needed, we can open another jira.

          Show
          Suresh Srinivas added a comment - Lets close this. If any work is needed, we can open another jira.

            People

            • Assignee:
              Suresh Srinivas
              Reporter:
              Suresh Srinivas
            • Votes:
              0 Vote for this issue
              Watchers:
              37 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development