Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 2.9.0, 3.0.0-alpha1, 2.8.2
    • Component/s: None
    • Labels:
      None

      Description

      Current HDFS DN rolling upgrade step requires sequential DN restart to minimize the impact on data availability and read/write operations. The side effect is longer upgrade duration for large clusters. This might be acceptable for DN JVM quick restart to update hadoop code/configuration. However, for OS upgrade that requires machine reboot, the overall upgrade duration will be too long if we continue to do sequential DN rolling restart.

      1. HDFS-7541.patch
        67 kB
        Ming Ma
      2. HDFS-7541-2.patch
        89 kB
        Ming Ma
      3. SupportforfastHDFSdatanoderollingupgrade.pdf
        125 kB
        Ming Ma
      4. UpgradeDomains_design_v2.pdf
        288 kB
        Ming Ma
      5. UpgradeDomains_Design_v3.pdf
        242 kB
        Ming Ma

        Issue Links

        There are no Sub-Tasks for this issue.

          Activity

          Hide
          mingma Ming Ma added a comment -

          We (Chris Trezzo, John Meagher, Lohit Vijayarenu, Siqi Li and Kihwal Lee and others) discussed ways to address this. Attached is the initial high level design document.

          • Upgrade domain support. HDFS-3566 outlines the idea, but it isn't applicable to hadoop 2 and it uses network topology to store upgrade domain definition. We can make load balancer to be more extensible to support different policies.
          • Have NN support for new "maintenance" datanode state. Under this state, the DN won't process read/write requests; But its replica will remains in BlockMaps and thus is still considered valid from block replication point of view.

          Appreciate any input.

          Show
          mingma Ming Ma added a comment - We ( Chris Trezzo , John Meagher , Lohit Vijayarenu , Siqi Li and Kihwal Lee and others) discussed ways to address this. Attached is the initial high level design document. Upgrade domain support. HDFS-3566 outlines the idea, but it isn't applicable to hadoop 2 and it uses network topology to store upgrade domain definition. We can make load balancer to be more extensible to support different policies. Have NN support for new "maintenance" datanode state. Under this state, the DN won't process read/write requests; But its replica will remains in BlockMaps and thus is still considered valid from block replication point of view. Appreciate any input.
          Hide
          mingma Ming Ma added a comment -

          Here are the updated design document and initial draft patch. Appreciate any input others might have on the design and the general approach.

          The maintenance state support is somewhat independent of upgrade domain work. So the latest design doesn't include maintenance state support; which will be discussed separately. We have tested it out on a small cluster. Others are welcome to try out the patch.

          After we finalize on the design, we can break the feature into subtasks.

          Show
          mingma Ming Ma added a comment - Here are the updated design document and initial draft patch. Appreciate any input others might have on the design and the general approach. The maintenance state support is somewhat independent of upgrade domain work. So the latest design doesn't include maintenance state support; which will be discussed separately. We have tested it out on a small cluster. Others are welcome to try out the patch. After we finalize on the design, we can break the feature into subtasks.
          Hide
          mingma Ming Ma added a comment -

          We have been running upgrade domain policy on one of our large production clusters, here are the results.

          • Not perf impact on write operation, specifically the RPC AddBlock latency
          • All blocks have been migrated to the upgrade domain policy.

          Here is the updated version of the patch. Appreciate if anyone has any high level comments on the design. If people are ok with the approach, I will open sub tasks.

          During the work, we also found out that the balancer has hard code rack based policy, instead of leveraging block placement policy, e.g. HDFS-1431. Something we should follow up more so that balancer doesn’t need to be modified when we introduce new block placement policy.

          Show
          mingma Ming Ma added a comment - We have been running upgrade domain policy on one of our large production clusters, here are the results. Not perf impact on write operation, specifically the RPC AddBlock latency All blocks have been migrated to the upgrade domain policy. Here is the updated version of the patch. Appreciate if anyone has any high level comments on the design. If people are ok with the approach, I will open sub tasks. During the work, we also found out that the balancer has hard code rack based policy, instead of leveraging block placement policy, e.g. HDFS-1431 . Something we should follow up more so that balancer doesn’t need to be modified when we introduce new block placement policy.
          Hide
          eddyxu Lei (Eddy) Xu added a comment -

          Hi, Ming Ma

          Thanks a lot for working on this very useful feature. I had some similar thoughts for a while.

          The design is very reasonable. It preserves the perf/availability characteristics of the current block placement policy while accurately controls the availability of replicas in the event of shutting down multiple DNs.

          A few small questions:

          • How about call it Availability Domain, which is similar to AWS's availability zones. I think that this concept can be used in a broader way.
          • Is this upgrade domain on each DN a soft state or a hard state? Would choosing soft/hard state have some implications for admins? For example, can admins re-purpose DN machines and move them around?
          • What do you anticipate as a good strategy to choose upgrade domains UDs? For instance, supposing we have 40 machines / rack, and 100 racks. Should we choose 40 different UDs and each rack has one of it, or 10 UDs and each rack has 4 of them? Or 50 racks have 20 UDs and the rest racks have the other UDs. What are the pros/cons between having 3-5 UDs vs 40 UDs?
          • Regarding the performance impact, would you share us about the approximated scale of # of racks, # of different UD and # of concurrent writes?
          • In design v2.pdf, could you mind to rephrase the process of "Replica delete operation"? It is a little bit difficult to understand.
          • The last one maybe not relevant: would this design work well with erasure coding (HDFS-7285)?

          Looking forward to hear more.

          Show
          eddyxu Lei (Eddy) Xu added a comment - Hi, Ming Ma Thanks a lot for working on this very useful feature. I had some similar thoughts for a while. The design is very reasonable. It preserves the perf/availability characteristics of the current block placement policy while accurately controls the availability of replicas in the event of shutting down multiple DNs. A few small questions: How about call it Availability Domain , which is similar to AWS's availability zones. I think that this concept can be used in a broader way. Is this upgrade domain on each DN a soft state or a hard state? Would choosing soft/hard state have some implications for admins? For example, can admins re-purpose DN machines and move them around? What do you anticipate as a good strategy to choose upgrade domains UDs ? For instance, supposing we have 40 machines / rack, and 100 racks. Should we choose 40 different UDs and each rack has one of it, or 10 UDs and each rack has 4 of them? Or 50 racks have 20 UDs and the rest racks have the other UDs . What are the pros/cons between having 3-5 UDs vs 40 UDs? Regarding the performance impact, would you share us about the approximated scale of # of racks, # of different UD and # of concurrent writes? In design v2.pdf , could you mind to rephrase the process of "Replica delete operation"? It is a little bit difficult to understand. The last one maybe not relevant: would this design work well with erasure coding ( HDFS-7285 )? Looking forward to hear more.
          Hide
          mingma Ming Ma added a comment -

          Thanks Lei (Eddy) Xu! These are very good points. Here is the updated design doc that answers some of your questions in details. Please find specific replies below.

          How about call it Availability Domain

          Availability might be too general in this context. The service can become unavailable due to unplanned event such as TOR outage or planned maintenance such as software upgrade. Both can impact the availability. If we define "Availability Domain" as "if all machines in that domain aren't available, the service can still function", then machines belonging to one rack can also be considered in one availability domain.

          Is this upgrade domain on each DN a soft state or a hard state?

          It is a hard state, just like network location of the node. While admins likely keep upgrade domain unchanged during common operations; the design allows admins to move machines around as long as the machines are decommissioned properly at the first place and thus when machines rejoin under different upgrade domains, the proper replica will be removed. The updated design doc provides more details on this.

          What do you anticipate as a good strategy to choose upgrade domains UDs?

          Updated design doc has more on this. The number of upgrade domains has impact on data loss, replica recovery time and rolling upgrade parallelism.

          Regarding the performance impact

          1. of racks is in the order of 100, # of upgrade domains is in the ballpark of 40, # of addBlocks operation is around 1000 ops / sec at leak.

          In design v2.pdf, could you mind to rephrase the process of "Replica delete operation"?

          Updated design adds more description.

          The last one maybe not relevant: would this design work well with erasure coding (HDFS-7285)?

          Similar question was asked in HDFS-7613, how we can reuse different block placement policies. Like you said, we can address this issue separately.

          Show
          mingma Ming Ma added a comment - Thanks Lei (Eddy) Xu ! These are very good points. Here is the updated design doc that answers some of your questions in details. Please find specific replies below. How about call it Availability Domain Availability might be too general in this context. The service can become unavailable due to unplanned event such as TOR outage or planned maintenance such as software upgrade. Both can impact the availability. If we define "Availability Domain" as "if all machines in that domain aren't available, the service can still function", then machines belonging to one rack can also be considered in one availability domain. Is this upgrade domain on each DN a soft state or a hard state? It is a hard state, just like network location of the node. While admins likely keep upgrade domain unchanged during common operations; the design allows admins to move machines around as long as the machines are decommissioned properly at the first place and thus when machines rejoin under different upgrade domains, the proper replica will be removed. The updated design doc provides more details on this. What do you anticipate as a good strategy to choose upgrade domains UDs? Updated design doc has more on this. The number of upgrade domains has impact on data loss, replica recovery time and rolling upgrade parallelism. Regarding the performance impact of racks is in the order of 100, # of upgrade domains is in the ballpark of 40, # of addBlocks operation is around 1000 ops / sec at leak. In design v2.pdf, could you mind to rephrase the process of "Replica delete operation"? Updated design adds more description. The last one maybe not relevant: would this design work well with erasure coding ( HDFS-7285 )? Similar question was asked in HDFS-7613 , how we can reuse different block placement policies. Like you said, we can address this issue separately.
          Hide
          mingma Ming Ma added a comment - - edited

          Thanks everyone for the input and contribution. Special thanks to Lei (Eddy) Xu for the review and suggestion.

          Show
          mingma Ming Ma added a comment - - edited Thanks everyone for the input and contribution. Special thanks to Lei (Eddy) Xu for the review and suggestion.
          Hide
          kihwal Kihwal Lee added a comment -

          Ming Ma Since you guys have been using it in production for a while, wouldn't it make sense to bring this to 2.8?

          Show
          kihwal Kihwal Lee added a comment - Ming Ma Since you guys have been using it in production for a while, wouldn't it make sense to bring this to 2.8?
          Hide
          zhz Zhe Zhang added a comment -

          This feature would be very useful for our clusters. We'd love to see it in 2.8/2.7.

          Show
          zhz Zhe Zhang added a comment - This feature would be very useful for our clusters. We'd love to see it in 2.8/2.7.
          Hide
          mingma Ming Ma added a comment -

          Sure I can backport HDFS-9005, HDFS-9016 and HDFS-9922 to 2.8. Which 2.8 release do we want, 2.8.1 or 2.8.2? Pushing the feature to 2.7 requires much more work though. Regarding the production quality, yes it has been pretty reliable. The only feature we don't use in our production is HDFS-9005. We used script-based configuration approach while the feature was developed and tested and haven't spend time changing the configuration mechanism.

          Show
          mingma Ming Ma added a comment - Sure I can backport HDFS-9005 , HDFS-9016 and HDFS-9922 to 2.8. Which 2.8 release do we want, 2.8.1 or 2.8.2? Pushing the feature to 2.7 requires much more work though. Regarding the production quality, yes it has been pretty reliable. The only feature we don't use in our production is HDFS-9005 . We used script-based configuration approach while the feature was developed and tested and haven't spend time changing the configuration mechanism.
          Hide
          kihwal Kihwal Lee added a comment -

          Just found out I somehow clicked the evil "Assign to me" link 6 days ago. I didn't mean to do that.

          Ming Ma, fork & exec with a large NN is not desirable for us and using the new host file format doesn't seem attractive since a lot of automation is built around the old format. We could probably add something similar to the network topology java plugin.

          I think 2.8.1 is for immediate stabilization and fixes after 2.8.0, so 2.8.2 will be a more reasonable target.

          Show
          kihwal Kihwal Lee added a comment - Just found out I somehow clicked the evil "Assign to me" link 6 days ago. I didn't mean to do that. Ming Ma , fork & exec with a large NN is not desirable for us and using the new host file format doesn't seem attractive since a lot of automation is built around the old format. We could probably add something similar to the network topology java plugin. I think 2.8.1 is for immediate stabilization and fixes after 2.8.0, so 2.8.2 will be a more reasonable target.
          Hide
          mingma Ming Ma added a comment -

          Sounds good. HDFS-9005, HDFS-9016 and HDFS-9922 have been committed to 2.8.2.

          Show
          mingma Ming Ma added a comment - Sounds good. HDFS-9005 , HDFS-9016 and HDFS-9922 have been committed to 2.8.2.

            People

            • Assignee:
              mingma Ming Ma
              Reporter:
              mingma Ming Ma
            • Votes:
              0 Vote for this issue
              Watchers:
              38 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development