Details

    • Type: Improvement Improvement
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      There are many instances when the same piece of data resides on multiple HDFS clusters in different data centers. The primary reason being that the physical limitation of one data center is insufficient to host the entire data set. In that case, the administrator(s) typically partition that data into two (or more) HDFS clusters on two different data centers and then duplicates some subset of that data into both the HDFS clusters.

      In such a situation, there will be six physical copies of data that is duplicated, three copies in one data center and another three copies in another data center. It would be nice if we can keep fewer than 3 replicas on each of the data centers and have the ability to fix a replica in the local data center by copying data from the remote copy in the remote data center.

        Issue Links

          Activity

          Hide
          Jeff Hammerbacher added a comment -
          Show
          Jeff Hammerbacher added a comment - BTW, what is a BCP cluster? http://en.wikipedia.org/wiki/Business_continuity_planning
          Hide
          Sriram Rao added a comment -

          > Performance: Map-reduce jobs could have a performance impact if the number of replicas are reduced from 3 to 2. So, the >tradeoff is reducing the total amount of storage while possibly increasing job latencies.

          With 2 copies in 2 racks, you are still preserving rack locality. That maybe sufficient.

          Show
          Sriram Rao added a comment - > Performance: Map-reduce jobs could have a performance impact if the number of replicas are reduced from 3 to 2. So, the >tradeoff is reducing the total amount of storage while possibly increasing job latencies. With 2 copies in 2 racks, you are still preserving rack locality. That maybe sufficient.
          Hide
          dhruba borthakur added a comment -

          > Or, you want the same data-set in multiple data-centres for BCP clusters, but do not want to store 6 copies

          Yes, that sounds right. BTW, what is a BCP cluster?

          Show
          dhruba borthakur added a comment - > Or, you want the same data-set in multiple data-centres for BCP clusters, but do not want to store 6 copies Yes, that sounds right. BTW, what is a BCP cluster?
          Hide
          Joydeep Sen Sarma added a comment -

          who manages the replication in both the data centers? it seems simpler to put the logic for asserting equivalence at the same layer that manages this replication (at the time of replicating the data - the replication agent knows that the two sides are now equivalent).

          the follow-up question then is why hightidenode itself does not take care of the cross data center file replication?

          the other problem with choosing one of the two file systems and then passing that down to the client is that the client cannot recover from bad/missing block problems that have yet to be healed by the system. it seems better to have a hightide client side file system layer that can switch to the survivor copy in the other data center (just like HDFS-RAID).

          in short - i think things seem more consistent if hightide (node/client) takes care of discovery/replication/recovery of data transparently across the two data centers (with some policy on which parts of the namespace need replication to which data centers).

          in certain cases - we do not want complete location transparency. for example - a client may want to execute a job only when all required data sets have been replicated to the data center where the job will run. that can be done easily by having new api's (provided by the hightidenode) regarding available data centers for a given part of the namespace. then the caller can wait for availability at the desired data center.

          the other aspect that is not addressed here is concurrent modification (regardless of who manages the replication, there is replication lag and we have to account for this). i will assume for now that hightide manages replication and gives the appearance of a single file system. a crc checksum can distinguish a modified free tree from the original - but it cannot figure out which one is the later of the two.

          at least for our use case - a simple 'latest wins' policy is good enough:

          • the client application generates a file id by means external to HighTide
          • it creates files/directories stamped with this id
          • when hightide replicates data to a remote data center, before overwriting the existing data if any - it checks whether the incoming id is bigger than the existing one - and only then performs the overwrite.

          for us - an id encoding the timestamp will be sufficient (since we have a central entity doling out timestamps). but in general - we would want to keep this abstract (more generally - at least <Data-Center, time> tuple would be required).

          another option that may be feasible in certain scenarios is that hightide prevents concurrent updates to a part of the namespace that is required to be replicated - but has not been replicated so far. however - this makes the system very tightly coupled and will not have good availability characteristics.

          Show
          Joydeep Sen Sarma added a comment - who manages the replication in both the data centers? it seems simpler to put the logic for asserting equivalence at the same layer that manages this replication (at the time of replicating the data - the replication agent knows that the two sides are now equivalent). the follow-up question then is why hightidenode itself does not take care of the cross data center file replication? the other problem with choosing one of the two file systems and then passing that down to the client is that the client cannot recover from bad/missing block problems that have yet to be healed by the system. it seems better to have a hightide client side file system layer that can switch to the survivor copy in the other data center (just like HDFS-RAID). in short - i think things seem more consistent if hightide (node/client) takes care of discovery/replication/recovery of data transparently across the two data centers (with some policy on which parts of the namespace need replication to which data centers). — in certain cases - we do not want complete location transparency. for example - a client may want to execute a job only when all required data sets have been replicated to the data center where the job will run. that can be done easily by having new api's (provided by the hightidenode) regarding available data centers for a given part of the namespace. then the caller can wait for availability at the desired data center. — the other aspect that is not addressed here is concurrent modification (regardless of who manages the replication, there is replication lag and we have to account for this). i will assume for now that hightide manages replication and gives the appearance of a single file system. a crc checksum can distinguish a modified free tree from the original - but it cannot figure out which one is the later of the two. at least for our use case - a simple 'latest wins' policy is good enough: the client application generates a file id by means external to HighTide it creates files/directories stamped with this id when hightide replicates data to a remote data center, before overwriting the existing data if any - it checks whether the incoming id is bigger than the existing one - and only then performs the overwrite. for us - an id encoding the timestamp will be sufficient (since we have a central entity doling out timestamps). but in general - we would want to keep this abstract (more generally - at least <Data-Center, time> tuple would be required). another option that may be feasible in certain scenarios is that hightide prevents concurrent updates to a part of the namespace that is required to be replicated - but has not been replicated so far. however - this makes the system very tightly coupled and will not have good availability characteristics.
          Hide
          dhruba borthakur added a comment - - edited

          Goal: The goal of the HighTideNode is to keep only one physical replica per data center. This is mostly for older files that change very infrequently.The HighTide server watches over the two HDFS namespaces from two different NameNodes in two different data centers. These two equivalent namespaces will be populated via means that are external to HighTide. The HighTide server verifies (via checksum of the crc files) that two directories in the two HDFS contain identical data, and if so, reduces the replication factor to 2 on both HDFS. (One or both HDFS could be using HDFS-RAID too).The HighTideNode monitors any missing replicas on both namenode, and if it finds any it will fix by copying data from the other namenode in the remote data center.

          In short, the replication within a HDFS cluster will occur via the NameNode as usual. Each NameNode will maintain fewer than 3 copies of the data. The replication across HDFS clusters will be coordinated by the HighTideNode. It invokes the -list-corruptFiles RPC to each NameNode periodically (every minute) to detect missing replicas.

          DataNodeGateway:I envision a single HighTideNode coordinating replication between multiple HDFS clusters. An alternative approach would be to do some sort of a GateWay approach: a specialized DataNode that exports the DataNode protocol and appears like a huge-big DataNode to a HDFS cluster, but instead of storing blocks on local disks, the GatewayDataNode would store data in a remote HDFS cluster. This is similar to existing NFS Gateways, e.g. NFS-CIFS interaction. The downside is that this design is more complex and intrusive to HDFS rather than being a layer on top of it.

          Mean-Time-To-Recover (MTR): Will this approach of having remote replicas increase the probability of data loss? My claim is that we should try to keep the MTR practically the same as it is today. If all the replicas of a block on HDFS1 goes missing, then the HighTideNode will first increase the replication factor of the equivalent file in HDFS2. This ensures that we get back to 3-overall copies as soon as possible, thus keeping the MTR same as it is now. Then the HighTideNode will copy over this block from HDFS2 to HDFS1, wait for HDFS1 to attain a replica count of 2 before decreasing the replica count on HDFS2 from 3 back to 2.

          HDFS-RAID: HighTide can co-exist with HDFS-RAID. HDFS-RAID allows us to keep fewer physical copies of data + parity. The MTR from RAID is smaller compared to HighTide, but the savings using HighTide is way more because the savings-percentage does not depend on RAID-stripe size or on file-lengths. Once can use RAID to achieve a replication factor of 1.2 in each HDFS cluster and then use HighTide to have an additional 1.2 replicas on the remote HDFS cluster.

          Performance: Map-reduce jobs could have a performance impact if the number of replicas are reduced from 3 to 2. So, the tradeoff is reducing the total amount of storage while possibly increasing job latencies.

          HBase: With the current HBase design it is difficult to use HighTide to replicate across data centers. This is something that we need to delve more into.

          Show
          dhruba borthakur added a comment - - edited Goal : The goal of the HighTideNode is to keep only one physical replica per data center. This is mostly for older files that change very infrequently.The HighTide server watches over the two HDFS namespaces from two different NameNodes in two different data centers. These two equivalent namespaces will be populated via means that are external to HighTide. The HighTide server verifies (via checksum of the crc files) that two directories in the two HDFS contain identical data, and if so, reduces the replication factor to 2 on both HDFS. (One or both HDFS could be using HDFS-RAID too).The HighTideNode monitors any missing replicas on both namenode, and if it finds any it will fix by copying data from the other namenode in the remote data center. In short, the replication within a HDFS cluster will occur via the NameNode as usual. Each NameNode will maintain fewer than 3 copies of the data. The replication across HDFS clusters will be coordinated by the HighTideNode. It invokes the -list-corruptFiles RPC to each NameNode periodically (every minute) to detect missing replicas. DataNodeGateway :I envision a single HighTideNode coordinating replication between multiple HDFS clusters. An alternative approach would be to do some sort of a GateWay approach: a specialized DataNode that exports the DataNode protocol and appears like a huge-big DataNode to a HDFS cluster, but instead of storing blocks on local disks, the GatewayDataNode would store data in a remote HDFS cluster. This is similar to existing NFS Gateways, e.g. NFS-CIFS interaction. The downside is that this design is more complex and intrusive to HDFS rather than being a layer on top of it. Mean-Time-To-Recover (MTR) : Will this approach of having remote replicas increase the probability of data loss? My claim is that we should try to keep the MTR practically the same as it is today. If all the replicas of a block on HDFS1 goes missing, then the HighTideNode will first increase the replication factor of the equivalent file in HDFS2. This ensures that we get back to 3-overall copies as soon as possible, thus keeping the MTR same as it is now. Then the HighTideNode will copy over this block from HDFS2 to HDFS1, wait for HDFS1 to attain a replica count of 2 before decreasing the replica count on HDFS2 from 3 back to 2. HDFS-RAID : HighTide can co-exist with HDFS-RAID. HDFS-RAID allows us to keep fewer physical copies of data + parity. The MTR from RAID is smaller compared to HighTide, but the savings using HighTide is way more because the savings-percentage does not depend on RAID-stripe size or on file-lengths. Once can use RAID to achieve a replication factor of 1.2 in each HDFS cluster and then use HighTide to have an additional 1.2 replicas on the remote HDFS cluster. Performance : Map-reduce jobs could have a performance impact if the number of replicas are reduced from 3 to 2. So, the tradeoff is reducing the total amount of storage while possibly increasing job latencies. HBase : With the current HBase design it is difficult to use HighTide to replicate across data centers. This is something that we need to delve more into.
          Hide
          Arun C Murthy added a comment -

          There are many instances when the same piece of data resides on multiple HDFS clusters in different data centers. The primary reason being that the physical limitation of one data center is insufficient to host the entire data set.

          Or, you want the same data-set in multiple data-centres for BCP clusters, but do not want to store 6 copies...

          Show
          Arun C Murthy added a comment - There are many instances when the same piece of data resides on multiple HDFS clusters in different data centers. The primary reason being that the physical limitation of one data center is insufficient to host the entire data set. Or, you want the same data-set in multiple data-centres for BCP clusters, but do not want to store 6 copies...

            People

            • Assignee:
              dhruba borthakur
              Reporter:
              dhruba borthakur
            • Votes:
              1 Vote for this issue
              Watchers:
              54 Start watching this issue

              Dates

              • Created:
                Updated:

                Development