Hadoop Common
  1. Hadoop Common
  2. HADOOP-124

don't permit two datanodes to run from same dfs.data.dir

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Critical Critical
    • Resolution: Fixed
    • Affects Version/s: 0.2.0
    • Fix Version/s: 0.3.0
    • Component/s: None
    • Labels:
      None
    • Environment:

      ~30 node cluster

      Description

      DFS files are still rotting.

      I suspect that there's a problem with block accounting/detecting identical hosts in the namenode. I have 30 physical nodes, with various numbers of local disks, meaning that my current 'bin/hadoop dfs -report" shows 80 nodes after a full restart. However, when I discovered the problem (which resulted in losing about 500gb worth of temporary data because of missing blocks in some of the larger chunks) -report showed 96 nodes. I suspect somehow there were extra datanodes running against the same paths, and that the namenode was counting those as replicated instances, which then showed up over-replicated, and one of them was told to delete its local block, leading to the block actually getting lost.

      I will debug it more the next time the situation arises. This is at least the 5th time I've had a large amount of file data "rot" in DFS since January.

      1. Hadoop-124-v3.patch
        90 kB
        Konstantin Shvachko
      2. Hadoop-124.patch
        81 kB
        Konstantin Shvachko
      3. DatanodeRegister.txt
        4 kB
        Konstantin Shvachko

        Issue Links

          Activity

          Hide
          Doug Cutting added a comment -

          Did you potentially start identically configured datanodes as a different user?

          Right now the only "lock" preventing this is the pid file used by the nutch-daemon.sh script. Perhaps the datanode should lock each directory in dfs.data.dir? That should prevent this, no?

          I suppose this could also happen if the datanode lost its connection to the namenode, but the namenode had not yet timed out the datanode. Then the datanode would reconnect and blocks might be doubly-reported. To fix this, perhaps the namenode should refuse to represent more than one copy of a block from a given IP? If a second is reported, the first should be forgotten?

          Show
          Doug Cutting added a comment - Did you potentially start identically configured datanodes as a different user? Right now the only "lock" preventing this is the pid file used by the nutch-daemon.sh script. Perhaps the datanode should lock each directory in dfs.data.dir? That should prevent this, no? I suppose this could also happen if the datanode lost its connection to the namenode, but the namenode had not yet timed out the datanode. Then the datanode would reconnect and blocks might be doubly-reported. To fix this, perhaps the namenode should refuse to represent more than one copy of a block from a given IP? If a second is reported, the first should be forgotten?
          Hide
          Owen O'Malley added a comment -

          It seems like it would help to have the datanode generate a unique identifier the first time it is run, and save that in the data directory. On datanode restarts it uses the unique identifier from the data directory. The namenode would be able to complain about multiple instance of the same datanode.

          Show
          Owen O'Malley added a comment - It seems like it would help to have the datanode generate a unique identifier the first time it is run, and save that in the data directory. On datanode restarts it uses the unique identifier from the data directory. The namenode would be able to complain about multiple instance of the same datanode.
          Hide
          Bryan Pendleton added a comment -

          I have certainly seen some weirdness with my cluster where a stop-all seemed to think it had succeeded but there were still datanode and tasktracker instances running. Locking at the directory level seems like a good hedge against that sort of problem. As a bonus - it'll prevent blocks from getting removed if someone mistakenly starts up a second set of datanodes against a different namenode, as long as the datanode daemon for the competing DFS instance is running on the machine.

          Keeping track of only one copy of a block for a given IP seems like a good start, but might be too simple. What if, by some unfortunate process, a lot of duplicates get stored on a node. If there's no way to detect this, it could be that a bunch of copies of a block end up being duplicated on a node unnecessarily. One way out of that would be to notice if a host is reporting more than one copy of a block, and kick off a knee-jerk fix: 1) Extra-replicate the block 2) Ask the node with dups to remove its copies of the block. Of course, in the case where two datanode instances are somehow servicing from the same directory, the knee-jerk reaction would kick off for all blocks that get stored into that space - so locking would definitely be necessary to make this work.

          Show
          Bryan Pendleton added a comment - I have certainly seen some weirdness with my cluster where a stop-all seemed to think it had succeeded but there were still datanode and tasktracker instances running. Locking at the directory level seems like a good hedge against that sort of problem. As a bonus - it'll prevent blocks from getting removed if someone mistakenly starts up a second set of datanodes against a different namenode, as long as the datanode daemon for the competing DFS instance is running on the machine. Keeping track of only one copy of a block for a given IP seems like a good start, but might be too simple. What if, by some unfortunate process, a lot of duplicates get stored on a node. If there's no way to detect this, it could be that a bunch of copies of a block end up being duplicated on a node unnecessarily. One way out of that would be to notice if a host is reporting more than one copy of a block, and kick off a knee-jerk fix: 1) Extra-replicate the block 2) Ask the node with dups to remove its copies of the block. Of course, in the case where two datanode instances are somehow servicing from the same directory, the knee-jerk reaction would kick off for all blocks that get stored into that space - so locking would definitely be necessary to make this work.
          Hide
          Doug Cutting added a comment -

          This seems like something we should fix in the .2 release, no?

          Show
          Doug Cutting added a comment - This seems like something we should fix in the .2 release, no?
          Hide
          Doug Cutting added a comment -

          Changing the summary to better describe what we intend to fix.

          Show
          Doug Cutting added a comment - Changing the summary to better describe what we intend to fix.
          Hide
          Doug Cutting added a comment -

          I just talked with Owen, and we came up with the following plan:

          (0) store a node id in each dfs.data.dir;

          (1) pass the node id to the name node in all DataNodeProtocol calls;

          (2) the namenode tracks datanodes by <id,host:port> pairs, only talking to one id from a given host:port at a time. requests from an unknown host:port return a status that causes the datanode to exit and not restart itself.

          (3) add a hello() method to DataNodeProtocol, called at datanode startup only. this erases any entries for a datanode id, replacing them with the new entry.

          Thus when a second datanode is started it causes any existing datanode running on that host to be forgotten and to exit when it next contacts the namenode.

          Show
          Doug Cutting added a comment - I just talked with Owen, and we came up with the following plan: (0) store a node id in each dfs.data.dir; (1) pass the node id to the name node in all DataNodeProtocol calls; (2) the namenode tracks datanodes by <id,host:port> pairs, only talking to one id from a given host:port at a time. requests from an unknown host:port return a status that causes the datanode to exit and not restart itself. (3) add a hello() method to DataNodeProtocol, called at datanode startup only. this erases any entries for a datanode id, replacing them with the new entry. Thus when a second datanode is started it causes any existing datanode running on that host to be forgotten and to exit when it next contacts the namenode.
          Hide
          Konstantin Shvachko added a comment -

          There is a problem with shutting down the old datanode when the new one starts.
          The new datanode must be sure the old one is gone before taking over.

          I think a datanode should merely keep a FileLock in the dfs.data.dir while it is running.
          In this case the new node will not be able to start with the same data directory.
          Is that the problem we are trying to solve here?

          Show
          Konstantin Shvachko added a comment - There is a problem with shutting down the old datanode when the new one starts. The new datanode must be sure the old one is gone before taking over. I think a datanode should merely keep a FileLock in the dfs.data.dir while it is running. In this case the new node will not be able to start with the same data directory. Is that the problem we are trying to solve here?
          Hide
          eric baldeschwieler added a comment -

          I believe we have three problems to address:

          1) The namenode needs to know to purge old entries identical entries when a new datanode registers. Else we get rot. See doug's suggestions above.

          2) You could have 2 or more datanodes on one server. They need to always be unique. We should asign a unique ID to each datanode home directory and make sure the datanode is started with a valid home directory as well. I like the idea of assigning a uniqueID to each datanode home.

          3) You concern that two daemons might run on the same datadir.

          We should address all these concerns. Your suggestion of a startup lock and a uniqueID, plus (hello method) should together handle all of these.

          Show
          eric baldeschwieler added a comment - I believe we have three problems to address: 1) The namenode needs to know to purge old entries identical entries when a new datanode registers. Else we get rot. See doug's suggestions above. 2) You could have 2 or more datanodes on one server. They need to always be unique. We should asign a unique ID to each datanode home directory and make sure the datanode is started with a valid home directory as well. I like the idea of assigning a uniqueID to each datanode home. 3) You concern that two daemons might run on the same datadir. We should address all these concerns. Your suggestion of a startup lock and a uniqueID, plus (hello method) should together handle all of these.
          Hide
          Konstantin Shvachko added a comment -

          I see exactly 2 problems now.
          1) Two datanodes can be started sharing the same data directory, and reporting the same blocks twice.
          I'm planning to solve this by placing a lock in the data dir.
          2) Starting a replacement datanode.
          If a datanode fails, or intentionally shut down, the new datanode
          server can be started on the same node and port or on a different node or port,
          but connected to the same data directory.
          In current implementation if the new datanode starts before the namenode removed the old datanode from
          its list identical block will be considered belonging to both nodes, and consequently over-replicated.
          To solve this second problem the datanode should store a unique id per data directory as proposed.
          And the namenode should identify block reports by those ids (only!), rather than by hostname:port.
          Each datanode reporting the same id (no matter which hostname or port it has) should be considered
          a replacement to the old node, and the old datanode at this point should be removed from the namenode
          the same way it is removed when its hearbeats are not reaching the namenode for 10 minutes.

          Show
          Konstantin Shvachko added a comment - I see exactly 2 problems now. 1) Two datanodes can be started sharing the same data directory, and reporting the same blocks twice. I'm planning to solve this by placing a lock in the data dir. 2) Starting a replacement datanode. If a datanode fails, or intentionally shut down, the new datanode server can be started on the same node and port or on a different node or port, but connected to the same data directory. In current implementation if the new datanode starts before the namenode removed the old datanode from its list identical block will be considered belonging to both nodes, and consequently over-replicated. To solve this second problem the datanode should store a unique id per data directory as proposed. And the namenode should identify block reports by those ids (only!), rather than by hostname:port. Each datanode reporting the same id (no matter which hostname or port it has) should be considered a replacement to the old node, and the old datanode at this point should be removed from the namenode the same way it is removed when its hearbeats are not reaching the namenode for 10 minutes.
          Hide
          eric baldeschwieler added a comment -

          +1

          Show
          eric baldeschwieler added a comment - +1
          Hide
          Konstantin Shvachko added a comment -

          Here is an algorithm that imo solves the problem in the most general case.
          This is the registration part only, since the rest is rather straightforward.

          I'm trying to cover two issues.
          1) Data nodes should be banned from reporting the same block copies multiple times
          if they are intentionally or unintentionally started to serve the same data storage.
          That is why data nodes need to register, and need to keep a persistent storageID.
          2) Name node should be able to recognize registered data nodes, even if it is restarted
          or replaced by a spare name node serving the same name space.
          That is why name nodes need to keep a persistent namespaceID.

          Comments are highly appreciated.

          Show
          Konstantin Shvachko added a comment - Here is an algorithm that imo solves the problem in the most general case. This is the registration part only, since the rest is rather straightforward. I'm trying to cover two issues. 1) Data nodes should be banned from reporting the same block copies multiple times if they are intentionally or unintentionally started to serve the same data storage. That is why data nodes need to register, and need to keep a persistent storageID. 2) Name node should be able to recognize registered data nodes, even if it is restarted or replaced by a spare name node serving the same name space. That is why name nodes need to keep a persistent namespaceID. Comments are highly appreciated.
          Hide
          Konstantin Shvachko added a comment -

          This is the patch that fixes the problem.
          DFS_CURRENT_VERSION has been changed to -2,
          since internal file layouts have changed.

          I created a new package for exceptions.

          Show
          Konstantin Shvachko added a comment - This is the patch that fixes the problem. DFS_CURRENT_VERSION has been changed to -2, since internal file layouts have changed. I created a new package for exceptions.
          Hide
          Konstantin Shvachko added a comment -

          For future development in this direction.
          We should persistently store on the name node all storage IDs, which the
          name node ever assigned any blocks to.
          With that knowledge the name node can reject blocks from any newly
          registered data storages that are not on the name node list.
          In other words when a data node registers NEW data storage it should not
          report any blocks from that storage, and the name node can effectively verify
          that since it never assigned any blocks to this storage.
          This would prevent us from accidentally connecting data nodes representing
          different clusters (DFS instances).

          Show
          Konstantin Shvachko added a comment - For future development in this direction. We should persistently store on the name node all storage IDs, which the name node ever assigned any blocks to. With that knowledge the name node can reject blocks from any newly registered data storages that are not on the name node list. In other words when a data node registers NEW data storage it should not report any blocks from that storage, and the name node can effectively verify that since it never assigned any blocks to this storage. This would prevent us from accidentally connecting data nodes representing different clusters (DFS instances).
          Hide
          Konstantin Shvachko added a comment -

          Eric Baldeschwieler wrote:
          > why not store the cluster in the data node?

          We can alternatively store namespaceID in every data storage belonging to the cluster.
          May be this is conceptually cleaner. I preferred storing storageIDs in the meta data just
          because this gives the name node knowledge of which storages are missing, and lets it
          report it.

          Show
          Konstantin Shvachko added a comment - Eric Baldeschwieler wrote: > why not store the cluster in the data node? We can alternatively store namespaceID in every data storage belonging to the cluster. May be this is conceptually cleaner. I preferred storing storageIDs in the meta data just because this gives the name node knowledge of which storages are missing, and lets it report it.
          Hide
          Doug Cutting added a comment -

          +1, except the exceptions should not be in a separate package, but in the dfs package. If you make that change then I'd be happy to commit this, since it fixes a very critical bug. Thanks!

          Show
          Doug Cutting added a comment - +1, except the exceptions should not be in a separate package, but in the dfs package. If you make that change then I'd be happy to commit this, since it fixes a very critical bug. Thanks!
          Hide
          Doug Cutting added a comment -

          I just applied this to trunk and unit tests failed with UnregisteredDatanodeExceptions.

          Show
          Doug Cutting added a comment - I just applied this to trunk and unit tests failed with UnregisteredDatanodeExceptions.
          Hide
          Konstantin Shvachko added a comment -

          Yes, that was a MiniDFSCluster incomatibility.
          Fixed that, merged with the current version,
          removed the exception package.

          Show
          Konstantin Shvachko added a comment - Yes, that was a MiniDFSCluster incomatibility. Fixed that, merged with the current version, removed the exception package.
          Hide
          Doug Cutting added a comment -

          This patch does not apply to the current trunk. Can you please update your sources to the current trunk, resolving any conflicts, and re-generate the patch? Thanks!

          Show
          Doug Cutting added a comment - This patch does not apply to the current trunk. Can you please update your sources to the current trunk, resolving any conflicts, and re-generate the patch? Thanks!
          Hide
          Doug Cutting added a comment -

          This adds a number of new public classes that I'm not certain should be public. Should user code ever need to access a DataStorage, DatanodeID or DatanodeRegistration, or are these only used internally? Also, several of these exceptions appear only to be used internally, but I'm not certain about all of them. Would you object if I simply make all of these new classes package-private? Then, if we need, we can reveal more later as needed.

          Show
          Doug Cutting added a comment - This adds a number of new public classes that I'm not certain should be public. Should user code ever need to access a DataStorage, DatanodeID or DatanodeRegistration, or are these only used internally? Also, several of these exceptions appear only to be used internally, but I'm not certain about all of them. Would you object if I simply make all of these new classes package-private? Then, if we need, we can reveal more later as needed.
          Hide
          Doug Cutting added a comment -

          I just committed this, modified to make most of the new classes package-private rather than public. Thanks, Konstantin!

          Show
          Doug Cutting added a comment - I just committed this, modified to make most of the new classes package-private rather than public. Thanks, Konstantin!
          Hide
          Konstantin Shvachko added a comment -

          Yes, I think it is ok to make the classes unaccessible from the outside of the project.
          Exceptions are different, they are returned to a client in the form of IOException right now,
          but if the client wants to distinguish between them then we need to keep them public.

          Show
          Konstantin Shvachko added a comment - Yes, I think it is ok to make the classes unaccessible from the outside of the project. Exceptions are different, they are returned to a client in the form of IOException right now, but if the client wants to distinguish between them then we need to keep them public.

            People

            • Assignee:
              Konstantin Shvachko
              Reporter:
              Bryan Pendleton
            • Votes:
              2 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development