Hadoop Common
  1. Hadoop Common
  2. HADOOP-101

DFSck - fsck-like utility for checking DFS volumes

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.2.0
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      This is a utility to check health status of a DFS volume, and collect some additional statistics.

      1. Fsck.html
        16 kB
        Ravi Phulari
      2. DFSck.java
        18 kB
        Andrzej Bialecki

        Activity

        Hide
        Konstantin Shvachko added a comment -

        Looks like a good utility for collecting meta-data statistics.

        It does not detect missing blocks, though.
        The missing block problem is when the namenode "thinks" a block
        is stored on one or several datanodes, while it is not.
        In order to detect this the utility needs to actually read blocks from
        each of the specified locations. Is there an API call for that?
        And it does not have an option to convert the system into a consistent state,
        which is usually provided by fscks.

        It would be very useful to integrate this code with the DFSShell.report(),
        providing more reporting options for the system.

        Show
        Konstantin Shvachko added a comment - Looks like a good utility for collecting meta-data statistics. It does not detect missing blocks, though. The missing block problem is when the namenode "thinks" a block is stored on one or several datanodes, while it is not. In order to detect this the utility needs to actually read blocks from each of the specified locations. Is there an API call for that? And it does not have an option to convert the system into a consistent state, which is usually provided by fscks. It would be very useful to integrate this code with the DFSShell.report(), providing more reporting options for the system.
        Hide
        Andrzej Bialecki added a comment -

        It does detect missing blocks, that's the whole point. However, you need to restart the namenode first to be sure you get the latest block reports from datanodes.

        Current failure modes for DFS involve blocks that are completely missing. The only way to "fix" them would be to recover chains of blocks and put them into lost+found - and here we could do better than fsck, because we know the file name they belong to.

        Show
        Andrzej Bialecki added a comment - It does detect missing blocks, that's the whole point. However, you need to restart the namenode first to be sure you get the latest block reports from datanodes. Current failure modes for DFS involve blocks that are completely missing. The only way to "fix" them would be to recover chains of blocks and put them into lost+found - and here we could do better than fsck, because we know the file name they belong to.
        Hide
        Andrzej Bialecki added a comment -

        Oops, forgot to comment on another point you make: yes, it would be great if we could add an API call to request a datanode to check the status of a given block without sending it to the client, e..g read it physically off the disk and report success/failure. We can do this now only by actually retrieving the block, which is too costly.

        This is the first version of the tool, any contributions are of course welcome.

        Show
        Andrzej Bialecki added a comment - Oops, forgot to comment on another point you make: yes, it would be great if we could add an API call to request a datanode to check the status of a given block without sending it to the client, e..g read it physically off the disk and report success/failure. We can do this now only by actually retrieving the block, which is too costly. This is the first version of the tool, any contributions are of course welcome.
        Hide
        Hairong Kuang added a comment -

        The dfsck is done at the client side and hence does not lock the file system. It has a potential inconsitency problem. What if a client deletes a directory when tool is checking it?

        Another concern is the performance. The tool needs to issue a RPC to the namenode for each dir/file in the system. Any benchmark?

        Show
        Hairong Kuang added a comment - The dfsck is done at the client side and hence does not lock the file system. It has a potential inconsitency problem. What if a client deletes a directory when tool is checking it? Another concern is the performance. The tool needs to issue a RPC to the namenode for each dir/file in the system. Any benchmark?
        Hide
        Doug Cutting added a comment -

        I think not locking the fs is a feature. It should probably, if it gets file-not-found error from the namenode, backup and recheck to see if the file still exists, and, if it doesn't check whether its parent still exists, etc., attempting to ignore errors that are only the result of changes to the FS while checking it. In the short-term, this tool is superior to anything else we currently have, and I'd vote for adding it as a 'bin/hadoop dfs -check' command. As for performance, as a single-threaded client process it shouldn't be able to overwhelm the namenode.

        Show
        Doug Cutting added a comment - I think not locking the fs is a feature. It should probably, if it gets file-not-found error from the namenode, backup and recheck to see if the file still exists, and, if it doesn't check whether its parent still exists, etc., attempting to ignore errors that are only the result of changes to the FS while checking it. In the short-term, this tool is superior to anything else we currently have, and I'd vote for adding it as a 'bin/hadoop dfs -check' command. As for performance, as a single-threaded client process it shouldn't be able to overwhelm the namenode.
        Hide
        Yoram Arnon added a comment -

        This is a good and awaited feature, filed previously as bug hadoop-95. I vote to check it in, because as you say, it's much better than anything we have, and of critical importance.

        Regarding performance, clearly the nameserver will not be overwhelmed, but the operation may take a very long time to execute. It's one thing to traverse a million entries in memory (for a modest 32TB FS), but another matter to execute a hundred thousand RPC calls from a single client. Also, when we change the open command to not return the entire list of blocks, in the interest of shortening the time of opening a file, especially when reading just a few blocks from a very large file, the implementation will need to change.

        Lastly, there's extensibility. We'll want to test for things that are available only on the name server, like blocks that are not used by any file.

        Wouldn't it be better to request the server to execute this code internally, and report results either to the client or to a local file?

        Show
        Yoram Arnon added a comment - This is a good and awaited feature, filed previously as bug hadoop-95. I vote to check it in, because as you say, it's much better than anything we have, and of critical importance. Regarding performance, clearly the nameserver will not be overwhelmed, but the operation may take a very long time to execute. It's one thing to traverse a million entries in memory (for a modest 32TB FS), but another matter to execute a hundred thousand RPC calls from a single client. Also, when we change the open command to not return the entire list of blocks, in the interest of shortening the time of opening a file, especially when reading just a few blocks from a very large file, the implementation will need to change. Lastly, there's extensibility. We'll want to test for things that are available only on the name server, like blocks that are not used by any file. Wouldn't it be better to request the server to execute this code internally, and report results either to the client or to a local file?
        Hide
        Doug Cutting added a comment -

        I like that this does not use anything more than the client API to check the server. That keeps the server core lean and mean. The use of RPC's effectively restricts the impact of the scan on the FS.

        A datanode operation that streams through a block without transferring it over the wire won't correctly check checksums using our existing mechanism. To check file content we could instead simply implement a map-reduce job that streams through all the files in the fs. This would not take much code: nothing additional in the core. MapReduce should handle the locality, so that most data shouldn't go over the wire.

        BTW, blocks not used by any file are not known to the name node, are they? When they're reported by a datanode the datanode is told to remove them.

        Show
        Doug Cutting added a comment - I like that this does not use anything more than the client API to check the server. That keeps the server core lean and mean. The use of RPC's effectively restricts the impact of the scan on the FS. A datanode operation that streams through a block without transferring it over the wire won't correctly check checksums using our existing mechanism. To check file content we could instead simply implement a map-reduce job that streams through all the files in the fs. This would not take much code: nothing additional in the core. MapReduce should handle the locality, so that most data shouldn't go over the wire. BTW, blocks not used by any file are not known to the name node, are they? When they're reported by a datanode the datanode is told to remove them.
        Hide
        Andrzej Bialecki added a comment -

        Wow, lots of comments, let me address some of them:

        • re: locking. I also see this as an advantage, fsck can run in parallel with normal operations. If someone else deletes a file, no big deal - the name is removed from the namesystem, so if we suddenly detect missing blocks we could always check if a file with this name still exists in the namesystem.
        • re: performance. Sure, we could parallelize this, which should speed things up (currently it's rather slow, checking ~1TB takes > 2 hours), but then it would put a higher load on the namenode. Perhaps we could make this an option, e.g. start a configurable pool of fsck threads in parallel.
        • re: blocks not in use by any file. I think this is already handled internally by namenode<->datanode protocol (for good and for bad), i.e. namenode detects orphaned blocks and tells datanodes to remove them. See FSNamesystem:924 .
        • handling the reverse situation (missing blocks in existing files) should be straightforward, with the use of /lost+found directory: for each corrupted file a directory would be created there, and remaining chains of consecutive blocks would be stored in that directory.
        • re: checking blocks through streaming: +1, I like the concept, could you perhaps implement it? Also, what happens if a mapred task tries to retrieve a missing/corrupted block? I think currently this hangs the task, due to a missing break in the while loop in DFSClient:354
        Show
        Andrzej Bialecki added a comment - Wow, lots of comments, let me address some of them: re: locking. I also see this as an advantage, fsck can run in parallel with normal operations. If someone else deletes a file, no big deal - the name is removed from the namesystem, so if we suddenly detect missing blocks we could always check if a file with this name still exists in the namesystem. re: performance. Sure, we could parallelize this, which should speed things up (currently it's rather slow, checking ~1TB takes > 2 hours), but then it would put a higher load on the namenode. Perhaps we could make this an option, e.g. start a configurable pool of fsck threads in parallel. re: blocks not in use by any file. I think this is already handled internally by namenode<->datanode protocol (for good and for bad), i.e. namenode detects orphaned blocks and tells datanodes to remove them. See FSNamesystem:924 . handling the reverse situation (missing blocks in existing files) should be straightforward, with the use of /lost+found directory: for each corrupted file a directory would be created there, and remaining chains of consecutive blocks would be stored in that directory. re: checking blocks through streaming: +1, I like the concept, could you perhaps implement it? Also, what happens if a mapred task tries to retrieve a missing/corrupted block? I think currently this hangs the task, due to a missing break in the while loop in DFSClient:354
        Hide
        Konstantin Shvachko added a comment -

        Just posted a map-reduce test that checks all blocks of all files.
        See HADOOP-95.
        The infinite loop in DFSClient is fixed now, so it works.

        With respect to some of the previous comments.
        Restarting the cluster (a big one) just to check its consistency is not an exciting option.
        This means that we will have to wait up to 55 minutes before missing blocks will
        be detected by examining just the namenode data.

        A drawback of the map-reduce test is that we cannot force the system to check all replicas
        of the block. So corrupted block is reported only if all of its replicas are bad.
        But yes this is better than nothing.

        Show
        Konstantin Shvachko added a comment - Just posted a map-reduce test that checks all blocks of all files. See HADOOP-95 . The infinite loop in DFSClient is fixed now, so it works. With respect to some of the previous comments. Restarting the cluster (a big one) just to check its consistency is not an exciting option. This means that we will have to wait up to 55 minutes before missing blocks will be detected by examining just the namenode data. A drawback of the map-reduce test is that we cannot force the system to check all replicas of the block. So corrupted block is reported only if all of its replicas are bad. But yes this is better than nothing.
        Hide
        Andrzej Bialecki added a comment -

        Regarding the restarting time - yes, but I couldn't find any other way to force datanodes to update their block reports with the namenode, perhaps we should extend BlockCommand to support this (the namenode can't call datanodes, it can just return BlockCommands when datanodes call in).

        Show
        Andrzej Bialecki added a comment - Regarding the restarting time - yes, but I couldn't find any other way to force datanodes to update their block reports with the namenode, perhaps we should extend BlockCommand to support this (the namenode can't call datanodes, it can just return BlockCommands when datanodes call in).
        Hide
        Andrzej Bialecki added a comment -

        Updated version. Added options to treat inconsistencies: ignore, move to /lost+found, delete.

        Show
        Andrzej Bialecki added a comment - Updated version. Added options to treat inconsistencies: ignore, move to /lost+found, delete.
        Hide
        Hairong Kuang added a comment -

        Andrzei, for your benchmark experiment (~1TB takes > 2 hours), how many directories and files were there in your dfs? For the performance of your dfsck, I think the number of files and dirs matters rather than the size of the data in dfs.

        Show
        Hairong Kuang added a comment - Andrzei, for your benchmark experiment (~1TB takes > 2 hours), how many directories and files were there in your dfs? For the performance of your dfsck, I think the number of files and dirs matters rather than the size of the data in dfs.
        Hide
        Andrzej Bialecki added a comment -

        There were around 40,000 directories, and 200,000 files.

        Show
        Andrzej Bialecki added a comment - There were around 40,000 directories, and 200,000 files.
        Hide
        Andrzej Bialecki added a comment -

        Added as 'bin/hadoop fsck'.

        Show
        Andrzej Bialecki added a comment - Added as 'bin/hadoop fsck'.
        Hide
        Ravi Phulari added a comment -

        Attaching test plan for Fsck.

        Show
        Ravi Phulari added a comment - Attaching test plan for Fsck.

          People

          • Assignee:
            Andrzej Bialecki
            Reporter:
            Andrzej Bialecki
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development