Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.16.0
    • Component/s: None
    • Labels:
      None

      Description

      Currently on-disk data corruption on data blocks is detected only when it is read by the client or by another datanode. These errors are detected much earlier if datanode can periodically verify the data checksums for the local blocks.

      Some of the issues to consider :

      • How should we check the blocks ( no more often than once every couple of weeks ?)
      • How do we keep track of when a block was last verfied ( there is a .meta file associcated with each lock ).
      • What action to take once a corruption is detected
      • Scanning should be done as a very low priority with rest of the datanode disk traffic in mind.
      1. HADOOP-2012.patch
        21 kB
        Raghu Angadi
      2. HADOOP-2012.patch
        22 kB
        Raghu Angadi
      3. HADOOP-2012.patch
        22 kB
        Raghu Angadi
      4. HADOOP-2012.patch
        22 kB
        Raghu Angadi
      5. HADOOP-2012.patch
        69 kB
        Raghu Angadi
      6. HADOOP-2012.patch
        49 kB
        Raghu Angadi
      7. HADOOP-2012.patch
        51 kB
        Raghu Angadi
      8. HADOOP-2012.patch
        56 kB
        Raghu Angadi
      9. HADOOP-2012.patch
        61 kB
        Raghu Angadi
      10. HADOOP-2012.patch
        61 kB
        Raghu Angadi
      11. HADOOP-2012.patch
        26 kB
        Raghu Angadi
      12. HADOOP-2012.patch
        61 kB
        Raghu Angadi
      13. HADOOP-2012.patch
        62 kB
        Raghu Angadi
      14. HADOOP-2012.patch
        61 kB
        Raghu Angadi

        Issue Links

          Activity

          Hide
          dhruba borthakur added a comment -

          Another proposal would be to dedicate a certain percentage of the disk bandwidth (e.g. 5%) to verifications tasks. (I believe HDFS adopts a similar approach for rebalancing, wherein a certain bandwidth is reserved for rebalancing jobs).

          The Datanode would continuously keep verifying blocks in a round-robin manner. Would it be enough to store the last verification time of a block only on memory (and not on the disk) and start on a random block when datanode restarts? Once a corruption is detected, it can remove that block and trigger a block report.

          Show
          dhruba borthakur added a comment - Another proposal would be to dedicate a certain percentage of the disk bandwidth (e.g. 5%) to verifications tasks. (I believe HDFS adopts a similar approach for rebalancing, wherein a certain bandwidth is reserved for rebalancing jobs). The Datanode would continuously keep verifying blocks in a round-robin manner. Would it be enough to store the last verification time of a block only on memory (and not on the disk) and start on a random block when datanode restarts? Once a corruption is detected, it can remove that block and trigger a block report.
          Hide
          Raghu Angadi added a comment -

          The main issue I am wondering about is also whether to store the last verification time persistently or not. Even if we want to store persistently, this would be non critical data.. some simple text file would do. Storing this persistently is not that important for real clusters since we expect the datanodes to stay up for many days. This mainly matters during dev where we start and restart all the time.

          So I vote for storing only in memory as well.

          Yes, we should throttle the scan rate. we can do something like at most 5% (what 5% means is another issue!) and at most once a week. Rebalancing already adds an optional throttler for reader. will use the same. Also whenever a complete block is read without errors as part of normal operation, we will update the last verification time.

          Show
          Raghu Angadi added a comment - The main issue I am wondering about is also whether to store the last verification time persistently or not. Even if we want to store persistently, this would be non critical data.. some simple text file would do. Storing this persistently is not that important for real clusters since we expect the datanodes to stay up for many days. This mainly matters during dev where we start and restart all the time. So I vote for storing only in memory as well. Yes, we should throttle the scan rate. we can do something like at most 5% (what 5% means is another issue!) and at most once a week. Rebalancing already adds an optional throttler for reader. will use the same. Also whenever a complete block is read without errors as part of normal operation, we will update the last verification time.
          Hide
          Robert Chansler added a comment -

          I might think it important for the verifier to keep a record of what work it has done. Else how would we know the verifier has been verifying?

          Also, there needs to be a systematic survey of all blocks to ensure that each block is eventually visited even in the event of frequent restarts of the datanode (even if that is not the intention).

          Show
          Robert Chansler added a comment - I might think it important for the verifier to keep a record of what work it has done. Else how would we know the verifier has been verifying? Also, there needs to be a systematic survey of all blocks to ensure that each block is eventually visited even in the event of frequent restarts of the datanode (even if that is not the intention).
          Hide
          Raghu Angadi added a comment -

          > I might think it important for the verifier to keep a record of what work it has done. Else how would we know the verifier has been verifying?

          Datanode will log each time it verifies a block. Also a simple servlet can list all the blocks and thier verification times on the web-interface.

          > Also, there needs to be a systematic survey of all blocks to ensure that each block is eventually visited even in the event of frequent restarts of the datanode (even if that is not the intention).

          Yes. will keep this in mind. Since the information is not persistent, it is important to make sure some blocks don't get starved. Initially starting in random order might work.

          Show
          Raghu Angadi added a comment - > I might think it important for the verifier to keep a record of what work it has done. Else how would we know the verifier has been verifying? Datanode will log each time it verifies a block. Also a simple servlet can list all the blocks and thier verification times on the web-interface. > Also, there needs to be a systematic survey of all blocks to ensure that each block is eventually visited even in the event of frequent restarts of the datanode (even if that is not the intention). Yes. will keep this in mind. Since the information is not persistent, it is important to make sure some blocks don't get starved. Initially starting in random order might work.
          Hide
          dhruba borthakur added a comment -

          Thinking more about this one, I agree with Rob that it is important to have an algorithm that eventually verifies all blocks even in the face of frequent datanode restarts. In fact, if we want the datanode to scale to hundred thousand blocks, then this algorithm is essential.

          Instead of storing the last modification time of each block, can we have some other algorithm where each block's metadata need not be updated everytime a block is verified? How about if we start verifying blocks in increasing blockid order and record the current blockid that was verified? Maybe we need to persist this information only once every 100 blocks or so. If we reach the largest known blockid then we cycle back to the lowest blockid and start verifying from there. For a datanode that has 100K blocks, it will take only about 1MB of memory to keep a lazily-sorted list of blockids.

          Show
          dhruba borthakur added a comment - Thinking more about this one, I agree with Rob that it is important to have an algorithm that eventually verifies all blocks even in the face of frequent datanode restarts. In fact, if we want the datanode to scale to hundred thousand blocks, then this algorithm is essential. Instead of storing the last modification time of each block, can we have some other algorithm where each block's metadata need not be updated everytime a block is verified? How about if we start verifying blocks in increasing blockid order and record the current blockid that was verified? Maybe we need to persist this information only once every 100 blocks or so. If we reach the largest known blockid then we cycle back to the lowest blockid and start verifying from there. For a datanode that has 100K blocks, it will take only about 1MB of memory to keep a lazily-sorted list of blockids.
          Hide
          Raghu Angadi added a comment -

          Chatted with Rob and Dhruba. Here is the proposal :

          • No data is stored persistently for now. If we need this we can add later.
          • Datanode maintains last verification time for all the nodes in memory.
          • When it first starts up, it keeps the blocks in random order. And then on wards verifies them based on last verfication time. Initial random order is to avoid not verifying some blocks across frequest restarts.
          • Provides detailed stats on webui so that we can know if there any blocks that are not verified in a log time. This also lets us know we should store any data persistenly.
          • Instead of just deleting a block when a corruption is detected, it could let the namenode decide whether to remove the block immediately or not. This might need a few changes on Namenode.
          Show
          Raghu Angadi added a comment - Chatted with Rob and Dhruba. Here is the proposal : No data is stored persistently for now. If we need this we can add later. Datanode maintains last verification time for all the nodes in memory. When it first starts up, it keeps the blocks in random order. And then on wards verifies them based on last verfication time. Initial random order is to avoid not verifying some blocks across frequest restarts. Provides detailed stats on webui so that we can know if there any blocks that are not verified in a log time. This also lets us know we should store any data persistenly. Instead of just deleting a block when a corruption is detected, it could let the namenode decide whether to remove the block immediately or not. This might need a few changes on Namenode.
          Hide
          Sameer Paranjpye added a comment -

          Why not keep the last verification time in each blocks .meta file or use the native filesystem timestamp for this purpose?

          Show
          Sameer Paranjpye added a comment - Why not keep the last verification time in each blocks .meta file or use the native filesystem timestamp for this purpose?
          Hide
          Raghu Angadi added a comment -

          > Why not keep the last verification time in each blocks .meta file or use the native filesystem timestamp for this purpose?

          Many disk errors occur during the write path. If we modify .meta file that contains .crc every few days, we might actually increase error rate.

          Also writing to metafile will be at least one random seek/write for each verification.

          If above two are ok, we can put this info in .meta file. I viewed last verification time as non-critical data, that does not need to be maintained persistently.

          Show
          Raghu Angadi added a comment - > Why not keep the last verification time in each blocks .meta file or use the native filesystem timestamp for this purpose? Many disk errors occur during the write path. If we modify .meta file that contains .crc every few days, we might actually increase error rate. Also writing to metafile will be at least one random seek/write for each verification. If above two are ok, we can put this info in .meta file. I viewed last verification time as non-critical data, that does not need to be maintained persistently.
          Hide
          Sameer Paranjpye added a comment -

          > Many disk errors occur during the write path. If we modify .meta file that contains .crc every few days, we might actually increase error rate.

          Would writing more to the .meta file increase the chance of the .meta file being corrupted? I'm not sure I see that.

          > Also writing to metafile will be at least one random seek/write for each verification.

          This matters only if the validation is done very frequently. The scheme proposed here suggests slow background checking. Do we really need to check more than a few blocks an hour?

          Show
          Sameer Paranjpye added a comment - > Many disk errors occur during the write path. If we modify .meta file that contains .crc every few days, we might actually increase error rate. Would writing more to the .meta file increase the chance of the .meta file being corrupted? I'm not sure I see that. > Also writing to metafile will be at least one random seek/write for each verification. This matters only if the validation is done very frequently. The scheme proposed here suggests slow background checking. Do we really need to check more than a few blocks an hour?
          Hide
          Raghu Angadi added a comment -

          > Would writing more to the .meta file increase the chance of the .meta file being corrupted? I'm not sure I see that.

          Even if we are writing 8 bytes, if there a problem during the write (hardware / driver error), I would usually corrupt a few sectors instead of just a few bytes. More times we write to the same location of a file more the chances there is an error.

          I might be wrong, may be errors that occur during writes is much much less than other ways disks go corrupt.. but I have seen quite a few kernel / driver errors while writing.

          Show
          Raghu Angadi added a comment - > Would writing more to the .meta file increase the chance of the .meta file being corrupted? I'm not sure I see that. Even if we are writing 8 bytes, if there a problem during the write (hardware / driver error), I would usually corrupt a few sectors instead of just a few bytes. More times we write to the same location of a file more the chances there is an error. I might be wrong, may be errors that occur during writes is much much less than other ways disks go corrupt.. but I have seen quite a few kernel / driver errors while writing.
          Hide
          Sameer Paranjpye added a comment -

          > Even if we are writing 8 bytes, if there a problem during the write (hardware / driver error) ...

          Maybe, but the number of writes that will be done as a result of this process is a couple of orders of magnitude less than those from regular block writes, replications, appends ... This is just noise.

          Show
          Sameer Paranjpye added a comment - > Even if we are writing 8 bytes, if there a problem during the write (hardware / driver error) ... Maybe, but the number of writes that will be done as a result of this process is a couple of orders of magnitude less than those from regular block writes, replications, appends ... This is just noise.
          Hide
          Raghu Angadi added a comment -

          Without appends, a block and .meta file are written to only once. With this, .meta will be written to many times. Ignoring appends, probability of write corruption for a given byte around the 'last verification time' would be orders of magnitude more than other data on the disk and not less.

          I will increase the .meta version and write this information at the the beginning of the file. I am still not very sure of benifits of persisting the verification time.. but there might be more.

          Show
          Raghu Angadi added a comment - Without appends, a block and .meta file are written to only once. With this, .meta will be written to many times. Ignoring appends, probability of write corruption for a given byte around the 'last verification time' would be orders of magnitude more than other data on the disk and not less. I will increase the .meta version and write this information at the the beginning of the file. I am still not very sure of benifits of persisting the verification time.. but there might be more.
          Hide
          Raghu Angadi added a comment -

          Currently checksum is not verified when a block is read to serve to client. Should we add checksum verification during these reads? Otherwise I won't update the block verification time when it is read.

          Show
          Raghu Angadi added a comment - Currently checksum is not verified when a block is read to serve to client. Should we add checksum verification during these reads? Otherwise I won't update the block verification time when it is read.
          Hide
          Doug Cutting added a comment -

          > Currently checksum is not verified when a block is read to serve to client.

          The checksum is verified on the client, and failures there are reported back to DFS. They may or may not indicate an on-disk problem, but that's also true of tests done on the datanode. So I don't think there's much point in validating on the datanode at read time.

          Show
          Doug Cutting added a comment - > Currently checksum is not verified when a block is read to serve to client. The checksum is verified on the client, and failures there are reported back to DFS. They may or may not indicate an on-disk problem, but that's also true of tests done on the datanode. So I don't think there's much point in validating on the datanode at read time.
          Hide
          Raghu Angadi added a comment -

          Yes, only motivation was so that we could update the verification time. Failure is not currently reported to Namenode if the checksum failure is during a transfer to another datanode.

          Actually we can still update that verification time if client responds with a success. I guess for now, we don't do that (since client might return success if it is specifically asked to ignore checksum errors). I will just keep in option in comments.

          Show
          Raghu Angadi added a comment - Yes, only motivation was so that we could update the verification time. Failure is not currently reported to Namenode if the checksum failure is during a transfer to another datanode. Actually we can still update that verification time if client responds with a success. I guess for now, we don't do that (since client might return success if it is specifically asked to ignore checksum errors). I will just keep in option in comments.
          Hide
          Sameer Paranjpye added a comment -

          > The checksum is verified on the client, and failures there are reported back to DFS

          Are these failures reported to the Namenode or the Datanode? From what I know it's the Namenode, which I believe doesn't do anything with the reports other than logging the failure. It would be better to report the failure to the Datanode, have the Datanode validate what's on disk and report corruption to the Namenode if the validation fails.

          Show
          Sameer Paranjpye added a comment - > The checksum is verified on the client, and failures there are reported back to DFS Are these failures reported to the Namenode or the Datanode? From what I know it's the Namenode, which I believe doesn't do anything with the reports other than logging the failure. It would be better to report the failure to the Datanode, have the Datanode validate what's on disk and report corruption to the Namenode if the validation fails.
          Hide
          Raghu Angadi added a comment -

          > From what I know it's the Namenode, which I believe doesn't do anything with the reports other than logging the failure.

          Namenode actually deletes the block if there are more than one copy. Periodic verification uses that same RPC.

          > It would be better to report the failure to the Datanode, have the Datanode validate what's on disk and report corruption to the Namenode if the validation fails.

          I agree. This will avoid spurious block deletion during in memory corruption.

          Show
          Raghu Angadi added a comment - > From what I know it's the Namenode, which I believe doesn't do anything with the reports other than logging the failure. Namenode actually deletes the block if there are more than one copy. Periodic verification uses that same RPC. > It would be better to report the failure to the Datanode, have the Datanode validate what's on disk and report corruption to the Namenode if the validation fails. I agree. This will avoid spurious block deletion during in memory corruption.
          Hide
          Raghu Angadi added a comment -

          Attaching the patch for review.

          • Currently the period is 2 weeks. Not configurable
          • This patch does not throttle yet. Will need throttler from HADOOP-1912. It will be simple mod to add it.
          • This patch does not store the last scan times persistently. How we want to modify meta file will be clearer once we have better picture of how it is affected by append feature.
          • This does not have a unit test yet. For unit test, we need to make the SCAN_PERIOD configurable.
          Show
          Raghu Angadi added a comment - Attaching the patch for review. Currently the period is 2 weeks. Not configurable This patch does not throttle yet. Will need throttler from HADOOP-1912 . It will be simple mod to add it. This patch does not store the last scan times persistently. How we want to modify meta file will be clearer once we have better picture of how it is affected by append feature. This does not have a unit test yet. For unit test, we need to make the SCAN_PERIOD configurable.
          Hide
          Raghu Angadi added a comment -

          Also this patch should increment DatanodeProtocolVersion.

          Show
          Raghu Angadi added a comment - Also this patch should increment DatanodeProtocolVersion.
          Hide
          Raghu Angadi added a comment -

          Updated the patch. SCAN_PERIOD was set to negative value by mistake.

          Show
          Raghu Angadi added a comment - Updated the patch. SCAN_PERIOD was set to negative value by mistake.
          Hide
          Raghu Angadi added a comment -

          Another minor fix.

          Also orgot to mention this earlier :
          http://datanode:50075/blockScannerReport shows summary of periodic verification.
          http://datanode:50075/blockScannerReport?listblocks lists all the blocks along with summary.

          Show
          Raghu Angadi added a comment - Another minor fix. Also orgot to mention this earlier : http://datanode:50075/blockScannerReport shows summary of periodic verification. http://datanode:50075/blockScannerReport?listblocks lists all the blocks along with summary.
          Hide
          Raghu Angadi added a comment -

          Updated the patch again. When verification fails, we repeat the verification to minimize effect of transient errors.

          Show
          Raghu Angadi added a comment - Updated the patch again. When verification fails, we repeat the verification to minimize effect of transient errors.
          Hide
          eric baldeschwieler added a comment -

          A couple of comments:

          1) the idea of not keeping scan times strikes me as "really bad". Also randomizing the scan order weirds me out. I do think Raghu's point about modeling the meta files may be valid. Why not simply always scan blocks in numeric order starting from zero and log actions. On restart we can tail this log to find the last block validated and start from there. Inserted blocks are recent by definition, so it is ok if we don't get around to them til the next scan.

          2) It seems to me it might be better to try to repair the block if possible, rather then just delete it. This avoids bad corner cases. It adds complexity though. Thoughts? A simple variant is just to copy a new version locally.

          3) Throttling might simply entail spacing and scheduling when you scan the next block to complete within roughly two weeks. This would imply that we want to persist when the current scan started. If we do that, the penalty of scanning quickly might be fairly ignorable, when you consider the other variations in work load a DN is exposed to anyway. You'd want some rule like always wait 10x the time it took you to validate a block between blocks to avoid wierd corner cases where the node gets a huge number of blocks added near the end of the time period.

          Show
          eric baldeschwieler added a comment - A couple of comments: 1) the idea of not keeping scan times strikes me as "really bad". Also randomizing the scan order weirds me out. I do think Raghu's point about modeling the meta files may be valid. Why not simply always scan blocks in numeric order starting from zero and log actions. On restart we can tail this log to find the last block validated and start from there. Inserted blocks are recent by definition, so it is ok if we don't get around to them til the next scan. 2) It seems to me it might be better to try to repair the block if possible, rather then just delete it. This avoids bad corner cases. It adds complexity though. Thoughts? A simple variant is just to copy a new version locally. 3) Throttling might simply entail spacing and scheduling when you scan the next block to complete within roughly two weeks. This would imply that we want to persist when the current scan started. If we do that, the penalty of scanning quickly might be fairly ignorable, when you consider the other variations in work load a DN is exposed to anyway. You'd want some rule like always wait 10x the time it took you to validate a block between blocks to avoid wierd corner cases where the node gets a huge number of blocks added near the end of the time period.
          Hide
          dhruba borthakur added a comment -

          I like Eric's idea of ordering the blocks in numeric order. I had described this approach in my earlier comment in this Jira. The Datanode can persist the last block that was verified (for Datanode restarts).

          Show
          dhruba borthakur added a comment - I like Eric's idea of ordering the blocks in numeric order. I had described this approach in my earlier comment in this Jira. The Datanode can persist the last block that was verified (for Datanode restarts).
          Hide
          Raghu Angadi added a comment -

          > 2) It seems to me it might be better to try to repair the block if possible, rather then just delete it. This avoids bad corner cases. It adds complexity though. Thoughts? A simple variant is just to copy a new version locally.

          Datanode actually does not physically delete the block after detecting corruption. It asks the Namenode to delete the block (just like a client does when it detects corruption). Namenode deletes the blocks only if there are more replicas left and then replicates the block. Does this address the case?

          Regd (1) and Dhruba's comment, a log file would be my preferred approach too. But main question I get asked is "Why add another file?".

          will think about (3).

          Show
          Raghu Angadi added a comment - > 2) It seems to me it might be better to try to repair the block if possible, rather then just delete it. This avoids bad corner cases. It adds complexity though. Thoughts? A simple variant is just to copy a new version locally. Datanode actually does not physically delete the block after detecting corruption. It asks the Namenode to delete the block (just like a client does when it detects corruption). Namenode deletes the blocks only if there are more replicas left and then replicates the block. Does this address the case? Regd (1) and Dhruba's comment, a log file would be my preferred approach too. But main question I get asked is "Why add another file?". will think about (3).
          Hide
          dhruba borthakur added a comment -

          The metadata about the entire Datanode is stored in the VERSION file. It is possible that we can store the last verified-blockid in this file (instead of adding a new file).

          Do we really need a scan period? Your proposal that the Datanode spends a certain percentage of the disk bandwidth to verify blocks sounds effective by itself. If a datanode has 100K blocks each of 128MB each, and it is configured to use 5MB/sec disk bandwidth for verification, it would take the Datanoed about 4 days to verify each and every block it has in the system. The next iteration could start immediately. If a datanode has few blocks, each iteration would finish quickly and the nect iteration would start immediately. Is there a dis-advantage in starting iterations back-to-back? We can get away by not having another configuration parameter.

          Show
          dhruba borthakur added a comment - The metadata about the entire Datanode is stored in the VERSION file. It is possible that we can store the last verified-blockid in this file (instead of adding a new file). Do we really need a scan period? Your proposal that the Datanode spends a certain percentage of the disk bandwidth to verify blocks sounds effective by itself. If a datanode has 100K blocks each of 128MB each, and it is configured to use 5MB/sec disk bandwidth for verification, it would take the Datanoed about 4 days to verify each and every block it has in the system. The next iteration could start immediately. If a datanode has few blocks, each iteration would finish quickly and the nect iteration would start immediately. Is there a dis-advantage in starting iterations back-to-back? We can get away by not having another configuration parameter.
          Hide
          eric baldeschwieler added a comment -

          what value is there in scanning a single block continuously in the extreme case?

          Seems like we should do our part to keep our carbon footprint down.

          Show
          eric baldeschwieler added a comment - what value is there in scanning a single block continuously in the extreme case? Seems like we should do our part to keep our carbon footprint down.
          Hide
          Raghu Angadi added a comment - - edited

          > The metadata about the entire Datanode is stored in the VERSION file. It is possible that we can store the last verified-blockid in this file (instead of adding a new file).

          VERSION file contains only the basic information needed for starting up a Datanode and is vital for Datanode startup. It is not updated at runtime. I don't think it is suited for for this.

          Regd continuous scanning, I think most users would not prefer that. Even 5MB per sec is close to 20% of a single disk read and much more if the read/write is not very sequential (happens when there are multiple reads and writes).

          We can certainly make the SCAN_PERIOD and throttle bandwidth configurable (may be not in hadoop-defaults.xml), where power users can tweak it as appropriate. I know there is strong resistance for adding any config variables . But my personal opinion is that some more config vars that 99% of users don't need to worry about because of good defaults and less effect on performance are ok.

          Show
          Raghu Angadi added a comment - - edited > The metadata about the entire Datanode is stored in the VERSION file. It is possible that we can store the last verified-blockid in this file (instead of adding a new file). VERSION file contains only the basic information needed for starting up a Datanode and is vital for Datanode startup. It is not updated at runtime. I don't think it is suited for for this. Regd continuous scanning, I think most users would not prefer that. Even 5MB per sec is close to 20% of a single disk read and much more if the read/write is not very sequential (happens when there are multiple reads and writes). We can certainly make the SCAN_PERIOD and throttle bandwidth configurable (may be not in hadoop-defaults.xml), where power users can tweak it as appropriate. I know there is strong resistance for adding any config variables . But my personal opinion is that some more config vars that 99% of users don't need to worry about because of good defaults and less effect on performance are ok.
          Hide
          Sameer Paranjpye added a comment -

          Why not have a scan period only?

          The scan period defines a window in which every block that exists at the beginning of the window will be examined (barring blocks that are deleted). A Datanode would construct a schedule for examining blocks in a scan period with least recently examined blocks going first. New blocks would be scheduled in the next window. The schedule could be constructed by dividing a window into scanperiod/n intervals, one interval per block. A Datanode would make a determination of how much bandwidth it needs to scan a block based on when the next block is scheduled.

          This would guarantee that every block that exists at the beginning of a scan period is examined once in the scan period. It would also guarantee an upper bound of 2*scan period between 2 scans of a given block. This is also an upper bound on the amount of time that elapses before a new block is scanned. In both cases, the time elapsed will, in the average case, be close to scan period and approach 2*scan period if a large number of blocks are added in a window. These seem like reasonable guarantees.

          It would make sense to have a reasonable upper bound on the amount of bandwidth used for scanning and emit a warning if this is not enough to examine all blocks in a scan period. So if someone set a scan period of 1 minute or something else silly the Datanode doesn't spend all its time scanning.

          Show
          Sameer Paranjpye added a comment - Why not have a scan period only? The scan period defines a window in which every block that exists at the beginning of the window will be examined (barring blocks that are deleted). A Datanode would construct a schedule for examining blocks in a scan period with least recently examined blocks going first. New blocks would be scheduled in the next window. The schedule could be constructed by dividing a window into scanperiod/n intervals, one interval per block. A Datanode would make a determination of how much bandwidth it needs to scan a block based on when the next block is scheduled. This would guarantee that every block that exists at the beginning of a scan period is examined once in the scan period. It would also guarantee an upper bound of 2*scan period between 2 scans of a given block. This is also an upper bound on the amount of time that elapses before a new block is scanned. In both cases, the time elapsed will, in the average case, be close to scan period and approach 2*scan period if a large number of blocks are added in a window. These seem like reasonable guarantees. It would make sense to have a reasonable upper bound on the amount of bandwidth used for scanning and emit a warning if this is not enough to examine all blocks in a scan period. So if someone set a scan period of 1 minute or something else silly the Datanode doesn't spend all its time scanning.
          Hide
          Raghu Angadi added a comment -

          My preference would also be make scan period configurable. Also I can make the bw used for scanning adaptive.

          In my implementation, there are no 'start' and 'end' of a period. All the blocks are kept sorted by their last verification time. The loop just looks at the first block and if its last verification time is older than scan period, then it is verified. All the new blocks are assigned a (psuedo) last verification time of randLong(now - SCAN_PERIOD) so that its gets verified within the scan period.

          So if we want to make scan b/w adaptive, it needs to be changed every time a new block is added or removed, or verified by client (verification by client comes at 0 cost). This is of course doable. will do it.

          It would make sense to have a reasonable upper bound on the amount of bandwidth used for scanning and emit a warning if this is not enough to examine all blocks in a scan period. So if someone set a scan period of 1 minute or something else silly the Datanode doesn't spend all its time scanning.

          Yes. If datanode is not able complete verification within the configured period, datanode will print warning (no more than once a day).

          Show
          Raghu Angadi added a comment - My preference would also be make scan period configurable. Also I can make the bw used for scanning adaptive. In my implementation, there are no 'start' and 'end' of a period. All the blocks are kept sorted by their last verification time. The loop just looks at the first block and if its last verification time is older than scan period, then it is verified. All the new blocks are assigned a (psuedo) last verification time of randLong(now - SCAN_PERIOD) so that its gets verified within the scan period. So if we want to make scan b/w adaptive, it needs to be changed every time a new block is added or removed, or verified by client (verification by client comes at 0 cost). This is of course doable. will do it. It would make sense to have a reasonable upper bound on the amount of bandwidth used for scanning and emit a warning if this is not enough to examine all blocks in a scan period. So if someone set a scan period of 1 minute or something else silly the Datanode doesn't spend all its time scanning. Yes. If datanode is not able complete verification within the configured period, datanode will print warning (no more than once a day).
          Hide
          Sameer Paranjpye added a comment -

          > So if we want to make scan b/w adaptive, it needs to be changed every time a new block is added or removed,

          I'm not suggesting that we change the b/w for the current window. Just schedule new blocks to the next scan period.

          Show
          Sameer Paranjpye added a comment - > So if we want to make scan b/w adaptive, it needs to be changed every time a new block is added or removed, I'm not suggesting that we change the b/w for the current window. Just schedule new blocks to the next scan period.
          Hide
          Raghu Angadi added a comment - - edited

          I am preparing a patch for this Jira. I am thinking of disabling this currently for Windows. This is because as part of this feature we want to modify block metada file. Until now, once a file is written to on Datanode, it is never modified. This issue needs to be fixed for appends anyway, I think. The reason why this is a pain on Windows :

          1. A file may not be opened for writing if some other thread is reading from it (I need to verify this).
          2. file.renameTo(existingFile) is not allowed. This is simpler to handle.
          3. fileOpenForRead.renameTo(newFile) is not allowed. This is harder to fix since Datanode does not keep track of which files are being read, etc. To replace a file, we need to wait till all the readers are done.
          Show
          Raghu Angadi added a comment - - edited I am preparing a patch for this Jira. I am thinking of disabling this currently for Windows. This is because as part of this feature we want to modify block metada file. Until now, once a file is written to on Datanode, it is never modified. This issue needs to be fixed for appends anyway, I think. The reason why this is a pain on Windows : A file may not be opened for writing if some other thread is reading from it (I need to verify this). file.renameTo(existingFile) is not allowed. This is simpler to handle. fileOpenForRead.renameTo(newFile) is not allowed. This is harder to fix since Datanode does not keep track of which files are being read, etc. To replace a file, we need to wait till all the readers are done.
          Hide
          eric baldeschwieler added a comment -

          I'd really rather see us get this right for supported platforms rather than declare an issue done when we know it does not work on some platforms.

          This is particularly vexing when design choices could clearly be made that would avoid these issues.

          -1

          Show
          eric baldeschwieler added a comment - I'd really rather see us get this right for supported platforms rather than declare an issue done when we know it does not work on some platforms. This is particularly vexing when design choices could clearly be made that would avoid these issues. -1
          Hide
          Raghu Angadi added a comment -

          > This is particularly vexing when design choices could clearly be made that would avoid these issues.
          Our initial design did not modify metadata files on Datanode.. that was my preference too. All this stems from the fact that we are modifying these files.

          Show
          Raghu Angadi added a comment - > This is particularly vexing when design choices could clearly be made that would avoid these issues. Our initial design did not modify metadata files on Datanode.. that was my preference too. All this stems from the fact that we are modifying these files.
          Hide
          Raghu Angadi added a comment -

          For Windows, will make work most of the time.. very rarely some updates to last verification might fail and thats ok. I will see how much this will take.

          Another option of course is to fix properly and make it work equally work every where.

          Show
          Raghu Angadi added a comment - For Windows, will make work most of the time.. very rarely some updates to last verification might fail and thats ok. I will see how much this will take. Another option of course is to fix properly and make it work equally work every where.
          Hide
          Konstantin Shvachko added a comment -

          The question here is whether we would go with our current decision if we new it will not be supported on windows?
          If we let the balancer write a log type data (verified block #s) into a special file "balancer.log" instead of modifying
          meta-data files, will that be a problem? Looks like Eric already had a proposal of scanning blocks in a predetermined
          order. Should we reconsider this?

          Show
          Konstantin Shvachko added a comment - The question here is whether we would go with our current decision if we new it will not be supported on windows? If we let the balancer write a log type data (verified block #s) into a special file "balancer.log" instead of modifying meta-data files, will that be a problem? Looks like Eric already had a proposal of scanning blocks in a predetermined order. Should we reconsider this?
          Hide
          Raghu Angadi added a comment -

          > The question here is whether we would go with our current decision if we new it will not be supported on windows?
          A related concern is whether this is required for appends anyway.

          Show
          Raghu Angadi added a comment - > The question here is whether we would go with our current decision if we new it will not be supported on windows? A related concern is whether this is required for appends anyway.
          Hide
          dhruba borthakur added a comment -

          For appends, a reader can be reading the datafile and metafile while the writer is still writing to them. This is supported on Linux as well as windows.

          Show
          dhruba borthakur added a comment - For appends, a reader can be reading the datafile and metafile while the writer is still writing to them. This is supported on Linux as well as windows.
          Hide
          Raghu Angadi added a comment -

          > This is supported on Linux as well as windows.
          Can we use that code here? I guessing it handles upgraded directories also...

          Show
          Raghu Angadi added a comment - > This is supported on Linux as well as windows. Can we use that code here? I guessing it handles upgraded directories also...
          Hide
          Raghu Angadi added a comment -

          The latest patch, for WIW is attached. It does not following :

          • verifies blocks and adjusts the read bandwidth between 1-8 MBps in order to complete the verification in a given period. The period is configurable. Its progress can be tracked on the page '/blockScannerReport'.
          • When a client reads a complete block and checksum succeeds, informs the datanode. Datanode considers that as verification of the data.
          • The last verification time is stored in block metadata file (this file contains CRC). This itself includes quite a few changes :
            • BlockMetadata.java handles metadata related operations : reading and writing headers, handling versions, upgrading to new versions etc.
            • It is now simpler to add new fields to metadata.
            • It handles file modifications when the file is linked from backup (after an upgrade) (not yet not Windows).

          I try to make another version of the patch where last modification is stored in a separate text file.

          Show
          Raghu Angadi added a comment - The latest patch, for WIW is attached. It does not following : verifies blocks and adjusts the read bandwidth between 1-8 MBps in order to complete the verification in a given period. The period is configurable. Its progress can be tracked on the page '/blockScannerReport'. When a client reads a complete block and checksum succeeds, informs the datanode. Datanode considers that as verification of the data. The last verification time is stored in block metadata file (this file contains CRC). This itself includes quite a few changes : BlockMetadata.java handles metadata related operations : reading and writing headers, handling versions, upgrading to new versions etc. It is now simpler to add new fields to metadata. It handles file modifications when the file is linked from backup (after an upgrade) (not yet not Windows). I try to make another version of the patch where last modification is stored in a separate text file.
          Hide
          Raghu Angadi added a comment -

          The next iteration of the patch that stores the verification in a log file (human readable text).
          The above comment is still valid, except the third bullet.

          The log file is managed in the following manner:

          • At any time there are upto two files in top level datanode directory : dncp_block_verification.log.curr and dncp_block_verification.log.prev.
          • New verifications times are appended to current file. Once it has 5*num_of_blocks lines, it will be rolled (i.e. prev will be replaced by curr).
          • The log file is managed by new static class LogFileHandler in DataBlockScanner. It also provides an iterator.
          • Each line is about 80 bytes now. looke like:
             date="2008-01-08 01:11:11,674"   time="1199754671674"    id="3144315593631418455"

            . This is easily extendable. "date" is present just for readability and can be removed.

          • During upgrade these files are copied as opposed to blocks which are hard linked. DataStorate.java is modified to copy any file that starts with "dncp_".

          Review of this patch is much appreciated. I will add a test case in the meanwhile.

          Show
          Raghu Angadi added a comment - The next iteration of the patch that stores the verification in a log file (human readable text). The above comment is still valid, except the third bullet. The log file is managed in the following manner: At any time there are upto two files in top level datanode directory : dncp_block_verification.log.curr and dncp_block_verification.log.prev . New verifications times are appended to current file. Once it has 5*num_of_blocks lines, it will be rolled (i.e. prev will be replaced by curr ). The log file is managed by new static class LogFileHandler in DataBlockScanner . It also provides an iterator. Each line is about 80 bytes now. looke like: date="2008-01-08 01:11:11,674" time="1199754671674" id="3144315593631418455" . This is easily extendable. "date" is present just for readability and can be removed. During upgrade these files are copied as opposed to blocks which are hard linked. DataStorate.java is modified to copy any file that starts with "dncp_". Review of this patch is much appreciated. I will add a test case in the meanwhile.
          Hide
          Raghu Angadi added a comment -

          Patch is updated with a unit test and couple of minor fixes. NullOutputStream is moved IOUtils.

          Show
          Raghu Angadi added a comment - Patch is updated with a unit test and couple of minor fixes. NullOutputStream is moved IOUtils.
          Hide
          Raghu Angadi added a comment -

          Opps, prev patch was missing TestBlockVerification.java file.

          Show
          Raghu Angadi added a comment - Opps, prev patch was missing TestBlockVerification.java file.
          Hide
          Raghu Angadi added a comment -

          Updated patch disables this feature for a few more upgrade test that expect datanode directory not to change even after datanode starts.

          Show
          Raghu Angadi added a comment - Updated patch disables this feature for a few more upgrade test that expect datanode directory not to change even after datanode starts.
          Hide
          dhruba borthakur added a comment -

          Code looks good. +1.

          Some issues related to design of this patch:

          1. The DFSClient makes an additional RPC to the namenode after it received and verified the block contents. It might be a good idea to run a DFSIO benchmark to validate that it does not impact read performance.

          2. This patch adds additional simon metrics to record block validation statistics. Simon config file for existing clusters might need to be updated to view these new metrics.

          3. The additional data structure that maps Blocks to BlockInfo may be merged with existing blocksmap in FSDataset.java.

          Show
          dhruba borthakur added a comment - Code looks good. +1. Some issues related to design of this patch: 1. The DFSClient makes an additional RPC to the namenode after it received and verified the block contents. It might be a good idea to run a DFSIO benchmark to validate that it does not impact read performance. 2. This patch adds additional simon metrics to record block validation statistics. Simon config file for existing clusters might need to be updated to view these new metrics. 3. The additional data structure that maps Blocks to BlockInfo may be merged with existing blocksmap in FSDataset.java.
          Hide
          Raghu Angadi added a comment -

          Thanks Dhruba.

          Regd 1) : Its not an extra RPC to Namenode. When a client receives the full block (say 64MB), it writes back 2 bytes to Datanode to say that checksum was ok, on the same connection. So its overhead is just 2 bytes. I think it is probably not required to run benchmarks before committing this, but we certainly can, I will talk to Mukund.

          2) yes, we should add new stats to Simon config.

          3) well noted. Right now Periodic verification is quite seperated, merging BlockMap in FSDataset and here will take quite a few code changes (not complicated, but might look messy), we can certainly merge these when we want to reduce memory. I should save around 64 bytes per block on 64 bit JVM.

          Show
          Raghu Angadi added a comment - Thanks Dhruba. Regd 1) : Its not an extra RPC to Namenode. When a client receives the full block (say 64MB), it writes back 2 bytes to Datanode to say that checksum was ok, on the same connection. So its overhead is just 2 bytes. I think it is probably not required to run benchmarks before committing this, but we certainly can, I will talk to Mukund. 2) yes, we should add new stats to Simon config. 3) well noted. Right now Periodic verification is quite seperated, merging BlockMap in FSDataset and here will take quite a few code changes (not complicated, but might look messy), we can certainly merge these when we want to reduce memory. I should save around 64 bytes per block on 64 bit JVM.
          Hide
          Raghu Angadi added a comment -

          The updated patch has one minor change: rename TestBlockVerification.java to TestDatanodeBlockVerification.java.

          Note to committers : there are two files added : dfs/DataBlockScanner.java and test/../dfs/DataBlockScanner.java .

          Show
          Raghu Angadi added a comment - The updated patch has one minor change: rename TestBlockVerification.java to TestDatanodeBlockVerification.java. Note to committers : there are two files added : dfs/DataBlockScanner.java and test/../dfs/DataBlockScanner.java .
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12373186/HADOOP-2012.patch
          against trunk revision r612314.

          @author +1. The patch does not contain any @author tags.

          javadoc -1. The javadoc tool appears to have generated messages.

          javac +1. The applied patch does not generate any new compiler warnings.

          findbugs -1. The patch appears to introduce 3 new Findbugs warnings.

          core tests +1. The patch passed core unit tests.

          contrib tests +1. The patch passed contrib unit tests.

          Test results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1606/testReport/
          Findbugs warnings: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1606/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1606/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1606/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12373186/HADOOP-2012.patch against trunk revision r612314. @author +1. The patch does not contain any @author tags. javadoc -1. The javadoc tool appears to have generated messages. javac +1. The applied patch does not generate any new compiler warnings. findbugs -1. The patch appears to introduce 3 new Findbugs warnings. core tests +1. The patch passed core unit tests. contrib tests +1. The patch passed contrib unit tests. Test results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1606/testReport/ Findbugs warnings: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1606/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1606/artifact/trunk/build/test/checkstyle-errors.html Console output: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1606/console This message is automatically generated.
          Hide
          Raghu Angadi added a comment -

          New findbugs warnings are at http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1606/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html

          • The first one does not let me ignore an IOException. Is this a new restriction?
          • The second one is ok.
          • For the third one, I had to add pseudo accesses just to tell findbugs that we know what we are doing. See comment at DataBlockScanner.java:458

          Javadoc is also fixed. The warning does not appear with Java 6, only with Java 1.5.

          Show
          Raghu Angadi added a comment - New findbugs warnings are at http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1606/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html The first one does not let me ignore an IOException. Is this a new restriction? The second one is ok. For the third one, I had to add pseudo accesses just to tell findbugs that we know what we are doing. See comment at DataBlockScanner.java:458 Javadoc is also fixed. The warning does not appear with Java 6, only with Java 1.5.
          Hide
          Raghu Angadi added a comment -

          slightly different fix for 3rd findbugs warning above.

          Show
          Raghu Angadi added a comment - slightly different fix for 3rd findbugs warning above.
          Hide
          Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12373305/HADOOP-2012.patch
          against trunk revision r612561.

          @author +1. The patch does not contain any @author tags.

          javadoc +1. The javadoc tool did not generate any warning messages.

          javac +1. The applied patch does not generate any new compiler warnings.

          findbugs +1. The patch does not introduce any new Findbugs warnings.

          core tests +1. The patch passed core unit tests.

          contrib tests +1. The patch passed contrib unit tests.

          Test results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1614/testReport/
          Findbugs warnings: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1614/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1614/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1614/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12373305/HADOOP-2012.patch against trunk revision r612561. @author +1. The patch does not contain any @author tags. javadoc +1. The javadoc tool did not generate any warning messages. javac +1. The applied patch does not generate any new compiler warnings. findbugs +1. The patch does not introduce any new Findbugs warnings. core tests +1. The patch passed core unit tests. contrib tests +1. The patch passed contrib unit tests. Test results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1614/testReport/ Findbugs warnings: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1614/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1614/artifact/trunk/build/test/checkstyle-errors.html Console output: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1614/console This message is automatically generated.
          Hide
          Raghu Angadi added a comment -

          The patch blessed by Hudson had an extra comment I forgot to remove. Please commit the the patch attached (the latest). I don't think there is a need for another Hudson run. Only different is a comment.

          Show
          Raghu Angadi added a comment - The patch blessed by Hudson had an extra comment I forgot to remove. Please commit the the patch attached (the latest). I don't think there is a need for another Hudson run. Only different is a comment.
          Hide
          dhruba borthakur added a comment -

          I just committed this. Thanks Raghu!

          Show
          dhruba borthakur added a comment - I just committed this. Thanks Raghu!

            People

            • Assignee:
              Raghu Angadi
              Reporter:
              Raghu Angadi
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development