Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.14.0
    • Component/s: None
    • Labels:
      None

      Description

      Currently CRCs are handled at FileSystem level and are transparent to core HDFS. See recent improvement HADOOP-928 ( that can add checksums to a given filesystem ) regd more about it. Though this served us well there a few disadvantages :

      1) This doubles namespace in HDFS ( or other filesystem implementations ). In many cases, it nearly doubles the number of blocks. Taking namenode out of CRCs would nearly double namespace performance both in terms of CPU and memory.

      2) Since CRCs are transparent to HDFS, it can not actively detect corrupted blocks. With block level CRCs, Datanode can periodically verify the checksums and report corruptions to namnode such that name replicas can be created.

      We propose to have CRCs maintained for all HDFS data in much the same way as in GFS. I will update the jira with detailed requirements and design. This will include same guarantees provided by current implementation and will include a upgrade of current data.

      1. BlockCrcFeatureTestPlan.pdf
        19 kB
        Raghu Angadi
      2. DfsBlockCrcDesign.htm
        21 kB
        Raghu Angadi
      3. HADOOP-1134-03.patch
        203 kB
        Raghu Angadi
      4. HADOOP-1134-02.patch
        203 kB
        Raghu Angadi
      5. HADOOP-1134-01.patch
        189 kB
        Raghu Angadi
      6. BlockLevelCrc-07122007.patch
        193 kB
        Raghu Angadi
      7. BlockLevelCrc-07102007.patch
        186 kB
        Raghu Angadi
      8. BlockLevelCrc-07062007.patch
        177 kB
        Raghu Angadi
      9. BlockLevelCrc-07052007.patch
        178 kB
        Raghu Angadi
      10. BlockLevelCrc-07032007.patch
        168 kB
        Raghu Angadi
      11. readBuffer.java
        1.0 kB
        Raghu Angadi

        Issue Links

          Activity

          Hide
          Doug Cutting added a comment -

          Sub-block checksums are useful, as they permit efficient, checksummed random-access without scanning the entire block.

          Show
          Doug Cutting added a comment - Sub-block checksums are useful, as they permit efficient, checksummed random-access without scanning the entire block.
          Hide
          Sameer Paranjpye added a comment -

          +1 on sub-block checksums. HDFS ought to be able to support small random reads, we could go to larger blocks than 512 bytes but not by very much. We should seriously consider storing checksums inline with block data, this makes the upgrade harder but it enables us to get data with just 1 seek vs 2 if the checksum are stored in a separate file.

          Show
          Sameer Paranjpye added a comment - +1 on sub-block checksums. HDFS ought to be able to support small random reads, we could go to larger blocks than 512 bytes but not by very much. We should seriously consider storing checksums inline with block data, this makes the upgrade harder but it enables us to get data with just 1 seek vs 2 if the checksum are stored in a separate file.
          Hide
          Raghu Angadi added a comment -

          This is surely going to be sub block. For every 64k bytes or so. "Block level CRCs" is probably misleading title. Suggestions are welcome.

          Inline CRCs sounds good to me:

          1) Not sure of any disadvantages of storing inline since how blocks are stored is totally internal to Datanodes.

          2) It does save a seek if we are reading small chunks of data. But in many cases, we would be reading many megabytes serially. This way the seeks saved may not be great. But every seek saved helps.

          3) It save files opened and closed as well for each block accessed.

          4) It also matches how I was thinking of sending CRCs to client and peers ( inline with on the same data connection instead of a separate channel ).

          Show
          Raghu Angadi added a comment - This is surely going to be sub block. For every 64k bytes or so. "Block level CRCs" is probably misleading title. Suggestions are welcome. Inline CRCs sounds good to me: 1) Not sure of any disadvantages of storing inline since how blocks are stored is totally internal to Datanodes. 2) It does save a seek if we are reading small chunks of data. But in many cases, we would be reading many megabytes serially. This way the seeks saved may not be great. But every seek saved helps. 3) It save files opened and closed as well for each block accessed. 4) It also matches how I was thinking of sending CRCs to client and peers ( inline with on the same data connection instead of a separate channel ).
          Hide
          Raghu Angadi added a comment -

          Regd Datanode upgrade : Online or offline?

          Offline : When datanodes restart with the new version, they essentially go offline for couple of hours (100+ GB) come up with a new and shiny blocks.
          Pros:
          1) Simpler code. upgrade code does not need to be maintained in future versions. Users with with very old dfs could upgrade iteratively.
          2) No change in block filenames or other maintenance is required.
          Cons:
          1) Cluster would in accessible for couple of hours which implies lot of one time work especially admins.
          2) Upgrade to future versions from current versions requires multiple upgrades.

          Online: Datanode would immediately start serving existing data after restart. It will upgrade the blocks in background.
          Pros:
          1) no extra downtime.
          Cons:
          1) Increases the code and this exra code would be spread around the datanode code.
          2) The code to handle mixed database may not be removed anytime in near future.

          My (selfish) preference is to upgrade offline . It does not seem unreasonable before 1.0 release.

          Show
          Raghu Angadi added a comment - Regd Datanode upgrade : Online or offline? Offline : When datanodes restart with the new version, they essentially go offline for couple of hours (100+ GB) come up with a new and shiny blocks. Pros: 1) Simpler code. upgrade code does not need to be maintained in future versions. Users with with very old dfs could upgrade iteratively. 2) No change in block filenames or other maintenance is required. Cons: 1) Cluster would in accessible for couple of hours which implies lot of one time work especially admins. 2) Upgrade to future versions from current versions requires multiple upgrades. Online: Datanode would immediately start serving existing data after restart. It will upgrade the blocks in background. Pros: 1) no extra downtime. Cons: 1) Increases the code and this exra code would be spread around the datanode code. 2) The code to handle mixed database may not be removed anytime in near future. My (selfish) preference is to upgrade offline . It does not seem unreasonable before 1.0 release.
          Hide
          Doug Cutting added a comment -

          I like the offline approach: smaller, simpler code should be more reliable and easier to maintain long-term, which outweighs a one-time downtime.

          It'd be nice to not have to rewrite all blocks files, which argues for non-interleaved checksums. Ideally we could even reuse existing checksum files, so that all blocks would not need to be read. Then the upgrade would primarily consist of migrating checksum data from HDFS to local files on the datanode beside each block file. That should be pretty fast.

          Show
          Doug Cutting added a comment - I like the offline approach: smaller, simpler code should be more reliable and easier to maintain long-term, which outweighs a one-time downtime. It'd be nice to not have to rewrite all blocks files, which argues for non-interleaved checksums. Ideally we could even reuse existing checksum files, so that all blocks would not need to be read. Then the upgrade would primarily consist of migrating checksum data from HDFS to local files on the datanode beside each block file. That should be pretty fast.
          Hide
          Owen O'Malley added a comment -

          I think the inline crcs are too problematic. They will add a mapping between logical and physical offsets into the block that will hit a fair amount of code. If the side file is opened with a 4k buffer, it will only take 2 reads of the side file to handle the entire block (assuming 4B CRC/64KB and 128MB blocks). It also is much much easier to handle upgrade.

          Show
          Owen O'Malley added a comment - I think the inline crcs are too problematic. They will add a mapping between logical and physical offsets into the block that will hit a fair amount of code. If the side file is opened with a 4k buffer, it will only take 2 reads of the side file to handle the entire block (assuming 4B CRC/64KB and 128MB blocks). It also is much much easier to handle upgrade.
          Hide
          Doug Cutting added a comment -

          I'd also vote for using a format like the existing CRC files, with a header that includes a version and bytes/crc. That way we can switch algorithms to CRC64 or MD5, and also perhaps reuse existing CRCs, even if we change the default bytes/crc for new files.

          Show
          Doug Cutting added a comment - I'd also vote for using a format like the existing CRC files, with a header that includes a version and bytes/crc. That way we can switch algorithms to CRC64 or MD5, and also perhaps reuse existing CRCs, even if we change the default bytes/crc for new files.
          Hide
          Sameer Paranjpye added a comment -

          +1 for offline upgrades.

          >> Owen O'Malley [20/Mar/07 12:59 PM] I think the inline crcs are too problematic. They will add a mapping between logical and physical offsets into the block that will hit a
          >> fair amount of code. If the side file is opened with a 4k buffer, it will only take 2 reads of the side file to handle the entire block (assuming 4B CRC/64KB and 128MB
          >> blocks). It also is much much easier to handle upgrade.

          It takes only 2 reads to handle the entire block which is good. But it takes those same 2 reads to handle a tiny fraction of the block as well, which is where the downside appears. It's quite clear that doing inline checksums makes the upgrade process a lot harder. The question is whether or not taking the hit of a difficult upgrade and complicating the data access code is a reasonable price to pay for halving the number of seeks in the system for good. It feels like it is, thoughts?

          Show
          Sameer Paranjpye added a comment - +1 for offline upgrades. >> Owen O'Malley [20/Mar/07 12:59 PM] I think the inline crcs are too problematic. They will add a mapping between logical and physical offsets into the block that will hit a >> fair amount of code. If the side file is opened with a 4k buffer, it will only take 2 reads of the side file to handle the entire block (assuming 4B CRC/64KB and 128MB >> blocks). It also is much much easier to handle upgrade. It takes only 2 reads to handle the entire block which is good. But it takes those same 2 reads to handle a tiny fraction of the block as well, which is where the downside appears. It's quite clear that doing inline checksums makes the upgrade process a lot harder. The question is whether or not taking the hit of a difficult upgrade and complicating the data access code is a reasonable price to pay for halving the number of seeks in the system for good. It feels like it is, thoughts?
          Hide
          Raghu Angadi added a comment -

          I don't think using existing CRCs helps much. First, it will increase the upgrade code by quite a bit:
          Datanode needs to contact namenode to fetch filename for the block
          Then it needs to get blocks for CRC file.
          On top of this, during this upgrade most datanodes are down and namenode does not know where blocks are located.

          Show
          Raghu Angadi added a comment - I don't think using existing CRCs helps much. First, it will increase the upgrade code by quite a bit: Datanode needs to contact namenode to fetch filename for the block Then it needs to get blocks for CRC file. On top of this, during this upgrade most datanodes are down and namenode does not know where blocks are located.
          Hide
          Raghu Angadi added a comment -

          I do agree that separate checksum file looks 'cleaner'. Also, when we combine 'Version' upgrade ( HADOOP-702 ) and offline CRC upgrade, datanodes should be able to store each block twice if we want to have inline CRCs. This might be unacceptable in practice.

          > It's quite clear that doing inline checksums makes the upgrade process a lot harder.
          I am not sure if inline CRCs increases upgrade complexity. Surely upgrade will take less time.. but it would be more like 1 hour instead of 2-3 hours, which is not a big issue.

          Show
          Raghu Angadi added a comment - I do agree that separate checksum file looks 'cleaner'. Also, when we combine 'Version' upgrade ( HADOOP-702 ) and offline CRC upgrade, datanodes should be able to store each block twice if we want to have inline CRCs. This might be unacceptable in practice. > It's quite clear that doing inline checksums makes the upgrade process a lot harder. I am not sure if inline CRCs increases upgrade complexity. Surely upgrade will take less time.. but it would be more like 1 hour instead of 2-3 hours, which is not a big issue.
          Hide
          Doug Cutting added a comment -

          Another thing to think about is reverting if the upgrade doesn't work. If the upgrade purely adds new files next to block files then reversion is easy until you remove the old CRC files. So the removal of the old CRC files should probably be a separate step, only performed after the rest of the upgrade is shown to be satisfactory.

          > I don't think using existing CRCs helps much.

          I suspect it would greatly speed the upgrade. Yes, the filesystem would need to be brought up in a read-only mode so that the old CRC files could be read. But note that the old CRCs were computed on the client as the data was created (as should be new CRC files). If a block has been corrupted, simply CRCing its data on the datanode would hide that. So the old CRCs are what we want for correctness too.

          Show
          Doug Cutting added a comment - Another thing to think about is reverting if the upgrade doesn't work. If the upgrade purely adds new files next to block files then reversion is easy until you remove the old CRC files. So the removal of the old CRC files should probably be a separate step, only performed after the rest of the upgrade is shown to be satisfactory. > I don't think using existing CRCs helps much. I suspect it would greatly speed the upgrade. Yes, the filesystem would need to be brought up in a read-only mode so that the old CRC files could be read. But note that the old CRCs were computed on the client as the data was created (as should be new CRC files). If a block has been corrupted, simply CRCing its data on the datanode would hide that. So the old CRCs are what we want for correctness too.
          Hide
          Raghu Angadi added a comment -

          > Another thing to think about is reverting if the upgrade doesn't work.

          Version upgrade feature lets us go back to old state even if we removed old CRC files. Taking advantage of version upgrade would require that we should not modify block files (ie. no inline CRCs, otherwise DN needs to be able to store two copies). As such, I hadn't planned on removing old CRC files as part of upgrade. We could write an external script to delete all .crc files after the system has been running well.

          Show
          Raghu Angadi added a comment - > Another thing to think about is reverting if the upgrade doesn't work. Version upgrade feature lets us go back to old state even if we removed old CRC files. Taking advantage of version upgrade would require that we should not modify block files (ie. no inline CRCs, otherwise DN needs to be able to store two copies). As such, I hadn't planned on removing old CRC files as part of upgrade. We could write an external script to delete all .crc files after the system has been running well.
          Hide
          Raghu Angadi added a comment -

          >> I don't think using existing CRCs helps much.
          >I suspect it would greatly speed the upgrade.

          Do you mean upgrade would not actually read the blocks if we are using old CRC?

          Show
          Raghu Angadi added a comment - >> I don't think using existing CRCs helps much. >I suspect it would greatly speed the upgrade. Do you mean upgrade would not actually read the blocks if we are using old CRC?
          Hide
          Sameer Paranjpye added a comment -

          If we did CRCs for 32 or 64k blocks we could keep the CRCs in side files and cache them in RAM on the Datanodes with only a small amount of overhead. If we did a CRC for every 64k then a 128MB block would have a 4k CRC file. With as many as 3000 such blocks on a node (3TB) we'd only have 24MB of CRC data which could easily be kept in RAM. This would let us work with side files and eliminate seeks as well.

          Show
          Sameer Paranjpye added a comment - If we did CRCs for 32 or 64k blocks we could keep the CRCs in side files and cache them in RAM on the Datanodes with only a small amount of overhead. If we did a CRC for every 64k then a 128MB block would have a 4k CRC file. With as many as 3000 such blocks on a node (3TB) we'd only have 24MB of CRC data which could easily be kept in RAM. This would let us work with side files and eliminate seeks as well.
          Hide
          Raghu Angadi added a comment -

          Good idea. If required we could limit the amount of memory used for this and treat it as LRU cache.

          Show
          Raghu Angadi added a comment - Good idea. If required we could limit the amount of memory used for this and treat it as LRU cache.
          Hide
          Raghu Angadi added a comment -

          correction : it is 64k memory per GB => 192 MB for 3 TB.

          Show
          Raghu Angadi added a comment - correction : it is 64k memory per GB => 192 MB for 3 TB.
          Hide
          Doug Cutting added a comment -

          > Do you mean upgrade would not actually read the blocks if we are using old CRC?

          Yes. And, as I noted above, strictly speaking, this is required for correctness. We cannot simply checksum blocks, assuming they're not corrupt. The existing checksum data was created on the client and is much more trustworthy. So if we're going to re-compute checksums we should first validate the data against the old checksums. But it might be easier to simply re-use the existing checksums.

          I'm all for moving to something like 64k bytes/checksum for new files (and old files, if they're validated), although we ought to benchmark the cost of transferring and checksumming 64k before we do this as that will be added to the cost of a seek. We should test that seek performance does not significantly suffer. Note that the entire 64k chunk must be transferred to the client for checksumming, so the added cost per seek is not just computation and disk time, but network bandwidth too.

          Show
          Doug Cutting added a comment - > Do you mean upgrade would not actually read the blocks if we are using old CRC? Yes. And, as I noted above, strictly speaking, this is required for correctness. We cannot simply checksum blocks, assuming they're not corrupt. The existing checksum data was created on the client and is much more trustworthy. So if we're going to re-compute checksums we should first validate the data against the old checksums. But it might be easier to simply re-use the existing checksums. I'm all for moving to something like 64k bytes/checksum for new files (and old files, if they're validated), although we ought to benchmark the cost of transferring and checksumming 64k before we do this as that will be added to the cost of a seek. We should test that seek performance does not significantly suffer. Note that the entire 64k chunk must be transferred to the client for checksumming, so the added cost per seek is not just computation and disk time, but network bandwidth too.
          Hide
          Raghu Angadi added a comment -

          Correctness is a big advantage of using existing CRCs. Still, we could choose to 64k checksums during upgrade, which implies reading of of all the blocks and associated increase in downtime. Whether to recreate checksums or not depends on how far off is 512 byte CRCs is from our (gu)estimated optimal value. Another option to get more trust after to upgrade is to compare checksums of replicas and choose the majority. For current design we can assume we compare with the old CRCs.

          If a client asks for 2k data, we are saying we should send 64k (or 128k) where 2k is located to client. Why not send only 2k and send newly calculated CRC for 2k? Datanode still verifies all the 64k blocks involved in the read but recalculates. One argument against this is that this is a weaker guarantee than sending full blocks for verification. But datanode will calculate the new CRC right next to where the on disk CRC is verified thus minimizing some other corruptions. I feel this compromise is probably worth it. But I guess many will disagree. Requiring whole blocks will also increase overhead when support appends in future.

          How do we benchmark for good CRC-chunk size? It heavily depends on work load. I will find more about a typical MapReduce load.

          Show
          Raghu Angadi added a comment - Correctness is a big advantage of using existing CRCs. Still, we could choose to 64k checksums during upgrade, which implies reading of of all the blocks and associated increase in downtime. Whether to recreate checksums or not depends on how far off is 512 byte CRCs is from our (gu)estimated optimal value. Another option to get more trust after to upgrade is to compare checksums of replicas and choose the majority. For current design we can assume we compare with the old CRCs. If a client asks for 2k data, we are saying we should send 64k (or 128k) where 2k is located to client. Why not send only 2k and send newly calculated CRC for 2k? Datanode still verifies all the 64k blocks involved in the read but recalculates. One argument against this is that this is a weaker guarantee than sending full blocks for verification. But datanode will calculate the new CRC right next to where the on disk CRC is verified thus minimizing some other corruptions. I feel this compromise is probably worth it. But I guess many will disagree. Requiring whole blocks will also increase overhead when support appends in future. How do we benchmark for good CRC-chunk size? It heavily depends on work load. I will find more about a typical MapReduce load.
          Hide
          Doug Cutting added a comment -

          > Why not send only 2k and send newly calculated CRC for 2k?

          Perhaps that could work. So we'd still need to transfer 64k off disk and checksum it, but, once it's validated, we could re-checksum the 2k we send. It should be re-checksummed before it's validated, so that the re-checksummed data is guaranteed valid. On the other hand, the simplicity of the end-to-end checksum makes it more certain that we've implemented things correctly and will properly detect corruptions. +0

          > How do we benchmark for good CRC-chunk size?

          Can we push that to a separate issue? This issue is about removing checksum files from the HDFS namespace. We might then optimize things in other ways later.

          Show
          Doug Cutting added a comment - > Why not send only 2k and send newly calculated CRC for 2k? Perhaps that could work. So we'd still need to transfer 64k off disk and checksum it, but, once it's validated, we could re-checksum the 2k we send. It should be re-checksummed before it's validated, so that the re-checksummed data is guaranteed valid. On the other hand, the simplicity of the end-to-end checksum makes it more certain that we've implemented things correctly and will properly detect corruptions. +0 > How do we benchmark for good CRC-chunk size? Can we push that to a separate issue? This issue is about removing checksum files from the HDFS namespace. We might then optimize things in other ways later.
          Hide
          Doug Cutting added a comment -

          To elaborate on my last comment: I would prefer that this issue just move CRC files to side files on the datanode. The default bytes/crc should not be changed, no crc cache should be added, etc. Those potential improvements should be evaluated separately and individually afterwards. The datanode's protocols will need to change in order to receive and transmit checksums along with file data, but, other than that, we should leave things as-is. Ideally we'll be able to share much of the checksum code that's currently used on the client.

          Show
          Doug Cutting added a comment - To elaborate on my last comment: I would prefer that this issue just move CRC files to side files on the datanode. The default bytes/crc should not be changed, no crc cache should be added, etc. Those potential improvements should be evaluated separately and individually afterwards. The datanode's protocols will need to change in order to receive and transmit checksums along with file data, but, other than that, we should leave things as-is. Ideally we'll be able to share much of the checksum code that's currently used on the client.
          Hide
          Raghu Angadi added a comment -

          >> How do we benchmark for good CRC-chunk size?
          > Can we push that to a separate issue? This issue is about removing checksum files from the HDFS namespace. We might then optimize things in other ways later.

          +1. CRC-chunk should be configurable for new blocks and client does not need to know a priory what the chunk size is.

          Show
          Raghu Angadi added a comment - >> How do we benchmark for good CRC-chunk size? > Can we push that to a separate issue? This issue is about removing checksum files from the HDFS namespace. We might then optimize things in other ways later. +1. CRC-chunk should be configurable for new blocks and client does not need to know a priory what the chunk size is.
          Hide
          Doug Cutting added a comment -

          > CRC-chunk should be configurable for new blocks and client does not need to know a priory what the chunk size is.

          I'm not sure what "CRC-chunk" is, but I see no reason to add new config parameters right off. We can continue to use io.bytes.per.checksum and perhaps throw an exception if dfs.block.size is not an even multiple of this. If it later makes sense to increase the default bytes/checksum for DFS, it may also make sense to increase the default for all other filesystems too, continuing to use the same parameter, or it may make sense to split this to a new parameter, but, for now, I see no reason to add a new parameter.

          Show
          Doug Cutting added a comment - > CRC-chunk should be configurable for new blocks and client does not need to know a priory what the chunk size is. I'm not sure what "CRC-chunk" is, but I see no reason to add new config parameters right off. We can continue to use io.bytes.per.checksum and perhaps throw an exception if dfs.block.size is not an even multiple of this. If it later makes sense to increase the default bytes/checksum for DFS, it may also make sense to increase the default for all other filesystems too, continuing to use the same parameter, or it may make sense to split this to a new parameter, but, for now, I see no reason to add a new parameter.
          Hide
          Raghu Angadi added a comment -

          My mistake, yes, I had io.bytes.per.checksum in mind. It will be read by DN and we don't need a new config var.

          Show
          Raghu Angadi added a comment - My mistake, yes, I had io.bytes.per.checksum in mind. It will be read by DN and we don't need a new config var.
          Hide
          Raghu Angadi added a comment -

          Also, DFSClient will no longer use this io.bytes.per.checksum. The value depends on particular block being read or written. If DN1 is sending a block to DN2, the DN1 decides what the value it.

          Show
          Raghu Angadi added a comment - Also, DFSClient will no longer use this io.bytes.per.checksum. The value depends on particular block being read or written. If DN1 is sending a block to DN2, the DN1 decides what the value it.
          Hide
          Doug Cutting added a comment -

          I'm confused. Aren't checksums computed in the client, then passed to datanodes? So only the client reads io.bytes.per.checksum and dfs.block.size, not the datanode. Or am I missing something?

          Show
          Doug Cutting added a comment - I'm confused. Aren't checksums computed in the client, then passed to datanodes? So only the client reads io.bytes.per.checksum and dfs.block.size, not the datanode. Or am I missing something?
          Hide
          Raghu Angadi added a comment -

          Client still calculates checksums when it writes. Is it required for client to decide what the size of checksum block (ie CRC-chunk) should be? I thought since DN has complete control of how checksums are done, it could as well decide size various parameters for it. But functionally they are same. One advantage of client deciding is that it need not contact DN untill a full DFS block is written (as it does now). But if it starts streaming to DN immediately, then either one could decide. I am fine either way.

          Show
          Raghu Angadi added a comment - Client still calculates checksums when it writes. Is it required for client to decide what the size of checksum block (ie CRC-chunk) should be? I thought since DN has complete control of how checksums are done, it could as well decide size various parameters for it. But functionally they are same. One advantage of client deciding is that it need not contact DN untill a full DFS block is written (as it does now). But if it starts streaming to DN immediately, then either one could decide. I am fine either way.
          Hide
          Raghu Angadi added a comment -

          I will summarize various points discussed here later today.

          Regd informing namenode(NN) about corrupted blocks :

          1) Initially we will do same thing DFSClient does now : tells NN to delete DN:Block. NN schedules it to be deleted if there are more than one replicas. Ideally we want the new replica to be created before deleting the known corrupted blocks.

          2) Initially DN detects corrupted blocks when they are read for any reason. Pretty soon we will add periodic scan of blocks on datanode. Periodic checker should throttle itself based on how busy the disks are.

          3) Later, if required, we could add an option such that DFSClient can read even if checksum fails (with failure promptly noted). For now, it will always be an error.

          Show
          Raghu Angadi added a comment - I will summarize various points discussed here later today. Regd informing namenode(NN) about corrupted blocks : 1) Initially we will do same thing DFSClient does now : tells NN to delete DN:Block. NN schedules it to be deleted if there are more than one replicas. Ideally we want the new replica to be created before deleting the known corrupted blocks. 2) Initially DN detects corrupted blocks when they are read for any reason. Pretty soon we will add periodic scan of blocks on datanode. Periodic checker should throttle itself based on how busy the disks are. 3) Later, if required, we could add an option such that DFSClient can read even if checksum fails (with failure promptly noted). For now, it will always be an error.
          Hide
          Doug Cutting added a comment -

          > I thought since DN has complete control of how checksums are done, it could as well decide size various parameters for it.

          The datanode primarily only has control of how checksums are stored. Checksums are created and consumed primarily by the client. So my instinct would be to have the client determine checksum creation parameters. This is also simpler, as you observed.

          > Initially DN detects corrupted blocks when they are read for any reason.

          To be clear: I think the first priority should be to detect corruption in the client. Then, only subsequently might we add checking on the datanode itself. Checks on the datanode might only happen when a client-side check fails, to determine whether the problem is on the disk or happened in transit, and perhaps during periodic scans. The most important check however is the client's check.

          Show
          Doug Cutting added a comment - > I thought since DN has complete control of how checksums are done, it could as well decide size various parameters for it. The datanode primarily only has control of how checksums are stored. Checksums are created and consumed primarily by the client. So my instinct would be to have the client determine checksum creation parameters. This is also simpler, as you observed. > Initially DN detects corrupted blocks when they are read for any reason. To be clear: I think the first priority should be to detect corruption in the client. Then, only subsequently might we add checking on the datanode itself. Checks on the datanode might only happen when a client-side check fails, to determine whether the problem is on the disk or happened in transit, and perhaps during periodic scans. The most important check however is the client's check.
          Hide
          Raghu Angadi added a comment -

          I think I need a little bit more clarification : There are multiple tasks:

          Client :
          --------
          a) It needs confidence DFS stores the data it writes and gets back the same data when it reads.
          1) For client to be absolutely sure of this, we need the current .crc files i.e. client is assuming the responsibility of maintaining checksums
          2) current proposal slightly relaxes this and expects the clients to trust DFS to maintain data to best of its ability and try client's best to verify it. ie, trusts that DN verifies checksum that client calculates while writing, and maintains it after that etc.

          If (2) is not acceptable, then probably we don't need block level checksums.

          Show
          Raghu Angadi added a comment - I think I need a little bit more clarification : There are multiple tasks: Client : -------- a) It needs confidence DFS stores the data it writes and gets back the same data when it reads. 1) For client to be absolutely sure of this, we need the current .crc files i.e. client is assuming the responsibility of maintaining checksums 2) current proposal slightly relaxes this and expects the clients to trust DFS to maintain data to best of its ability and try client's best to verify it. ie, trusts that DN verifies checksum that client calculates while writing, and maintains it after that etc. If (2) is not acceptable, then probably we don't need block level checksums.
          Hide
          Raghu Angadi added a comment -

          > To be clear: I think the first priority should be to detect corruption in the client.

          I agreed.. absolutely. I am curious if any part of the discussion implied other wise.. this will let me correct the mistake.

          Show
          Raghu Angadi added a comment - > To be clear: I think the first priority should be to detect corruption in the client. I agreed.. absolutely. I am curious if any part of the discussion implied other wise.. this will let me correct the mistake.
          Hide
          Doug Cutting added a comment -

          > current proposal slightly relaxes this and expects the clients to trust DFS to maintain data

          The client is a part of DFS. The primary change here should be how checksums are stored. Instead of storing complete checksums for the entire file in a FileSystem-based file, we store checksums per block on the datanodes. But the checksums should still be computed in the client as data is written, and validated in the client as it's read. Instead of writing a parallel file, the client will send to the datanode with each block its checksums, and, when reading, the client will receive a checksum with each range of data bytes. Reads should always be aligned with checksum boundaries. Since the bytes/checksum of a file may differ from the client's configuration, each read should probably return something like <start, len, checksum, data>, where the start is the position in the block of the data (at or before the requested position), length is the length of the data, and the checksum is the checksum for the data.

          So this issue should not alter what's checksummed, when it's checksummed, where it's checksummed, etc., but only where and how the checksums are stored. Right?

          Show
          Doug Cutting added a comment - > current proposal slightly relaxes this and expects the clients to trust DFS to maintain data The client is a part of DFS. The primary change here should be how checksums are stored. Instead of storing complete checksums for the entire file in a FileSystem-based file, we store checksums per block on the datanodes. But the checksums should still be computed in the client as data is written, and validated in the client as it's read. Instead of writing a parallel file, the client will send to the datanode with each block its checksums, and, when reading, the client will receive a checksum with each range of data bytes. Reads should always be aligned with checksum boundaries. Since the bytes/checksum of a file may differ from the client's configuration, each read should probably return something like <start, len, checksum, data>, where the start is the position in the block of the data (at or before the requested position), length is the length of the data, and the checksum is the checksum for the data. So this issue should not alter what's checksummed, when it's checksummed, where it's checksummed, etc., but only where and how the checksums are stored. Right?
          Hide
          Raghu Angadi added a comment -

          I think thinner and more detached the client is, better in long term. DFSClient is not part of DFS cluster, because it is not expected to be up and running when cluster is running. Tighter the coupling, more difficult it will be to implement new features in future.

          I don't think the issue is just about where the checksums are stored. I interpreted as: DFS (excluding the DFSClient) will be aware of and in complete control of how data is stored and its integrity is maintained. To me it does not sound right if DFS can't change how its data is maintained in future. Of course DFS still needs to be able to serve data it is supposed to serve.. thats its fundamental duty.

          Just my thoughts.. will ask more issue specific questions soon.

          Show
          Raghu Angadi added a comment - I think thinner and more detached the client is, better in long term. DFSClient is not part of DFS cluster, because it is not expected to be up and running when cluster is running. Tighter the coupling, more difficult it will be to implement new features in future. I don't think the issue is just about where the checksums are stored. I interpreted as: DFS (excluding the DFSClient) will be aware of and in complete control of how data is stored and its integrity is maintained. To me it does not sound right if DFS can't change how its data is maintained in future. Of course DFS still needs to be able to serve data it is supposed to serve.. thats its fundamental duty. Just my thoughts.. will ask more issue specific questions soon.
          Hide
          Raghu Angadi added a comment -

          One case where we don't want client to delete data : if corruption happens on network (and escapes TCP checks) or a client has some bug.. it will end up deleting a perfectly good block.

          Show
          Raghu Angadi added a comment - One case where we don't want client to delete data : if corruption happens on network (and escapes TCP checks) or a client has some bug.. it will end up deleting a perfectly good block.
          Hide
          Raghu Angadi added a comment -

          Doug,

          I am not sure what exactly we are not agreeing on. To be specific do you agree with the following :

          1) Client checksums the data to write and sends both data and checksum to DFS.
          2) It does not really care how it is stored etc. When it reads, it gets back data with associated checksum.
          3) How the data is checksummed at the client while reading could be different from how the data is checksummed while writing.

          Show
          Raghu Angadi added a comment - Doug, I am not sure what exactly we are not agreeing on. To be specific do you agree with the following : 1) Client checksums the data to write and sends both data and checksum to DFS. 2) It does not really care how it is stored etc. When it reads, it gets back data with associated checksum. 3) How the data is checksummed at the client while reading could be different from how the data is checksummed while writing.
          Hide
          Raghu Angadi added a comment -

          Regd which CRCs to use during upgrade to Block crcs :

          1) Ideal case : Use the existing CRC files.
          2) Reasonable (I think, given the not so large deployments) : calculate on disk CRC and verify with the other datanodes. If some checksum does not match with two others, then it is considered corrupt.

          I think (2) is easier in terms of one-time code for upgrading. If (1) is required, then I don't mind try doing it.

          Show
          Raghu Angadi added a comment - Regd which CRCs to use during upgrade to Block crcs : 1) Ideal case : Use the existing CRC files. 2) Reasonable (I think, given the not so large deployments) : calculate on disk CRC and verify with the other datanodes. If some checksum does not match with two others, then it is considered corrupt. I think (2) is easier in terms of one-time code for upgrading. If (1) is required, then I don't mind try doing it.
          Hide
          Doug Cutting added a comment -

          > I am not sure what exactly we are not agreeing on.

          I'm not sure whether we are disagreeing. I'm just trying to clarify things that sound ambiguous and limit the scope of this issue. You've implied that checksums may be computed on datanodes, and I don't think they ought to be, at least not by the patch for this issue. That might be a subsequent optimization.

          > How the data is checksummed at the client while reading could be different from how the data is checksummed while writing.

          I'd vote against that for this issue. For this issue I think we ought to focus on changing where checksums are stored, but keep the end-to-end checksum story simple. The checksums we compute & check should be the same before and after this. Subsequently we may optimize and further complicate things.

          > If some checksum does not match with two others, then it is considered corrupt.

          That would not catch corruptions that happened before the first copy is received. I'm not fully comfortable with this shortcut.

          Show
          Doug Cutting added a comment - > I am not sure what exactly we are not agreeing on. I'm not sure whether we are disagreeing. I'm just trying to clarify things that sound ambiguous and limit the scope of this issue. You've implied that checksums may be computed on datanodes, and I don't think they ought to be, at least not by the patch for this issue. That might be a subsequent optimization. > How the data is checksummed at the client while reading could be different from how the data is checksummed while writing. I'd vote against that for this issue. For this issue I think we ought to focus on changing where checksums are stored, but keep the end-to-end checksum story simple. The checksums we compute & check should be the same before and after this. Subsequently we may optimize and further complicate things. > If some checksum does not match with two others, then it is considered corrupt. That would not catch corruptions that happened before the first copy is received. I'm not fully comfortable with this shortcut.
          Hide
          Raghu Angadi added a comment -

          Thank Doug.

          >> How the data is checksummed at the client while reading could be different from how the data is checksummed while writing.
          > I'd vote against that for this issue.

          Do you agree with the policy in principle? We may not do that as part of this issue.

          How the over all interaction will look like in future will affect how we organize our code for this issue even if we limit the changes. I will list various things we want to or intend to do and we can pick the required things as part of this issue.

          >> If some checksum does not match with two others, then it is considered corrupt.
          > That would not catch corruptions that happened before the first copy is received. I'm not fully comfortable with this shortcut.

          True. I will see how bad getting old CRCs would be...

          Show
          Raghu Angadi added a comment - Thank Doug. >> How the data is checksummed at the client while reading could be different from how the data is checksummed while writing. > I'd vote against that for this issue. Do you agree with the policy in principle? We may not do that as part of this issue. How the over all interaction will look like in future will affect how we organize our code for this issue even if we limit the changes. I will list various things we want to or intend to do and we can pick the required things as part of this issue. >> If some checksum does not match with two others, then it is considered corrupt. > That would not catch corruptions that happened before the first copy is received. I'm not fully comfortable with this shortcut. True. I will see how bad getting old CRCs would be...
          Hide
          Doug Cutting added a comment -

          > Do you agree with the policy in principle?

          I don't oppose it in principal, but might in practice. I think it should only be pursued if we find that it offers significant performance improvements, sufficient to offset increased complexity and consequent risk of introducing checksum validation logic errors. We can benchmark this after this issue is committed to find if it is warranted.

          > How the over all interaction will look like in future will affect how we organize our code for this issue

          True. So we should try to architect the protocol with future flexibility in mind. But this is a private protocol, and there is still only a single client, so revising the protocol in the future should also not be that difficult, so we needn't prepare for every possibility now. I think the protocol I proposed above, where read requests return something like <start, length, checksum, data> records, would permit checksums to be re-generated at read-time or not, no?

          Show
          Doug Cutting added a comment - > Do you agree with the policy in principle? I don't oppose it in principal, but might in practice. I think it should only be pursued if we find that it offers significant performance improvements, sufficient to offset increased complexity and consequent risk of introducing checksum validation logic errors. We can benchmark this after this issue is committed to find if it is warranted. > How the over all interaction will look like in future will affect how we organize our code for this issue True. So we should try to architect the protocol with future flexibility in mind. But this is a private protocol, and there is still only a single client, so revising the protocol in the future should also not be that difficult, so we needn't prepare for every possibility now. I think the protocol I proposed above, where read requests return something like <start, length, checksum, data> records, would permit checksums to be re-generated at read-time or not, no?
          Hide
          Raghu Angadi added a comment -

          Outline of requirements and decisions made:
          -------------------------------------------------------------

          I will include more details on some of these points as the implementation or discussion continues. I think I can start implementation of some of the parts.

          Here DFS refers to Namenode(NN) and Datanodes(DNs), and client refers to
          DFSClient.

          1) Checksums are maintained end to end and maintained in datanodes as metadata for each block file. 4 byte CRC32 is calculated for for configure size of of sub-block data. Default is 4 byte CRC for each 64KB of block data.

          2) DN keeps checksums for each blocks as separate files (e.g.: blk_<id>.crc) and includes a header at the front of the file.

          3) Checksums are calculated by client while writing and passed on to DN. DN verifies before writing to disk. DN verifies the checksum each time it reads the block data to serve to a client and client verifies it as well.

          4) Data transfer protocol between client and datanodes includes inline checksums transmitted along with the data on same TCP connection. Client reads from a different replica when checksum fails from a DN.

          5) When DN notices a checksum failure, it informs namenode. Namenode will treat this as a deleted block in initial implementation. Later improvement will delay delation of the block until a new valid replica is created.

          6) DistributedFileSystem class will not extend ChecksumFileSystem since checksums will be integral to DFS. We could have a ChecksumDistributedFileSystem if weneed user visible checksums.

          7) Upgrade : When DFS is upgraded to this new version, DFS cluster will in safemode until all (or most of) the datanodes upgrade thier local files with checksums. This process is expected to last for couple of hours.

          8) Currently each DFS file has associated checksum stored in ".crc" file. During upgrade, datanodes fetch relevant parts of .crc files to verify checksums for each block. Since this involves interaction with namenode, namenode could be busy or even bottleneck for upgrade.

          9) We haven't decided how and when to delete .crc files. They could be deleted by a shell script as well.

          Future enhancements :
          ---------------------

          1) Bechmark CPU and I/O overhead of checksums on Datanodes. Most of CPU overhead could be hidden with overlapping with network and disk I/O. Tests showed that java CRC32 takes 5-6 micro seconds for each 1MB of data. Because of the overlap I don't expect any noticeable increase in latency because of CPU. Disk I/O overhead might contribute more for latency.

          2) Based on benchmark tests, investigate if in-memory cache can be used for CRC.

          3) Datanodes should periodically scan and verify checksums for its blocks.

          4) Option to change CRC-block size. E.g. during the upgrade, datanodes maintain 4 byte CRC for every 512 bytes since ".crc" files used 512 byte chunks. We might want to convert them to 4 bytes for every 64K bytes of data.

          5) Namenode should delay deletion of a corrupted block until a new replica is created.

          Show
          Raghu Angadi added a comment - Outline of requirements and decisions made: ------------------------------------------------------------- I will include more details on some of these points as the implementation or discussion continues. I think I can start implementation of some of the parts. Here DFS refers to Namenode(NN) and Datanodes(DNs), and client refers to DFSClient. 1) Checksums are maintained end to end and maintained in datanodes as metadata for each block file. 4 byte CRC32 is calculated for for configure size of of sub-block data. Default is 4 byte CRC for each 64KB of block data. 2) DN keeps checksums for each blocks as separate files (e.g.: blk_<id>.crc) and includes a header at the front of the file. 3) Checksums are calculated by client while writing and passed on to DN. DN verifies before writing to disk. DN verifies the checksum each time it reads the block data to serve to a client and client verifies it as well. 4) Data transfer protocol between client and datanodes includes inline checksums transmitted along with the data on same TCP connection. Client reads from a different replica when checksum fails from a DN. 5) When DN notices a checksum failure, it informs namenode. Namenode will treat this as a deleted block in initial implementation. Later improvement will delay delation of the block until a new valid replica is created. 6) DistributedFileSystem class will not extend ChecksumFileSystem since checksums will be integral to DFS. We could have a ChecksumDistributedFileSystem if weneed user visible checksums. 7) Upgrade : When DFS is upgraded to this new version, DFS cluster will in safemode until all (or most of) the datanodes upgrade thier local files with checksums. This process is expected to last for couple of hours. 8) Currently each DFS file has associated checksum stored in ".crc" file. During upgrade, datanodes fetch relevant parts of .crc files to verify checksums for each block. Since this involves interaction with namenode, namenode could be busy or even bottleneck for upgrade. 9) We haven't decided how and when to delete .crc files. They could be deleted by a shell script as well. Future enhancements : --------------------- 1) Bechmark CPU and I/O overhead of checksums on Datanodes. Most of CPU overhead could be hidden with overlapping with network and disk I/O. Tests showed that java CRC32 takes 5-6 micro seconds for each 1MB of data. Because of the overlap I don't expect any noticeable increase in latency because of CPU. Disk I/O overhead might contribute more for latency. 2) Based on benchmark tests, investigate if in-memory cache can be used for CRC. 3) Datanodes should periodically scan and verify checksums for its blocks. 4) Option to change CRC-block size. E.g. during the upgrade, datanodes maintain 4 byte CRC for every 512 bytes since ".crc" files used 512 byte chunks. We might want to convert them to 4 bytes for every 64K bytes of data. 5) Namenode should delay deletion of a corrupted block until a new replica is created.
          Hide
          Doug Cutting added a comment -

          > Default is 4 byte CRC for each 64KB of block data.

          I thought we'd keep this at 512B until we'd benchmarked things?

          > DN verifies the checksum each time it reads the block data

          I thought that only the client would validate, reporting failures to the DN. Why validate twice, since we have to validate in the client anyway?

          > During upgrade, datanodes fetch relevant parts of .crc files to verify checksums for each block.

          Aren't we going to simply copy the existing checksums, rather than re-generate new checksums or verify against existing? That'd be a lot faster.

          Show
          Doug Cutting added a comment - > Default is 4 byte CRC for each 64KB of block data. I thought we'd keep this at 512B until we'd benchmarked things? > DN verifies the checksum each time it reads the block data I thought that only the client would validate, reporting failures to the DN. Why validate twice, since we have to validate in the client anyway? > During upgrade, datanodes fetch relevant parts of .crc files to verify checksums for each block. Aren't we going to simply copy the existing checksums, rather than re-generate new checksums or verify against existing? That'd be a lot faster.
          Hide
          Raghu Angadi added a comment -

          > I thought we'd keep this at 512B until we'd benchmarked things?
          64KB is only for new blocks. We will keep the 512 byte checksums during the upgrade.

          > Aren't we going to simply copy the existing checksums, rather than re-generate new checksums or verify against existing? That'd be a lot faster.

          You are right. I missed that. will correct it. We will just read the existing CRCs.

          >> DN verifies the checksum each time it reads the block data
          > I thought that only the client would validate, reporting failures to the DN. Why validate twice, since we have to validate in the client anyway?

          I am not sure of this. I personally prefer datanode verification (and general ownership of CRCs). I don't think it adds noticeable latency.. I think CPU overhead would be reasonable.

          Show
          Raghu Angadi added a comment - > I thought we'd keep this at 512B until we'd benchmarked things? 64KB is only for new blocks. We will keep the 512 byte checksums during the upgrade. > Aren't we going to simply copy the existing checksums, rather than re-generate new checksums or verify against existing? That'd be a lot faster. You are right. I missed that. will correct it. We will just read the existing CRCs. >> DN verifies the checksum each time it reads the block data > I thought that only the client would validate, reporting failures to the DN. Why validate twice, since we have to validate in the client anyway? I am not sure of this. I personally prefer datanode verification (and general ownership of CRCs). I don't think it adds noticeable latency.. I think CPU overhead would be reasonable.
          Hide
          Doug Cutting added a comment -

          > 64KB is only for new blocks.

          I thought we'd benchmark before we changed any parameters, no? I'd vote to keep this at 512 until we benchmark.

          > I personally prefer datanode verification (and general ownership of CRCs). I don't think it adds noticeable latency.

          This sounds like another change we should benchmark before we make. This patch should confine itself to a single issue: removing the checksum data from the DFS namespace. Subsequently we should determine the performance impacts of double-checksumming (on both datanode & client), larger checksum windows, etc., before committing such changes.

          Changing the storage of checksum data should alone improve performance, since fewer datanode and namenode accesses are required. It will be useful to see how much this change alone improves things before we start making other optimizations.

          Show
          Doug Cutting added a comment - > 64KB is only for new blocks. I thought we'd benchmark before we changed any parameters, no? I'd vote to keep this at 512 until we benchmark. > I personally prefer datanode verification (and general ownership of CRCs). I don't think it adds noticeable latency. This sounds like another change we should benchmark before we make. This patch should confine itself to a single issue: removing the checksum data from the DFS namespace. Subsequently we should determine the performance impacts of double-checksumming (on both datanode & client), larger checksum windows, etc., before committing such changes. Changing the storage of checksum data should alone improve performance, since fewer datanode and namenode accesses are required. It will be useful to see how much this change alone improves things before we start making other optimizations.
          Hide
          Raghu Angadi added a comment -

          Upgrade from previous versions requires quite a bit of interaction with namenode. This results in a quite a bit a upgrade specific code both in namenode and datanode. This will add a few new RPCs too. I propose to clearly mark such code as temporary and remove it in next major version. If anyone tries to upgrade from pre block-crc version to post block-crc version, the namenode and datanode will exit with a clear message that DFS first needs to be upgraded to block-crc version.

          Regd, getting rid of .crc files:

          It will be done as part of upgrade. Namenode gets into a special safe mode during upgrade.

          1) It keeps track of all the blocks that belong to non-".crc" files.
          2) Once all the replicas are upgraded, it marks the corresponding .crc file to be deleted at the end of special mode.
          3) At the end of special mode, it deletes all the .crc files that are enqueued.
          4) When everything goes well, there should not be any .crc files left. Namenode prints all the .crc files still left in the system to log.
          5) Users will be advised to delete the extra .crc files manually.

          Another option is to let namenode delete all ".crc" files at the end of upgrade.

          During the upgrade, datanodes do not delete the blocks that belong to to .crc files. These blocks will be deleted when cluster resumes normal functionality (during block report, or if the blocks are read by client). These blocks will not have have corresponding crc files on datanode.

          Also note that Namenode and Datanodes are expected to restart anytime during the upgrade process. So all the code will be written assuming the java process could be killed anytime.

          When is upgrade considered complete?
          ------------------------------------------------------

          I propose the following conditions are met :

          1) All the datanodes that registered should report completion of their upgrade. If Namenode restarts, each datanode will re-register and inform again about their completion.
          2) Similar to current safemode, "dfs.safemode.threshold.pct" of the blocks that belong to non .crc files, should have at least "dfs.replication.min" replicas reported upgraded.

          Show
          Raghu Angadi added a comment - Upgrade from previous versions requires quite a bit of interaction with namenode. This results in a quite a bit a upgrade specific code both in namenode and datanode. This will add a few new RPCs too. I propose to clearly mark such code as temporary and remove it in next major version. If anyone tries to upgrade from pre block-crc version to post block-crc version, the namenode and datanode will exit with a clear message that DFS first needs to be upgraded to block-crc version. Regd, getting rid of .crc files: It will be done as part of upgrade. Namenode gets into a special safe mode during upgrade. 1) It keeps track of all the blocks that belong to non-".crc" files. 2) Once all the replicas are upgraded, it marks the corresponding .crc file to be deleted at the end of special mode. 3) At the end of special mode, it deletes all the .crc files that are enqueued. 4) When everything goes well, there should not be any .crc files left. Namenode prints all the .crc files still left in the system to log. 5) Users will be advised to delete the extra .crc files manually. Another option is to let namenode delete all ".crc" files at the end of upgrade. During the upgrade, datanodes do not delete the blocks that belong to to .crc files. These blocks will be deleted when cluster resumes normal functionality (during block report, or if the blocks are read by client). These blocks will not have have corresponding crc files on datanode. Also note that Namenode and Datanodes are expected to restart anytime during the upgrade process. So all the code will be written assuming the java process could be killed anytime. When is upgrade considered complete? ------------------------------------------------------ I propose the following conditions are met : 1) All the datanodes that registered should report completion of their upgrade. If Namenode restarts, each datanode will re-register and inform again about their completion. 2) Similar to current safemode, "dfs.safemode.threshold.pct" of the blocks that belong to non .crc files, should have at least "dfs.replication.min" replicas reported upgraded.
          Hide
          Doug Cutting added a comment -

          > I propose to clearly mark such code as temporary and remove it in next major version. If anyone tries to upgrade from pre block-crc version to post block-crc version, the namenode and datanode will exit with a clear message that DFS first needs to be upgraded to block-crc version.

          +1

          The rest also sounds reasonable to me.

          It's worth noting that, if a .crc file cannot be found for a block, then the upgrade should probably generate one (along with a warning). So that, after the upgrade, all blocks should be guaranteed to have CRC files. We still might fail softly when a CRC file is missing, logging a warning and either generating CRCs on the fly or regenerating the CRC file.

          Show
          Doug Cutting added a comment - > I propose to clearly mark such code as temporary and remove it in next major version. If anyone tries to upgrade from pre block-crc version to post block-crc version, the namenode and datanode will exit with a clear message that DFS first needs to be upgraded to block-crc version. +1 The rest also sounds reasonable to me. It's worth noting that, if a .crc file cannot be found for a block, then the upgrade should probably generate one (along with a warning). So that, after the upgrade, all blocks should be guaranteed to have CRC files. We still might fail softly when a CRC file is missing, logging a warning and either generating CRCs on the fly or regenerating the CRC file.
          Hide
          Raghu Angadi added a comment -

          >> 64KB is only for new blocks.
          > I thought we'd benchmark before we changed any parameters, no? I'd vote to keep this at 512 until we benchmark.

          I am fine with keeping it at 512. default config will not change then. Bigger motivation for me to run good benchmarks

          > This sounds like another change we should benchmark before we make.

          I think this is as much a policy decision as it is a performance decision. Do we ever want datanode to ship corrupted data? If the benchmarks don't show any negative effect of double checksum on real clusters, would that be enough?

          If source datanode does not check for errors, how would you propose we handle CRC errors while transferring to another datanode? I guess, destination datanode should report the problem.

          Show
          Raghu Angadi added a comment - >> 64KB is only for new blocks. > I thought we'd benchmark before we changed any parameters, no? I'd vote to keep this at 512 until we benchmark. I am fine with keeping it at 512. default config will not change then. Bigger motivation for me to run good benchmarks > This sounds like another change we should benchmark before we make. I think this is as much a policy decision as it is a performance decision. Do we ever want datanode to ship corrupted data? If the benchmarks don't show any negative effect of double checksum on real clusters, would that be enough? If source datanode does not check for errors, how would you propose we handle CRC errors while transferring to another datanode? I guess, destination datanode should report the problem.
          Hide
          Doug Cutting added a comment -

          > If source datanode does not check for errors, how would you propose we handle CRC errors while transferring to another datanode?

          Initially, as we do today: when a CRC error is detected in the client, it makes a callback to the namenode. Subsequently we can discuss ways to improve this. But I'd rather keep this change as simple as possible. It will already be complicated enough without taking on more challenges.

          Show
          Doug Cutting added a comment - > If source datanode does not check for errors, how would you propose we handle CRC errors while transferring to another datanode? Initially, as we do today: when a CRC error is detected in the client, it makes a callback to the namenode. Subsequently we can discuss ways to improve this. But I'd rather keep this change as simple as possible. It will already be complicated enough without taking on more challenges.
          Hide
          Raghu Angadi added a comment -

          > It's worth noting that, if a .crc file cannot be found for a block, then the upgrade should probably generate one (along with a warning).
          > So that, after the upgrade, all blocks should be guaranteed to have CRC files. We still might fail softly when a CRC file is missing,
          > logging a warning and either generating CRCs on the fly or regenerating the CRC file.

          I agree. We will generate local CRC file if we can not find ".crc" file data with a warning. I think datanode should fail softly with a warning if CRC file is missing. I don't think it is necessary to regenerate the crc file if one is missing outside upgrade mode.. it should be normal error condition.

          Thanks for feedback. I am preparing an HTML file and will include all the decisions made till now and later comments for this jira. Will attach the HTML file here.

          Show
          Raghu Angadi added a comment - > It's worth noting that, if a .crc file cannot be found for a block, then the upgrade should probably generate one (along with a warning). > So that, after the upgrade, all blocks should be guaranteed to have CRC files. We still might fail softly when a CRC file is missing, > logging a warning and either generating CRCs on the fly or regenerating the CRC file. I agree. We will generate local CRC file if we can not find ".crc" file data with a warning. I think datanode should fail softly with a warning if CRC file is missing. I don't think it is necessary to regenerate the crc file if one is missing outside upgrade mode.. it should be normal error condition. Thanks for feedback. I am preparing an HTML file and will include all the decisions made till now and later comments for this jira. Will attach the HTML file here.
          Hide
          Sameer Paranjpye added a comment -

          This is complicated. The general design direction appears to be the right one, but I think some key details need to be spelled out.

          We say we're going to use the default of 512 bytes/chksum. What do we do for HDFS installations that use a different value in configuration? What do we do for installations that have used more than one value of bytes/chksum when generating data?

          One option is to simply use the existing checksum data, as is, with the understanding that we could end up with different values of bytes/chksum across HDFS installations and across different files in the same installation. The alternative would be to re-generate checksum data on the Datanodes with 512 bytes/chksum and validate the new checksums against the existing data.

          Do we keep the io.bytes.per.checksum configuration variable or do we kill it?

          Do we simply copy existing checksum data or do we re-generate it?

          I don't think simply copying checksum data is enough since the checksums can themselves be corrupt. We need some level of validation. We can compare copies of the checksum data against each other, if we find a majority of copies that match then we treat those as authoritative. But what happens when we don't find a majority? Or we can re-generate checksum data on the Datanode and validate it against the existing data.

          How does a Datanode discover authoritative sources of checksum data for it's blocks?

          This is presumably done with a call to the Namenode that given a block id responds with the name of a checksum file. The Datanode then reads the header, determines the offset and length where the specified blocks checksums lie, then reads the checksum data and validates it. This works while the upgrade is in progress but perhaps it can be extended to deal with Datanodes that join the system after the upgrade is complete. If a Datanode joins after a complete upgrade and crc file deletion, the Namenode could redirect it to other Datanodes that have copies of it's blocks, the new Datanode can then pull block level CRC files from it's peers, validate it's data and perform an upgrade even though the .crc files are gone.

          Thoughts?

          Show
          Sameer Paranjpye added a comment - This is complicated. The general design direction appears to be the right one, but I think some key details need to be spelled out. We say we're going to use the default of 512 bytes/chksum. What do we do for HDFS installations that use a different value in configuration? What do we do for installations that have used more than one value of bytes/chksum when generating data? One option is to simply use the existing checksum data, as is, with the understanding that we could end up with different values of bytes/chksum across HDFS installations and across different files in the same installation. The alternative would be to re-generate checksum data on the Datanodes with 512 bytes/chksum and validate the new checksums against the existing data. Do we keep the io.bytes.per.checksum configuration variable or do we kill it? Do we simply copy existing checksum data or do we re-generate it? I don't think simply copying checksum data is enough since the checksums can themselves be corrupt. We need some level of validation. We can compare copies of the checksum data against each other, if we find a majority of copies that match then we treat those as authoritative. But what happens when we don't find a majority? Or we can re-generate checksum data on the Datanode and validate it against the existing data. How does a Datanode discover authoritative sources of checksum data for it's blocks? This is presumably done with a call to the Namenode that given a block id responds with the name of a checksum file. The Datanode then reads the header, determines the offset and length where the specified blocks checksums lie, then reads the checksum data and validates it. This works while the upgrade is in progress but perhaps it can be extended to deal with Datanodes that join the system after the upgrade is complete. If a Datanode joins after a complete upgrade and crc file deletion, the Namenode could redirect it to other Datanodes that have copies of it's blocks, the new Datanode can then pull block level CRC files from it's peers, validate it's data and perform an upgrade even though the .crc files are gone. Thoughts?
          Hide
          Raghu Angadi added a comment -

          "io.bytes.per.checksum" will still be used for the expected purpose and will dictate bytes/checksum for all the new data created by the client. Doug prefered client dictating this value and I preferred Namenode or Datanode informing client (using the same config).

          Similar to current behavior, each checksum file has a header that indicates bytes/checksum. Thus at any time each block has its own bytes/checksum it does not need to match with other blocks or even with other replicas. When a block is copied to another datanode, source datanode decides what bytes/checksum is.

          > Do we simply copy existing checksum data or do we re-generate it?
          During upgrade, we simply copy, of course with new header. This is Doug's preference since this will speed up the upgrade. I agree but don't mind implementing forced check during upgrade.

          During block replication, destination datanode verifies the checksum and create its local copy. End result would be that source and destination will have the same content in checksum file (unless the header format has changed).

          > I don't think simply copying checksum data is enough since the checksums can themselves be corrupt. We need some level of validation. > We can compare copies of the checksum data against each other, if we find a majority of copies that match then we treat those as
          > authoritative. But what happens when we don't find a majority? Or we can re-generate checksum data on the Datanode and validate it
          > against the existing data.

          If we cannot get old CRC data for any reason, we will generate one based on the local data (which could be wrong). There are two options to validate upgraded data (for simplicity all the details and error conditions are not explained) :
          1) use old CRCs (Doug's choice)
          2) check CRC of each replica and choose the majority (Sameer's choice)
          3) Combination of (1) and (2). i.e. use (2) if (1) fails etc. This option is proposed only now.

          At this point, I would leave it you guys to decide which one we should do. Please choose one.

          > How does a Datanode discover authoritative sources of checksum data for it's blocks?
          Is this during upgrade?

          > This works while the upgrade is in progress but perhaps it can be extended to deal with Datanodes that join the system after the
          > upgrade is complete. If a Datanode joins after a complete upgrade and crc file deletion, the Namenode could redirect it to other
          > Datanodes that have copies of it's blocks, the new Datanode can then pull block level CRC files from it's peers, validate it's data and
          > perform an upgrade even though the .crc files are gone.

          My expectation was that once the upgrade is considered done by the namenode, any datanode that comes up with old database, will shutdown with a clear error message with out entering into upgrade phrase. The following two conditions should be met before upgrade is considered done :
          1) All the datanodes that registered should report completion of their upgrade. If Namenode restarts, each datanode will re-register and inform again about their completion.
          2) Similar to current safemode, "dfs.safemode.threshold.pct" of the blocks that belong to non .crc files, should have at least "dfs.replication.min" replicas reported upgraded.

          Of cource, we can still do what you propose for datanodes that come up with old data after the above conditions are met.. it it is considered required. It means that some of the upgrade specific code could spread a little bit more into normal operation of namenode. Upgrade is already enough complicated that this might not add much more code.

          Show
          Raghu Angadi added a comment - "io.bytes.per.checksum" will still be used for the expected purpose and will dictate bytes/checksum for all the new data created by the client. Doug prefered client dictating this value and I preferred Namenode or Datanode informing client (using the same config). Similar to current behavior, each checksum file has a header that indicates bytes/checksum. Thus at any time each block has its own bytes/checksum it does not need to match with other blocks or even with other replicas. When a block is copied to another datanode, source datanode decides what bytes/checksum is. > Do we simply copy existing checksum data or do we re-generate it? During upgrade, we simply copy, of course with new header. This is Doug's preference since this will speed up the upgrade. I agree but don't mind implementing forced check during upgrade. During block replication, destination datanode verifies the checksum and create its local copy. End result would be that source and destination will have the same content in checksum file (unless the header format has changed). > I don't think simply copying checksum data is enough since the checksums can themselves be corrupt. We need some level of validation. > We can compare copies of the checksum data against each other, if we find a majority of copies that match then we treat those as > authoritative. But what happens when we don't find a majority? Or we can re-generate checksum data on the Datanode and validate it > against the existing data. If we cannot get old CRC data for any reason, we will generate one based on the local data (which could be wrong). There are two options to validate upgraded data (for simplicity all the details and error conditions are not explained) : 1) use old CRCs (Doug's choice) 2) check CRC of each replica and choose the majority (Sameer's choice) 3) Combination of (1) and (2). i.e. use (2) if (1) fails etc. This option is proposed only now. At this point, I would leave it you guys to decide which one we should do. Please choose one. > How does a Datanode discover authoritative sources of checksum data for it's blocks? Is this during upgrade? > This works while the upgrade is in progress but perhaps it can be extended to deal with Datanodes that join the system after the > upgrade is complete. If a Datanode joins after a complete upgrade and crc file deletion, the Namenode could redirect it to other > Datanodes that have copies of it's blocks, the new Datanode can then pull block level CRC files from it's peers, validate it's data and > perform an upgrade even though the .crc files are gone. My expectation was that once the upgrade is considered done by the namenode, any datanode that comes up with old database, will shutdown with a clear error message with out entering into upgrade phrase. The following two conditions should be met before upgrade is considered done : 1) All the datanodes that registered should report completion of their upgrade. If Namenode restarts, each datanode will re-register and inform again about their completion. 2) Similar to current safemode, "dfs.safemode.threshold.pct" of the blocks that belong to non .crc files, should have at least "dfs.replication.min" replicas reported upgraded. Of cource, we can still do what you propose for datanodes that come up with old data after the above conditions are met.. it it is considered required. It means that some of the upgrade specific code could spread a little bit more into normal operation of namenode. Upgrade is already enough complicated that this might not add much more code.
          Hide
          Sameer Paranjpye added a comment -

          > If we cannot get old CRC data for any reason, we will generate one based on the local data (which could be wrong). There are two options to validate upgraded data (for
          > simplicity all the details and error conditions are not explained) :
          > 1) use old CRCs (Doug's choice)
          > 2) check CRC of each replica and choose the majority (Sameer's choice)
          > 3) Combination of (1) and (2). i.e. use (2) if (1) fails etc. This option is proposed only now.

          Even if we can get the old CRC data, how do we know that it is not corrupt? There are 3 copies of each CRC file, one or more of these could be corrupt. We need some way to ensure that we're copying correct checksum data to the Datanode. As I said before, we can do this by comparing copies of the existing CRC data against each other and electing a set of authorities OR by validating checksum data that we pull against the local blocks.

          Show
          Sameer Paranjpye added a comment - > If we cannot get old CRC data for any reason, we will generate one based on the local data (which could be wrong). There are two options to validate upgraded data (for > simplicity all the details and error conditions are not explained) : > 1) use old CRCs (Doug's choice) > 2) check CRC of each replica and choose the majority (Sameer's choice) > 3) Combination of (1) and (2). i.e. use (2) if (1) fails etc. This option is proposed only now. Even if we can get the old CRC data, how do we know that it is not corrupt? There are 3 copies of each CRC file, one or more of these could be corrupt. We need some way to ensure that we're copying correct checksum data to the Datanode. As I said before, we can do this by comparing copies of the existing CRC data against each other and electing a set of authorities OR by validating checksum data that we pull against the local blocks.
          Hide
          Raghu Angadi added a comment -

          Thats a good point Sameer. I agree. I don't think check the CRCs during upgrade will add much complexity. Downtime wise, upgrade is already quite disruptive, 2 hours with verification is not going to be much worse than half an hour with out verification.

          Show
          Raghu Angadi added a comment - Thats a good point Sameer. I agree. I don't think check the CRCs during upgrade will add much complexity. Downtime wise, upgrade is already quite disruptive, 2 hours with verification is not going to be much worse than half an hour with out verification.
          Hide
          Raghu Angadi added a comment -

          Sameer, does the following sum up your proposal (a refinement of option (3) above) :

          1) For each blocks for which CRC is generated (in one of the ways mentioned below), Datanode reports CRC of the checksum file to namenode.

          2) First a datanode checks its block data with CRC.

          3) If CRC check fails, it verifies with different block of the old CRC data.

          4) If check with old CRC matches, it will be reported to Namenode as authoritative checksum of the CRC file to Namenode.

          5) If this fails as well, it will generate CRC based on local CRC data and report the checksum of the CRC file to Namenode.

          6) For blocks that go through step (5) it periodically checks with Namenode to see whether its CRC file matches either with authoritative CRC or majority. If answer is yes, then its CRC is considered valid. If answer is no, then the block is scheduled to be deleted (note an authoritative or majority CRC already exists).

          7) If namenode cannot say yes or no for some reason, local CRC is kept with a warning.

          8) If the block cannot even be read properly or has incorrect file length for some reason, it is considered not to exist (could be marked for deletion).

          Note that since Namenode needs to track extra information, its memory footprint will be larger than pre-upgrade. If we want to avoid this, we could do the authoritative/majority check with the one of the datanodes (lexically first node) for each replica.

          To reduce the wait for matching with the authoritative or majority copy, each datanode will sort the their blocks and upgrade them in order.
          Once an authoritative match is found for a block, namenode does need to track the meta-crc (crc of the checksum file) from each datanode.

          Show
          Raghu Angadi added a comment - Sameer, does the following sum up your proposal (a refinement of option (3) above) : 1) For each blocks for which CRC is generated (in one of the ways mentioned below), Datanode reports CRC of the checksum file to namenode. 2) First a datanode checks its block data with CRC. 3) If CRC check fails, it verifies with different block of the old CRC data. 4) If check with old CRC matches, it will be reported to Namenode as authoritative checksum of the CRC file to Namenode. 5) If this fails as well, it will generate CRC based on local CRC data and report the checksum of the CRC file to Namenode. 6) For blocks that go through step (5) it periodically checks with Namenode to see whether its CRC file matches either with authoritative CRC or majority. If answer is yes, then its CRC is considered valid. If answer is no, then the block is scheduled to be deleted (note an authoritative or majority CRC already exists). 7) If namenode cannot say yes or no for some reason, local CRC is kept with a warning. 8) If the block cannot even be read properly or has incorrect file length for some reason, it is considered not to exist (could be marked for deletion). Note that since Namenode needs to track extra information, its memory footprint will be larger than pre-upgrade. If we want to avoid this, we could do the authoritative/majority check with the one of the datanodes (lexically first node) for each replica. To reduce the wait for matching with the authoritative or majority copy, each datanode will sort the their blocks and upgrade them in order. Once an authoritative match is found for a block, namenode does need to track the meta-crc (crc of the checksum file) from each datanode.
          Hide
          Raghu Angadi added a comment -

          minor correction:

          > 2) First a datanode checks its block data with CRC.
          I meant : '.. with old CRC'.

          Show
          Raghu Angadi added a comment - minor correction: > 2) First a datanode checks its block data with CRC. I meant : '.. with old CRC'.
          Hide
          James P. White added a comment -

          Shouldn't this design information go into a wiki page or into version control under /docs?

          Show
          James P. White added a comment - Shouldn't this design information go into a wiki page or into version control under /docs?
          Hide
          Raghu Angadi added a comment -

          Jira was preferred mainly to facilitate discussion. I am currently writing a wiki and attach html version here or move it hadoop wiki.

          Description of how checksums work in DFS will be generally useful (ie. overview of the doc I am preparing). A big chunk of the doc will also describe upgrade, hopefully something most users and developers don't need to worry about soon

          Show
          Raghu Angadi added a comment - Jira was preferred mainly to facilitate discussion. I am currently writing a wiki and attach html version here or move it hadoop wiki. Description of how checksums work in DFS will be generally useful (ie. overview of the doc I am preparing). A big chunk of the doc will also describe upgrade, hopefully something most users and developers don't need to worry about soon
          Hide
          Sameer Paranjpye added a comment -

          > Sameer, does the following sum up your proposal (a refinement of option (3) above) :
          >
          > 1) For each blocks for which CRC is generated (in one of the ways mentioned below), Datanode reports CRC of the checksum file to namenode.
          >
          >

          Not really. I don't think I mentioned the Namenode keeping CRCs of block CRCs.

          There is another issue that appears not to have been dealt with yet, we say that the Datanode locally generates CRCs if the .crc file for a block is unavailable. We need to be very careful here IMO. During registration many many blocks are missing simply because they haven't been reported in yet. We don't want the Namenode to report missing .crcs to the Datanodes until a certain threshold of blocks has been reached. This should probably be the same as dfs.safemode.threshold.pct.

          I would propose the following:

          1) Extend the Namenode interface, add a getChecksumAuthority(long blockId) method. This method takes a block-id as input and responds a <type, authority, offset> tuple, where 'authority' is the name of the .crc file in the Namenode and 'offset' is the offset of the specified block in the data file. It throws an appropriate exception when the input block-id is unrecognized or belongs to a .crc file or the checksum authority is missing. The 'type' field indicates the authority type which is either CRCFILE, DATANODE or YOU. The latter two codes are used by the Namenode when it has determined that blocks of a .crc file are missing.

          2) Each Datanode does the following during the upgrade:

          • For each block it calls getChecksumAuthority(), discovers the checksum file, opens it for read, reads the header and discovers the bytes/chksum for the current block
          • If getChecksumAuthority() fails, the Datanode moves on to the next block, it will return to this block when it has run through all it's remaining blocks
          • It uses the bytes/chksum and the data offset to determine where in the .crc file the current blocks checksums lie
          • It reads the checksums from one of the replicas and validates the block data against the checksums, if validation succeeds, the checksum data is written to disk and it moves on to the next block. The checksum upgrade for the current block is reported to the Namenode.
          • If validation fails it tries the other replicas of the checksum data. If validation fails against all checksum replicas it arbitrarily chooses one replica, copies checksum data from it and reports a corrupt block to the Namenode

          3) When the Namenode determines that the .crc file corresponding to a block is unavailable, it chooses a representative from one of the Datanodes hosting the block to locally generate CRCs for the block. It does so by sending YOU in the type field when getChecksumAuthority is invoked. For the remaining Datanodes hosting the block the Namenode sends DATANODE in the type field and asks them to copy CRCs from the chosen representative.

          The upgrade is considered complete when dfs.replication.min replicas of all known blocks have been transitioned to block level CRCs.

          In some cases, this condition will not be met either because some data blocks are MIA.

          During the upgrade process 'dfs -report' should indicate how many blocks have not been upgraded and for what reason. It should also indicate whether
          a) the upgrade is incomplete
          b) the upgrade is complete
          c) the upgrade is wedged because some blocks are missing

          In the case of b) or c) occuring, the sysadmin can issue a 'finishUpgrade' command to the Namenode which causes the .crc files to be removed and their blocks marked for deletion. Note that this is different from 'finalizeUpgrade' which causes state from the previous version to be discarded. Datanodes that join the system after the upgrade is finished are handled using 3) above.

          This is a more complex proposal, but the additional complexity has been introduced in order to provide much stronger correctness guarantees, so I feel that it is warranted. Comments welcome.

          Show
          Sameer Paranjpye added a comment - > Sameer, does the following sum up your proposal (a refinement of option (3) above) : > > 1) For each blocks for which CRC is generated (in one of the ways mentioned below), Datanode reports CRC of the checksum file to namenode. > > Not really. I don't think I mentioned the Namenode keeping CRCs of block CRCs. There is another issue that appears not to have been dealt with yet, we say that the Datanode locally generates CRCs if the .crc file for a block is unavailable. We need to be very careful here IMO. During registration many many blocks are missing simply because they haven't been reported in yet. We don't want the Namenode to report missing .crcs to the Datanodes until a certain threshold of blocks has been reached. This should probably be the same as dfs.safemode.threshold.pct. I would propose the following: 1) Extend the Namenode interface, add a getChecksumAuthority(long blockId) method. This method takes a block-id as input and responds a <type, authority, offset> tuple, where 'authority' is the name of the .crc file in the Namenode and 'offset' is the offset of the specified block in the data file. It throws an appropriate exception when the input block-id is unrecognized or belongs to a .crc file or the checksum authority is missing. The 'type' field indicates the authority type which is either CRCFILE, DATANODE or YOU. The latter two codes are used by the Namenode when it has determined that blocks of a .crc file are missing. 2) Each Datanode does the following during the upgrade: For each block it calls getChecksumAuthority(), discovers the checksum file, opens it for read, reads the header and discovers the bytes/chksum for the current block If getChecksumAuthority() fails, the Datanode moves on to the next block, it will return to this block when it has run through all it's remaining blocks It uses the bytes/chksum and the data offset to determine where in the .crc file the current blocks checksums lie It reads the checksums from one of the replicas and validates the block data against the checksums, if validation succeeds, the checksum data is written to disk and it moves on to the next block. The checksum upgrade for the current block is reported to the Namenode. If validation fails it tries the other replicas of the checksum data. If validation fails against all checksum replicas it arbitrarily chooses one replica, copies checksum data from it and reports a corrupt block to the Namenode 3) When the Namenode determines that the .crc file corresponding to a block is unavailable, it chooses a representative from one of the Datanodes hosting the block to locally generate CRCs for the block. It does so by sending YOU in the type field when getChecksumAuthority is invoked. For the remaining Datanodes hosting the block the Namenode sends DATANODE in the type field and asks them to copy CRCs from the chosen representative. The upgrade is considered complete when dfs.replication.min replicas of all known blocks have been transitioned to block level CRCs. In some cases, this condition will not be met either because some data blocks are MIA. During the upgrade process 'dfs -report' should indicate how many blocks have not been upgraded and for what reason. It should also indicate whether a) the upgrade is incomplete b) the upgrade is complete c) the upgrade is wedged because some blocks are missing In the case of b) or c) occuring, the sysadmin can issue a 'finishUpgrade' command to the Namenode which causes the .crc files to be removed and their blocks marked for deletion. Note that this is different from 'finalizeUpgrade' which causes state from the previous version to be discarded. Datanodes that join the system after the upgrade is finished are handled using 3) above. This is a more complex proposal, but the additional complexity has been introduced in order to provide much stronger correctness guarantees, so I feel that it is warranted. Comments welcome.
          Hide
          Raghu Angadi added a comment -

          > There is another issue that appears not to have been dealt with yet, we say that the Datanode locally generates CRCs if the .crc file for a
          > block is unavailable. We need to be very careful here IMO. During registration many many blocks are missing simply because they haven't
          > been reported in yet. We don't want the Namenode to report missing .crcs to the Datanodes until a certain threshold of blocks has been
          > reached. This should probably be the same as dfs.safemode.threshold.pct.

          Certainly. During upgrade, namenode will first wait for current "SafeMode" requirements to be met.

          Show
          Raghu Angadi added a comment - > There is another issue that appears not to have been dealt with yet, we say that the Datanode locally generates CRCs if the .crc file for a > block is unavailable. We need to be very careful here IMO. During registration many many blocks are missing simply because they haven't > been reported in yet. We don't want the Namenode to report missing .crcs to the Datanodes until a certain threshold of blocks has been > reached. This should probably be the same as dfs.safemode.threshold.pct. Certainly. During upgrade, namenode will first wait for current "SafeMode" requirements to be met.
          Hide
          Raghu Angadi added a comment -

          Just a few clarifications on your proposal :

          1) We don't look for "majority checksum" if ".crc" file does not exist, but select the first datanode to be the master?

          > ... reports a corrupt block to the Namenode
          2) What should the namenode do with such blocks? Delete them?

          > ... Datanodes that join the system after the upgrade is finished are handled using 3) above. \
          This implies that in normal operation, namenode should know if any datanode has finished this upgrade or not. If it can not find an upgraded node, then it will reply with "YOU".

          Show
          Raghu Angadi added a comment - Just a few clarifications on your proposal : 1) We don't look for "majority checksum" if ".crc" file does not exist, but select the first datanode to be the master? > ... reports a corrupt block to the Namenode 2) What should the namenode do with such blocks? Delete them? > ... Datanodes that join the system after the upgrade is finished are handled using 3) above. \ This implies that in normal operation, namenode should know if any datanode has finished this upgrade or not. If it can not find an upgraded node, then it will reply with "YOU".
          Hide
          Konstantin Shvachko added a comment -

          Sorry, joining the discussion on such a late stage. There's been a lot of progress here. Got some fresh thoughts

          Currently the name-node does not know anything about .crc files. IMO we should keep it that way or at least minimize the impact.
          That said, I'd not support the idea of implementing the getChecksumAuthority() method on the name-node, since if we do
          it the name-node will act as a client to itself and its data-nodes.
          I would propose to implement a separate crcConverter - a client, which acts as an fsck client as we used to have in the past.
          We can make it MR-distributed if we want to accelerate things.
          The crcConverter

          • reads a set of files and checks each block determining which crc block is valid, basically implementing
            Sameer's getChecksumAuthority() but on the client.
          • Then asks each data-node to takeChecksumAuthority() over the block by either copying its crc from another node or generating it locally.
          • And then removes the crc file, which will also lead to removal of old crc blocks.

          In case of failure the crcConverter will restart as of nothing has been done before.
          For those files that have already been converted the converter will not find its crc files and will ask the data-nodes to
          takeChecksumAuthority() over the corresponding blocks by generating them locally. Data-nodes will see that corresponding
          crc files have already been generated and will do nothing.

          With respect to the name-node, during crc conversion it should be started with -upgrade option, and the converter should wait
          until everything is upgraded, replicated and stabilized on the cluster. Then it can enter the manual safe mode and do the conversion.
          I am sure there is a lot of details missing in my proposal, but it seems simpler to me because it requires less changes on the name- and data-nodes.

          Show
          Konstantin Shvachko added a comment - Sorry, joining the discussion on such a late stage. There's been a lot of progress here. Got some fresh thoughts Currently the name-node does not know anything about .crc files. IMO we should keep it that way or at least minimize the impact. That said, I'd not support the idea of implementing the getChecksumAuthority() method on the name-node, since if we do it the name-node will act as a client to itself and its data-nodes. I would propose to implement a separate crcConverter - a client, which acts as an fsck client as we used to have in the past. We can make it MR-distributed if we want to accelerate things. The crcConverter reads a set of files and checks each block determining which crc block is valid, basically implementing Sameer's getChecksumAuthority() but on the client. Then asks each data-node to takeChecksumAuthority() over the block by either copying its crc from another node or generating it locally. And then removes the crc file, which will also lead to removal of old crc blocks. In case of failure the crcConverter will restart as of nothing has been done before. For those files that have already been converted the converter will not find its crc files and will ask the data-nodes to takeChecksumAuthority() over the corresponding blocks by generating them locally. Data-nodes will see that corresponding crc files have already been generated and will do nothing. With respect to the name-node, during crc conversion it should be started with -upgrade option, and the converter should wait until everything is upgraded, replicated and stabilized on the cluster. Then it can enter the manual safe mode and do the conversion. I am sure there is a lot of details missing in my proposal, but it seems simpler to me because it requires less changes on the name- and data-nodes.
          Hide
          Doug Cutting added a comment -

          > Even if we can get the old CRC data, how do we know that it is not corrupt?

          The same way we do today: we don't. If checksums are corrupt, we assume the data is corrupt, which is right 99% of the time, since the data is 100 times larger. I don't think we should try to improve the checksum quality for existing data with this patch. (Who will checksum the checksums?) That's mission creep. It's a laudable long-term goal. This patch should confine itself to removing checksums from the HDFS namespace. That will be difficult enough to develop, debug, review, test, deploy and support. We should not use this issue for a complete re-design of checksumming. Rather we can incrementally optimize and improve the checksum system after this issue is committed.

          Show
          Doug Cutting added a comment - > Even if we can get the old CRC data, how do we know that it is not corrupt? The same way we do today: we don't. If checksums are corrupt, we assume the data is corrupt, which is right 99% of the time, since the data is 100 times larger. I don't think we should try to improve the checksum quality for existing data with this patch. (Who will checksum the checksums?) That's mission creep. It's a laudable long-term goal. This patch should confine itself to removing checksums from the HDFS namespace. That will be difficult enough to develop, debug, review, test, deploy and support. We should not use this issue for a complete re-design of checksumming. Rather we can incrementally optimize and improve the checksum system after this issue is committed.
          Hide
          Sameer Paranjpye added a comment -

          > The same way we do today: we don't.

          When the HDFS client encounters a checksum error, it doesn't know whether it is the data or the checksum that is corrupt. But it takes care to switch both the data block and the checksum block that it is reading. So while we don't know whether the checksum is corrupt, we certainly don't assume that it's not.

          Computing CRCs is cheap, every job that we run computes them on input and on output, and it hasn't been a burden so far. We can benchmark how long it takes to compute checksums for a block, but I doubt it would add significantly to the time for upgrade.

          > This patch should confine itself to removing checksums from the HDFS namespace

          Perhaps, but even so, the removal should be done correctly. The system as it exists today deals with the possibility of checksum corruption, it should do so during and after the upgrade. Not doing so would be a regression.

          Show
          Sameer Paranjpye added a comment - > The same way we do today: we don't. When the HDFS client encounters a checksum error, it doesn't know whether it is the data or the checksum that is corrupt. But it takes care to switch both the data block and the checksum block that it is reading. So while we don't know whether the checksum is corrupt, we certainly don't assume that it's not. Computing CRCs is cheap, every job that we run computes them on input and on output, and it hasn't been a burden so far. We can benchmark how long it takes to compute checksums for a block, but I doubt it would add significantly to the time for upgrade. > This patch should confine itself to removing checksums from the HDFS namespace Perhaps, but even so, the removal should be done correctly. The system as it exists today deals with the possibility of checksum corruption, it should do so during and after the upgrade. Not doing so would be a regression.
          Hide
          Doug Cutting added a comment -

          > When the HDFS client encounters a checksum error, it doesn't know whether it is the data or the checksum that is corrupt.

          Okay, I see your point. If we import only a single replica of the checksums on upgrade then we'd increase the false-positive rate of checksum errors, right? But for every false-positive we'd still have a 100 real data corruptions, so I'm not sure this is a big deal.

          > The system as it exists today deals with the possibility of checksum corruption [ ... ]

          Yes, in some cases. If the corruption happened to a checksum on a datanode, then it does. If it happened before the data reached the datanode, then it doesn't.

          So, sure, we could read all copies of the checksum data when updating and vote. It adds complexity, introducing potential bugs, but with a some benefit. +0

          However I don't see how checking the data against the checksums during upgrade helps much. If they don't agree, the block's probably corrupt, but the checksum could be (or both could be). It seems the best we can do in this case is let the client discover the problem if/when the data is read, and, if they insist, use the unvalidated data. Or am I missing something?

          Show
          Doug Cutting added a comment - > When the HDFS client encounters a checksum error, it doesn't know whether it is the data or the checksum that is corrupt. Okay, I see your point. If we import only a single replica of the checksums on upgrade then we'd increase the false-positive rate of checksum errors, right? But for every false-positive we'd still have a 100 real data corruptions, so I'm not sure this is a big deal. > The system as it exists today deals with the possibility of checksum corruption [ ... ] Yes, in some cases. If the corruption happened to a checksum on a datanode, then it does. If it happened before the data reached the datanode, then it doesn't. So, sure, we could read all copies of the checksum data when updating and vote. It adds complexity, introducing potential bugs, but with a some benefit. +0 However I don't see how checking the data against the checksums during upgrade helps much. If they don't agree, the block's probably corrupt, but the checksum could be (or both could be). It seems the best we can do in this case is let the client discover the problem if/when the data is read, and, if they insist, use the unvalidated data. Or am I missing something?
          Hide
          Sameer Paranjpye added a comment -

          As Konstantin suggests, using a client program to perform validation is also reasonable. It has the advantage of keeping upgrade code in HDFS very simple and decoupling the Namenode and Datanode upgrades.

          Datanodes would perform local upgrades during which they'd re-generate checksums for all their blocks and put them in side checksum files. Once this is done, we could lauch a Map/Reduce job that reads data files and validates them against the existing .crc files, ensuring that it reads all blocks. If it discovers corruption, it reports the corrupt blocks to the Namenode, which can then proceed to invalidate them and replicate the correct instances. For every file that is successfully validated the client would delete the .crc file from the namespace. Dealing with missing replicas is a bit tricky with this approach, the other downside it has is that it is potentially much slower since validating with Map/Reduce could cause a lot of data transfer over the network.

          Show
          Sameer Paranjpye added a comment - As Konstantin suggests, using a client program to perform validation is also reasonable. It has the advantage of keeping upgrade code in HDFS very simple and decoupling the Namenode and Datanode upgrades. Datanodes would perform local upgrades during which they'd re-generate checksums for all their blocks and put them in side checksum files. Once this is done, we could lauch a Map/Reduce job that reads data files and validates them against the existing .crc files, ensuring that it reads all blocks. If it discovers corruption, it reports the corrupt blocks to the Namenode, which can then proceed to invalidate them and replicate the correct instances. For every file that is successfully validated the client would delete the .crc file from the namespace. Dealing with missing replicas is a bit tricky with this approach, the other downside it has is that it is potentially much slower since validating with Map/Reduce could cause a lot of data transfer over the network.
          Hide
          Doug Cutting added a comment -

          > it is potentially much slower since validating with Map/Reduce could cause a lot of data transfer over the network

          Why wouldn't the map tasks run on a node where the block is local? The checksum data would need to be read over the network, but checksums are 1% the size of data, and we typically assume that net reads from a random node are 10x slower than local disk reads, so the checksum network i/o should only add 10% to the cost of reading the block, right?

          Show
          Doug Cutting added a comment - > it is potentially much slower since validating with Map/Reduce could cause a lot of data transfer over the network Why wouldn't the map tasks run on a node where the block is local? The checksum data would need to be read over the network, but checksums are 1% the size of data, and we typically assume that net reads from a random node are 10x slower than local disk reads, so the checksum network i/o should only add 10% to the cost of reading the block, right?
          Hide
          Sameer Paranjpye added a comment -

          > Okay, I see your point. If we import only a single replica of the checksums on upgrade then we'd increase the false-positive rate of checksum errors, right? But for every
          > false-positive we'd still have a 100 real data corruptions, so I'm not sure this is a big deal.

          Yes, but in each of those 100 real data corruptions data can be salvaged by switching to a valid instance of the block, deleting the corrupt instance and replicating the valid instances. If we blindly copy checksums we'll introduce errors that cannot be recovered from, period.

          > Yes, in some cases. If the corruption happened to a checksum on a datanode, then it does. If it happened before the data reached the datanode, then it doesn't.

          Not in just some cases, but in the overwhelming majority of cases. Data corruption before data reaches the Datanode would occur either in RAM or during network transmission, the likelihood of this happening is orders of magnitude lower than 1 out of 3 replicas on disk becoming corrupt. Disks are flaky, far flakier than RAM or network, and bit rot happens all the time when you have a large number of them. The percentage of corruptions that would occur before data reaches the Datanodes is far lower than 1, based on the error rates I'm aware of.

          Show
          Sameer Paranjpye added a comment - > Okay, I see your point. If we import only a single replica of the checksums on upgrade then we'd increase the false-positive rate of checksum errors, right? But for every > false-positive we'd still have a 100 real data corruptions, so I'm not sure this is a big deal. Yes, but in each of those 100 real data corruptions data can be salvaged by switching to a valid instance of the block, deleting the corrupt instance and replicating the valid instances. If we blindly copy checksums we'll introduce errors that cannot be recovered from, period. > Yes, in some cases. If the corruption happened to a checksum on a datanode, then it does. If it happened before the data reached the datanode, then it doesn't. Not in just some cases, but in the overwhelming majority of cases. Data corruption before data reaches the Datanode would occur either in RAM or during network transmission, the likelihood of this happening is orders of magnitude lower than 1 out of 3 replicas on disk becoming corrupt. Disks are flaky, far flakier than RAM or network, and bit rot happens all the time when you have a large number of them. The percentage of corruptions that would occur before data reaches the Datanodes is far lower than 1, based on the error rates I'm aware of.
          Hide
          Sameer Paranjpye added a comment -

          > Why wouldn't the map tasks run on a node where the block is local? The checksum data would need to be read over the network, but checksums are 1% the size of data, and
          > we typically assume that net reads from a random node are 10x slower than local disk reads, so the checksum network i/o should only add 10% to the cost of reading the
          > block, right?

          Yes, it could be done that way, if each split were a set of block instances on a node. The client would need a way to go from a block id to a .crc file via an extension of the Namenode API. The difficulty there is in determining the set of validated files from the set of validated blocks and so knowing which .crc files can be deleted. Of course, all the .crc files could be deleted at the end.

          The way I was thinking about it was to have each split be a file or a set of files, it would be hard to schedule local to all the blocks in that case. This requires practically no API changes, there already exists a API to report corrupt blocks. Once a file is validated the .crc file would be deleted by the client. The set of .crc files remaining at the end tells you exactly which data is suspect. This feels very clean, but doesn't do such a great job of ensuring data locality.

          Show
          Sameer Paranjpye added a comment - > Why wouldn't the map tasks run on a node where the block is local? The checksum data would need to be read over the network, but checksums are 1% the size of data, and > we typically assume that net reads from a random node are 10x slower than local disk reads, so the checksum network i/o should only add 10% to the cost of reading the > block, right? Yes, it could be done that way, if each split were a set of block instances on a node. The client would need a way to go from a block id to a .crc file via an extension of the Namenode API. The difficulty there is in determining the set of validated files from the set of validated blocks and so knowing which .crc files can be deleted. Of course, all the .crc files could be deleted at the end. The way I was thinking about it was to have each split be a file or a set of files, it would be hard to schedule local to all the blocks in that case. This requires practically no API changes, there already exists a API to report corrupt blocks. Once a file is validated the .crc file would be deleted by the client. The set of .crc files remaining at the end tells you exactly which data is suspect. This feels very clean, but doesn't do such a great job of ensuring data locality.
          Hide
          Doug Cutting added a comment -

          > Yes, but in each of those 100 real data corruptions data can be salvaged by switching to a valid instance of the block

          Assuming the corruption happened after replication.

          > Data corruption before data reaches the Datanode would occur either in RAM or during network transmission, the likelihood of this happening is orders of magnitude lower than 1 out of 3 replicas on disk becoming corrupt.

          That's not the universal experience. Many if not most of the checksum errors I've heard of traced back to memory errors. Someone recently reported a non-reproducible checksum error from the InMemoryFileSystem, didn't they?

          Show
          Doug Cutting added a comment - > Yes, but in each of those 100 real data corruptions data can be salvaged by switching to a valid instance of the block Assuming the corruption happened after replication. > Data corruption before data reaches the Datanode would occur either in RAM or during network transmission, the likelihood of this happening is orders of magnitude lower than 1 out of 3 replicas on disk becoming corrupt. That's not the universal experience. Many if not most of the checksum errors I've heard of traced back to memory errors. Someone recently reported a non-reproducible checksum error from the InMemoryFileSystem, didn't they?
          Hide
          Doug Cutting added a comment -

          > The client would need a way to go from a block id to a .crc file via an extension of the Namenode API.

          The split could include the datanode name, the block ID, the file name and the offset of the block within the file. Then the mapper could access the CRC file using normal namenode and datanode calls.

          But actually, now that I think about it, if we're primarily not validating checksums against the data, but rather comparing all the checksums for a block, then locality may not be worthwhile. In that case we'd want a temporary datanode extension that permits writing the checksum file for a block. Then the updater map task can read through all copies of the checksum file, construct the best possible checksum for each block, then send these to datanodes. So, in aggregate, 6% of the filesystem would cross the wire during the upgrade. Could that work?

          > Once a file is validated the .crc file would be deleted by the client.

          Is the upgrade the time to detect corrupt blocks? Won't these be detected through the normal mechanisms later? We don't want to perform any replication during the upgrade. As a subsequent patch, we should validate checksums during replication, so that we don't replicate a corrupt block, but I don't think we need to do that as a part of this patch. (Some might wait until both patches are committed before updating a particular filesystem.)

          Show
          Doug Cutting added a comment - > The client would need a way to go from a block id to a .crc file via an extension of the Namenode API. The split could include the datanode name, the block ID, the file name and the offset of the block within the file. Then the mapper could access the CRC file using normal namenode and datanode calls. But actually, now that I think about it, if we're primarily not validating checksums against the data, but rather comparing all the checksums for a block, then locality may not be worthwhile. In that case we'd want a temporary datanode extension that permits writing the checksum file for a block. Then the updater map task can read through all copies of the checksum file, construct the best possible checksum for each block, then send these to datanodes. So, in aggregate, 6% of the filesystem would cross the wire during the upgrade. Could that work? > Once a file is validated the .crc file would be deleted by the client. Is the upgrade the time to detect corrupt blocks? Won't these be detected through the normal mechanisms later? We don't want to perform any replication during the upgrade. As a subsequent patch, we should validate checksums during replication, so that we don't replicate a corrupt block, but I don't think we need to do that as a part of this patch. (Some might wait until both patches are committed before updating a particular filesystem.)
          Hide
          Sameer Paranjpye added a comment -

          > That's not the universal experience. Many if not most of the checksum errors I've heard of traced back to memory errors. Someone recently reported a non-reproducible
          > checksum error from the InMemoryFileSystem, didn't they?

          Perhaps, but it is contrary to the published error rates for disk and RAM. Also, people tend to report errors that bring their jobs to a halt. When disk errors occur in HDFS, jobs don't fail, corrupt blocks get deleted and replaced with good ones, so they don't get noticed as much. The reports in this case are biased IMO, we'd get a much better picture by going through logs.

          Show
          Sameer Paranjpye added a comment - > That's not the universal experience. Many if not most of the checksum errors I've heard of traced back to memory errors. Someone recently reported a non-reproducible > checksum error from the InMemoryFileSystem, didn't they? Perhaps, but it is contrary to the published error rates for disk and RAM. Also, people tend to report errors that bring their jobs to a halt. When disk errors occur in HDFS, jobs don't fail, corrupt blocks get deleted and replaced with good ones, so they don't get noticed as much. The reports in this case are biased IMO, we'd get a much better picture by going through logs.
          Hide
          Sameer Paranjpye added a comment -

          > Is the upgrade the time to detect corrupt blocks? Won't these be detected through the normal mechanisms later? We don't want to perform any replication during the upgrade.

          Not necessarily, and we don't want to be replicating during the upgrade. Reporting corrupt blocks during the upgrade gives us some early warning, and the blocks could be deleted right after the upgrade. So it's a minor advantage, but not necessary.

          Show
          Sameer Paranjpye added a comment - > Is the upgrade the time to detect corrupt blocks? Won't these be detected through the normal mechanisms later? We don't want to perform any replication during the upgrade. Not necessarily, and we don't want to be replicating during the upgrade. Reporting corrupt blocks during the upgrade gives us some early warning, and the blocks could be deleted right after the upgrade. So it's a minor advantage, but not necessary.
          Hide
          Konstantin Shvachko added a comment -

          > The split could include the datanode name, the block ID, the file name and the offset of the block within the file.

          We already have a similar test: DistributedFSCheck.
          It is a fs test, not dfs, so the data-node name was not included as a part of the split key.

          > Then the updater map task can read through all copies of the checksum file, construct the best possible checksum for each block, then send these to datanodes. [...] Could that work?

          I was thinking about letting the data-node containing the data block to read corresponding crc from other node,
          but sending crcs from the client, which should read them anyway is even better.

          Show
          Konstantin Shvachko added a comment - > The split could include the datanode name, the block ID, the file name and the offset of the block within the file. We already have a similar test: DistributedFSCheck. It is a fs test, not dfs, so the data-node name was not included as a part of the split key. > Then the updater map task can read through all copies of the checksum file, construct the best possible checksum for each block, then send these to datanodes. [...] Could that work? I was thinking about letting the data-node containing the data block to read corresponding crc from other node, but sending crcs from the client, which should read them anyway is even better.
          Hide
          Sameer Paranjpye added a comment -

          > But actually, now that I think about it, if we're primarily not validating checksums against the data, but rather comparing all the checksums for a block, then locality may not be
          > worthwhile. In that case we'd want a temporary datanode extension that permits writing the checksum file for a block. Then the updater map task can read through all copies of
          > the checksum file, construct the best possible checksum for each block, then send these to datanodes. So, in aggregate, 6% of the filesystem would cross the wire during the
          > upgrade. Could that work?

          If the client is going to send CRCs to Datanodes then we can have each split be a filename or a list of filenames, this will work fine if we don't want to validate data. If we want validation a map task can be scheduled local to 1 instance of most or all blocks of a file i.e. scheduled on the node where the file was generated, it can validate local data, fall back to remote data if local validation fails, then write checksums to all the block instances.

          How do we manage blocks that are missing during the upgrade? There are 3 cases really:

          • Blocks that are entirely missing i.e. no instances are available
          • Blocks that have some instances missing
          • Blocks whose checksums are missing
          Show
          Sameer Paranjpye added a comment - > But actually, now that I think about it, if we're primarily not validating checksums against the data, but rather comparing all the checksums for a block, then locality may not be > worthwhile. In that case we'd want a temporary datanode extension that permits writing the checksum file for a block. Then the updater map task can read through all copies of > the checksum file, construct the best possible checksum for each block, then send these to datanodes. So, in aggregate, 6% of the filesystem would cross the wire during the > upgrade. Could that work? If the client is going to send CRCs to Datanodes then we can have each split be a filename or a list of filenames, this will work fine if we don't want to validate data. If we want validation a map task can be scheduled local to 1 instance of most or all blocks of a file i.e. scheduled on the node where the file was generated, it can validate local data, fall back to remote data if local validation fails, then write checksums to all the block instances. How do we manage blocks that are missing during the upgrade? There are 3 cases really: Blocks that are entirely missing i.e. no instances are available Blocks that have some instances missing Blocks whose checksums are missing
          Hide
          Raghu Angadi added a comment -

          Sameer talked to me about the following approach for upgrade. The following is based on the discussion :

          1) Namenode consideres upgrade complete only when 100% of blocks have at least one replica upgraded (as described below). We will still have a manual override call to mark upgrade complete (i.e. finalizeUpgrade() call that was mentioned earlier).

          2) During upgrade, datanodes fetch all the replicas of CRC data (C1, C2, C3.. ) for a given block B. If the CRC replicas match, it stores generates new CRC file with the data. There is no verification with the block data in this case. If not all CRCs match but majority of them match, then majority CRC will be used. If all are different then, we just pick one (with least block id?).

          3) If there is no existing CRC data available for some reason, we generate from existing block data. A warning will be printed.

          4) If a datanode comes into cluster with old data directories and notices that namenode has completed upgrade, it upgrades each of the blocks like this:
          4.a) It gets the datanode that has an 'upgraded replica' for a block and contacts that datanode for the CRC data.
          4.b) If can not find an upgraded replica, it generates local CRC.. Or should it delete it?
          4.c) It waits for some time (half an hour) before deciding if an upgraded replica is missing.

          Show
          Raghu Angadi added a comment - Sameer talked to me about the following approach for upgrade. The following is based on the discussion : 1) Namenode consideres upgrade complete only when 100% of blocks have at least one replica upgraded (as described below). We will still have a manual override call to mark upgrade complete (i.e. finalizeUpgrade() call that was mentioned earlier). 2) During upgrade, datanodes fetch all the replicas of CRC data (C1, C2, C3.. ) for a given block B. If the CRC replicas match, it stores generates new CRC file with the data. There is no verification with the block data in this case. If not all CRCs match but majority of them match, then majority CRC will be used. If all are different then, we just pick one (with least block id?). 3) If there is no existing CRC data available for some reason, we generate from existing block data. A warning will be printed. 4) If a datanode comes into cluster with old data directories and notices that namenode has completed upgrade, it upgrades each of the blocks like this: 4.a) It gets the datanode that has an 'upgraded replica' for a block and contacts that datanode for the CRC data. 4.b) If can not find an upgraded replica, it generates local CRC.. Or should it delete it? 4.c) It waits for some time (half an hour) before deciding if an upgraded replica is missing.
          Hide
          Doug Cutting added a comment -

          > 3) If there is no existing CRC data available for some reason, we generate from existing block data. A warning will be printed.

          I'm not sure about this one. Generating a checksum from existing data gives the appearance that the data is correct when it might not be, which could be dangerous. I think we should handle this the same way we'll handle things if/when a CRC file is missing after the upgrade. That shouldn't happen, but it might, and we need to think about what we should do in that case. My guess is that we should return a null checksum with the data when it is read, and let the client decide whether to accept or reject the unchecksummed data. Currently I think we emit a warning and accept it.

          > 4.b) If can not find an upgraded replica, it generates local CRC.. Or should it delete it?

          This should be handled as above. We should not generate a new CRC, but rather leave the block without a CRC file and pass that information along to the client at read time. The client can then decide to look for a replica with a CRC, emit a warning, fail, etc.

          Show
          Doug Cutting added a comment - > 3) If there is no existing CRC data available for some reason, we generate from existing block data. A warning will be printed. I'm not sure about this one. Generating a checksum from existing data gives the appearance that the data is correct when it might not be, which could be dangerous. I think we should handle this the same way we'll handle things if/when a CRC file is missing after the upgrade. That shouldn't happen, but it might, and we need to think about what we should do in that case. My guess is that we should return a null checksum with the data when it is read, and let the client decide whether to accept or reject the unchecksummed data. Currently I think we emit a warning and accept it. > 4.b) If can not find an upgraded replica, it generates local CRC.. Or should it delete it? This should be handled as above. We should not generate a new CRC, but rather leave the block without a CRC file and pass that information along to the client at read time. The client can then decide to look for a replica with a CRC, emit a warning, fail, etc.
          Hide
          Raghu Angadi added a comment -

          > I think we should handle this the same way we'll handle things if/when a CRC file is missing after the upgrade.

          After the upgrade, I think it is cleaner and simpler to treat this as hard error on the block. i.e, block will be considered badly corrupt handled accordingly.

          > That shouldn't happen, but it might, and we need to think about what we should do in that case. My guess is that we should
          > return a null checksum with the data when it is read, and let the client decide whether to accept or reject the unchecksummed data

          How do we handle transfer from datanode to another.

          I understand it will be better to be flexible, but one way or the we have to deal with real hard errors (mostly because of hardware errors). If our software is so buggy that we need to expect CRC file not to exists and handle it as an 'expected condition', I think it would be better to spend more time fixing those bugs. I vote against treating this as soft error.

          Regd option to serve possibly corrupt data, I was thinking of making client to explicitly ask datanode to ignore checksum errors at the beginning of reading data from the datanode (possibly based on client config). Since CRC is served inline on the connection, we should have some conventions like 'checksum of 0000 followed by some magic 8 bytes means checksum is incorrect' or some such thing.

          Show
          Raghu Angadi added a comment - > I think we should handle this the same way we'll handle things if/when a CRC file is missing after the upgrade. After the upgrade, I think it is cleaner and simpler to treat this as hard error on the block. i.e, block will be considered badly corrupt handled accordingly. > That shouldn't happen, but it might, and we need to think about what we should do in that case. My guess is that we should > return a null checksum with the data when it is read, and let the client decide whether to accept or reject the unchecksummed data How do we handle transfer from datanode to another. I understand it will be better to be flexible, but one way or the we have to deal with real hard errors (mostly because of hardware errors). If our software is so buggy that we need to expect CRC file not to exists and handle it as an 'expected condition', I think it would be better to spend more time fixing those bugs. I vote against treating this as soft error. Regd option to serve possibly corrupt data, I was thinking of making client to explicitly ask datanode to ignore checksum errors at the beginning of reading data from the datanode (possibly based on client config). Since CRC is served inline on the connection, we should have some conventions like 'checksum of 0000 followed by some magic 8 bytes means checksum is incorrect' or some such thing.
          Hide
          Raghu Angadi added a comment -

          > Since CRC is served inline on the connection, [...]

          This is for handling checksum errors dynamically on the client connections.

          Show
          Raghu Angadi added a comment - > Since CRC is served inline on the connection, [...] This is for handling checksum errors dynamically on the client connections.
          Hide
          Doug Cutting added a comment -

          > After the upgrade, I think it is cleaner and simpler to treat this as hard error on the block.

          If the only copy of a block has no CRC, shouldn't we still permit folks to access the data somehow? I don't think we should just remove the block in that case.

          > If our software is so buggy that we need to expect CRC file not to exists and handle it as an 'expected condition', I think it would be better to spend more time fixing those bugs.

          It's not expected, and it should normally cause an exception to be thrown by the client. But folks should still be able to scavenge their data by setting a config parameter that permits them to access the data even if it doesn't have a checksum. Wouldn't that be preferable to data loss?

          Show
          Doug Cutting added a comment - > After the upgrade, I think it is cleaner and simpler to treat this as hard error on the block. If the only copy of a block has no CRC, shouldn't we still permit folks to access the data somehow? I don't think we should just remove the block in that case. > If our software is so buggy that we need to expect CRC file not to exists and handle it as an 'expected condition', I think it would be better to spend more time fixing those bugs. It's not expected, and it should normally cause an exception to be thrown by the client. But folks should still be able to scavenge their data by setting a config parameter that permits them to access the data even if it doesn't have a checksum. Wouldn't that be preferable to data loss?
          Hide
          Raghu Angadi added a comment -

          > If the only copy of a block has no CRC, shouldn't we still permit folks to access the data somehow? I don't think we should just remove the block in that case.

          Agreeed. earlier in the Jira I mentioned that Namnode should be enhanced to treat 'block corrupt' message from Datanode properly. i.e. try to create a new replica and not delete the block until a new replica can be created.

          I wanted to imply if CRC file does not exist, it will be treated just like a corrupt block (whatever the policy may be). I somehow thought you meant to distinguish between 'missing checksum' and 'checksum mismatch'.

          > access the data even if it doesn't have a checksum. Wouldn't that be preferable to data loss?
          yes. This will be built into client-datanode data transfer protocol.

          Show
          Raghu Angadi added a comment - > If the only copy of a block has no CRC, shouldn't we still permit folks to access the data somehow? I don't think we should just remove the block in that case. Agreeed. earlier in the Jira I mentioned that Namnode should be enhanced to treat 'block corrupt' message from Datanode properly. i.e. try to create a new replica and not delete the block until a new replica can be created. I wanted to imply if CRC file does not exist, it will be treated just like a corrupt block (whatever the policy may be). I somehow thought you meant to distinguish between 'missing checksum' and 'checksum mismatch'. > access the data even if it doesn't have a checksum. Wouldn't that be preferable to data loss? yes. This will be built into client-datanode data transfer protocol.
          Hide
          Doug Cutting added a comment -

          > I wanted to imply if CRC file does not exist, it will be treated just like a corrupt block (whatever the policy may be).

          We should, during upgrade, implement a policy consistent with this. I think that means that, if no CRC files exist for a block before the upgrade, then no CRC files should exist for it after the upgrade, right?

          Show
          Doug Cutting added a comment - > I wanted to imply if CRC file does not exist, it will be treated just like a corrupt block (whatever the policy may be). We should, during upgrade, implement a policy consistent with this. I think that means that, if no CRC files exist for a block before the upgrade, then no CRC files should exist for it after the upgrade, right?
          Hide
          Raghu Angadi added a comment -

          Doing so implies that we should implement improved treatment of corrupt blocks in Namenode and datanode as apart of this jira instead of as a follow up feature. I am fine with that.

          Show
          Raghu Angadi added a comment - Doing so implies that we should implement improved treatment of corrupt blocks in Namenode and datanode as apart of this jira instead of as a follow up feature. I am fine with that.
          Hide
          Doug Cutting added a comment -

          > Doing so implies that we should implement improved treatment of corrupt blocks in Namenode and datanode as apart of this jira instead of as a follow up feature.

          I don't follow. For this issue the client should continue to behave as it currently does when a checksum file is missing. We might change that in a subsequent issue, perhaps urgently. But let's avoid mission creep in this issue.

          Show
          Doug Cutting added a comment - > Doing so implies that we should implement improved treatment of corrupt blocks in Namenode and datanode as apart of this jira instead of as a follow up feature. I don't follow. For this issue the client should continue to behave as it currently does when a checksum file is missing. We might change that in a subsequent issue, perhaps urgently. But let's avoid mission creep in this issue.
          Hide
          Sameer Paranjpye added a comment -

          We should not generate new CRCs or do anything with blocks that have missing CRCs.

          > We might change that in a subsequent issue, perhaps urgently.

          +1

          My understanding is that the client currently ignores missing CRCs, this is probably not desirable. Lets use this issue to get CRCs out of the Namenode and address client behavior in a follow-up issue.

          Show
          Sameer Paranjpye added a comment - We should not generate new CRCs or do anything with blocks that have missing CRCs. > We might change that in a subsequent issue, perhaps urgently. +1 My understanding is that the client currently ignores missing CRCs, this is probably not desirable. Lets use this issue to get CRCs out of the Namenode and address client behavior in a follow-up issue.
          Hide
          Doug Cutting added a comment -

          > we should have some conventions like 'checksum of 0000 followed by some magic 8 bytes means checksum is incorrect' or some such thing

          That assumes that the checksum is validated, doesn't it? And we're not yet validating checksums except in the client, so I don't see where such an encoding would be used. Right now we only need a way for the datanode to indicate either (1) that there is a checksum, and here it is; or (b) there is no checksum.

          > if CRC file does not exist, it will be treated just like a corrupt block

          The client might handle these cases differently. If the data does not match the checksum, then odds are the data is invalid. However if the data has no checksum then odds are good that the data is valid. So one might reasonably configure the client to permit reading of data without checksums but to throw exceptions for data whose checksum does not match the data.

          Show
          Doug Cutting added a comment - > we should have some conventions like 'checksum of 0000 followed by some magic 8 bytes means checksum is incorrect' or some such thing That assumes that the checksum is validated, doesn't it? And we're not yet validating checksums except in the client, so I don't see where such an encoding would be used. Right now we only need a way for the datanode to indicate either (1) that there is a checksum, and here it is; or (b) there is no checksum. > if CRC file does not exist, it will be treated just like a corrupt block The client might handle these cases differently. If the data does not match the checksum, then odds are the data is invalid. However if the data has no checksum then odds are good that the data is valid. So one might reasonably configure the client to permit reading of data without checksums but to throw exceptions for data whose checksum does not match the data.
          Hide
          Raghu Angadi added a comment -

          Not generating missing CRCs sounds fine and in the initial release datanode will not delete any blocks for checksum reasons.

          > That assumes that the checksum is validated, doesn't it? And we're not yet validating checksums except in the client,

          Please confirm the desired behaviour in normal operation (after upgrade) : Two options I have seen till now:

          1) Datanode does not verify any checksums anytime. This might or might not change in future.

          2) Datanode also verifies checksums while receiving and sending data. It closes the client connection if checksum does not match while receiving from client. In short term, it at most prints a warning for checksum errors while sending data and does not report the error.

          Doug, I guess you are proposing (1). Though I prefer (2), I am ok with (1). If this is not true I would like to know what the desired behavior is.

          Show
          Raghu Angadi added a comment - Not generating missing CRCs sounds fine and in the initial release datanode will not delete any blocks for checksum reasons. > That assumes that the checksum is validated, doesn't it? And we're not yet validating checksums except in the client, Please confirm the desired behaviour in normal operation (after upgrade) : Two options I have seen till now: 1) Datanode does not verify any checksums anytime. This might or might not change in future. 2) Datanode also verifies checksums while receiving and sending data. It closes the client connection if checksum does not match while receiving from client. In short term, it at most prints a warning for checksum errors while sending data and does not report the error. Doug, I guess you are proposing (1). Though I prefer (2), I am ok with (1). If this is not true I would like to know what the desired behavior is.
          Hide
          Doug Cutting added a comment -

          This issue should implement (1). (2) is desirable long-term, since it will permit us to detect corruptions before they're stored to the filesystem, but it should be done as a separate issue.

          Show
          Doug Cutting added a comment - This issue should implement (1). (2) is desirable long-term, since it will permit us to detect corruptions before they're stored to the filesystem, but it should be done as a separate issue.
          Hide
          Raghu Angadi added a comment -

          Proposed protocol for data transfers. Use for 'OPs' e.g OP_WRITE_BLOCK, OP_READ_BLOCK etc. Note that this is not a fixed font and packets might look skewed.

          Common header for all the OPs. The requesting side send the following header.
          ----------------------------------------------------------------

          2 byte version 1 byte OP OP specific data ...

          -----------------------------------------------------------------
          Version should match exactly on both sides.

          Read and write ops transfer data in DATA_CHUNKS that contain <offset, len, checksum, data> that Doug mentioned earlier :

          DATA_CHUNK: current checksum is CRC32 and checksum will be 4 bytes.
          ---------------------------------------------------------------

          4 byte Offset 4 byte Len data .. checksum

          ---------------------------------------------------------------
          A DATA_CHUNK packet with 0 offset and 0 length indicates proper end of stream.

          When OP is OP_WRITE_BLOCK ( used when blocks are written to datanode):
          ==========================
          --------------------------------------------------------------------------

          1 byte Checksum Type 4 byte bytes.per.checksum contd.
          --------------------------------------------------------------------------
          --------------------------------------------------------------------------------------
          4 byte num nodes to copy DatnodeInfos ... DATA_CHUNKS ..
          --------------------------------------------------------------------------------------

          byte.per.checksum is fixed at the start and can not change. An empty DATA_CHUNK indicates proper end of stream.

          When OP is OP_READ_BLOCK ( use to read data data from the blocks ):
          ===========================

          ------------------------------------------------------------------------

          8 byte block id 4 byte start offset 4 byte end offset

          -------------------------------------------------------------------------

          end offset == -1 indicates till the end of the block. Reply from the data node:
          -------------------------------------------------------------------------------------

          1 byte checksum type 4 byte bytes.per.checksum DATA_CHUNKS ..
          ---------------------------------------------------------------------------------------
          Show
          Raghu Angadi added a comment - Proposed protocol for data transfers. Use for 'OPs' e.g OP_WRITE_BLOCK, OP_READ_BLOCK etc. Note that this is not a fixed font and packets might look skewed. Common header for all the OPs. The requesting side send the following header. ---------------------------------------------------------------- 2 byte version 1 byte OP OP specific data ... ----------------------------------------------------------------- Version should match exactly on both sides. Read and write ops transfer data in DATA_CHUNKS that contain <offset, len, checksum, data> that Doug mentioned earlier : DATA_CHUNK: current checksum is CRC32 and checksum will be 4 bytes. --------------------------------------------------------------- 4 byte Offset 4 byte Len data .. checksum --------------------------------------------------------------- A DATA_CHUNK packet with 0 offset and 0 length indicates proper end of stream. When OP is OP_WRITE_BLOCK ( used when blocks are written to datanode): ========================== -------------------------------------------------------------------------- 1 byte Checksum Type 4 byte bytes.per.checksum contd. -------------------------------------------------------------------------- -------------------------------------------------------------------------------------- 4 byte num nodes to copy DatnodeInfos ... DATA_CHUNKS .. -------------------------------------------------------------------------------------- byte.per.checksum is fixed at the start and can not change. An empty DATA_CHUNK indicates proper end of stream. When OP is OP_READ_BLOCK ( use to read data data from the blocks ): =========================== ------------------------------------------------------------------------ 8 byte block id 4 byte start offset 4 byte end offset ------------------------------------------------------------------------- end offset == -1 indicates till the end of the block. Reply from the data node: ------------------------------------------------------------------------------------- 1 byte checksum type 4 byte bytes.per.checksum DATA_CHUNKS .. ---------------------------------------------------------------------------------------
          Hide
          Raghu Angadi added a comment -

          The existing protocol and the one proposed in the previous comment don't have confirmation from the receiver that complete block was received. Looks like it is necessary for sender to be confident that the block was properly received. Extention to the above :

          In response to OP_WRITE_BLOCK, receiver sends one of the following one byte code to the sender :

          ----------------------------

          1 byte Write status

          ----------------------------
          Status is one of the following :

          OP_STATUS_SUCCESS
          OP_STATUS_ERROR_CHECKSUM ( currently not used )
          OP_STATUS_ERROR ( Misc error. e.g. offset mismatch for successive DATA_CHUNKS )
          OP_STATUS_EXISTS

          OP_STATUS_EXISTS could be used when client try to resend the block since it failed to receive the 'OP_STATUS_OK' response from previous attempt. Sometimes this could be an error. The receiver could close the connection anytime and it could write status byte before closing the connection.

          Show
          Raghu Angadi added a comment - The existing protocol and the one proposed in the previous comment don't have confirmation from the receiver that complete block was received. Looks like it is necessary for sender to be confident that the block was properly received. Extention to the above : In response to OP_WRITE_BLOCK, receiver sends one of the following one byte code to the sender : ---------------------------- 1 byte Write status ---------------------------- Status is one of the following : OP_STATUS_SUCCESS OP_STATUS_ERROR_CHECKSUM ( currently not used ) OP_STATUS_ERROR ( Misc error. e.g. offset mismatch for successive DATA_CHUNKS ) OP_STATUS_EXISTS OP_STATUS_EXISTS could be used when client try to resend the block since it failed to receive the 'OP_STATUS_OK' response from previous attempt. Sometimes this could be an error. The receiver could close the connection anytime and it could write status byte before closing the connection.
          Hide
          Konstantin Shvachko added a comment -

          > For this issue the client should continue to behave as it currently does when a checksum file is missing.
          >We might change that in a subsequent issue, perhaps urgently.

          An urgent +1

          Currently once a missing crc block is detected the client does not check any crcs for any blocks of the file until it is reopened.
          So yes missing crcs are ignored but in a bad way discarding further verification with existing crcs.

          I agree with the idea that for now checksum verification should remain only on the client, but in the
          long term we should also verify it on data-nodes. Probably a good time to do that would be
          when we introduce periodic checksum verification.

          I'd like to underline some advantages of using a client program to perform the checksum upgrade.

          • This code will be used only once, and it should be as isolated as possible.
          • The client approach requires very few if any changes to the existing communication protocols.
          • It does not require changes to the name-node code and very few changes to the data-node.
          • It has a potential to be transformed into a distributed fsck later on.
            Everything that Raghu is proposing to run during checksum upgrade on the data-node can be implemented on the client.
            Regular upgrade procedures can be used to save current file system state before starting the checksum upgrade
            leaving a possibility of doing a rollback.
            We do not need to worry about old data-nodes coming up after the checksum upgrade was done.
            These data-nodes should fail and be reformatted, since the cluster has already collected/replicated all necessary blocks.
          Show
          Konstantin Shvachko added a comment - > For this issue the client should continue to behave as it currently does when a checksum file is missing. >We might change that in a subsequent issue, perhaps urgently. An urgent +1 Currently once a missing crc block is detected the client does not check any crcs for any blocks of the file until it is reopened. So yes missing crcs are ignored but in a bad way discarding further verification with existing crcs. I agree with the idea that for now checksum verification should remain only on the client, but in the long term we should also verify it on data-nodes. Probably a good time to do that would be when we introduce periodic checksum verification. I'd like to underline some advantages of using a client program to perform the checksum upgrade. This code will be used only once, and it should be as isolated as possible. The client approach requires very few if any changes to the existing communication protocols. It does not require changes to the name-node code and very few changes to the data-node. It has a potential to be transformed into a distributed fsck later on. Everything that Raghu is proposing to run during checksum upgrade on the data-node can be implemented on the client. Regular upgrade procedures can be used to save current file system state before starting the checksum upgrade leaving a possibility of doing a rollback. We do not need to worry about old data-nodes coming up after the checksum upgrade was done. These data-nodes should fail and be reformatted, since the cluster has already collected/replicated all necessary blocks.
          Hide
          Hairong Kuang added a comment -

          The proposal tries to read the existing crc files to generate block-level crc files during upgrade. This does not work when an existing file contains blocks with a size which is not a mutiple of bytesPerChecksum. This may exist because FileSystem allows a user to specify the block size when creates a file. In this case, a data node may have to verify block data first and then regenerates the checksums.

          Show
          Hairong Kuang added a comment - The proposal tries to read the existing crc files to generate block-level crc files during upgrade. This does not work when an existing file contains blocks with a size which is not a mutiple of bytesPerChecksum. This may exist because FileSystem allows a user to specify the block size when creates a file. In this case, a data node may have to verify block data first and then regenerates the checksums.
          Hide
          Doug Cutting added a comment -

          > This does not work when an existing file contains blocks with a size which is not a multiple of bytesPerChecksum.

          I'd be surprised if there are many files, if any, in this state, but it is technically possible, so we should consider it. It would be easy to write a program that scanned a filesystem to see which file's block size is not evenly divisible by its bytesPerChecksum.

          > In this case, a data node may have to verify block data first and then regenerates the checksums.

          Yes, that would be best. It'd be nice to know if this actually occurs before we implement that however.

          Show
          Doug Cutting added a comment - > This does not work when an existing file contains blocks with a size which is not a multiple of bytesPerChecksum. I'd be surprised if there are many files, if any, in this state, but it is technically possible, so we should consider it. It would be easy to write a program that scanned a filesystem to see which file's block size is not evenly divisible by its bytesPerChecksum. > In this case, a data node may have to verify block data first and then regenerates the checksums. Yes, that would be best. It'd be nice to know if this actually occurs before we implement that however.
          Hide
          Raghu Angadi added a comment -

          When a block's data does not start or end at bytesPerChecksum(bpc), I am planning to do the following in Datanode:

          Assume blocks data is divided like this : x + n * bpc + y, where x and y are less than bpc.

          1. Using DFSClient fetch bcp-x bytes before the block and bpc-y bytes after the block.
          2. Just like for rest of blocks, fetch .crc file data for the block as if the block had bpc + n*bpc + bpc bytes. (i.e. DN fetches crc data from multiple replicas and selects the majority that match).
          3. Read the actual block and verify that data matches and generate new CRC data with same bpc.
          4. When there is a mismatch, we could have wrong CRC just for the affected range or we could have null checksum for the entire block. I am thinking of having wrong CRC value just for the affected range since it does not increase the code by much.
          Show
          Raghu Angadi added a comment - When a block's data does not start or end at bytesPerChecksum( bpc ), I am planning to do the following in Datanode: Assume blocks data is divided like this : x + n * bpc + y , where x and y are less than bpc . Using DFSClient fetch bcp-x bytes before the block and bpc-y bytes after the block. Just like for rest of blocks, fetch .crc file data for the block as if the block had bpc + n*bpc + bpc bytes. (i.e. DN fetches crc data from multiple replicas and selects the majority that match). Read the actual block and verify that data matches and generate new CRC data with same bpc . When there is a mismatch, we could have wrong CRC just for the affected range or we could have null checksum for the entire block. I am thinking of having wrong CRC value just for the affected range since it does not increase the code by much.
          Hide
          Raghu Angadi added a comment -

          Overview of Block Level CRC. I will attach current patch.

          Show
          Raghu Angadi added a comment - Overview of Block Level CRC. I will attach current patch.
          Hide
          Raghu Angadi added a comment -

          Current patch is attached. This is a VERY EXPERIMENTAL patch. A few notes:

          1. Applies to trunk as of 05/30/07.
          2. This will not start on existing DFS. You need first format with this patch.
          3. DFS is fully functional and is not expected be any slower than current DFS. But this in no way close to being final patch.
          4. There is a bug chunk of Upgrade related in DataNode.java and a little bit in FSNameSystem.java but it is not currently excuted.
          5. DistributedFileSystem in trunk becomes ChecsumDistributedFileSystem with the patch. DistributeFileSystem in the patch does not create .crc files.
          6. Core functionality and protocol is described in the html file attached with previous comment.
          7. The important changes are in DFSClient.java and DataNode.java
          8. DFSClient.BlockReader class handles reading from a datanode and unpacks the data.
          9. There are quite a few XXX comments and will be handled in later patches.
          Show
          Raghu Angadi added a comment - Current patch is attached. This is a VERY EXPERIMENTAL patch. A few notes: Applies to trunk as of 05/30/07. This will not start on existing DFS. You need first format with this patch. DFS is fully functional and is not expected be any slower than current DFS. But this in no way close to being final patch. There is a bug chunk of Upgrade related in DataNode.java and a little bit in FSNameSystem.java but it is not currently excuted. DistributedFileSystem in trunk becomes ChecsumDistributedFileSystem with the patch. DistributeFileSystem in the patch does not create .crc files. Core functionality and protocol is described in the html file attached with previous comment. The important changes are in DFSClient.java and DataNode.java DFSClient.BlockReader class handles reading from a datanode and unpacks the data. There are quite a few XXX comments and will be handled in later patches.
          Hide
          Doug Cutting added a comment -

          Some comments on the design doc:

          1. .meta seems like a more obvious extension than .mtd for metadata.

          2. Re: "common header": should packets be length-prefixed? That might make processing easier.

          3. Re: "DATA_CHUNK": Unless I misunderstand, this only supports a single checksum per chunk and a single chunk per packet. So, if we want to support the current 512-byte checksums, this requires a packet every 512 bytes, which seems small to me. I'd prefer something like:

          <header, OP=WRITE, start, length, bytesPerChecksum, <data, checksum>*>

          So, if length<=bytesPerChecksum, there's just a single checksum, as in your proposal, but if length > bytesPerChecksum then there are multiple checksums in the data.

          Show
          Doug Cutting added a comment - Some comments on the design doc: 1. .meta seems like a more obvious extension than .mtd for metadata. 2. Re: "common header": should packets be length-prefixed? That might make processing easier. 3. Re: "DATA_CHUNK": Unless I misunderstand, this only supports a single checksum per chunk and a single chunk per packet. So, if we want to support the current 512-byte checksums, this requires a packet every 512 bytes, which seems small to me. I'd prefer something like: <header, OP=WRITE, start, length, bytesPerChecksum, <data, checksum>*> So, if length<=bytesPerChecksum, there's just a single checksum, as in your proposal, but if length > bytesPerChecksum then there are multiple checksums in the data.
          Hide
          Doug Cutting added a comment -

          Re the patch: As I've mentioned before, I'd prefer it if the CRC code could be shared with CheckSumFileSystem. In particular, it seems to me that FSInputChecker and FSOutputSummer could be extended to support pluggable sources and sinks for checksums, respectively, and DFSDataInputStream and DFSDataOutputStream could use these. Advantages of this are: (a) single implementation of checksum logic to debug and maintain; (b) keeps checksumming as close to possible to data generation and use. This patch computes checksums after data has been buffered, and validates them before it is buffered. We sometimes use large buffers and would like to guard against in-memory errors. The current checksum code catches a lot of such errors. So we should compute checksums after minimal buffering (just bytesPerChecksum, ideally) and validate them at the last possible moment (e.g., through the use of a small final buffer with a larger buffer behind it). I do not think this will significantly affect performance, and data integrity is a high priority.

          Show
          Doug Cutting added a comment - Re the patch: As I've mentioned before, I'd prefer it if the CRC code could be shared with CheckSumFileSystem. In particular, it seems to me that FSInputChecker and FSOutputSummer could be extended to support pluggable sources and sinks for checksums, respectively, and DFSDataInputStream and DFSDataOutputStream could use these. Advantages of this are: (a) single implementation of checksum logic to debug and maintain; (b) keeps checksumming as close to possible to data generation and use. This patch computes checksums after data has been buffered, and validates them before it is buffered. We sometimes use large buffers and would like to guard against in-memory errors. The current checksum code catches a lot of such errors. So we should compute checksums after minimal buffering (just bytesPerChecksum, ideally) and validate them at the last possible moment (e.g., through the use of a small final buffer with a larger buffer behind it). I do not think this will significantly affect performance, and data integrity is a high priority.
          Hide
          Raghu Angadi added a comment -

          1. yes, .meta was my first thought. For some reason it became .mtd. will change.
          2. "common header" is extremely general and very small. What is 'length'? e.g. reading from a block, is it the length of the data client wants to read?

          3. Yes, there is one DATA_CHUNK for every bytesPerChecksum. But DATA_CHUNK does not have any other header than start_offset and length. "<data, checksum>*" in your example is the DATA_CHUNK (of course DATA_CHUNK has 16 more bytes than 'data, checksum'). So, is it the 16 bytes for every bytesPerChecksum you want to save?

          Show
          Raghu Angadi added a comment - 1. yes, .meta was my first thought. For some reason it became .mtd. will change. 2. "common header" is extremely general and very small. What is 'length'? e.g. reading from a block, is it the length of the data client wants to read? 3. Yes, there is one DATA_CHUNK for every bytesPerChecksum. But DATA_CHUNK does not have any other header than start_offset and length. "<data, checksum>*" in your example is the DATA_CHUNK (of course DATA_CHUNK has 16 more bytes than 'data, checksum'). So, is it the 16 bytes for every bytesPerChecksum you want to save?
          Hide
          Raghu Angadi added a comment -

          Regd checksum calculation at the source, as we discussed on hadoop-dev, it should ideally be solved by not buffering data at the higher level. There is no reason to think higher level knows best what to buffer and how to buffer.

          That leaves the problem with the validation while reading. DistributedFileSystem does not a ChecksumFileSystem any more. Should it be? Similarly once we make DistributedFileSystem not buffer any data, would that address this issue?

          This issue needs to be and will be addressed, especially since we know that most often, memory is the culprit.

          Regd sharing the code, except for the fact that both use CRC32 class, almost everything is different about the implementation. Process of making these share the code would result in quite a few changes to ChecksumFileSystem. May be that should be a different Jira?

          Show
          Raghu Angadi added a comment - Regd checksum calculation at the source, as we discussed on hadoop-dev, it should ideally be solved by not buffering data at the higher level. There is no reason to think higher level knows best what to buffer and how to buffer. That leaves the problem with the validation while reading. DistributedFileSystem does not a ChecksumFileSystem any more. Should it be? Similarly once we make DistributedFileSystem not buffer any data, would that address this issue? This issue needs to be and will be addressed, especially since we know that most often, memory is the culprit. Regd sharing the code, except for the fact that both use CRC32 class, almost everything is different about the implementation. Process of making these share the code would result in quite a few changes to ChecksumFileSystem. May be that should be a different Jira?
          Hide
          Doug Cutting added a comment -

          > What is 'length'?

          The total size of the packet in bytes. Having this up front might make it easier to, e.g., write an NIO-based datanode that uses async io. Ideally we could re-write datanode to be async without modifying the on-the-wire protocol.

          > Yes, there is one DATA_CHUNK for every bytesPerChecksum.

          Okay. So I take it that when you write 'XXX ...' you mean '<XXX>*' in BNF, right? That wasn't clear to me.

          > So, is it the 16 bytes for every bytesPerChecksum you want to save?

          It wouldn't hurt to save those, but, moreover, I don't see the use case for transmitting the start and length with each checksum, rather it seems like it only makes sense once per request, no? So why not factor it to the OP-level?

          > There is no reason to think higher level knows best what to buffer and how to buffer.

          The primary method for opening a file is:

          FSDataInputStream FileSystem#open(Path, bufferSize);

          I do not expect this patch to change that. So the FileSystem implementation creates buffers for applications. The FileSystem implementation also computes checksums. When, e.g., passed a 10MB bufferSize, the FileSystem implementation should attempt to read-ahead and cache 10MB chunks of data: that's what the application is asking it to do. But, in such cases, the FileSystem implementation should also try to arrange to checksum that data as it is delivered to the application from the stream, rather than as it is read into the stream's internal buffer. Do you disagree with this? I expect this patch to implement this by sharing as much code as is reasonable with ChecksumFileSystem.

          > DistributedFileSystem does not a ChecksumFileSystem any more. Should it be?

          No, but ChecksumFileSystem's stream implementations should be made public and reusable. Both ChecksumFileSystem and DistributedFileSystem should build on common checksumming input and output stream abstract base classes. These classes should have abstract methods to write and read checksum data. ChecksumFileSystem can extend them to read and write from separate checksum file streams, while DistributedFileSystem can extend them to access checksum data from DFSClient.

          > Process of making these share the code would result in quite a few changes to ChecksumFileSystem. May be that should be a different Jira?

          No, I think it should be a part of this issue. We need shared checksumming stream classes. ChecksumFileSystem and DistributedFileSystem should both be converted to use them. We can't know whether they're really generic and sharable until they're sucessfully used in more than one place, so I think this is properly included in this issue.

          However, if you want to make it a separate issue, then this issue should be made dependent on that issue.

          Show
          Doug Cutting added a comment - > What is 'length'? The total size of the packet in bytes. Having this up front might make it easier to, e.g., write an NIO-based datanode that uses async io. Ideally we could re-write datanode to be async without modifying the on-the-wire protocol. > Yes, there is one DATA_CHUNK for every bytesPerChecksum. Okay. So I take it that when you write 'XXX ...' you mean '<XXX>*' in BNF, right? That wasn't clear to me. > So, is it the 16 bytes for every bytesPerChecksum you want to save? It wouldn't hurt to save those, but, moreover, I don't see the use case for transmitting the start and length with each checksum, rather it seems like it only makes sense once per request, no? So why not factor it to the OP-level? > There is no reason to think higher level knows best what to buffer and how to buffer. The primary method for opening a file is: FSDataInputStream FileSystem#open(Path, bufferSize); I do not expect this patch to change that. So the FileSystem implementation creates buffers for applications. The FileSystem implementation also computes checksums. When, e.g., passed a 10MB bufferSize, the FileSystem implementation should attempt to read-ahead and cache 10MB chunks of data: that's what the application is asking it to do. But, in such cases, the FileSystem implementation should also try to arrange to checksum that data as it is delivered to the application from the stream, rather than as it is read into the stream's internal buffer. Do you disagree with this? I expect this patch to implement this by sharing as much code as is reasonable with ChecksumFileSystem. > DistributedFileSystem does not a ChecksumFileSystem any more. Should it be? No, but ChecksumFileSystem's stream implementations should be made public and reusable. Both ChecksumFileSystem and DistributedFileSystem should build on common checksumming input and output stream abstract base classes. These classes should have abstract methods to write and read checksum data. ChecksumFileSystem can extend them to read and write from separate checksum file streams, while DistributedFileSystem can extend them to access checksum data from DFSClient. > Process of making these share the code would result in quite a few changes to ChecksumFileSystem. May be that should be a different Jira? No, I think it should be a part of this issue. We need shared checksumming stream classes. ChecksumFileSystem and DistributedFileSystem should both be converted to use them. We can't know whether they're really generic and sharable until they're sucessfully used in more than one place, so I think this is properly included in this issue. However, if you want to make it a separate issue, then this issue should be made dependent on that issue.
          Hide
          Raghu Angadi added a comment -

          The current checksum code catches a lot of such errors.

          btw, ChecksumDistributedFileSystem has these problems. i.e. data is buffered before it reaches FSOutputSummer while writing and after it is out of FSInputChecker while reading.

          Show
          Raghu Angadi added a comment - The current checksum code catches a lot of such errors. btw, ChecksumDistributedFileSystem has these problems. i.e. data is buffered before it reaches FSOutputSummer while writing and after it is out of FSInputChecker while reading.
          Hide
          Doug Cutting added a comment -

          > btw, ChecksumDistributedFileSystem has these problems. i.e. data is buffered before it reaches FSOutputSummer while writing and after it is out of FSInputChecker while reading.

          Yes. I've complained about this several times. We need to fix that. We should not design new things similarly.

          Show
          Doug Cutting added a comment - > btw, ChecksumDistributedFileSystem has these problems. i.e. data is buffered before it reaches FSOutputSummer while writing and after it is out of FSInputChecker while reading. Yes. I've complained about this several times. We need to fix that. We should not design new things similarly.
          Hide
          Raghu Angadi added a comment -

          the FileSystem implementation should also try to arrange to checksum that data as it is delivered to the application from the stream, rather than as it is read into the stream's internal buffer. Do you disagree with this?

          I agree. How do we implementation this? My preferred method is to FS implementation to do the the buffering.. implementation can still respect the user supplied bufferSize. This will also fix current problem with current ChecksumFileSystem. Does it make sense?

          Do you want the code sharing mainly in software sense or mainly to fix the buffering issue?

          Show
          Raghu Angadi added a comment - the FileSystem implementation should also try to arrange to checksum that data as it is delivered to the application from the stream, rather than as it is read into the stream's internal buffer. Do you disagree with this? I agree. How do we implementation this? My preferred method is to FS implementation to do the the buffering.. implementation can still respect the user supplied bufferSize. This will also fix current problem with current ChecksumFileSystem. Does it make sense? Do you want the code sharing mainly in software sense or mainly to fix the buffering issue?
          Hide
          Doug Cutting added a comment -

          > How do we implementation this? My preferred method is to FS implementation to do the the buffering.

          Yes, FileSystem#open(Path, bufferSize) should return a buffered stream, and that method is abstract, so buffering is implemented by the FileSystem. But two FileSystem implementations might reasonably share some of their buffering code. In this case, I think there's lots to share between ChecksumFileSystem and DistributedFileSystem. We've found a number of subtle bugs in the checksumming code, and there may be more. Two separate implementations will double the bugs.

          I've submitted a separate issue to fix the buffering issue, HADOOP-1450. It's simple & will probably be committed long before this issue is complete.

          Show
          Doug Cutting added a comment - > How do we implementation this? My preferred method is to FS implementation to do the the buffering. Yes, FileSystem#open(Path, bufferSize) should return a buffered stream, and that method is abstract, so buffering is implemented by the FileSystem. But two FileSystem implementations might reasonably share some of their buffering code. In this case, I think there's lots to share between ChecksumFileSystem and DistributedFileSystem. We've found a number of subtle bugs in the checksumming code, and there may be more. Two separate implementations will double the bugs. I've submitted a separate issue to fix the buffering issue, HADOOP-1450 . It's simple & will probably be committed long before this issue is complete.
          Hide
          Raghu Angadi added a comment -

          A seperate Jira will be filed about changes to ChecksumFileSystem changes to support generic sources for data and checksum so that most of the code can be shared.

          Show
          Raghu Angadi added a comment - A seperate Jira will be filed about changes to ChecksumFileSystem changes to support generic sources for data and checksum so that most of the code can be shared.
          Hide
          Raghu Angadi added a comment -

          The total size of the packet in bytes. Having this up front might make it easier to, e.g., write an NIO-based datanode that uses async io. Ideally we could re-write datanode to be async without modifying the on-the-wire protocol.

          I am still not clear which length is missing. Length of common header is a constant. Both OP_READ_BLOCK and OP_WRITE_BLOCK include lengths. In the case of WRITE, html doc is out of date. I will update.

          I don't see the use case for transmitting the start and length with each checksum, rather it seems like it only makes sense once per request, no? So why not factor it to the OP-level?

          E.g. OP_READ_BLOCK:
          Right now start_offset is required for the first 'DATA_CHUNK' and length is required for last two DATA_CHUNKS (at least one data chunk for sure) to indicate end of the stream (for what ever reason). Using 'Vint' will bring the byte over head to 5-6. So start_offset can be removed from DATA_CHUNK. I would prefer to keep the length so that loops that read and write to these streams could be a little simpler.

          Show
          Raghu Angadi added a comment - The total size of the packet in bytes. Having this up front might make it easier to, e.g., write an NIO-based datanode that uses async io. Ideally we could re-write datanode to be async without modifying the on-the-wire protocol. I am still not clear which length is missing. Length of common header is a constant. Both OP_READ_BLOCK and OP_WRITE_BLOCK include lengths. In the case of WRITE, html doc is out of date. I will update. I don't see the use case for transmitting the start and length with each checksum, rather it seems like it only makes sense once per request, no? So why not factor it to the OP-level? E.g. OP_READ_BLOCK: Right now start_offset is required for the first 'DATA_CHUNK' and length is required for last two DATA_CHUNKS (at least one data chunk for sure) to indicate end of the stream (for what ever reason). Using 'Vint' will bring the byte over head to 5-6. So start_offset can be removed from DATA_CHUNK. I would prefer to keep the length so that loops that read and write to these streams could be a little simpler.
          Hide
          Raghu Angadi added a comment -

          > A seperate Jira will be filed about changes to ChecksumFileSystem changes to support generic sources for data and checksum so that most of the code can be shared.

          In the mean time, I will have full checksum failure handling functionality in my patch.. only notable code replication I see is retry logic in FSInputChecker.readBuffer() where SeekToNewSource() and reportChecksumFailure are executed. Though these changes won't be checked in it will give me full functionality for my testing. And it might help us see how sharing this part will reduce complexity and/or code.

          Show
          Raghu Angadi added a comment - > A seperate Jira will be filed about changes to ChecksumFileSystem changes to support generic sources for data and checksum so that most of the code can be shared. In the mean time, I will have full checksum failure handling functionality in my patch.. only notable code replication I see is retry logic in FSInputChecker.readBuffer() where SeekToNewSource() and reportChecksumFailure are executed. Though these changes won't be checked in it will give me full functionality for my testing. And it might help us see how sharing this part will reduce complexity and/or code.
          Hide
          Doug Cutting added a comment -

          > I am still not clear which length is missing.

          It's a minor point, but if we write async-io daemons for this protocol then the easier it is to parse the total packet length the easier it will be to write these daemons. So placing the total packet length in a fixed position at the front of the packet so that it may be generically accessed without having to determine what kind of a packet it is, will simplify things. An async daemon will typically buffer entire requests as they arrive in small bits, then, once a request is complete, perform an action. However we can always add such a length on later, if and when we write async daemons, but it may then take longer to roll out such daemons, as it may require protocol changes.

          > So start_offset can be removed from DATA_CHUNK. I would prefer to keep the length so that loops that read and write to these streams could be a little simpler.

          That sounds fine.

          > only notable code replication I see is retry logic in FSInputChecker.readBuffer() where SeekToNewSource() and reportChecksumFailure are executed

          Which is some of the most delicate code, that has taken several revisions to get to its current level of correctness. In other words, logic that shouldn't be replicated if at all possible.

          Show
          Doug Cutting added a comment - > I am still not clear which length is missing. It's a minor point, but if we write async-io daemons for this protocol then the easier it is to parse the total packet length the easier it will be to write these daemons. So placing the total packet length in a fixed position at the front of the packet so that it may be generically accessed without having to determine what kind of a packet it is, will simplify things. An async daemon will typically buffer entire requests as they arrive in small bits, then, once a request is complete, perform an action. However we can always add such a length on later, if and when we write async daemons, but it may then take longer to roll out such daemons, as it may require protocol changes. > So start_offset can be removed from DATA_CHUNK. I would prefer to keep the length so that loops that read and write to these streams could be a little simpler. That sounds fine. > only notable code replication I see is retry logic in FSInputChecker.readBuffer() where SeekToNewSource() and reportChecksumFailure are executed Which is some of the most delicate code, that has taken several revisions to get to its current level of correctness. In other words, logic that shouldn't be replicated if at all possible.
          Hide
          Doug Cutting added a comment -

          Calvin Yu noted on hadoop-user that join() seems to sometimes hang even if the thread has been interrupted. In other places we use the idiom of a 'running' flag that's checked in a thread's loop in conjunction with an interrupt, rather than interrupt+join, and that seems to be reliable. So I think we should switch to that here to.

          Also, in the current patch, I don't see why the thread is held in a field. I worry that someone might add code like 'if (sortProgressThread == null) ...', and that we might somehow not always null this field. If it is kept in a local variable around the call then this is much less of a risk.

          So I think we should convert the createProgressThread method to a nested class whose constructor starts the thread and which has a stop() method that sets a flag. It would also be good if the 'try' block could be shared between 'collect()' and 'flush()'. I think this calls for a new method something like:

          private void sortWithProgress() {
          ProgressThread progress = new ProgressThread();
          try

          { sortAndSpillToDisk(); }

          finally

          { progress.stop(); }

          }

          Show
          Doug Cutting added a comment - Calvin Yu noted on hadoop-user that join() seems to sometimes hang even if the thread has been interrupted. In other places we use the idiom of a 'running' flag that's checked in a thread's loop in conjunction with an interrupt, rather than interrupt+join, and that seems to be reliable. So I think we should switch to that here to. Also, in the current patch, I don't see why the thread is held in a field. I worry that someone might add code like 'if (sortProgressThread == null) ...', and that we might somehow not always null this field. If it is kept in a local variable around the call then this is much less of a risk. So I think we should convert the createProgressThread method to a nested class whose constructor starts the thread and which has a stop() method that sets a flag. It would also be good if the 'try' block could be shared between 'collect()' and 'flush()'. I think this calls for a new method something like: private void sortWithProgress() { ProgressThread progress = new ProgressThread(); try { sortAndSpillToDisk(); } finally { progress.stop(); } }
          Hide
          Raghu Angadi added a comment -

          >> only notable code replication I see is retry logic in FSInputChecker.readBuffer() where SeekToNewSource() and reportChecksumFailure are executed
          > Which is some of the most delicate code, that has taken several revisions to get to its current level of correctness. In other words, logic that shouldn't be replicated if at all possible.

          Of course, code reuse is good. But in this case we need to write equally important tricky and logic (in more than one place) to support sharing of another piece of tricky code. But this is probably considered better since there is no code replication.

          Also most Jira's filed for ChecksumFileSystem are to do with the complication of maintaining two independent streams that are related to each other by offsets. Also each stream need to be retried correctly with different blocks. It is complecated but Block Level CRCs does not have that problem since it only needs to deal with one stream and one type of retry. But new Jira will be filed any way.

          Show
          Raghu Angadi added a comment - >> only notable code replication I see is retry logic in FSInputChecker.readBuffer() where SeekToNewSource() and reportChecksumFailure are executed > Which is some of the most delicate code, that has taken several revisions to get to its current level of correctness. In other words, logic that shouldn't be replicated if at all possible. Of course, code reuse is good. But in this case we need to write equally important tricky and logic (in more than one place) to support sharing of another piece of tricky code. But this is probably considered better since there is no code replication. Also most Jira's filed for ChecksumFileSystem are to do with the complication of maintaining two independent streams that are related to each other by offsets. Also each stream need to be retried correctly with different blocks. It is complecated but Block Level CRCs does not have that problem since it only needs to deal with one stream and one type of retry. But new Jira will be filed any way.
          Hide
          Raghu Angadi added a comment -

          It's a minor point, but if we write async-io daemons for this protocol then the easier it is to parse the total packet length the easier it will be to write these daemons. So placing the total packet length in a fixed position at the front of the packet so that it may be generically accessed without having to determine what kind of a packet it is, will simplify things.

          Agreed. Btw, these are not really packets in any sense. These are streams with typical lengths of 100s of MB. Of course each DATA_CHUNK is like packet and it does include a length. Though using Vints negates simplicity of Async reading. In fact, knowing lengths for DATA_CHUNK helps even non-async processing and new code uses the length heavily.

          We might even get rid of Initial length (or set it to -1) for OP_WRITE_BLOCK if we want to move to streaming block to DN in parallel to client's writes (instead of transferring the whole block inside endBlock()).

          Show
          Raghu Angadi added a comment - It's a minor point, but if we write async-io daemons for this protocol then the easier it is to parse the total packet length the easier it will be to write these daemons. So placing the total packet length in a fixed position at the front of the packet so that it may be generically accessed without having to determine what kind of a packet it is, will simplify things. Agreed. Btw, these are not really packets in any sense. These are streams with typical lengths of 100s of MB. Of course each DATA_CHUNK is like packet and it does include a length. Though using Vints negates simplicity of Async reading. In fact, knowing lengths for DATA_CHUNK helps even non-async processing and new code uses the length heavily. We might even get rid of Initial length (or set it to -1) for OP_WRITE_BLOCK if we want to move to streaming block to DN in parallel to client's writes (instead of transferring the whole block inside endBlock()).
          Hide
          Doug Cutting added a comment -

          Btw, these are not really packets in any sense. These are streams with typical lengths of 100s of MB.

          Well, there's long been an interest in experimenting with flow control in HDFS. So we might, e.g., break HDFS reads and writes into 64k chunks, waiting for an ACK after each, rather than streaming entire blocks. I'd assumed that was the purpose of these packets. Even if that wasn't your intent, they will make this possible. Currently we rely on the TCP stack to handle flow control, but things may be better if it moved to the application level.

          Show
          Doug Cutting added a comment - Btw, these are not really packets in any sense. These are streams with typical lengths of 100s of MB. Well, there's long been an interest in experimenting with flow control in HDFS. So we might, e.g., break HDFS reads and writes into 64k chunks, waiting for an ACK after each, rather than streaming entire blocks. I'd assumed that was the purpose of these packets. Even if that wasn't your intent, they will make this possible. Currently we rely on the TCP stack to handle flow control, but things may be better if it moved to the application level.
          Hide
          Raghu Angadi added a comment -

          Attaching an implementation of readBuffer() that handles the retries similar to ChecksumFileSystem. I am planning to use this in my development. IMHO complexity of this should be compared with what is required for HADOOP-1470 (both under fs and dfs).

          Show
          Raghu Angadi added a comment - Attaching an implementation of readBuffer() that handles the retries similar to ChecksumFileSystem. I am planning to use this in my development. IMHO complexity of this should be compared with what is required for HADOOP-1470 (both under fs and dfs).
          Hide
          Doug Cutting added a comment -

          > complexity of this should be compared with what is required for HADOOP-1470

          Or perhaps this can be used as a template for the generic version. The only DFS-specific bits are in the catch clause, and they could be factored into an abstract method, no?

          Show
          Doug Cutting added a comment - > complexity of this should be compared with what is required for HADOOP-1470 Or perhaps this can be used as a template for the generic version. The only DFS-specific bits are in the catch clause, and they could be factored into an abstract method, no?
          Hide
          Owen O'Malley added a comment -

          I realized that we need some extension to FileSystem or DistributedFileSystem that lets us get the checksum information for a file. The most convenient would be:

          FileSystem :
            /** Return a single checksum for the entire file. It may be either a checksum of block 
                 checksums or a single checksum of the entire contents. */
            byte[] getChecksum(Path filename);
          

          Otherwise the distributed file cache won't have any convenient way to detect dirty files in the file cache. Currently the distribute file cache uses an md5 of the crc file, but clearly the crc file is going away...

          Show
          Owen O'Malley added a comment - I realized that we need some extension to FileSystem or DistributedFileSystem that lets us get the checksum information for a file. The most convenient would be: FileSystem : /** Return a single checksum for the entire file. It may be either a checksum of block checksums or a single checksum of the entire contents. */ byte [] getChecksum(Path filename); Otherwise the distributed file cache won't have any convenient way to detect dirty files in the file cache. Currently the distribute file cache uses an md5 of the crc file, but clearly the crc file is going away...
          Hide
          Doug Cutting added a comment -

          > Otherwise the distributed file cache won't have any convenient way to detect dirty files in the file cache.

          Do we really want to force all FileSystem implementations to implement public checksums? Maybe we should, but we might instead force them all to implement modification times, which should work for this too (HADOOP-1377). Checksums are stronger here, but are considerably more expensive to compute if not directly supported by the FileSystem. The combination of date+length is usually good enough for cache invalidation.

          Show
          Doug Cutting added a comment - > Otherwise the distributed file cache won't have any convenient way to detect dirty files in the file cache. Do we really want to force all FileSystem implementations to implement public checksums? Maybe we should, but we might instead force them all to implement modification times, which should work for this too ( HADOOP-1377 ). Checksums are stronger here, but are considerably more expensive to compute if not directly supported by the FileSystem. The combination of date+length is usually good enough for cache invalidation.
          Hide
          Raghu Angadi added a comment -

          +1 for light weight invalidation. From my understanding of filecache from Owen and Mahadev, filecache requirements are not very strict (thats good). It does not really validate cache every time the file is read. What it wants to know is that "whether a given file has changed between the time Job tracker starts and task tracker fetches the same file first time".

          That said, DFS can provide something like getChecksum().

          Show
          Raghu Angadi added a comment - +1 for light weight invalidation. From my understanding of filecache from Owen and Mahadev, filecache requirements are not very strict (thats good). It does not really validate cache every time the file is read. What it wants to know is that "whether a given file has changed between the time Job tracker starts and task tracker fetches the same file first time". That said, DFS can provide something like getChecksum().
          Hide
          Raghu Angadi added a comment -

          Brief out line of when an upgrade is considered complete and how a manual override works :

          • Upgrade process starts when normal safemode conditions are met.
          • Any datanode that is already registered or heartbeats after the upgrade starts will be asked to perform datanode upgrade.
          • Without manual override, Namenode waits for all the nodes that were known to complete their upgrade.
          • Every few minutes it prints a brief message in namenode log about the current status.
          • The brief report lists the datanode that have not finished if there are only a handful left (may be <= 10?).
          • If some datanodes go down after the upgrade starts, automatic upgrade might not finish at all

          Manual override:

          • When an admin notices an upgrade that seems to be stuck, a 'detailed report' can be requests.
          • A detailed report iterates through all the blocks and checks how many blocks belong to following categories :
            1. Atleast minReplicas placed on upgraded datanodes.
            2. All the replicas placed on upgraded nodes.
            3. None of the replicas placed on upgraded nodes.
          • Based on above data, an admin can decide either manually stop the upgrade (if there are no blocks with zero upgraded replicas) or alternately bring up a dead datanode.
          • Detailed report and manual override can be done through new dfsadmin commands that will be added.
          Show
          Raghu Angadi added a comment - Brief out line of when an upgrade is considered complete and how a manual override works : Upgrade process starts when normal safemode conditions are met. Any datanode that is already registered or heartbeats after the upgrade starts will be asked to perform datanode upgrade. Without manual override, Namenode waits for all the nodes that were known to complete their upgrade. Every few minutes it prints a brief message in namenode log about the current status. The brief report lists the datanode that have not finished if there are only a handful left (may be <= 10?). If some datanodes go down after the upgrade starts, automatic upgrade might not finish at all Manual override: When an admin notices an upgrade that seems to be stuck, a 'detailed report' can be requests. A detailed report iterates through all the blocks and checks how many blocks belong to following categories : Atleast minReplicas placed on upgraded datanodes. All the replicas placed on upgraded nodes. None of the replicas placed on upgraded nodes. Based on above data, an admin can decide either manually stop the upgrade (if there are no blocks with zero upgraded replicas) or alternately bring up a dead datanode. Detailed report and manual override can be done through new dfsadmin commands that will be added.
          Hide
          Konstantin Shvachko added a comment -

          > A detailed report iterates through all the blocks and checks how many blocks belong to following categories

          1. The categories you define do not add up to the complete number of blocks. If I counted blocks I'd define
            the following categories:
            • Fully replicated blocks (r >= target replication);
            • minimally-replicated blocks (minReplicas <= r < target replication);
            • under-replicated blocks (r < minReplicas)
          2. But I'd rather report it in more general terms like an overall percentage of the upgrade actually completed.
            For the CRC upgrade it is the percentage of blocks that have at least the minimal # of replicas,
            other distr upgrades can measure their completeness in different terms.
          Show
          Konstantin Shvachko added a comment - > A detailed report iterates through all the blocks and checks how many blocks belong to following categories The categories you define do not add up to the complete number of blocks. If I counted blocks I'd define the following categories: Fully replicated blocks (r >= target replication); minimally-replicated blocks (minReplicas <= r < target replication); under-replicated blocks (r < minReplicas) But I'd rather report it in more general terms like an overall percentage of the upgrade actually completed. For the CRC upgrade it is the percentage of blocks that have at least the minimal # of replicas, other distr upgrades can measure their completeness in different terms.
          Hide
          Sameer Paranjpye added a comment -

          > Without manual override, Namenode waits for all the nodes that were known to complete their upgrade

          Why should upgrade completion depend on the set of Datanodes present? Datanodes can come and go, what matters is the number of blocks that have been upgraded. Wouldn't it be better to declare victory when all blocks have been upgraded? All blocks may not be available, of course, and in the event of that occuring the manual override can be used.

          Show
          Sameer Paranjpye added a comment - > Without manual override, Namenode waits for all the nodes that were known to complete their upgrade Why should upgrade completion depend on the set of Datanodes present? Datanodes can come and go, what matters is the number of blocks that have been upgraded. Wouldn't it be better to declare victory when all blocks have been upgraded? All blocks may not be available, of course, and in the event of that occuring the manual override can be used.
          Hide
          Raghu Angadi added a comment -

          Yes, we could check for completion w.r.t blocks every few minutes and if all the replicas are upgraded, we should get out of upgrade even if some datanode have not completed upgrade. will do that.

          Show
          Raghu Angadi added a comment - Yes, we could check for completion w.r.t blocks every few minutes and if all the replicas are upgraded, we should get out of upgrade even if some datanode have not completed upgrade. will do that.
          Hide
          Raghu Angadi added a comment -

          To calculate detailed report (as in Konstantin's comment above, % of fullyReplicated blocks etc), we need iterate over all the blocks. If we do it on straight iteration, it will lock the namenode for a long time. Instead, I am planning to iterate over all the files and look at the blocks belonging to each file. This will lock namenode once for each block (to get file info). This will take more CPU but will not affect the responsiveness of the namenode. Feedback is welcome. During this operation, the namespace does not change since we are still in SafeMode.

          Show
          Raghu Angadi added a comment - To calculate detailed report (as in Konstantin's comment above, % of fullyReplicated blocks etc), we need iterate over all the blocks. If we do it on straight iteration, it will lock the namenode for a long time. Instead, I am planning to iterate over all the files and look at the blocks belonging to each file. This will lock namenode once for each block (to get file info). This will take more CPU but will not affect the responsiveness of the namenode. Feedback is welcome. During this operation, the namespace does not change since we are still in SafeMode.
          Hide
          Raghu Angadi added a comment -

          Attaching BlockLevelCrc-07032007.patch

          This pretty much works. sort with validation passed on 500 nodes. The upgrade works but not very well tested. This a good patch for review and running a test cluster with fresh data.

          To apply this patch, you need to apply patch in HADOOP-1286 first. This patch does not use InputChecker yet (HADOOP-1470).

          We are still testing this. Haven't done much performance much. Sort on 500 nodes runs at comparable speed.

          This is still work in progress (close to finishing).
          Comments are welcome.

          Show
          Raghu Angadi added a comment - Attaching BlockLevelCrc-07032007.patch This pretty much works. sort with validation passed on 500 nodes. The upgrade works but not very well tested. This a good patch for review and running a test cluster with fresh data. To apply this patch, you need to apply patch in HADOOP-1286 first. This patch does not use InputChecker yet ( HADOOP-1470 ). We are still testing this. Haven't done much performance much. Sort on 500 nodes runs at comparable speed. This is still work in progress (close to finishing). Comments are welcome.
          Hide
          Raghu Angadi added a comment -

          Another patch. More upgrade related changes. added 'dfsadmin -crcUpgrade' command.

          Show
          Raghu Angadi added a comment - Another patch. More upgrade related changes. added 'dfsadmin -crcUpgrade' command.
          Hide
          Raghu Angadi added a comment -

          Looks like fixing pread needs more changes than I imagined. Is it ok if I make sure that current patch is not a regression and fix pread properly as a follow up Jira? These are some of the issues I see:

          1. There is no synchronization around call to chooseDataNode(), updating deadNodes etc.
          2. Inside chooseDataNode, if not datanode could be found, it does openInfo(). But new updated information is not used in the next iteration of the loop.
          3. new call to openInfo() might fetch only part of the list of blocks.
          4. Ideally the method that invokes chooseDataNode should retry chooseDataNode so thant chooseDataNode could be called with updated block locations.
          5. making chooseDataNode() synchronized will trigger a findBugs warning since it can sleep for 3 seconds.
          6. We need to support the case where a file might stay open for a very long time (one hour?) with many simultaneous preads at the same time.

          For now, I can make sure that fetchBlockByteRange() (used by pread), synchronizes around accesses around deadNodes etc. So that it does not mess up the InputStream's state.

          Show
          Raghu Angadi added a comment - Looks like fixing pread needs more changes than I imagined. Is it ok if I make sure that current patch is not a regression and fix pread properly as a follow up Jira? These are some of the issues I see: There is no synchronization around call to chooseDataNode(), updating deadNodes etc. Inside chooseDataNode, if not datanode could be found, it does openInfo(). But new updated information is not used in the next iteration of the loop. new call to openInfo() might fetch only part of the list of blocks. Ideally the method that invokes chooseDataNode should retry chooseDataNode so thant chooseDataNode could be called with updated block locations. making chooseDataNode() synchronized will trigger a findBugs warning since it can sleep for 3 seconds. We need to support the case where a file might stay open for a very long time (one hour?) with many simultaneous preads at the same time. For now, I can make sure that fetchBlockByteRange() (used by pread), synchronizes around accesses around deadNodes etc. So that it does not mess up the InputStream's state.
          Hide
          Raghu Angadi added a comment -

          Another patch with pread fixes. Will remove some of my buffering changes in the current patch to minimize conflicts with HADOOP-1470

          Show
          Raghu Angadi added a comment - Another patch with pread fixes. Will remove some of my buffering changes in the current patch to minimize conflicts with HADOOP-1470
          Hide
          Raghu Angadi added a comment -

          Yet another patch. This applies after patches for HADOOP-1286 and HADOOP-1470 are applied.

          It has the genric checksum patch but does not actually use the generic checker inside DFS. I am adding that.

          Show
          Raghu Angadi added a comment - Yet another patch. This applies after patches for HADOOP-1286 and HADOOP-1470 are applied. It has the genric checksum patch but does not actually use the generic checker inside DFS. I am adding that.
          Hide
          Raghu Angadi added a comment -

          The checksum code in this patch is new but has been running for couple of months. HADOOP-1470 is also very new. Is there any advantage to let both to exist and merge in very near future once both stabilize? I am not sure if it is such a bad idea.

          Show
          Raghu Angadi added a comment - The checksum code in this patch is new but has been running for couple of months. HADOOP-1470 is also very new. Is there any advantage to let both to exist and merge in very near future once both stabilize? I am not sure if it is such a bad idea.
          Hide
          Raghu Angadi added a comment -

          More patch clean up.

          Show
          Raghu Angadi added a comment - More patch clean up.
          Hide
          Raghu Angadi added a comment -

          Another patch. I had left debug sleep in BlockCrcUpgrade.java.

          Show
          Raghu Angadi added a comment - Another patch. I had left debug sleep in BlockCrcUpgrade.java.
          Hide
          Raghu Angadi added a comment -

          Attaching the latest patch (07062007) for weekend perusal

          This one extends FSInputChecker for BlockReader class in DFSClient.

          This does not use FSOutputSummer.

          Doug, I don't think it is really necessary for my application. It probably removes 5 lines and adds whatever is required for using the inteface. Also, for writes smaller than bytesPerChecksum, FSOutputSummer buffers once more. This extra buffering is sort of required for InputChecker but not really required while writing, since writing is very simple. I hope that is ok. I wonder how big are map-reduce writes?

          Show
          Raghu Angadi added a comment - Attaching the latest patch (07062007) for weekend perusal This one extends FSInputChecker for BlockReader class in DFSClient. This does not use FSOutputSummer. Doug, I don't think it is really necessary for my application. It probably removes 5 lines and adds whatever is required for using the inteface. Also, for writes smaller than bytesPerChecksum, FSOutputSummer buffers once more. This extra buffering is sort of required for InputChecker but not really required while writing, since writing is very simple. I hope that is ok. I wonder how big are map-reduce writes?
          Hide
          Raghu Angadi added a comment -

          Another patch, with couple of fixes to use of InputChecker.

          Show
          Raghu Angadi added a comment - Another patch, with couple of fixes to use of InputChecker.
          Hide
          Doug Cutting added a comment -

          > This does not use FSOutputSummer.

          And consequently it can have bugs that FSOutputSummer does not. One I can see is that single-byte writes are implemented inefficiently. These are common for a DataOutputStream, e.g., under writeInt(), writeLong(), writeFloat(), etc.

          Show
          Doug Cutting added a comment - > This does not use FSOutputSummer. And consequently it can have bugs that FSOutputSummer does not. One I can see is that single-byte writes are implemented inefficiently. These are common for a DataOutputStream, e.g., under writeInt(), writeLong(), writeFloat(), etc.
          Hide
          Raghu Angadi added a comment -

          > One I can see is that single-byte writes are implemented inefficiently.

          True. I did the same for read(int) and write(int) not to duplicate logic in normal read() and write(). I will use a pre-allocated one byte buffer instead of allocating one each time. Even if I use FSOutputSummer(), I still need this since DFSOutputStream need to override write() any way (in each write, it needs decide if we need to close the burrent block and open a new one).

          Show
          Raghu Angadi added a comment - > One I can see is that single-byte writes are implemented inefficiently. True. I did the same for read(int) and write(int) not to duplicate logic in normal read() and write(). I will use a pre-allocated one byte buffer instead of allocating one each time. Even if I use FSOutputSummer(), I still need this since DFSOutputStream need to override write() any way (in each write, it needs decide if we need to close the burrent block and open a new one).
          Hide
          Doug Cutting added a comment -

          > Even if I use FSOutputSummer(), I still need this since DFSOutputStream need to override write() any way (in each write, it needs decide if we need to close the burrent block and open a new one).

          I don't follow. If block sizes are always a multiple of bytesPerChecksum, as we agreed, then a chunk should never cross a block boundary, right?

          Also, using a one-byte-buffer may still impact performance, since it must call multiple methods per call to write(int). It can make a significant performance difference if write(int) does much more than check a buffer boundary and buffer[count++] = byte.

          Show
          Doug Cutting added a comment - > Even if I use FSOutputSummer(), I still need this since DFSOutputStream need to override write() any way (in each write, it needs decide if we need to close the burrent block and open a new one). I don't follow. If block sizes are always a multiple of bytesPerChecksum, as we agreed, then a chunk should never cross a block boundary, right? Also, using a one-byte-buffer may still impact performance, since it must call multiple methods per call to write(int). It can make a significant performance difference if write(int) does much more than check a buffer boundary and buffer [count++] = byte.
          Hide
          Raghu Angadi added a comment -

          > I don't follow. If block sizes are always a multiple of bytesPerChecksum, as we agreed, then a chunk should never cross a block boundary, right?

          DFSOutputStream needs to have its own write() in any case, since it needs to decide how much can be written to current block and write the rest of the data to next block. This is the case even if block size a multiple of bytesPerChecksum. So DFSOutputStream needs to have a 'BlockWriter' class that subclasses OutputSummer (similar to BlockReader in DFSInputStream).

          Regd, enforcing block size to be multiple of bytesperchecksum, I did not know if there was a consensus reached . I am more than happy to enforce it. I think Owen thought it is not needed yet.

          > Also, using a one-byte-buffer may still impact performance, since it must call multiple methods per call to write(int).
          > It can make a significant performance difference if write(int) does much more than check a buffer boundary and
          > buffer[count++] = byte.

          It depends on what we are aiming to optimize. Yes, this write has couple of extra calls (mainly checksum.update()).
          If you think we have such a large percent of single byte writes, then this might just be part of a bigger performance issue.

          Show
          Raghu Angadi added a comment - > I don't follow. If block sizes are always a multiple of bytesPerChecksum, as we agreed, then a chunk should never cross a block boundary, right? DFSOutputStream needs to have its own write() in any case, since it needs to decide how much can be written to current block and write the rest of the data to next block. This is the case even if block size a multiple of bytesPerChecksum. So DFSOutputStream needs to have a 'BlockWriter' class that subclasses OutputSummer (similar to BlockReader in DFSInputStream). Regd, enforcing block size to be multiple of bytesperchecksum, I did not know if there was a consensus reached . I am more than happy to enforce it. I think Owen thought it is not needed yet. > Also, using a one-byte-buffer may still impact performance, since it must call multiple methods per call to write(int). > It can make a significant performance difference if write(int) does much more than check a buffer boundary and > buffer [count++] = byte. It depends on what we are aiming to optimize. Yes, this write has couple of extra calls (mainly checksum.update()). If you think we have such a large percent of single byte writes, then this might just be part of a bigger performance issue.
          Hide
          Raghu Angadi added a comment -

          DFSOutputStream needs to have its own write() in any case, since it needs to decide how much can be written to current block and write the rest of the data to next block. This is the case even if block size a multiple of bytesPerChecksum. So DFSOutputStream needs to have a 'BlockWriter' class that subclasses OutputSummer (similar to BlockReader in DFSInputStream).

          I was mistaken, this is not correct. Multiple blocks can be handled inside writeChunk(); May be we should change writeChunk() also to pass in checksum, just like readChunk() takes checksum in the same call.

          Show
          Raghu Angadi added a comment - DFSOutputStream needs to have its own write() in any case, since it needs to decide how much can be written to current block and write the rest of the data to next block. This is the case even if block size a multiple of bytesPerChecksum. So DFSOutputStream needs to have a 'BlockWriter' class that subclasses OutputSummer (similar to BlockReader in DFSInputStream). I was mistaken, this is not correct. Multiple blocks can be handled inside writeChunk(); May be we should change writeChunk() also to pass in checksum, just like readChunk() takes checksum in the same call.
          Hide
          Doug Cutting added a comment -

          > Regd, enforcing block size to be multiple of bytesperchecksum, I did not know if there was a consensus reached.

          I can't recall hearing a reason not to enforce this, if it proves advantageous. This seems like a case where it might be advantageous. HDFS should be able to assume that calls to writeChunk() are block-aligned.

          Show
          Doug Cutting added a comment - > Regd, enforcing block size to be multiple of bytesperchecksum, I did not know if there was a consensus reached. I can't recall hearing a reason not to enforce this, if it proves advantageous. This seems like a case where it might be advantageous. HDFS should be able to assume that calls to writeChunk() are block-aligned.
          Hide
          Raghu Angadi added a comment -

          my current patch (which uses FSOutputSummer), throws an IOException if blockSize is not a multiple of bytesPerChecksum. Looks like some current tests will fail such a policy. Please let me know what we want to do here:

          1. fix the tests.
          2. adjust blockSize with a warning to match the condition.
          3. don't enforce this condition, do what ever is necessary while writing.
          Show
          Raghu Angadi added a comment - my current patch (which uses FSOutputSummer), throws an IOException if blockSize is not a multiple of bytesPerChecksum. Looks like some current tests will fail such a policy. Please let me know what we want to do here: fix the tests. adjust blockSize with a warning to match the condition. don't enforce this condition, do what ever is necessary while writing.
          Hide
          Sameer Paranjpye added a comment -

          What tests are these? Existing unit tests or new tests introduced from block CRC work?

          Either way I'd suggest fixing the tests.

          Show
          Sameer Paranjpye added a comment - What tests are these? Existing unit tests or new tests introduced from block CRC work? Either way I'd suggest fixing the tests.
          Hide
          Raghu Angadi added a comment -

          testfilesystem is one of them. I think fixing the tests would be pretty straight fwd as soon as I find which tests depend on this. My preference is also fixing the tests.

          Show
          Raghu Angadi added a comment - testfilesystem is one of them. I think fixing the tests would be pretty straight fwd as soon as I find which tests depend on this. My preference is also fixing the tests.
          Hide
          Hairong Kuang added a comment -

          When I test Block-level crc upgrade, I set a file's replication factor to be 4. Afterwards I manually corrupted 3 out of 4 replicas of the crc file and then upgraded it to block-level crc dfs. Although there is one correct crc replica before upgrade, the current block upgrade algothrim does not handle the case well and reading the file gets a ChecksumError after the upgrade.

          It would be nice if block upgrade keeps all the crc replicas instead of taking the majority replicas when there is a crc mismatch. For example, for data block A, if we have replicas A1, A2, and A3, and crc replicas CRC1, CRC2, CRC3, then we can have Ai creates its meta file by reading CRCi (i=1,2,3). To form a one-to-one mapping from replica Ai to CRCi, we can sort all data replicas and sort all its crc replicas respectively.

          Show
          Hairong Kuang added a comment - When I test Block-level crc upgrade, I set a file's replication factor to be 4. Afterwards I manually corrupted 3 out of 4 replicas of the crc file and then upgraded it to block-level crc dfs. Although there is one correct crc replica before upgrade, the current block upgrade algothrim does not handle the case well and reading the file gets a ChecksumError after the upgrade. It would be nice if block upgrade keeps all the crc replicas instead of taking the majority replicas when there is a crc mismatch. For example, for data block A, if we have replicas A1, A2, and A3, and crc replicas CRC1, CRC2, CRC3, then we can have Ai creates its meta file by reading CRCi (i=1,2,3). To form a one-to-one mapping from replica Ai to CRCi, we can sort all data replicas and sort all its crc replicas respectively.
          Hide
          Raghu Angadi added a comment -

          Yet another patch. This needs patches from HADOOP-1564 HADOOP-1286 and HADOOP-1470 . Let me know if you want the combined patch.

          Current patch uses OutpuSummer and has a few more fixes.

          Regd bytes.per.checksum mismatch, only TestSmallBlock failed. Fixed it by setting bytes.per.checksum to 1 since its block size is 1. 'testfilesystem' failure I saw earlier was because of my config.

          Show
          Raghu Angadi added a comment - Yet another patch. This needs patches from HADOOP-1564 HADOOP-1286 and HADOOP-1470 . Let me know if you want the combined patch. Current patch uses OutpuSummer and has a few more fixes. Regd bytes.per.checksum mismatch, only TestSmallBlock failed. Fixed it by setting bytes.per.checksum to 1 since its block size is 1. 'testfilesystem' failure I saw earlier was because of my config.
          Hide
          Raghu Angadi added a comment -

          Attaching another patch. This requires first patch (attached on July 11th) from HADOOP-1597. CRC upgrade successfully completed on a very large cluster.

          One thing remaining is the 'offline upgrade' (i.e. when a datanode joins after the cluster wide upgrade finishes). Konstantin and I are working on changes required for both HADOOP-1597 and this patch.

          Show
          Raghu Angadi added a comment - Attaching another patch. This requires first patch (attached on July 11th) from HADOOP-1597 . CRC upgrade successfully completed on a very large cluster. One thing remaining is the 'offline upgrade' (i.e. when a datanode joins after the cluster wide upgrade finishes). Konstantin and I are working on changes required for both HADOOP-1597 and this patch.
          Hide
          Raghu Angadi added a comment -

          Finally attaching the patch. This patch needs patch from HADOOP-1597.

          All the tests pass. The patch has been tested quite a bit. Upgrade also ran on a large cluster and tested well.

          Following files are added under dfs/ : DataChecksum.java BlockCrcUpgrade.java ChecksumDistributedFileSystem.java

          There is an observed slow down of may be 5-10% (smaller the cluster, larger the difference) in sort benchmarks. I am pretty sure it is related buffering changes made in this patch. I just did not get enough time to investigate them well. I am pretty sure this can be fixed. will be looking into next. Most of the difference also shows up in randomWriter.

          Will write up more on Upgrade guide for cluster admins. It will include details on how to monitor the progress and how to handle some error situations.

          Feedback is welcome.

          Show
          Raghu Angadi added a comment - Finally attaching the patch. This patch needs patch from HADOOP-1597 . All the tests pass. The patch has been tested quite a bit. Upgrade also ran on a large cluster and tested well. Following files are added under dfs/ : DataChecksum.java BlockCrcUpgrade.java ChecksumDistributedFileSystem.java There is an observed slow down of may be 5-10% (smaller the cluster, larger the difference) in sort benchmarks. I am pretty sure it is related buffering changes made in this patch. I just did not get enough time to investigate them well. I am pretty sure this can be fixed. will be looking into next. Most of the difference also shows up in randomWriter. Will write up more on Upgrade guide for cluster admins. It will include details on how to monitor the progress and how to handle some error situations. Feedback is welcome.
          Hide
          Raghu Angadi added a comment -

          Forgot to tick license.

          Show
          Raghu Angadi added a comment - Forgot to tick license.
          Hide
          Raghu Angadi added a comment -

          Submitting the patch. Automated Hudson build will fail since this patch depends on HADOOP-1597.

          Show
          Raghu Angadi added a comment - Submitting the patch. Automated Hudson build will fail since this patch depends on HADOOP-1597 .
          Hide
          Raghu Angadi added a comment -

          02.patch. This patch includes indentation fix in DistributedFileSystem.java . What used to be an inner class becomes the main class in the file. I did not change the indentation before to minimize conflicts with other patches I was working with (e.g. HADOOP-1470 and HADOOP-1286).

          Show
          Raghu Angadi added a comment - 02.patch. This patch includes indentation fix in DistributedFileSystem.java . What used to be an inner class becomes the main class in the file. I did not change the indentation before to minimize conflicts with other patches I was working with (e.g. HADOOP-1470 and HADOOP-1286 ).
          Hide
          Hadoop QA added a comment -

          -1, could not apply patch.

          The patch command could not apply the latest attachment http://issues.apache.org/jira/secure/attachment/12361744/HADOOP-1134-02.patch as a patch to trunk revision r555813.

          Console output: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/404/console

          Please note that this message is automatically generated and may represent a problem with the automation system and not the patch.

          Show
          Hadoop QA added a comment - -1, could not apply patch. The patch command could not apply the latest attachment http://issues.apache.org/jira/secure/attachment/12361744/HADOOP-1134-02.patch as a patch to trunk revision r555813. Console output: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/404/console Please note that this message is automatically generated and may represent a problem with the automation system and not the patch.
          Hide
          Raghu Angadi added a comment -

          03.patch : minor change in status message.

          Show
          Raghu Angadi added a comment - 03.patch : minor change in status message.
          Hide
          Doug Cutting added a comment -

          I just committed this. Thanks, Raghu!

          Show
          Doug Cutting added a comment - I just committed this. Thanks, Raghu!
          Hide
          Raghu Angadi added a comment -

          Thanks to Konstantin, Hairong, Dhruba, Nigel, Doug, Sameer and others for helping a lot for this to happen. I am currently looking into performance regression.

          Show
          Raghu Angadi added a comment - Thanks to Konstantin, Hairong, Dhruba, Nigel, Doug, Sameer and others for helping a lot for this to happen. I am currently looking into performance regression.
          Hide
          Hadoop QA added a comment -

          -1, could not apply patch.

          The patch command could not apply the latest attachment http://issues.apache.org/jira/secure/attachment/12361802/HADOOP-1134-03.patch as a patch to trunk revision r556754.

          Console output: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/423/console

          Please note that this message is automatically generated and may represent a problem with the automation system and not the patch.

          Show
          Hadoop QA added a comment - -1, could not apply patch. The patch command could not apply the latest attachment http://issues.apache.org/jira/secure/attachment/12361802/HADOOP-1134-03.patch as a patch to trunk revision r556754. Console output: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/423/console Please note that this message is automatically generated and may represent a problem with the automation system and not the patch.
          Hide
          Nigel Daley added a comment -

          Raghu, please attach the latest design document and its test plan to this Jira.

          Show
          Nigel Daley added a comment - Raghu, please attach the latest design document and its test plan to this Jira.
          Hide
          dhruba borthakur added a comment -

          Raghu, it would be nice if we can have the latest design documents attached to this JIRA.

          Show
          dhruba borthakur added a comment - Raghu, it would be nice if we can have the latest design documents attached to this JIRA.
          Hide
          Raghu Angadi added a comment -

          latest design doc is attached.

          Show
          Raghu Angadi added a comment - latest design doc is attached.
          Hide
          Raghu Angadi added a comment -

          Attached test plan (pdf).

          Show
          Raghu Angadi added a comment - Attached test plan (pdf).

            People

            • Assignee:
              Raghu Angadi
              Reporter:
              Raghu Angadi
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development