Hadoop HDFS
  1. Hadoop HDFS
  2. HDFS-2699

Store data and checksums together in block file

    Details

    • Type: Improvement Improvement
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      The current implementation of HDFS stores the data in one block file and the metadata(checksum) in another block file. This means that every read from HDFS actually consumes two disk iops, one to the datafile and one to the checksum file. This is a major problem for scaling HBase, because HBase is usually bottlenecked on the number of random disk iops that the storage-hardware offers.

        Issue Links

          Activity

          Hide
          dhruba borthakur added a comment -

          The number of random reads issued by HBase is almost twice the iops shown via iostat. Each hbase random io translates to a position read (pread) to HDFS.

          In my workload, hbase is issuing 300 pread/sec. The iostat on the machine shows 600 reads/sec. I switched off "verifyChecksum" in the pread calls, and that reduces the iops (via iostats) to about 350/sec, thus validating the claim that storing data and checksum in two different files is very costly for an iops bound workload.

          Any thoughts on how we can put data and checksums together on the same block file?

          Show
          dhruba borthakur added a comment - The number of random reads issued by HBase is almost twice the iops shown via iostat. Each hbase random io translates to a position read (pread) to HDFS. In my workload, hbase is issuing 300 pread/sec. The iostat on the machine shows 600 reads/sec. I switched off "verifyChecksum" in the pread calls, and that reduces the iops (via iostats) to about 350/sec, thus validating the claim that storing data and checksum in two different files is very costly for an iops bound workload. Any thoughts on how we can put data and checksums together on the same block file?
          Hide
          Luke Lu added a comment -

          The number of random reads issued by HBase is almost twice the iops shown via iostat. Each hbase random io translates to a position read (pread) to HDFS.

          As I mentioned in our last conversation, you can embed an application level checksum in HBase block (a la Hypertable) and turn off verifyChecksum in preads. You'd need HFile v3 for this, of course

          Any thoughts on how we can put data and checksums together on the same block file?

          As discussed in HADOOP-1134, inline checksums not only makes the code more complex, but also makes inplace upgrade a lot more expensive (you have to copy the content). We can solve the latter by supporting two block format simultaneously at the expense of code complexity.

          Show
          Luke Lu added a comment - The number of random reads issued by HBase is almost twice the iops shown via iostat. Each hbase random io translates to a position read (pread) to HDFS. As I mentioned in our last conversation, you can embed an application level checksum in HBase block (a la Hypertable) and turn off verifyChecksum in preads. You'd need HFile v3 for this, of course Any thoughts on how we can put data and checksums together on the same block file? As discussed in HADOOP-1134 , inline checksums not only makes the code more complex, but also makes inplace upgrade a lot more expensive (you have to copy the content). We can solve the latter by supporting two block format simultaneously at the expense of code complexity.
          Hide
          Luke Lu added a comment -

          You'd need HFile v3 for this

          Or come up new compression codecs for compression codecs (including none) that don't have checksums.

          Show
          Luke Lu added a comment - You'd need HFile v3 for this Or come up new compression codecs for compression codecs (including none) that don't have checksums.
          Hide
          dhruba borthakur added a comment -

          There are various alternatie like you proposed. The advantage of an application level checksum (at the hbase block level) sounds easy to do. The disadvantage is that hdfs still have to do generate/store checksums to periodically validate data that is not accessed for a long time.

          > by supporting two block format simultaneously at the expense of code complexity
          are u saying that the same data is stored in two places? One is the current format and another is the format with inline checksums?

          Show
          dhruba borthakur added a comment - There are various alternatie like you proposed. The advantage of an application level checksum (at the hbase block level) sounds easy to do. The disadvantage is that hdfs still have to do generate/store checksums to periodically validate data that is not accessed for a long time. > by supporting two block format simultaneously at the expense of code complexity are u saying that the same data is stored in two places? One is the current format and another is the format with inline checksums?
          Hide
          dhruba borthakur added a comment -

          Another option that I am going to try is to fadvise away pages from data files (because those are anyways cached in the hbase cache) so that more file system cache is available to cache data from checksum files. Do people think this is a good idea?

          Show
          dhruba borthakur added a comment - Another option that I am going to try is to fadvise away pages from data files (because those are anyways cached in the hbase cache) so that more file system cache is available to cache data from checksum files. Do people think this is a good idea?
          Hide
          Andrew Purtell added a comment -

          IMHO, this is a design evolution question for HDFS. Is pread a first class use case? How many clients beyond HBase?

          If so, I think it makes sense to consider changes to DN storage that reduce IOPS.

          If not and/or if changes to DN storage are too radical by consensus, then a means to optionally fadvise away data file pages seems worthwhile to try. There are other considerations that suggest deployments should use a reasonable amount of RAM, this will be available in part for OS blockcache.

          There are other various alternatives: application level checksums, mixed device deployment (flash + disk), etc. Given the above two options, it may be a distraction to consider more options unless there is a compelling reason. (For example, optimizing IOPS for disk provides the same benefit for flash devices.)

          Show
          Andrew Purtell added a comment - IMHO, this is a design evolution question for HDFS. Is pread a first class use case? How many clients beyond HBase? If so, I think it makes sense to consider changes to DN storage that reduce IOPS. If not and/or if changes to DN storage are too radical by consensus, then a means to optionally fadvise away data file pages seems worthwhile to try. There are other considerations that suggest deployments should use a reasonable amount of RAM, this will be available in part for OS blockcache. There are other various alternatives: application level checksums, mixed device deployment (flash + disk), etc. Given the above two options, it may be a distraction to consider more options unless there is a compelling reason. (For example, optimizing IOPS for disk provides the same benefit for flash devices.)
          Hide
          dhruba borthakur added a comment -

          Hi andrew, all of the points you mentioned are valid points that could decrease the amount of iops needed for a particular workload. But my point is that if we keep the other pieces constant (amount of ram, amount of flash, etc), then what can we do to reduce iops for the same workload. If the machine has more RAM memory, I would rather give all of it to the hbase block cache, because accesess from the hbase block cache is more optimal that accessing the file system cache. The hbase block cache can do better caching policies (because it is closer to the application) than the OS file cache (I am making the same arguments why databases typically do unbuffered io from the filesytem).

          Most disks are getting larger and larger in size (4TB disks coming next year), but the iops per spindle has not changed much. Given that, an efficient storage system should strive to optimize on iops, is it not?

          Show
          dhruba borthakur added a comment - Hi andrew, all of the points you mentioned are valid points that could decrease the amount of iops needed for a particular workload. But my point is that if we keep the other pieces constant (amount of ram, amount of flash, etc), then what can we do to reduce iops for the same workload. If the machine has more RAM memory, I would rather give all of it to the hbase block cache, because accesess from the hbase block cache is more optimal that accessing the file system cache. The hbase block cache can do better caching policies (because it is closer to the application) than the OS file cache (I am making the same arguments why databases typically do unbuffered io from the filesytem). Most disks are getting larger and larger in size (4TB disks coming next year), but the iops per spindle has not changed much. Given that, an efficient storage system should strive to optimize on iops, is it not?
          Hide
          Andrew Purtell added a comment -

          @Dhruba, yes I agree with you fully. From the HBase point of view optimizing IOPS in HDFS is very important.

          Show
          Andrew Purtell added a comment - @Dhruba, yes I agree with you fully. From the HBase point of view optimizing IOPS in HDFS is very important.
          Hide
          Allen Wittenauer added a comment -

          Would it make sense to make the equivalent of a logging device instead? In other words, put the meta files on a dedicated fast/small disk to segregate them from the actual data blocks? Besides just being able to pick a better storage medium, it might potentially allow for better caching strategies at the OS level (depending upon the OS of course).

          Show
          Allen Wittenauer added a comment - Would it make sense to make the equivalent of a logging device instead? In other words, put the meta files on a dedicated fast/small disk to segregate them from the actual data blocks? Besides just being able to pick a better storage medium, it might potentially allow for better caching strategies at the OS level (depending upon the OS of course).
          Hide
          Todd Lipcon added a comment -

          +1 on considering putting them in the same file. Block files already have a metadata header so we could backward-compatibly support the earlier format without requiring a data rewrite on upgrade (prohibitively expensive)

          Regarding the other ideas, like caching checksums in buffer cache or on SSD, I think the issue here is that the 0.78% overhead (4/512) still makes for fairly large checksum size on a big DN. For example, if the application has a dataset of 4TB per node, then even caching just the checksums is 31GB of RAM. If you're mostly missing HBase's data cache, then you'll probably be missing the checksum cache too (are you really going to devote 30G to it?)

          Show
          Todd Lipcon added a comment - +1 on considering putting them in the same file. Block files already have a metadata header so we could backward-compatibly support the earlier format without requiring a data rewrite on upgrade (prohibitively expensive) Regarding the other ideas, like caching checksums in buffer cache or on SSD, I think the issue here is that the 0.78% overhead (4/512) still makes for fairly large checksum size on a big DN. For example, if the application has a dataset of 4TB per node, then even caching just the checksums is 31GB of RAM. If you're mostly missing HBase's data cache, then you'll probably be missing the checksum cache too (are you really going to devote 30G to it?)
          Hide
          Scott Carey added a comment -

          That brings up a related question: Why a 4 byte crc per 512 bytes and not per 4096 bytes?

          512 aligns with the old hard drive block size, the physical media had ECC at 512 byte blocks and could not read or write in smaller chunks than that. New hard drives all have 4096 byte blocks and ECC at that granularity – no smaller chunk can be read or written. SSDs use 4096 or 8192 byte blocks these days.

          If the physical media is corrupting blocks, these will most likely be corrupted in 4k chunks. A CRC per 4k decreases the checksum overhead by a factor of 8, increasing the likelihood of finding it in OS cache if it is in a side file. Now that CRC is accelerated by the processor and very fast, I don't think the overhead of the larger block CRC for reads smaller than 4k will matter either.

          Inlining the CRC could decrease seek and OS pagecache overhead a lot. Since most file systems and OS's work on 4k blocks, HDFS could store a 4 byte crc and 4092 bytes of data into a single OS / disk page. (Or, 8 4 byte CRCs and 4064 bytes in a page) This has big advantages: If your data is in the OS pagecache, the crc will be too – one will never be written to disk without the other, nor evicted from cache without the other.

          Show
          Scott Carey added a comment - That brings up a related question: Why a 4 byte crc per 512 bytes and not per 4096 bytes? 512 aligns with the old hard drive block size, the physical media had ECC at 512 byte blocks and could not read or write in smaller chunks than that. New hard drives all have 4096 byte blocks and ECC at that granularity – no smaller chunk can be read or written. SSDs use 4096 or 8192 byte blocks these days. If the physical media is corrupting blocks, these will most likely be corrupted in 4k chunks. A CRC per 4k decreases the checksum overhead by a factor of 8, increasing the likelihood of finding it in OS cache if it is in a side file. Now that CRC is accelerated by the processor and very fast, I don't think the overhead of the larger block CRC for reads smaller than 4k will matter either. Inlining the CRC could decrease seek and OS pagecache overhead a lot. Since most file systems and OS's work on 4k blocks, HDFS could store a 4 byte crc and 4092 bytes of data into a single OS / disk page. (Or, 8 4 byte CRCs and 4064 bytes in a page) This has big advantages: If your data is in the OS pagecache, the crc will be too – one will never be written to disk without the other, nor evicted from cache without the other.
          Hide
          dhruba borthakur added a comment -

          Thanks for your comments Scott, Andrew, Todd and Allen.

          Scott: most of our our hbase production clusters have io.bytes.per.checksum to 4096 (instead of 512)

          Allen: One can put crcs on a logging device, e.g. bookkeeper perhaps? But at the end of day, each random io from an hdfs file will consume two disk iops (one on the hdfs block storage and one from the loogging device), is it not? Won't it be optimal to inline crc and data.

          If we decide to implement inline crc, can we make the hdfs support two different data formats and not do any automatic data format upgrade for exisiting data? pre-existing data can remain in the older format while newly created files will have data in the new -inline-data-and-crc format. What to do people think about this idea?

          Show
          dhruba borthakur added a comment - Thanks for your comments Scott, Andrew, Todd and Allen. Scott: most of our our hbase production clusters have io.bytes.per.checksum to 4096 (instead of 512) Allen: One can put crcs on a logging device, e.g. bookkeeper perhaps? But at the end of day, each random io from an hdfs file will consume two disk iops (one on the hdfs block storage and one from the loogging device), is it not? Won't it be optimal to inline crc and data. If we decide to implement inline crc, can we make the hdfs support two different data formats and not do any automatic data format upgrade for exisiting data? pre-existing data can remain in the older format while newly created files will have data in the new -inline-data-and-crc format. What to do people think about this idea?
          Hide
          Todd Lipcon added a comment -

          The idea of introducing the new format as a backward-compatible option sounds good to me. That's what we did for the CRC32C checksums - new files are written with that checksum algorithm but old files continue to operate with the old one.

          Show
          Todd Lipcon added a comment - The idea of introducing the new format as a backward-compatible option sounds good to me. That's what we did for the CRC32C checksums - new files are written with that checksum algorithm but old files continue to operate with the old one.
          Hide
          M. C. Srivas added a comment -

          Couple of observations:

          a. If you want to eventually support random-IO, then a block size of 4096 is too large for the CRC, as it will cause a read-modify-write cycle on the entire 4K. 512-bytes reduces this overhead.

          b. Can the value of the variable "io.bytes.per.checksum" be transferred from the *-site.xml file into the file-properties at the NN at the time of file creation? If someone messes around with it, old files will still work as before

          Show
          M. C. Srivas added a comment - Couple of observations: a. If you want to eventually support random-IO, then a block size of 4096 is too large for the CRC, as it will cause a read-modify-write cycle on the entire 4K. 512-bytes reduces this overhead. b. Can the value of the variable "io.bytes.per.checksum" be transferred from the *-site.xml file into the file-properties at the NN at the time of file creation? If someone messes around with it, old files will still work as before
          Hide
          dhruba borthakur added a comment -

          Thanks srivas for your comments.

          > a block size of 4096 is too large for the CRC

          the hbase block size is 16K. The hdfs checksum size is 4K. The hdfs block size is 256 MB. which one r u referring to here? Can you pl explain the read-modify-write cycle? HDFS does mostly large sequential writes (no overwrites).

          > "io.bytes.per.checksum" be transferred from the *-site.xml
          It is already stored in the datanode meta file associated with each block. Different hdfs files in the same hdfs cluster can have different io.bytes.per.checksum

          Show
          dhruba borthakur added a comment - Thanks srivas for your comments. > a block size of 4096 is too large for the CRC the hbase block size is 16K. The hdfs checksum size is 4K. The hdfs block size is 256 MB. which one r u referring to here? Can you pl explain the read-modify-write cycle? HDFS does mostly large sequential writes (no overwrites). > "io.bytes.per.checksum" be transferred from the *-site.xml It is already stored in the datanode meta file associated with each block. Different hdfs files in the same hdfs cluster can have different io.bytes.per.checksum
          Hide
          M. C. Srivas added a comment -

          @dhruba:

          >> a block size of 4096 is too large for the CRC

          >the hbase block size is 16K. The hdfs checksum size is 4K. The hdfs block size is 256 MB. which one r u referring to >here? Can you pl explain the read-modify-write cycle? HDFS does mostly large sequential writes (no overwrites).

          The CRC block size. (that is, the contiguous region of the file that a CRC covers). Modifying any portion of that region will require that the entire data for the region be read in, and the CRC recomputed for that entire region and the entire region written out again.

          Note that it also introduces a new failure mode ... data that was previously written safely a long time ago could be now deemed "corrupt" since the CRC is no-longer good due to a minor modification during an append. The failure scenario is as follows:

          1. A thread writes to a file and closes it. Lets say the file length is 9K. There are 3 CRCs embedded inline – one for 0-4K, one for 4K-8K, and one for 8K-9K. Call the last one CRC3.

          2. An append happens a few days later to extend the file from 9K to 11K. CRC3 is now recomputed for the 3K-sized region spanning offsets 8K-11K and written out as CRC3-new. But there is a crash, and the entire 3K is not all written out cleanly (CRC3-new and some data in written out before the crash – all 3 copies crash and recover).

          3. A subsequent read on the region 8K-9K now fails with a CRC error ... even though the write was stable and used to succeed before.

          If this file was the HBase WAL, wouldn't this result in a data loss?

          Show
          M. C. Srivas added a comment - @dhruba: >> a block size of 4096 is too large for the CRC >the hbase block size is 16K. The hdfs checksum size is 4K. The hdfs block size is 256 MB. which one r u referring to >here? Can you pl explain the read-modify-write cycle? HDFS does mostly large sequential writes (no overwrites). The CRC block size. (that is, the contiguous region of the file that a CRC covers). Modifying any portion of that region will require that the entire data for the region be read in, and the CRC recomputed for that entire region and the entire region written out again. Note that it also introduces a new failure mode ... data that was previously written safely a long time ago could be now deemed "corrupt" since the CRC is no-longer good due to a minor modification during an append. The failure scenario is as follows: 1. A thread writes to a file and closes it. Lets say the file length is 9K. There are 3 CRCs embedded inline – one for 0-4K, one for 4K-8K, and one for 8K-9K. Call the last one CRC3. 2. An append happens a few days later to extend the file from 9K to 11K. CRC3 is now recomputed for the 3K-sized region spanning offsets 8K-11K and written out as CRC3-new. But there is a crash, and the entire 3K is not all written out cleanly (CRC3-new and some data in written out before the crash – all 3 copies crash and recover). 3. A subsequent read on the region 8K-9K now fails with a CRC error ... even though the write was stable and used to succeed before. If this file was the HBase WAL, wouldn't this result in a data loss?
          Hide
          Todd Lipcon added a comment -

          Modifying any portion of that region will require that the entire data for the region be read in, and the CRC recomputed for that entire region and the entire region written out again

          But the cost of random-reading 4K is essentially the same as the cost of reading 512 bytes. Once you seek to the offset, the data transfer time is insignificant.

          Plus, given the 4KB page size used by Linux, all IO is already at this granularity.

          An append happens a few days later to extend the file from 9K to 11K. CRC3 is now recomputed for the 3K-sized region spanning offsets 8K-11K and written out as CRC3-new. But there is a crash...

          This is an existing issue regardless of whether the checksums are interleaved or separate. The current solution is that we allow a checksum error on the last "checksum chunk" of a file in the case that it's being recovered after a crash – iirc only in the case that all replicas have this issue. If there is any valid replica, then we use that and truncate/rollback the other files to the sync boundary.

          Show
          Todd Lipcon added a comment - Modifying any portion of that region will require that the entire data for the region be read in, and the CRC recomputed for that entire region and the entire region written out again But the cost of random-reading 4K is essentially the same as the cost of reading 512 bytes. Once you seek to the offset, the data transfer time is insignificant. Plus, given the 4KB page size used by Linux, all IO is already at this granularity. An append happens a few days later to extend the file from 9K to 11K. CRC3 is now recomputed for the 3K-sized region spanning offsets 8K-11K and written out as CRC3-new. But there is a crash... This is an existing issue regardless of whether the checksums are interleaved or separate. The current solution is that we allow a checksum error on the last "checksum chunk" of a file in the case that it's being recovered after a crash – iirc only in the case that all replicas have this issue. If there is any valid replica, then we use that and truncate/rollback the other files to the sync boundary.
          Hide
          M. C. Srivas added a comment -

          @Todd: no one is arguing that putting the CRC inline is not beneficial wrt seek time. Recalculating CRC over with a 4K block is substantially slower than with a 512-byte block (256 bytes vs 2K on the average is a 10x factor). Imagine appending continuously to the HBase WAL with the 128-byte records that you mentioned in another thread ... the CPU burn will be much worse with 4K CRC blocks.

          Secondly, the disk manufacturers guarantee only a 512-byte atomicity on disk. Linux doing a 4K block write guarantees almost nothing wrt atomicity of that 4K write to disk. On a crash, unless you are running some sort of RAID or data-journal, there is a likelihood of the 4K block that's in-flight getting corrupted.

          Show
          M. C. Srivas added a comment - @Todd: no one is arguing that putting the CRC inline is not beneficial wrt seek time. Recalculating CRC over with a 4K block is substantially slower than with a 512-byte block (256 bytes vs 2K on the average is a 10x factor). Imagine appending continuously to the HBase WAL with the 128-byte records that you mentioned in another thread ... the CPU burn will be much worse with 4K CRC blocks. Secondly, the disk manufacturers guarantee only a 512-byte atomicity on disk. Linux doing a 4K block write guarantees almost nothing wrt atomicity of that 4K write to disk. On a crash, unless you are running some sort of RAID or data-journal, there is a likelihood of the 4K block that's in-flight getting corrupted.
          Hide
          Suresh Srinivas added a comment -

          Dhruba, say CRC inlining is done and reduces number of IOPs by half. Does it solve the number of IOPs bottleneck for HBase or does it just provide a relief?

          Show
          Suresh Srinivas added a comment - Dhruba, say CRC inlining is done and reduces number of IOPs by half. Does it solve the number of IOPs bottleneck for HBase or does it just provide a relief?
          Hide
          dhruba borthakur added a comment -

          Suresh: it provides great relief for an hbase workload that is mostly disk iops bound. HBase performance for many of my workloads are bottlenecked by the number of disk iops u can do. That is the precise reason why local-short-circuit-read helped hbase performance. This one (with inline crcs) almost doubles hbase performance for one workload that I am testing here.

          Show
          dhruba borthakur added a comment - Suresh: it provides great relief for an hbase workload that is mostly disk iops bound. HBase performance for many of my workloads are bottlenecked by the number of disk iops u can do. That is the precise reason why local-short-circuit-read helped hbase performance. This one (with inline crcs) almost doubles hbase performance for one workload that I am testing here.
          Hide
          Hari Mankude added a comment -

          Using application level checksums as suggested by Luke at the application block level (16k in hbase ?) and disabling hdfs checksums might be a better approach.

          Another possibility is to see if T10-DIX/DIF can be used for data verification if there is appropriate hardware support.

          Show
          Hari Mankude added a comment - Using application level checksums as suggested by Luke at the application block level (16k in hbase ?) and disabling hdfs checksums might be a better approach. Another possibility is to see if T10-DIX/DIF can be used for data verification if there is appropriate hardware support.
          Hide
          Todd Lipcon added a comment -

          Dhruba had a good idea in HBASE-5074 - i think it's similar to Luke's idea.

          It would be nicer to provide for fast random IO from HDFS without having to push checksumming responsibility up to the application. But, in the short/medium term, it's probably the right solution to make HBase do its own checksums and think about HDFS changes from a longer term vantage point.

          Show
          Todd Lipcon added a comment - Dhruba had a good idea in HBASE-5074 - i think it's similar to Luke's idea. It would be nicer to provide for fast random IO from HDFS without having to push checksumming responsibility up to the application. But, in the short/medium term, it's probably the right solution to make HBase do its own checksums and think about HDFS changes from a longer term vantage point.
          Hide
          Ramkumar Vadali added a comment -

          How about having InputStream/OutputStream wrapper classes that deal with checksums for the applications? Then we dont need to change HDFS, and any applications interested in having in-line checksums can just use the wrappers.

          Show
          Ramkumar Vadali added a comment - How about having InputStream/OutputStream wrapper classes that deal with checksums for the applications? Then we dont need to change HDFS, and any applications interested in having in-line checksums can just use the wrappers.
          Hide
          Hari Mankude added a comment -

          One of the issues with inlining checksums (512 bytes of data + 4 byte checksum) is that it would application layer slower since applications might have to copy data to get contiguous data only chunks. One of the tradeoff is between checksum disk iops and application copies on the read side. Similarly, writes will have to inline checksum bytes in the data stream buffers.

          Show
          Hari Mankude added a comment - One of the issues with inlining checksums (512 bytes of data + 4 byte checksum) is that it would application layer slower since applications might have to copy data to get contiguous data only chunks. One of the tradeoff is between checksum disk iops and application copies on the read side. Similarly, writes will have to inline checksum bytes in the data stream buffers.
          Hide
          Todd Lipcon added a comment -

          We already basically inline them on the wire in 64K chunks. But yes, I agree that if we inlined a checksum every 4 bytes it would be expensive.

          This is one of the advantages of doing it at the HBase layer - we're already reading an HFile "block" at a time (~64kb) - so putting the checksums in the header of the block doesn't add any extra read ops or non-contiguous IO.

          Show
          Todd Lipcon added a comment - We already basically inline them on the wire in 64K chunks. But yes, I agree that if we inlined a checksum every 4 bytes it would be expensive. This is one of the advantages of doing it at the HBase layer - we're already reading an HFile "block" at a time (~64kb) - so putting the checksums in the header of the block doesn't add any extra read ops or non-contiguous IO.
          Hide
          Scott Carey added a comment -

          @Srivas:

          If you want to eventually support random-IO, then a block size of 4096 is too large for the CRC, as it will cause a read-modify-write cycle on the entire 4K. 512-bytes reduces this overhead.

          With CRC hardware accelerated now, this is not a big overhead. Without hardware acceleration it is ~800MB/sec for 4096 byte chunks, or 200,000 blocks per second or 25% of one CPU load at 200MB/sec writes. With Hardware acceleration this drops by a factor of 4 to 8.

          This is besides the point, a paranoid user could configure smaller CRC chunks and test that. I'm suggesting that 4096 is a much saner default.

          @Todd:

          Secondly, the disk manufacturers guarantee only a 512-byte atomicity on disk. Linux doing a 4K block write guarantees almost nothing wrt atomicity of that 4K write to disk. On a crash, unless you are running some sort of RAID or data-journal, there is a likelihood of the 4K block that's in-flight getting corrupted.

          Actually, disk manufacturers are all using 4096 byte atomicity these days (starting with 500GB platters for most manufacturers) **. HDFS should not target protecting power_of_two_butes data with a checksum, but rather (power_of_two_bytes - checksum_size) data so that the hardware atomicity (and OS page cache) lines up exactly with the hdfs checksum chunk + inlined CRC.

          @Srivas:

          2. An append happens a few days later to extend the file from 9K to 11K. CRC3 is now recomputed for the 3K-sized region spanning offsets 8K-11K and written out as CRC3-new. But there is a crash, and the entire 3K is not all written out cleanly

          This can be avoided entirely.
          A. The OS and Hardware can avoid partial page writes. ext4 and others can avoid partial page writes. The OS only flushes a page at a time. Hardware these days writes blocks in atomic 4096 byte chunks.
          B. The inlined CRC can be done so that a single 4096 byte page in the OS contains all of the data and the crc in an atomic chunk, and the CRC and its corresponding data are therefore not split across pages.

          Under the above conditions, the performance would be excellent, and the data safety higher than the current situation or any application level crc (unless the application is inlining the crc to prevent splitting the data and crc across pages).

          About the transition to 4096 byte blocks on Hard drives ("Advanced Format" disks):
          http://www.zdnet.com/blog/storage/are-you-ready-for-4k-sector-drives/731
          http://en.wikipedia.org/wiki/Advanced_Format
          http://www.seagate.com/docs/pdf/whitepaper/tp613_transition_to_4k_sectors.pdf
          http://lwn.net/Articles/322777/
          http://www.anandtech.com/show/2888

          Show
          Scott Carey added a comment - @Srivas: If you want to eventually support random-IO, then a block size of 4096 is too large for the CRC, as it will cause a read-modify-write cycle on the entire 4K. 512-bytes reduces this overhead. With CRC hardware accelerated now, this is not a big overhead. Without hardware acceleration it is ~800MB/sec for 4096 byte chunks, or 200,000 blocks per second or 25% of one CPU load at 200MB/sec writes. With Hardware acceleration this drops by a factor of 4 to 8. This is besides the point, a paranoid user could configure smaller CRC chunks and test that. I'm suggesting that 4096 is a much saner default. @Todd: Secondly, the disk manufacturers guarantee only a 512-byte atomicity on disk. Linux doing a 4K block write guarantees almost nothing wrt atomicity of that 4K write to disk. On a crash, unless you are running some sort of RAID or data-journal, there is a likelihood of the 4K block that's in-flight getting corrupted. Actually, disk manufacturers are all using 4096 byte atomicity these days (starting with 500GB platters for most manufacturers) **. HDFS should not target protecting power_of_two_butes data with a checksum, but rather (power_of_two_bytes - checksum_size) data so that the hardware atomicity (and OS page cache) lines up exactly with the hdfs checksum chunk + inlined CRC. @Srivas: 2. An append happens a few days later to extend the file from 9K to 11K. CRC3 is now recomputed for the 3K-sized region spanning offsets 8K-11K and written out as CRC3-new. But there is a crash, and the entire 3K is not all written out cleanly This can be avoided entirely. A. The OS and Hardware can avoid partial page writes. ext4 and others can avoid partial page writes. The OS only flushes a page at a time. Hardware these days writes blocks in atomic 4096 byte chunks. B. The inlined CRC can be done so that a single 4096 byte page in the OS contains all of the data and the crc in an atomic chunk, and the CRC and its corresponding data are therefore not split across pages. Under the above conditions, the performance would be excellent, and the data safety higher than the current situation or any application level crc (unless the application is inlining the crc to prevent splitting the data and crc across pages). About the transition to 4096 byte blocks on Hard drives ("Advanced Format" disks): http://www.zdnet.com/blog/storage/are-you-ready-for-4k-sector-drives/731 http://en.wikipedia.org/wiki/Advanced_Format http://www.seagate.com/docs/pdf/whitepaper/tp613_transition_to_4k_sectors.pdf http://lwn.net/Articles/322777/ http://www.anandtech.com/show/2888
          Hide
          Guo Ruijing added a comment -

          what's plan to include the improvement?

          It is nice to include this improvement. when this feature is implemented, we need to consider HDFS upgrade since block format is changed.

          new block format can be

          BLOCK HEADER:

          1. MAGIC_NUMBER (can be "HDFSBLOCK")
          2. VERSION
          3. CRC/COMPRESSION_TYPE
          4. ONE BLOCK LENGTH (for example 512 byte)
          5. PADDING (optional)
          6. DATA_OFFSET

          BLOCK DATA1 (for example, the format depends on CRC/COMPRESSION_TYPE)
          1. RAW DATA (502 byte)
          2. DATA LENGTH (2 byte)
          3. DATA CRC (8 byte)

          How to make compatible with existing format?

          if (*.meta)

          { use original format }

          else

          { use new format }
          Show
          Guo Ruijing added a comment - what's plan to include the improvement? It is nice to include this improvement. when this feature is implemented, we need to consider HDFS upgrade since block format is changed. new block format can be BLOCK HEADER: 1. MAGIC_NUMBER (can be "HDFSBLOCK") 2. VERSION 3. CRC/COMPRESSION_TYPE 4. ONE BLOCK LENGTH (for example 512 byte) 5. PADDING (optional) 6. DATA_OFFSET BLOCK DATA1 (for example, the format depends on CRC/COMPRESSION_TYPE) 1. RAW DATA (502 byte) 2. DATA LENGTH (2 byte) 3. DATA CRC (8 byte) How to make compatible with existing format? if (*.meta) { use original format } else { use new format }

            People

            • Assignee:
              dhruba borthakur
              Reporter:
              dhruba borthakur
            • Votes:
              0 Vote for this issue
              Watchers:
              34 Start watching this issue

              Dates

              • Created:
                Updated:

                Development