This patch adds random read test toTestDFSIO. The test provides quite a few options and as a result is larger than other mappers in in TestDFSIO. I moved the main implementation of this mapper into its own file.
The patch currently includes a native java app that performs equivalent random reads on same set of physical files (more on this later). This gives a base line to asses HDFS overhead and helps us experiment different policies (e.g. does it help if we don't close the file after each read or not a start a new thread?).
A few changes in datanode are temporary. An option to skip the CRC file while serving data is added. This should ideally be a client option.
Instructions on how to run these tests is provided at the bottom of this comment. See JavaDoc for RandomReadMapperImpl for various configuration parameters for random read test.
- How bad is HDFS random access?
- Random access in HDFS always seemed to have bad PR though hardly anyone used the interface. Claims/rumours range from "transfers a lot of excess data" (not true) to "we noticed it is 10 times slower than our non-hdfs app" (hard to see how if the app is I/O bound and/or is doing at least semi random reads).
- It was good see HBase successfully used the interface for its speed up. It can not achieve competitive performance with out reasonable random access performance in HDFS (for HFile).
- How important is connection caching for pread (position read)
- clearly it saves 1-2ms latency (more closer to 1ms) . Should not have effect on throughput or scalability with multiple readers.
- For many loads, this latency could be important. 10% latency reduction is 10% throughput increase with a 10ms seek.
- Do checksum files add noticeable overhead
- Each 64MB block has 0.5MB checksum file. Each read read a few bytes at the front and a few bytes at a random offset in the file.
- This could cost a seek or two doubling the seek load.
- The preliminary tests with 10GB data set show this is not the case.
- Other HDFS specific issues :
- Each read opens and closes data and checksum files. Does it help to cache them?
- same with new thread for each read.
- How well does random access scale?
- I have not done any tests on larger clusters. but plan to.
- I don't see any reason why it would not scale.
The results depend on hardware more than I thought. The numbers presented here are for a single node cluster with one spindle.
- Single node Hadoop cluster
- Random access over 10 1GB files.
- CPU : Dual core Opteron 2GHz, Memory : 4GB
- Harddisk : 400 GB WD (WD4000YR-01P)
- Kernel : 2.6.9-22.12 64bit (Based on RedHat kernel I think)
- Kernel I/O scheduler : cfq
Preliminary results :
All the tests are done with single map at a time. It is very important to set "mapred.tasktracker.map.tasks.maximum" to '1'. Single mapper is used to simplify interpretation of the results. The actual commands run are given below the results. Each of the 10 maps performs 500 random reads over one of the 10 files (there is an option to read over all the the 10 files). The results vary a bit over runs, usually the first run or first few access would be costlier since OS is still caching file block indexes. All the reads are for 4KB of data.
| Description of read
|| Time for each read in ms
| 1000 native reads over block files
| Random Read 10x500
| Random Read without CRC
| Random Read with 'seek() and read()'
| Read with sequential offsets
| 1000 native reads without closing files
- It was surprising to see with Native reads, not closing the files saves 2ms per read (it increases to 3ms with 5000 reads). So closing the file probably affects kernel caching in important ways. I didn't notice such difference on a similar machine with 4 disk hardware raid (both over 10GB of data made up of 160 64MB block files).
- Effect of CRC reads is smaller than expected. This might be attributable to not so big range of 10GB. But such range might not be off from being practical. Even if the range increases, we could easily increase 'io.bytes.per.checksum' to something large (4kB or 16kB). Tests show latency does not increase noticeably until 64KB or so.
- implies inline CRCs or caching of CRC data in datanode is probably not immediately required.
- Reads with sequential offsets is a good indicator of all the overhead other than the hard disk seeks.
- Tests on larger clusters
- larger data set
- We would very much appreciate running the tests on realistic data sets and different environments
Running the tests :
The following commands are used to run the tests. comments are inline.
# first start the cluster :
# write 10 files
$ bin/hadoop jar build/hadoop-0.21.0-dev-test.jar TestDFSIO -write -nrFiles 10 -fileSize 1024
# Run random read test :
$ bin/hadoop jar build/hadoop-0.21.0-dev-test.jar TestDFSIO -randomread -nrFiles 10 -fileSize 1024
#examples of options : Perform 1000 reads per map
$ bin/hadoop jar build/hadoop-0.21.0-dev-test.jar TestDFSIO -Dtest.io.randomread.num.reads=1000 -randomread -nrFiles 10 -fileSize 1024
#Make each map iterate over all the 10 files instead of one per map
$ bin/hadoop jar build/hadoop-0.21.0-dev-test.jar TestDFSIO -Dtest.io.randomread.num.files=10 -randomread -nrFiles 10 -fileSize 1024
# To test without crc files, set "dfs.datanode.skip.crc" to true and restart the datanode.
# Native FS Reader :
# to create the jar :
$jar -cvf build/randomread.jar -C build/examples build/examples/org/apache/hadoop/examples/NativeFSRandomReader*.class
$bin/hadoop jar build/randomread.jar org.apache.hadoop.examples.NativeFSRandomReader -i tmp/test-blocks -n 1000
# "tmp/test-blocks" contains hard links to all the blocks in datanode directory.
# run without any options for help on command line options.