Details

    • Type: New Feature New Feature
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: mrv2
    • Labels:
      None
    • Hadoop Flags:
      Reviewed
    • Release Note:
      Enhancement to TestDFSIO to do random read test.

      Description

      We should have at least one random read benchmark that can be run with rest of Hadoop benchmarks regularly.

      Please provide benchmark ideas or requirements.

      1. RndRead-TestDFSIO-MR2593-trunk121211.patch
        20 kB
        Dave Thompson
      2. RndRead-TestDFSIO-061011.patch
        20 kB
        Dave Thompson
      3. RndRead-TestDFSIO.patch
        19 kB
        Dave Thompson
      4. HDFS-236.patch
        30 kB
        Raghu Angadi

        Issue Links

          Activity

          Hide
          Allen Wittenauer added a comment -

          Ha. Cancelling the patch so jenkins isn't freaked out.

          Show
          Allen Wittenauer added a comment - Ha. Cancelling the patch so jenkins isn't freaked out.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12507064/RndRead-TestDFSIO-MR2593-trunk121211.patch
          against trunk revision .

          -1 patch. The patch command could not apply the patch.

          Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/4751//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12507064/RndRead-TestDFSIO-MR2593-trunk121211.patch against trunk revision . -1 patch . The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/4751//console This message is automatically generated.
          Hide
          Allen Wittenauer added a comment -

          Ping!

          Show
          Allen Wittenauer added a comment - Ping!
          Hide
          Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12507064/RndRead-TestDFSIO-MR2593-trunk121211.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 6 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed unit tests in .

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/1425//testReport/
          Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/1425//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12507064/RndRead-TestDFSIO-MR2593-trunk121211.patch against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 6 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 eclipse:eclipse. The patch built with eclipse:eclipse. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed unit tests in . +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/1425//testReport/ Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/1425//console This message is automatically generated.
          Hide
          Dave Thompson added a comment -

          Updated path names for trunk 0.23 (12/12/11). Otherwise, it's the same patch.

          Show
          Dave Thompson added a comment - Updated path names for trunk 0.23 (12/12/11). Otherwise, it's the same patch.
          Hide
          Dave Thompson added a comment -

          Jenkins appears to be triggering a failure because "delete a file in archive" and "rename a file in archive" unit tests are failing, despite that those test have nothing to do with this patch.

          Show
          Dave Thompson added a comment - Jenkins appears to be triggering a failure because "delete a file in archive" and "rename a file in archive" unit tests are failing, despite that those test have nothing to do with this patch.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12482070/RndRead-TestDFSIO-061011.patch
          against trunk revision 1136261.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 6 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these core unit tests:
          org.apache.hadoop.cli.TestMRCLI
          org.apache.hadoop.fs.TestFileSystem

          -1 contrib tests. The patch failed contrib unit tests.

          +1 system test framework. The patch passed system test framework compile.

          Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/400//testReport/
          Findbugs warnings: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/400//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/400//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12482070/RndRead-TestDFSIO-061011.patch against trunk revision 1136261. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 6 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these core unit tests: org.apache.hadoop.cli.TestMRCLI org.apache.hadoop.fs.TestFileSystem -1 contrib tests. The patch failed contrib unit tests. +1 system test framework. The patch passed system test framework compile. Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/400//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/400//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/400//console This message is automatically generated.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12482070/RndRead-TestDFSIO-061011.patch
          against trunk revision 1135462.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 6 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these core unit tests:
          org.apache.hadoop.cli.TestMRCLI
          org.apache.hadoop.fs.TestFileSystem

          -1 contrib tests. The patch failed contrib unit tests.

          +1 system test framework. The patch passed system test framework compile.

          Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/395//testReport/
          Findbugs warnings: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/395//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/395//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12482070/RndRead-TestDFSIO-061011.patch against trunk revision 1135462. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 6 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these core unit tests: org.apache.hadoop.cli.TestMRCLI org.apache.hadoop.fs.TestFileSystem -1 contrib tests. The patch failed contrib unit tests. +1 system test framework. The patch passed system test framework compile. Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/395//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/395//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/395//console This message is automatically generated.
          Hide
          Todd Lipcon added a comment -

          Sorry, I must have missed the MR precommit when I updated the hudson jobs. I just fixed it and will kick this build again.

          Show
          Todd Lipcon added a comment - Sorry, I must have missed the MR precommit when I updated the hudson jobs. I just fixed it and will kick this build again.
          Hide
          Konstantin Shvachko added a comment -

          This was done to avoid circular project dependencies. Still valid. I tried to kick in the build. But it is trying to find mapreduce under "http://svn.apache.org/repos/asf/hadoop/mapreduce/trunk", which is not there anymore.

          Show
          Konstantin Shvachko added a comment - This was done to avoid circular project dependencies. Still valid. I tried to kick in the build. But it is trying to find mapreduce under "http://svn.apache.org/repos/asf/hadoop/mapreduce/trunk", which is not there anymore.
          Hide
          Dave Thompson added a comment -

          Not sure just yet. TestDFSIO is a mapreduce program. I assume a few had some opinions on the matter of location when they moved the whole mr test directory out of hdfs and into mapreduce. Perhaps if it is to be moved back, that could be handled as a separate Jira.

          Show
          Dave Thompson added a comment - Not sure just yet. TestDFSIO is a mapreduce program. I assume a few had some opinions on the matter of location when they moved the whole mr test directory out of hdfs and into mapreduce. Perhaps if it is to be moved back, that could be handled as a separate Jira.
          Hide
          stack added a comment -

          How hard to pull it around to hdfs Dave?

          Show
          stack added a comment - How hard to pull it around to hdfs Dave?
          Hide
          Dave Thompson added a comment -

          Hmmm... It looks like Jenkins is attempting to apply this patch to the HDFS trunk, a reasonable assumption given this is an HDFS Jira, and was the correct place when Raghu created this bug. Though TestDFSIO has since been moved into the mapreduce trunk.

          Show
          Dave Thompson added a comment - Hmmm... It looks like Jenkins is attempting to apply this patch to the HDFS trunk, a reasonable assumption given this is an HDFS Jira, and was the correct place when Raghu created this bug. Though TestDFSIO has since been moved into the mapreduce trunk.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12482070/RndRead-TestDFSIO-061011.patch
          against trunk revision 1134170.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 6 new or modified tests.

          -1 patch. The patch command could not apply the patch.

          Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/764//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12482070/RndRead-TestDFSIO-061011.patch against trunk revision 1134170. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 6 new or modified tests. -1 patch. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/764//console This message is automatically generated.
          Hide
          Dave Thompson added a comment -

          I agree. I've taken Raghu's random read tests, ported to trunk (above), and tied in the config params to command line (Kihwal's comments). Submitting patch now.

          Show
          Dave Thompson added a comment - I agree. I've taken Raghu's random read tests, ported to trunk (above), and tied in the config params to command line (Kihwal's comments). Submitting patch now.
          Hide
          stack added a comment -

          Reread Raghu's comments above. Its (still) great.

          Show
          stack added a comment - Reread Raghu's comments above. Its (still) great.
          Hide
          Kihwal Lee added a comment -
          • Some test.io.randomread.* seem to deserve a spot in command line args.
          • The buffer size can be used as the read size in random reads. I see no reason to separate the two in the random read mode.
          • The default behavior is, one random reader operates on just one file out of N files. Since it already has ability to limit the number of files that each reader can access, it might be better to make it work on all N files by default.
          Show
          Kihwal Lee added a comment - Some test.io.randomread.* seem to deserve a spot in command line args. The buffer size can be used as the read size in random reads. I see no reason to separate the two in the random read mode. The default behavior is, one random reader operates on just one file out of N files. Since it already has ability to limit the number of files that each reader can access, it might be better to make it work on all N files by default.
          Hide
          Dave Thompson added a comment -

          I've taken Raghu's patch from 6/27/09 with random read TestDFSIO enhancement, and ported it to the latest (now mapreduce) trunk 5/4/11 svn rev 1099590. Patch attached RndRead-TestDFSIO.patch.

          enjoy,
          Dave

          Show
          Dave Thompson added a comment - I've taken Raghu's patch from 6/27/09 with random read TestDFSIO enhancement, and ported it to the latest (now mapreduce) trunk 5/4/11 svn rev 1099590. Patch attached RndRead-TestDFSIO.patch. enjoy, Dave
          Hide
          Raghu Angadi added a comment -

          This patch adds random read test toTestDFSIO. The test provides quite a few options and as a result is larger than other mappers in in TestDFSIO. I moved the main implementation of this mapper into its own file.

          The patch currently includes a native java app that performs equivalent random reads on same set of physical files (more on this later). This gives a base line to asses HDFS overhead and helps us experiment different policies (e.g. does it help if we don't close the file after each read or not a start a new thread?).

          A few changes in datanode are temporary. An option to skip the CRC file while serving data is added. This should ideally be a client option.

          Instructions on how to run these tests is provided at the bottom of this comment. See JavaDoc for RandomReadMapperImpl for various configuration parameters for random read test.

          Concerns :
          =========

          • How bad is HDFS random access?
            • Random access in HDFS always seemed to have bad PR though hardly anyone used the interface. Claims/rumours range from "transfers a lot of excess data" (not true) to "we noticed it is 10 times slower than our non-hdfs app" (hard to see how if the app is I/O bound and/or is doing at least semi random reads).
            • It was good see HBase successfully used the interface for its speed up. It can not achieve competitive performance with out reasonable random access performance in HDFS (for HFile).
          • How important is connection caching for pread (position read)
            • clearly it saves 1-2ms latency (more closer to 1ms) . Should not have effect on throughput or scalability with multiple readers.
            • For many loads, this latency could be important. 10% latency reduction is 10% throughput increase with a 10ms seek.
          • Do checksum files add noticeable overhead
            • Each 64MB block has 0.5MB checksum file. Each read read a few bytes at the front and a few bytes at a random offset in the file.
            • This could cost a seek or two doubling the seek load.
            • The preliminary tests with 10GB data set show this is not the case.
          • Other HDFS specific issues :
            • Each read opens and closes data and checksum files. Does it help to cache them?
            • same with new thread for each read.
          • How well does random access scale?
            • I have not done any tests on larger clusters. but plan to.
            • I don't see any reason why it would not scale.

          The results depend on hardware more than I thought. The numbers presented here are for a single node cluster with one spindle.

          Environment :
          ===========

          • Single node Hadoop cluster
          • Random access over 10 1GB files.
          • CPU : Dual core Opteron 2GHz, Memory : 4GB
          • Harddisk : 400 GB WD (WD4000YR-01P)
          • Kernel : 2.6.9-22.12 64bit (Based on RedHat kernel I think)
          • Kernel I/O scheduler : cfq

          Preliminary results :
          ===============

          All the tests are done with single map at a time. It is very important to set "mapred.tasktracker.map.tasks.maximum" to '1'. Single mapper is used to simplify interpretation of the results. The actual commands run are given below the results. Each of the 10 maps performs 500 random reads over one of the 10 files (there is an option to read over all the the 10 files). The results vary a bit over runs, usually the first run or first few access would be costlier since OS is still caching file block indexes. All the reads are for 4KB of data.

          Description of read Time for each read in ms
          1000 native reads over block files 09.5
          Random Read 10x500 10.8
          Random Read without CRC 10.5
          Random Read with 'seek() and read()' 12.5
          Read with sequential offsets 01.7
          1000 native reads without closing files 07.5

          Comments :
          ==========

          • It was surprising to see with Native reads, not closing the files saves 2ms per read (it increases to 3ms with 5000 reads). So closing the file probably affects kernel caching in important ways. I didn't notice such difference on a similar machine with 4 disk hardware raid (both over 10GB of data made up of 160 64MB block files).
          • Effect of CRC reads is smaller than expected. This might be attributable to not so big range of 10GB. But such range might not be off from being practical. Even if the range increases, we could easily increase 'io.bytes.per.checksum' to something large (4kB or 16kB). Tests show latency does not increase noticeably until 64KB or so.
            • implies inline CRCs or caching of CRC data in datanode is probably not immediately required.
          • Reads with sequential offsets is a good indicator of all the overhead other than the hard disk seeks.

          TODO :
          ======

          • Tests on larger clusters
          • larger data set
          • We would very much appreciate running the tests on realistic data sets and different environments

          Running the tests :
          ===============
          The following commands are used to run the tests. comments are inline.

          # TestDFSIO:
          # ==========
          # first start the cluster :
          $ bin/start-dfs.sh
          $ bin/start-mapred.sh
          
          # write 10 files 
          $ bin/hadoop jar build/hadoop-0.21.0-dev-test.jar TestDFSIO -write -nrFiles 10 -fileSize 1024
          
          # Run random read test :
          $ bin/hadoop jar build/hadoop-0.21.0-dev-test.jar TestDFSIO -randomread -nrFiles 10 -fileSize 1024
          
          #examples of options : Perform 1000 reads per map
          $ bin/hadoop jar build/hadoop-0.21.0-dev-test.jar TestDFSIO -Dtest.io.randomread.num.reads=1000 -randomread -nrFiles 10 -fileSize 1024 
          
          #Make each map iterate over all the 10 files instead of one per map
          $ bin/hadoop jar build/hadoop-0.21.0-dev-test.jar TestDFSIO -Dtest.io.randomread.num.files=10 -randomread -nrFiles 10 -fileSize 1024 
          
          # To test without crc files, set "dfs.datanode.skip.crc" to true and restart the datanode.
          
          # Native FS Reader :
          #===============
          
          # to create the jar :
          $jar -cvf build/randomread.jar -C build/examples build/examples/org/apache/hadoop/examples/NativeFSRandomReader*.class
          # run 
          $bin/hadoop jar build/randomread.jar org.apache.hadoop.examples.NativeFSRandomReader -i tmp/test-blocks -n 1000
          
          # "tmp/test-blocks" contains hard links to all the blocks in datanode directory.
          
          # run without any options for help on command line options.
          
          
          Show
          Raghu Angadi added a comment - This patch adds random read test toTestDFSIO. The test provides quite a few options and as a result is larger than other mappers in in TestDFSIO. I moved the main implementation of this mapper into its own file. The patch currently includes a native java app that performs equivalent random reads on same set of physical files (more on this later). This gives a base line to asses HDFS overhead and helps us experiment different policies (e.g. does it help if we don't close the file after each read or not a start a new thread?). A few changes in datanode are temporary. An option to skip the CRC file while serving data is added. This should ideally be a client option. Instructions on how to run these tests is provided at the bottom of this comment. See JavaDoc for RandomReadMapperImpl for various configuration parameters for random read test. Concerns : ========= How bad is HDFS random access? Random access in HDFS always seemed to have bad PR though hardly anyone used the interface. Claims/rumours range from "transfers a lot of excess data" (not true) to "we noticed it is 10 times slower than our non-hdfs app" (hard to see how if the app is I/O bound and/or is doing at least semi random reads). It was good see HBase successfully used the interface for its speed up. It can not achieve competitive performance with out reasonable random access performance in HDFS (for HFile). How important is connection caching for pread (position read) clearly it saves 1-2ms latency (more closer to 1ms) . Should not have effect on throughput or scalability with multiple readers. For many loads, this latency could be important. 10% latency reduction is 10% throughput increase with a 10ms seek. Do checksum files add noticeable overhead Each 64MB block has 0.5MB checksum file. Each read read a few bytes at the front and a few bytes at a random offset in the file. This could cost a seek or two doubling the seek load. The preliminary tests with 10GB data set show this is not the case. Other HDFS specific issues : Each read opens and closes data and checksum files. Does it help to cache them? same with new thread for each read. How well does random access scale? I have not done any tests on larger clusters. but plan to. I don't see any reason why it would not scale. The results depend on hardware more than I thought. The numbers presented here are for a single node cluster with one spindle. Environment : =========== Single node Hadoop cluster Random access over 10 1GB files. CPU : Dual core Opteron 2GHz, Memory : 4GB Harddisk : 400 GB WD (WD4000YR-01P) Kernel : 2.6.9-22.12 64bit (Based on RedHat kernel I think) Kernel I/O scheduler : cfq Preliminary results : =============== All the tests are done with single map at a time. It is very important to set "mapred.tasktracker.map.tasks.maximum" to '1'. Single mapper is used to simplify interpretation of the results. The actual commands run are given below the results. Each of the 10 maps performs 500 random reads over one of the 10 files (there is an option to read over all the the 10 files). The results vary a bit over runs, usually the first run or first few access would be costlier since OS is still caching file block indexes. All the reads are for 4KB of data. Description of read Time for each read in ms 1000 native reads over block files 09.5 Random Read 10x500 10.8 Random Read without CRC 10.5 Random Read with 'seek() and read()' 12.5 Read with sequential offsets 01.7 1000 native reads without closing files 07.5 Comments : ========== It was surprising to see with Native reads, not closing the files saves 2ms per read (it increases to 3ms with 5000 reads). So closing the file probably affects kernel caching in important ways. I didn't notice such difference on a similar machine with 4 disk hardware raid (both over 10GB of data made up of 160 64MB block files). Effect of CRC reads is smaller than expected. This might be attributable to not so big range of 10GB. But such range might not be off from being practical. Even if the range increases, we could easily increase 'io.bytes.per.checksum' to something large (4kB or 16kB). Tests show latency does not increase noticeably until 64KB or so. implies inline CRCs or caching of CRC data in datanode is probably not immediately required. Reads with sequential offsets is a good indicator of all the overhead other than the hard disk seeks. TODO : ====== Tests on larger clusters larger data set We would very much appreciate running the tests on realistic data sets and different environments Running the tests : =============== The following commands are used to run the tests. comments are inline. # TestDFSIO: # ========== # first start the cluster : $ bin/start-dfs.sh $ bin/start-mapred.sh # write 10 files $ bin/hadoop jar build/hadoop-0.21.0-dev-test.jar TestDFSIO -write -nrFiles 10 -fileSize 1024 # Run random read test : $ bin/hadoop jar build/hadoop-0.21.0-dev-test.jar TestDFSIO -randomread -nrFiles 10 -fileSize 1024 #examples of options : Perform 1000 reads per map $ bin/hadoop jar build/hadoop-0.21.0-dev-test.jar TestDFSIO -Dtest.io.randomread.num.reads=1000 -randomread -nrFiles 10 -fileSize 1024 #Make each map iterate over all the 10 files instead of one per map $ bin/hadoop jar build/hadoop-0.21.0-dev-test.jar TestDFSIO -Dtest.io.randomread.num.files=10 -randomread -nrFiles 10 -fileSize 1024 # To test without crc files, set "dfs.datanode.skip.crc" to true and restart the datanode. # Native FS Reader : #=============== # to create the jar : $jar -cvf build/randomread.jar -C build/examples build/examples/org/apache/hadoop/examples/NativeFSRandomReader*.class # run $bin/hadoop jar build/randomread.jar org.apache.hadoop.examples.NativeFSRandomReader -i tmp/test-blocks -n 1000 # "tmp/test-blocks" contains hard links to all the blocks in datanode directory. # run without any options for help on command line options.
          Hide
          Doug Cutting added a comment -

          TestSetFile tests random access. The unit test creates a SetFile containing 10,000 RandomDatum instances, then seeks to sqrt(10,000)=100 random entries in that file and checks that the expected entry is found.

          We could also run it on a MiniDFS cluster, specifying a small block size (~32kb?) to test random access on HDFS.

          We might also specify block compression for a run, to better test seeks within a block-compressed SequenceFile.

          Show
          Doug Cutting added a comment - TestSetFile tests random access. The unit test creates a SetFile containing 10,000 RandomDatum instances, then seeks to sqrt(10,000)=100 random entries in that file and checks that the expected entry is found. We could also run it on a MiniDFS cluster, specifying a small block size (~32kb?) to test random access on HDFS. We might also specify block compression for a run, to better test seeks within a block-compressed SequenceFile.

            People

            • Assignee:
              Dave Thompson
              Reporter:
              Raghu Angadi
            • Votes:
              1 Vote for this issue
              Watchers:
              22 Start watching this issue

              Dates

              • Created:
                Updated:

                Development