Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.22.0
    • Fix Version/s: 0.22.0
    • Component/s: test, tools
    • Labels:
      None
    • Hadoop Flags:
      Reviewed
    • Release Note:
      Hide
      Does not currently provide anything but uniform distribution.
      Uses some older depreciated class interfaces (for mapper and reducer)
      This was tested on 0.20 and 0.22 (locally) so it should be fairly backwards compatible.
      Show
      Does not currently provide anything but uniform distribution. Uses some older depreciated class interfaces (for mapper and reducer) This was tested on 0.20 and 0.22 (locally) so it should be fairly backwards compatible.

      Description

      It would be good to have a tool for automatic stress testing HDFS, which would provide IO-intensive load on HDFS cluster.
      The idea is to start the tool, let it run overnight, and then be able to analyze possible failures.

      1. SLiveTest.pdf
        42 kB
        Konstantin Shvachko
      2. SLiveTest.pdf
        41 kB
        Konstantin Shvachko
      3. SLiveTest.pdf
        41 kB
        Konstantin Shvachko
      4. slive.patch.1
        216 kB
        Joshua Harlow
      5. slive.patch
        229 kB
        Joshua Harlow

        Issue Links

          Activity

          Hide
          Konstantin Shvachko added a comment -

          Updating documentation adding multiple reducers per MAPREDUCE-1893.

          Show
          Konstantin Shvachko added a comment - Updating documentation adding multiple reducers per MAPREDUCE-1893 .
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Mapreduce-trunk #324 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Mapreduce-trunk/324/)
          MAPREDUCE-1804. Stress-test tool for HDFS introduced in HDFS-708. Contributed by Joshua Harlow.

          Show
          Hudson added a comment - Integrated in Hadoop-Mapreduce-trunk #324 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Mapreduce-trunk/324/ ) MAPREDUCE-1804 . Stress-test tool for HDFS introduced in HDFS-708 . Contributed by Joshua Harlow.
          Hide
          Konstantin Shvachko added a comment -

          I committed this as a part MAPREDUCE-1804. Thank you Joshua.

          Show
          Konstantin Shvachko added a comment - I committed this as a part MAPREDUCE-1804 . Thank you Joshua.
          Hide
          Konstantin Shvachko added a comment -

          Updating the design doc. Section "Data Generation and Verification" has been updated. The old design turned out to be incompatible with file renames. Now it is compatible with renames, but files have to be read from the beginning.

          Show
          Konstantin Shvachko added a comment - Updating the design doc. Section "Data Generation and Verification" has been updated. The old design turned out to be incompatible with file renames. Now it is compatible with renames, but files have to be read from the beginning.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12444885/slive.patch.1
          against trunk revision 944566.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 112 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed core unit tests.

          -1 contrib tests. The patch failed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/370/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/370/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/370/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/370/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12444885/slive.patch.1 against trunk revision 944566. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 112 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/370/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/370/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/370/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/370/console This message is automatically generated.
          Hide
          Konstantin Shvachko added a comment -

          +1 Code looks good. I am submitting the patch.

          Show
          Konstantin Shvachko added a comment - +1 Code looks good. I am submitting the patch.
          Hide
          Joshua Harlow added a comment -

          Removed/cleaned up classes.

          Show
          Joshua Harlow added a comment - Removed/cleaned up classes.
          Hide
          Joshua Harlow added a comment -

          Added more unit tests and logging for reducer/mapper to see what is being reduced and timings associated with mapping in log output.

          Show
          Joshua Harlow added a comment - Added more unit tests and logging for reducer/mapper to see what is being reduced and timings associated with mapping in log output.
          Hide
          Joshua Harlow added a comment -

          Updated with unit test that tries various operations, selection, range, config merging and checks to ensure that it is done in a acceptable fashion, also it runs the setup locally and does basic validation that it worked. Junit 4 compatible. Some other small updates to data write and data verification. Added optional "dfs.write.packet.size" option as packetSize which just overrides the configuration value when merging.

          Show
          Joshua Harlow added a comment - Updated with unit test that tries various operations, selection, range, config merging and checks to ensure that it is done in a acceptable fashion, also it runs the setup locally and does basic validation that it worked. Junit 4 compatible. Some other small updates to data write and data verification. Added optional "dfs.write.packet.size" option as packetSize which just overrides the configuration value when merging.
          Hide
          Joshua Harlow added a comment -

          Updated with in file "header" like information so that when a file is renamed the "header" information can still be used to do basic data verification. Tried this out and it seems to work. Also updated so that file not found exceptions are reported different and added a bad file exception which signifies a data read failure (ie a header can not be read or is invalid) or an end of file happend when it should not.

          Show
          Joshua Harlow added a comment - Updated with in file "header" like information so that when a file is renamed the "header" information can still be used to do basic data verification. Tried this out and it seems to work. Also updated so that file not found exceptions are reported different and added a bad file exception which signifies a data read failure (ie a header can not be read or is invalid) or an end of file happend when it should not.
          Hide
          Joshua Harlow added a comment -

          Updated using read & write buffering.
          Added clean operation and sleep operation.
          Adjusted help messages and config checks.
          Started initial test (needs work).

          Show
          Joshua Harlow added a comment - Updated using read & write buffering. Added clean operation and sleep operation. Adjusted help messages and config checks. Started initial test (needs work).
          Hide
          Joshua Harlow added a comment -

          Attempt that using the following:

          try

          { Block b = DFSTestUtil.getFirstBlock(fs, fn); blockId = b.getBlockId(); }

          catch (IOException e)

          { LOG.warn("Failure to get first block info " + StringUtils.stringifyException(e)); }

          It seems though that an EOFException occurs if the file is empty even though it has been created.
          "Failure to get first block info java.io.EOFException" and since this is needed to write the first byte set then it will be needed before that write occurs.

          Show
          Joshua Harlow added a comment - Attempt that using the following: try { Block b = DFSTestUtil.getFirstBlock(fs, fn); blockId = b.getBlockId(); } catch (IOException e) { LOG.warn("Failure to get first block info " + StringUtils.stringifyException(e)); } It seems though that an EOFException occurs if the file is empty even though it has been created. "Failure to get first block info java.io.EOFException" and since this is needed to write the first byte set then it will be needed before that write occurs.
          Hide
          Konstantin Shvachko added a comment -

          With respect to 14. I found the following solution.

          public DataGenerator(FileSystem fs, Path fn) throws IOException {
            if(!(fs instanceof DistributedFileSystem)) {
              this.fileId = -1L;
              return;
            }
            DFSDataInputStream in = null;
            try {
              in = (DFSDataInputStream) ((DistributedFileSystem)fs).open(fn);
              this.fileId = in.getCurrentBlock().getBlockId();
            } finally {
              if(in != null) in.close();
            }
          }
          

          Right after creating a file for write you can get the id of the first block of the file and store it in DataGenerator.fileId - a new field.. This id is not changing while renames, and can be reliably used as a file-specific mix-in for hash in data generation and verification. The data value of a file at a specific offset is then calculated as hash(fileId, offset);

          Show
          Konstantin Shvachko added a comment - With respect to 14. I found the following solution. public DataGenerator(FileSystem fs, Path fn) throws IOException { if (!(fs instanceof DistributedFileSystem)) { this .fileId = -1L; return ; } DFSDataInputStream in = null ; try { in = (DFSDataInputStream) ((DistributedFileSystem)fs).open(fn); this .fileId = in.getCurrentBlock().getBlockId(); } finally { if (in != null ) in.close(); } } Right after creating a file for write you can get the id of the first block of the file and store it in DataGenerator.fileId - a new field.. This id is not changing while renames, and can be reliably used as a file-specific mix-in for hash in data generation and verification. The data value of a file at a specific offset is then calculated as hash(fileId, offset) ;
          Hide
          Joshua Harlow added a comment -

          Updated for code comments.

          Show
          Joshua Harlow added a comment - Updated for code comments.
          Hide
          Joshua Harlow added a comment -

          1. Done
          2. Done
          3. These methods have meanings for null (mainly for default checks for existence for merging) and a random seed meaning null means no seed which is possible. Duration for milliseconds can return an int though. Just that null has a meaning if the default value for a config option is set to be a null object. Which it is in a couple of cases.
          4 & 5. Done (we are no measuring only the time around readByte and write())
          6. Done
          7. Done
          8. Done
          9 & 10. Done
          11. Done and most classes made package private
          15. Will add some tests.

          Show
          Joshua Harlow added a comment - 1. Done 2. Done 3. These methods have meanings for null (mainly for default checks for existence for merging) and a random seed meaning null means no seed which is possible. Duration for milliseconds can return an int though. Just that null has a meaning if the default value for a config option is set to be a null object. Which it is in a couple of cases. 4 & 5. Done (we are no measuring only the time around readByte and write()) 6. Done 7. Done 8. Done 9 & 10. Done 11. Done and most classes made package private 15. Will add some tests.
          Hide
          Jakob Homan added a comment -

          Canceling patch pending review updates.

          Show
          Jakob Homan added a comment - Canceling patch pending review updates.
          Hide
          Konstantin Shvachko added a comment -

          Sorry, (4) was a bit vague. I meant that the start and end times should be taken right around the actual HDFS action. E.g. for write it would be

          get_start_time;
          outputStream.write();
          get_elapsed_time;
          

          15. Forgot to mention that SLive should have a test. It can be simple. It can call slive on local MR and local FS with some reasonable parameters, which trigger most of the code paths. An alternative is to start Mini clusters and run slive on them. The important thing is it should not take long time to run.

          Show
          Konstantin Shvachko added a comment - Sorry, (4) was a bit vague. I meant that the start and end times should be taken right around the actual HDFS action. E.g. for write it would be get_start_time; outputStream.write(); get_elapsed_time; 15. Forgot to mention that SLive should have a test. It can be simple. It can call slive on local MR and local FS with some reasonable parameters, which trigger most of the code paths. An alternative is to start Mini clusters and run slive on them. The important thing is it should not take long time to run.
          Hide
          Konstantin Shvachko added a comment -

          This is great. The patch covers main functionality and worked on a cluster as we tested it. Some comments:

          1. Consistently use annotations for overridden methods with the indication which class or interface it overrides, like:
            @Override // MapReduceBase
            

            SliveMapper and SliveReducer don't have annotations

          2. In SliveMapper.map() if unlimitiedTime is equivalent to duration == max_int then the code becomes simpler.
          3. ConfigExtractor methods like Integer getDuration(), Long getRandomSeed(), etc. should return base types int, long.
            I don't see where returning objects is useful except that you have to check for null value, which is not useful at all.
          4. Measuring op timeTaken includes noise. The timer is started at the beginning of op.run() and includes time for parsing
            configs, choosing parameters, creating output streams, and generation of data as I see for write and append.
            The best way would be to measure the performance of the operation by
            get_start_time;
            op.run();
            get_elapsed_time;
            
          5. timeTaken for read should not include data verification time.
          6. MinMax<> class is more like a Range. If you decide to change it the getMinMax*() should be also renamed.
          7. DataGenerator has import warning. Look for more.
          8. RandomInstance.betweenPositive() can be moved into the MinMax class, because you need this to get a random
            number in the range. After that it may make sense to move the instance of Random into Operation base class, and
            get rid of RandomInstance class.
          9. OperationType enumerator should be moved into Constants.
          10. Same with Distribution.
          11. I recommend to merge all the classes under one package o.a.h.fs.slive rather than in many sub-packages.
            That way most classes may be declared package private to emphasize they are specific for this application.
          12. Also may be some small classes will merge into other.
          13. I just realized that although the issue is in HDFS the commit will have to go into MapReduce. Let's keep tracking
            this in here and I'll create a MR issue when the commit is ready.
          14. There is one problem remaining in our design: that checksums don't work after file is renamed, because the file name
            is mixed in the hash.
          Show
          Konstantin Shvachko added a comment - This is great. The patch covers main functionality and worked on a cluster as we tested it. Some comments: Consistently use annotations for overridden methods with the indication which class or interface it overrides, like: @Override // MapReduceBase SliveMapper and SliveReducer don't have annotations In SliveMapper.map() if unlimitiedTime is equivalent to duration == max_int then the code becomes simpler. ConfigExtractor methods like Integer getDuration(), Long getRandomSeed() , etc. should return base types int, long. I don't see where returning objects is useful except that you have to check for null value, which is not useful at all. Measuring op timeTaken includes noise. The timer is started at the beginning of op.run() and includes time for parsing configs, choosing parameters, creating output streams, and generation of data as I see for write and append. The best way would be to measure the performance of the operation by get_start_time; op.run(); get_elapsed_time; timeTaken for read should not include data verification time. MinMax<> class is more like a Range . If you decide to change it the getMinMax*() should be also renamed. DataGenerator has import warning. Look for more. RandomInstance.betweenPositive() can be moved into the MinMax class, because you need this to get a random number in the range. After that it may make sense to move the instance of Random into Operation base class, and get rid of RandomInstance class. OperationType enumerator should be moved into Constants . Same with Distribution . I recommend to merge all the classes under one package o.a.h.fs.slive rather than in many sub-packages. That way most classes may be declared package private to emphasize they are specific for this application. Also may be some small classes will merge into other. I just realized that although the issue is in HDFS the commit will have to go into MapReduce. Let's keep tracking this in here and I'll create a MR issue when the commit is ready. There is one problem remaining in our design: that checksums don't work after file is renamed, because the file name is mixed in the hash.
          Hide
          Joshua Harlow added a comment -

          Updated to fix audit warning.

          Show
          Joshua Harlow added a comment - Updated to fix audit warning.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12442927/slive.patch
          against trunk revision 937914.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 133 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          -1 release audit. The applied patch generated 113 release audit warnings (more than the trunk's current 112 warnings).

          -1 core tests. The patch failed core unit tests.

          -1 contrib tests. The patch failed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h2.grid.sp2.yahoo.net/162/testReport/
          Release audit warnings: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h2.grid.sp2.yahoo.net/162/artifact/trunk/patchprocess/releaseAuditDiffWarnings.txt
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h2.grid.sp2.yahoo.net/162/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h2.grid.sp2.yahoo.net/162/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h2.grid.sp2.yahoo.net/162/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12442927/slive.patch against trunk revision 937914. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 133 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. -1 release audit. The applied patch generated 113 release audit warnings (more than the trunk's current 112 warnings). -1 core tests. The patch failed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h2.grid.sp2.yahoo.net/162/testReport/ Release audit warnings: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h2.grid.sp2.yahoo.net/162/artifact/trunk/patchprocess/releaseAuditDiffWarnings.txt Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h2.grid.sp2.yahoo.net/162/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h2.grid.sp2.yahoo.net/162/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h2.grid.sp2.yahoo.net/162/console This message is automatically generated.
          Hide
          Joshua Harlow added a comment -

          This utility provides the basic features for the defined feature.
          It allows a set of mappers to run which will perform various operations on a file system with various distributions and ratios.
          It implements most of the features provided in the given specification (minus the different distributions) and has been tested on a individual machine (using a local file system) and a single & multinode cluster (using a hdfs filesystem).

          It can be ran by building the test jar file and running -help.
          Through a command such as:
          hadoop org.apache.hadoop.fs.slive.SliveTest -help

          A sample run can be performed by doing the following:
          hadoop org.apache.hadoop.fs.slive.SliveTest -create 100 -files 1000 -ops 500 -duration 1 -maps 1 -baseDir /XYZ

          Show
          Joshua Harlow added a comment - This utility provides the basic features for the defined feature. It allows a set of mappers to run which will perform various operations on a file system with various distributions and ratios. It implements most of the features provided in the given specification (minus the different distributions) and has been tested on a individual machine (using a local file system) and a single & multinode cluster (using a hdfs filesystem). It can be ran by building the test jar file and running -help. Through a command such as: hadoop org.apache.hadoop.fs.slive.SliveTest -help A sample run can be performed by doing the following: hadoop org.apache.hadoop.fs.slive.SliveTest -create 100 -files 1000 -ops 500 -duration 1 -maps 1 -baseDir /XYZ
          Hide
          Konstantin Shvachko added a comment -

          Joshua asked what random file generation mean, as per this sentence from the design doc:
          2. Randomly chooses a file name. File names are enumerated, so choosing a file means choosing its sequence number, which defines the entire file path.

          I mean by this that we have a static enumeration of files. We choose a random number, and then calculate a full path for the corresponding file using that number.
          The static enumeration is like a heap structure. We have an array f0, f1, f2, ... There is a root r. The root's children are files f0 and f1. And two directories d0 and d1. The children of d0 are the files f2, f3 (and the directories d2, d3). The children of d1 are the files f4, f5 as well as the directories d4, d5. And so on. This provides 2 files per directory.
          We can generalize it to p files per directory for a fixed p. Here the root's children will be p files f0,...,f(p-1) and p directories d0,...,d(p-1). And so on. Importantly if you have a file fz, then it's parent is always the directory dz', where z' = z/p - 1.
          I don't want to use long numbers for file names. So within a directory its child files are named file_i and sub-directories are named dir_i for i = 0,...p-1.
          Then given a number z the path of file fz is calculateed recursively. File name of fz is file_(z%p). Its parent is the directory dz', where z' = z/p - 1, and the name of dz' is dir_(z'%p). Going further up the tree while the the indexes are positive.

          In the test we choose a random z and build a path out of it. If the operation is create we create a file with this path. In HDFS all missing directories along the path will be created automatically. If fz already exists the create fails.
          For read we do the same, but the operation fails if the file does not exist.

          Similar approach is used in class FileNameGenerator.

          Show
          Konstantin Shvachko added a comment - Joshua asked what random file generation mean, as per this sentence from the design doc: 2. Randomly chooses a file name. File names are enumerated, so choosing a file means choosing its sequence number, which defines the entire file path. I mean by this that we have a static enumeration of files. We choose a random number, and then calculate a full path for the corresponding file using that number. The static enumeration is like a heap structure. We have an array f0, f1, f2, ... There is a root r. The root's children are files f0 and f1. And two directories d0 and d1. The children of d0 are the files f2, f3 (and the directories d2, d3). The children of d1 are the files f4, f5 as well as the directories d4, d5. And so on. This provides 2 files per directory. We can generalize it to p files per directory for a fixed p. Here the root's children will be p files f0,...,f(p-1) and p directories d0,...,d(p-1). And so on. Importantly if you have a file fz, then it's parent is always the directory dz', where z' = z/p - 1. I don't want to use long numbers for file names. So within a directory its child files are named file_i and sub-directories are named dir_i for i = 0,...p-1. Then given a number z the path of file fz is calculateed recursively. File name of fz is file_(z%p) . Its parent is the directory dz', where z' = z/p - 1, and the name of dz' is dir_(z'%p) . Going further up the tree while the the indexes are positive. In the test we choose a random z and build a path out of it. If the operation is create we create a file with this path. In HDFS all missing directories along the path will be created automatically. If fz already exists the create fails. For read we do the same, but the operation fails if the file does not exist. Similar approach is used in class FileNameGenerator .
          Hide
          Joshua Harlow added a comment -

          Konstantin>
          That sounds like a good first approach.
          Think that will keep it simpler and allow for improvement later.

          Show
          Joshua Harlow added a comment - Konstantin> That sounds like a good first approach. Think that will keep it simpler and allow for improvement later.
          Hide
          Konstantin Shvachko added a comment -

          I am a bit confused about your iterations and inner loops.
          In the document attached here there is one loop and on each iteration of this loop the individual task firstly chooses a random operation. The operation is chosen randomly with a configurable ratio slive.op.<op>.pct. In the uniform distribution the ration/100 is the probably of generating this operation. In non-uniform cases this probability is skewed by the distribution factor. So on each iteration you produce probabilities for each operation, and then generate one according to them. And then execute the op. So it is not about how many operation of a type you execute on an iteration (as you state it). You execute only one op. It is about what is the probability of generating a particular operation.
          I wouldn't worry about distributions now. Lets assume there is only uniform distribution for now. We can add distributions later.

          Show
          Konstantin Shvachko added a comment - I am a bit confused about your iterations and inner loops. In the document attached here there is one loop and on each iteration of this loop the individual task firstly chooses a random operation. The operation is chosen randomly with a configurable ratio slive.op.<op>.pct . In the uniform distribution the ration/100 is the probably of generating this operation. In non-uniform cases this probability is skewed by the distribution factor. So on each iteration you produce probabilities for each operation, and then generate one according to them. And then execute the op. So it is not about how many operation of a type you execute on an iteration (as you state it). You execute only one op. It is about what is the probability of generating a particular operation. I wouldn't worry about distributions now. Lets assume there is only uniform distribution for now. We can add distributions later.
          Hide
          Joshua Harlow added a comment -

          For the distributions I was thinking that this could occur.
          The input would expected to be between [0,1] and output expected to be between [0,1].
          The way I was thinking this would work is that the mapper would give the current time and divide it by the maximum time (both known) and for each iteration of the mapper's inner loop (the one producing & running operations) it would calculate the distribution using these simple formulas for each operation type and distribution given. This would then give a list of numbers between [0,1] which can then be multiplied by a new config variable (slive.ops.per.iteration) and also multiplied by the operations ratio (percentage) to then determine how many operations should occur in that iteration. If the total operations after each loop reaches slive.map.ops or current time reaches the maximum time the loop would stop and the results would be sent to the reducer.

          Here are possible equations to be used:
          Beg would be defined by x^2 (having a number approaching 1 at the end)
          End would be defined by (x-1)^2 (having a number approach 0 at the end)
          Mid would be defined by -2*(x-1/2)^2+1/2 (having a bell shaped curve)
          Uniform would just return 1/3 (the above equations have areas of 1/3 so this seems to make sense)
          Suggestions are welcome.

          Show
          Joshua Harlow added a comment - For the distributions I was thinking that this could occur. The input would expected to be between [0,1] and output expected to be between [0,1] . The way I was thinking this would work is that the mapper would give the current time and divide it by the maximum time (both known) and for each iteration of the mapper's inner loop (the one producing & running operations) it would calculate the distribution using these simple formulas for each operation type and distribution given. This would then give a list of numbers between [0,1] which can then be multiplied by a new config variable (slive.ops.per.iteration) and also multiplied by the operations ratio (percentage) to then determine how many operations should occur in that iteration. If the total operations after each loop reaches slive.map.ops or current time reaches the maximum time the loop would stop and the results would be sent to the reducer. Here are possible equations to be used: Beg would be defined by x^2 (having a number approaching 1 at the end) End would be defined by (x-1)^2 (having a number approach 0 at the end) Mid would be defined by -2*(x-1/2)^2+1/2 (having a bell shaped curve) Uniform would just return 1/3 (the above equations have areas of 1/3 so this seems to make sense) Suggestions are welcome.
          Hide
          Joshua Harlow added a comment -

          Konstantin>

          1. Giving the stress test that is running the set of jobs the ability to control its own data creation might be useful to allow for a simpler runtime as well as a simpler procedure for running (u don't need to remember the steps). It could also allow the different "jobs" that will be ran to specify to some "data" creation entity the type of data they would like for the current stress test, instead of having to link these two steps via a previous job.

          Show
          Joshua Harlow added a comment - Konstantin> 1. Giving the stress test that is running the set of jobs the ability to control its own data creation might be useful to allow for a simpler runtime as well as a simpler procedure for running (u don't need to remember the steps). It could also allow the different "jobs" that will be ran to specify to some "data" creation entity the type of data they would like for the current stress test, instead of having to link these two steps via a previous job.
          Hide
          Konstantin Shvachko added a comment -

          Steve> Might be good for this to be independent of hdfs. That way, we can stress any filesystem.

          DFSIO uses generic FileSystem API, so it should be able to stress any fs implementing it.
          I remember somebody actually used it to benchmark MR over Lustre.

          Joshua>
          1. What is the difference between running a "create" set of jobs vs. "generating in a previous job"?
          2. I agree, it's a good thing to have.
          3. These are good generalizations of the framework. We should consider them as possible improvements of the tool in the future.

          Show
          Konstantin Shvachko added a comment - Steve> Might be good for this to be independent of hdfs. That way, we can stress any filesystem. DFSIO uses generic FileSystem API, so it should be able to stress any fs implementing it. I remember somebody actually used it to benchmark MR over Lustre. Joshua> 1. What is the difference between running a "create" set of jobs vs. "generating in a previous job"? 2. I agree, it's a good thing to have. 3. These are good generalizations of the framework. We should consider them as possible improvements of the tool in the future.
          Hide
          Joshua Harlow added a comment -

          Looks good to me as well.
          Just a couple thoughts/questions.

          1. Would it be correct to have a "create" set of jobs job that would ensure before reads/deletes/writes.. that the files exist (instead of generating in a previous job)? That way the data is created on demand, instead of needing to have a separate job that runs beforehand that just does data population (this stage would not affect the overall timing allotted and could be done at the start of the testing)?
          2. It would probably be useful to add in a seed number so that the tests can be "mostly" repeated (ie write and deletes can't really be truly repeated since they modify underlying storage)?
          3. Might it be useful to add in the future the ability to specify your own distribution "objects" that "generate" operation objects so that the current set of operations can be expanded without core changes, ie a plugin like framework for generating the distribution and for generating the actual set of operations that will occur (allowing for something like a AppendReadDelete operation or similar which will be created distributed according to a square wave as an example)?

          Show
          Joshua Harlow added a comment - Looks good to me as well. Just a couple thoughts/questions. 1. Would it be correct to have a "create" set of jobs job that would ensure before reads/deletes/writes.. that the files exist (instead of generating in a previous job)? That way the data is created on demand, instead of needing to have a separate job that runs beforehand that just does data population (this stage would not affect the overall timing allotted and could be done at the start of the testing)? 2. It would probably be useful to add in a seed number so that the tests can be "mostly" repeated (ie write and deletes can't really be truly repeated since they modify underlying storage)? 3. Might it be useful to add in the future the ability to specify your own distribution "objects" that "generate" operation objects so that the current set of operations can be expanded without core changes, ie a plugin like framework for generating the distribution and for generating the actual set of operations that will occur (allowing for something like a AppendReadDelete operation or similar which will be created distributed according to a square wave as an example)?
          Hide
          steve_l added a comment -

          -Might be good for this to be independent of hdfs. That way, we can stress any filesystem.

          Show
          steve_l added a comment - -Might be good for this to be independent of hdfs. That way, we can stress any filesystem.
          Hide
          Konstantin Boudnik added a comment -

          Looks pretty good to me and sophisticated enough to be useful.

          An additional requirement I'd like to add is about interfacing with this tool from other Java applications. E.g. I'd propose to have an API which will allow to something like

           SLive sLiveTool = new SLive(Configuration conf);
           sLiveTool.execute();
          

          Having such or similar API will allow a seamless integration with cluster-based framework (HADOOP-6332) to create end-to-end automated load and stress tests.

          Show
          Konstantin Boudnik added a comment - Looks pretty good to me and sophisticated enough to be useful. An additional requirement I'd like to add is about interfacing with this tool from other Java applications. E.g. I'd propose to have an API which will allow to something like SLive sLiveTool = new SLive(Configuration conf); sLiveTool.execute(); Having such or similar API will allow a seamless integration with cluster-based framework ( HADOOP-6332 ) to create end-to-end automated load and stress tests.
          Hide
          Konstantin Shvachko added a comment -

          Yes, two mappers can choose the same file for append. One of them is supposed to fail - a good test case.

          Show
          Konstantin Shvachko added a comment - Yes, two mappers can choose the same file for append. One of them is supposed to fail - a good test case.
          Hide
          dhruba borthakur added a comment -

          Thanks for the document. a few questions:

          1. Can two mappers pick the same file at the same time?
          2. If the answer to 1 is "yes", then how do you prevent two mappers from trying to append to the same file at the same time?

          Show
          dhruba borthakur added a comment - Thanks for the document. a few questions: 1. Can two mappers pick the same file at the same time? 2. If the answer to 1 is "yes", then how do you prevent two mappers from trying to append to the same file at the same time?
          Hide
          Konstantin Shvachko added a comment -

          Attaching a more detailed design document.

          Show
          Konstantin Shvachko added a comment - Attaching a more detailed design document.
          Hide
          Konstantin Shvachko added a comment -

          Hi Wang Xu. Thanks for the link. I think people will find your scripting solution useful for testing HDFS directly without the MapReduce framework.
          What we want to achieve by this is to model the workload of a real cluster and test it on that load. MR is a part of testing for us.

          Show
          Konstantin Shvachko added a comment - Hi Wang Xu. Thanks for the link. I think people will find your scripting solution useful for testing HDFS directly without the MapReduce framework. What we want to achieve by this is to model the workload of a real cluster and test it on that load. MR is a part of testing for us.
          Hide
          Wang Xu added a comment -

          I clean up our code and put them in google code ( http://code.google.com/p/hadoop-test/ ), and described it in brief. Hope it helps.

          Show
          Wang Xu added a comment - I clean up our code and put them in google code ( http://code.google.com/p/hadoop-test/ ), and described it in brief. Hope it helps.
          Hide
          Wang Xu added a comment -

          I post our design illustration here.
          http://gnawux.info/hadoop/2010/01/a-simple-hdfs-performance-test-tool/

          And I will post the code on google code or other place tomorrow.

          In our test program, synchronizer is a server written in python, it accepts the request of test program running in test nodes. Having received requests from all nodes, it admits them start pressure simultaneously.

          The test program is written in Java, and it starts several threads to write or read with DFSClient. All the pressure thread record the data it has written in a variable and the main thread of the test program collect them periodically, then written into a XML file.

          Analyzing the xml output file, we can tell the performance of reading and writing.

          In our test program, it supports read only, write only and read-write. And it can be set as read files writen by itself or random files.

          Show
          Wang Xu added a comment - I post our design illustration here. http://gnawux.info/hadoop/2010/01/a-simple-hdfs-performance-test-tool/ And I will post the code on google code or other place tomorrow. In our test program, synchronizer is a server written in python, it accepts the request of test program running in test nodes. Having received requests from all nodes, it admits them start pressure simultaneously. The test program is written in Java, and it starts several threads to write or read with DFSClient. All the pressure thread record the data it has written in a variable and the main thread of the test program collect them periodically, then written into a XML file. Analyzing the xml output file, we can tell the performance of reading and writing. In our test program, it supports read only, write only and read-write. And it can be set as read files writen by itself or random files.
          Hide
          Konstantin Boudnik added a comment -

          I'd suggest to call it 'load-test tool' for the sake of correct definitions. The purpose of this tool isn't to crash the system but rather to see it works under a significant load. That said, with a certain argument value the system might be exercised to the point of failure. However, that scenario is about a stressful application of the said load tool.

          Show
          Konstantin Boudnik added a comment - I'd suggest to call it 'load-test tool' for the sake of correct definitions. The purpose of this tool isn't to crash the system but rather to see it works under a significant load. That said, with a certain argument value the system might be exercised to the point of failure. However, that scenario is about a stressful application of the said load tool.
          Hide
          Todd Lipcon added a comment -

          Wang: yep, I'd be interested in taking a look. Thanks!

          Show
          Todd Lipcon added a comment - Wang: yep, I'd be interested in taking a look. Thanks!
          Hide
          Wang Xu added a comment -

          It's interesting.

          We had written a simple test program with Python and Java. It could parallel written or read from specified nodes and specify threads per node.

          It only support some simple specified read/write pattern and simple log feature, not as complicated as Konstantin described.

          Is anyone interesting in it if I post it?

          Show
          Wang Xu added a comment - It's interesting. We had written a simple test program with Python and Java. It could parallel written or read from specified nodes and specify threads per node. It only support some simple specified read/write pattern and simple log feature, not as complicated as Konstantin described. Is anyone interesting in it if I post it?
          Hide
          gary murry added a comment -

          An additional idea would be to maintain and validate hadoop metrics while the tool is running.

          Show
          gary murry added a comment - An additional idea would be to maintain and validate hadoop metrics while the tool is running.
          Hide
          Konstantin Shvachko added a comment -

          This is the result of a discussion held last week.
          Participated: Dhruba, Nicholas, Rob, Suresh, Todd, shv.
          Here is draft description of the tool as it was discussed.

          Similar to DFSIO benchmark, the tool should run as a map-reduce job.
          Each map task iterates choosing:

          1. a random hdfs operation (write, read, append)
            possible extensions may include deletes, renames, write-with-hflush
            random generator should be able to generate a configurable percentage of reads, writes, appends, etc.
          2. random file. File names may be enumerated as in DFSIO, so choosing a file means choosing its sequence number.
          3. random size of data to read, write or append;
            the size should be in a configurable range [min, max]
          4. random block size for write
          5. random replication for write
          6. possible random offset for reads

          After that the map task performs chosen operation with chosen parameters.
          Reads must verify read data that it is exactly the same as it was written.
          In order to do that write operation should store a seed for a standard pseudo-random generator.
          The seed can be either encoded in the file name or stored inside the file as a header.
          Appends may use the same seed for data generation.
          An alternative would be to view a file as a set of records, where each record starts with a header containing the seed. That way each append may choose an independent seed.

          Job completion can be defined by one of the following conditions:

          • run T hours. In that case map tasks iterate until their elapsed time exceeds T.
          • run N operations. Each map performs N / numMaps operations and stops.
          • terminate on first error flag. This may be useful for investigation purposes. The cluster state is as close to the failure condition as it gets.

          Error reporting. Errors are reported by maps as its output.
          Map output also should include reasonable stats:

          • number of operations performed in each category
          • errors, % of errors;

          The reduce task combines all error messages and stats, writes the results into HDFS.

          Potential extension with Fault Injection.
          For example, one may choose to intentionally corrupt data while writing, and then make sure it is corrupt on read.

          Show
          Konstantin Shvachko added a comment - This is the result of a discussion held last week. Participated: Dhruba, Nicholas, Rob, Suresh, Todd, shv. Here is draft description of the tool as it was discussed. Similar to DFSIO benchmark, the tool should run as a map-reduce job. Each map task iterates choosing: a random hdfs operation (write, read, append) possible extensions may include deletes, renames, write-with-hflush random generator should be able to generate a configurable percentage of reads, writes, appends, etc. random file. File names may be enumerated as in DFSIO, so choosing a file means choosing its sequence number. random size of data to read, write or append; the size should be in a configurable range [min, max] random block size for write random replication for write possible random offset for reads After that the map task performs chosen operation with chosen parameters. Reads must verify read data that it is exactly the same as it was written. In order to do that write operation should store a seed for a standard pseudo-random generator. The seed can be either encoded in the file name or stored inside the file as a header. Appends may use the same seed for data generation. An alternative would be to view a file as a set of records, where each record starts with a header containing the seed. That way each append may choose an independent seed. Job completion can be defined by one of the following conditions: run T hours. In that case map tasks iterate until their elapsed time exceeds T. run N operations. Each map performs N / numMaps operations and stops. terminate on first error flag. This may be useful for investigation purposes. The cluster state is as close to the failure condition as it gets. Error reporting. Errors are reported by maps as its output. Map output also should include reasonable stats: number of operations performed in each category errors, % of errors; The reduce task combines all error messages and stats, writes the results into HDFS. Potential extension with Fault Injection. For example, one may choose to intentionally corrupt data while writing, and then make sure it is corrupt on read.

            People

            • Assignee:
              Joshua Harlow
              Reporter:
              Konstantin Shvachko
            • Votes:
              0 Vote for this issue
              Watchers:
              22 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development