Details

    • Type: Test Test
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      The fact that we do not have any large-scale reliability tests bothers me. I'll be first to admit that it isn't the easiest of tasks, but I'd like to start a discussion around this... especially given that the code-base is growing to an extent that interactions due to small changes are very hard to predict.

      One of the simple scripts I run for every patch I work on does something very simple: run sort500 (or greater), then it randomly picks n tasktrackers from $

      {HADOOP_CONF_DIR}

      /conf/slaves and then kills them, a similar script one kills and restarts the tasktrackers.

      This helps in checking a fair number of reliability stories: lost tasktrackers, task-failures etc. Clearly this isn't good enough to cover everything, but a start.

      Lets discuss - What do we do for HDFS? We need more for Map-Reduce!

        Activity

        Hide
        dhruba borthakur added a comment -

        I would like an error-injection package for HDFS. It should be possible to inject errors on disk subsystem and network subsystem. This would simulate disk Io errors and network partitions, thereby triggering hard-to-test code paths in HDFS.

        Show
        dhruba borthakur added a comment - I would like an error-injection package for HDFS. It should be possible to inject errors on disk subsystem and network subsystem. This would simulate disk Io errors and network partitions, thereby triggering hard-to-test code paths in HDFS.
        Hide
        Devaraj Das added a comment -

        I have started looking at this. Some thoughts:

        1) Have a script that would launch a couple of large randomwriter/sort/sortvalidator jobs (via command line using Java). Some of these jobs would set speculative execution to true.
        1.1) Some randomwriter jobs would generate 5x amount of data per map. "Sort" for such data might use a very high value for mapred.min.split.size leading to large reduce partitions.

        2) Have another script that would query (via another Java program) the JobTracker to get the list of TaskTrackers. It would randomly issue SIGSTOP (via ssh) to a bunch of trackers. After a certain period, the JobTracker would mark these trackers as Lost. The script would then send a SIGCONT to the same processes allowing these trackers to join back the cluster.

        3) Have a script that gets the task reports from the JobTracker, and kill/fail a bunch of random tasks.

        (2) and (3) could be done multiple times. The test is to see whether the jobs launched by the first script all complete successfully in the event of such random failures. It would also test the JobTracker's reliability w.r.t dealing with a couple of large jobs. Similarly, the Map & Reduce tasks would be tested for reliability w.r.t big input handling & Shuffling would also be stressed (esp due to 1.1).

        Going one step forward, a script could grep for exceptions in the log files generated (JobTracker, TTs, and tasks), and archive them in the client machine for someone to look at (some exceptions could be indicator of bugs).

        These are some early thoughts I had. Please chime in with suggestions here.

        Show
        Devaraj Das added a comment - I have started looking at this. Some thoughts: 1) Have a script that would launch a couple of large randomwriter/sort/sortvalidator jobs (via command line using Java). Some of these jobs would set speculative execution to true. 1.1) Some randomwriter jobs would generate 5x amount of data per map. "Sort" for such data might use a very high value for mapred.min.split.size leading to large reduce partitions. 2) Have another script that would query (via another Java program) the JobTracker to get the list of TaskTrackers. It would randomly issue SIGSTOP (via ssh) to a bunch of trackers. After a certain period, the JobTracker would mark these trackers as Lost. The script would then send a SIGCONT to the same processes allowing these trackers to join back the cluster. 3) Have a script that gets the task reports from the JobTracker, and kill/fail a bunch of random tasks. (2) and (3) could be done multiple times. The test is to see whether the jobs launched by the first script all complete successfully in the event of such random failures. It would also test the JobTracker's reliability w.r.t dealing with a couple of large jobs. Similarly, the Map & Reduce tasks would be tested for reliability w.r.t big input handling & Shuffling would also be stressed (esp due to 1.1). Going one step forward, a script could grep for exceptions in the log files generated (JobTracker, TTs, and tasks), and archive them in the client machine for someone to look at (some exceptions could be indicator of bugs). These are some early thoughts I had. Please chime in with suggestions here.
        Hide
        steve_l added a comment -

        I'm assuming that the problems here are to see how the system handles network, host and disk failures. A good first step would be: what problems happen most often? And which problems are traumatic enough to send everyones pagers off, as they are the ones to care about.

        • disk failures could be mocked with some filesystem which simulates problems: bad data, missing data, even hanging reads and writes.
        • network failures are harder to simulate as there are so many kinds. DNS failures, exceptions on every single part of an IO operation, are all candidates. Perhaps we could have a special mock IPC client that raises these exceptions in test runs.
        • This is the kind of thing that virtualized clusters are good for, but they have odd timing quirks to make you worry about what is going on.
        Show
        steve_l added a comment - I'm assuming that the problems here are to see how the system handles network, host and disk failures. A good first step would be: what problems happen most often? And which problems are traumatic enough to send everyones pagers off, as they are the ones to care about. disk failures could be mocked with some filesystem which simulates problems: bad data, missing data, even hanging reads and writes. network failures are harder to simulate as there are so many kinds. DNS failures, exceptions on every single part of an IO operation, are all candidates. Perhaps we could have a special mock IPC client that raises these exceptions in test runs. This is the kind of thing that virtualized clusters are good for, but they have odd timing quirks to make you worry about what is going on.
        Hide
        Devaraj Das added a comment -

        Steve, this issue is more about how the various distributed parts of the system work together in the event of failures and under stress. So yes, while you are right that disk failures can be a part of this test, they can probably be easily covered in unit tests as well. As part of this issue, we could inject a fault that just deletes a bunch of map output files from some trackers that will lead to handling of the case where the corresponding maps are killed and reexecuted elsewhere.. Network failures are kind of handled in my proposal where we STOP/CONT trackers at will. STOP can be seen as faking a network failure while CONT can be seen as a network recovery...

        Show
        Devaraj Das added a comment - Steve, this issue is more about how the various distributed parts of the system work together in the event of failures and under stress. So yes, while you are right that disk failures can be a part of this test, they can probably be easily covered in unit tests as well. As part of this issue, we could inject a fault that just deletes a bunch of map output files from some trackers that will lead to handling of the case where the corresponding maps are killed and reexecuted elsewhere.. Network failures are kind of handled in my proposal where we STOP/CONT trackers at will. STOP can be seen as faking a network failure while CONT can be seen as a network recovery...
        Hide
        Sharad Agarwal added a comment -

        Perhaps we can install the error injection code with each daemon (Datanode/TaskTracker). This code gets triggered based on random function which works on say cluster error injection ratio = no of nodes to inject error /total nodes in cluster. The default package will inject system level errors. If required each daemon can extend it to inject its own more granular errors.
        This way error generation could be decentralized and can be controlled via config params; avoiding the need to get the slaves list for a cluster and injecting it from a single client.
        The question is do we need that kind of extensibility or we are fine with a few types. Thoughts?

        Show
        Sharad Agarwal added a comment - Perhaps we can install the error injection code with each daemon (Datanode/TaskTracker). This code gets triggered based on random function which works on say cluster error injection ratio = no of nodes to inject error /total nodes in cluster. The default package will inject system level errors. If required each daemon can extend it to inject its own more granular errors. This way error generation could be decentralized and can be controlled via config params; avoiding the need to get the slaves list for a cluster and injecting it from a single client. The question is do we need that kind of extensibility or we are fine with a few types. Thoughts?
        Hide
        steve_l added a comment -

        This is interesting to me; I'm effectively doing some of this in our codebase as we try out some of the lifecycle, but my tests are still trying to bring up and stress a functional cluster and not, yet, test how that cluster copes with various failure modes, such as

        • transient loss of namenode
        • loss of 10%, 20%, 30%, 50%, 50%+ of the workers -through either outages or network partitioning
        • DNS playing up. Because it will, you know
        • JT, TT, failures.
        • MR job progress when namenodes start failing
          There is also performance testing.

        Paper to read: http://googletesting.blogspot.com/2008/05/performance-testing-of-distributed-file.html

        Show
        steve_l added a comment - This is interesting to me; I'm effectively doing some of this in our codebase as we try out some of the lifecycle, but my tests are still trying to bring up and stress a functional cluster and not, yet, test how that cluster copes with various failure modes, such as transient loss of namenode loss of 10%, 20%, 30%, 50%, 50%+ of the workers -through either outages or network partitioning DNS playing up. Because it will, you know JT, TT, failures. MR job progress when namenodes start failing There is also performance testing. Paper to read: http://googletesting.blogspot.com/2008/05/performance-testing-of-distributed-file.html

          People

          • Assignee:
            Devaraj Das
            Reporter:
            Arun C Murthy
          • Votes:
            1 Vote for this issue
            Watchers:
            11 Start watching this issue

            Dates

            • Created:
              Updated:

              Development