Hadoop Common
  1. Hadoop Common
  2. HADOOP-128

Failure to replicate dfs block kills client

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.1.1
    • Fix Version/s: 0.2.0
    • Component/s: None
    • Labels:
      None
    • Environment:

      ~200 node linux cluster (kernel 2.6, redhat, 2 hyper threaded cpus)

      Description

      When the datanode gets an exception, which is logged as:

      060407 155835 13 DataXCeiver
      java.io.EOFException
      at java.io.DataInputStream.readFully(DataInputStream.java:178)
      at java.io.DataInputStream.readLong(DataInputStream.java:380)
      at org.apache.hadoop.dfs.DataNode$DataXceiver.run(DataNode.java:462)
      at java.lang.Thread.run(Thread.java:595)

      It closes the user's connection to the data node, which causes the client to get an IOException from:

      at java.io.DataInputStream.readFully(DataInputStream.java:178)
      at java.io.DataInputStream.readLong(DataInputStream.java:380)
      at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.internalClose(DFSClient.java:883)

      1. conf.patch
        0.6 kB
        Owen O'Malley
      2. datanode.no-ws-diff
        10 kB
        Owen O'Malley
      3. datanode-mirroring.patch
        31 kB
        Owen O'Malley

        Activity

        Hide
        Owen O'Malley added a comment -

        This patch changes the client so that:
        1. it has replication * 1 minute timeout for the block replicas to be written.
        2. improved logging, including the filename and remote hostname when things fail
        3.

        It patches the DataNode so that:
        1. Failures downstream (from the mirror nodes) never propagate back upstream.
        2. Improved logging including filenames and remote host names.
        3. the changes involve a lot of whitespace changes because of block changes, so i'll include a separate upload that ignores whitespaces.

        Show
        Owen O'Malley added a comment - This patch changes the client so that: 1. it has replication * 1 minute timeout for the block replicas to be written. 2. improved logging, including the filename and remote hostname when things fail 3. It patches the DataNode so that: 1. Failures downstream (from the mirror nodes) never propagate back upstream. 2. Improved logging including filenames and remote host names. 3. the changes involve a lot of whitespace changes because of block changes, so i'll include a separate upload that ignores whitespaces.
        Hide
        Owen O'Malley added a comment -

        These are the diffs to DataNode.java ignoring whitespaces.

        Show
        Owen O'Malley added a comment - These are the diffs to DataNode.java ignoring whitespaces.
        Hide
        Owen O'Malley added a comment -

        The read and write block functionality needs to be factored out of the huge if/then/else. I'll open a new bug for that.

        Show
        Owen O'Malley added a comment - The read and write block functionality needs to be factored out of the huge if/then/else. I'll open a new bug for that.
        Hide
        Owen O'Malley added a comment -

        I forgot the default value for the retries value.

        Show
        Owen O'Malley added a comment - I forgot the default value for the retries value.
        Hide
        Doug Cutting added a comment -

        I just committed this.

        I note that you increased the timeout in the client, presumably to account for timeouts down the replication chain. But shouldn't we then also increase the timeout in the datanode when it connects to the next link in the chain? It didn't look like you added that.

        +1 for refactoring this (in another patch). The logic of this is hard to follow!

        Show
        Doug Cutting added a comment - I just committed this. I note that you increased the timeout in the client, presumably to account for timeouts down the replication chain. But shouldn't we then also increase the timeout in the datanode when it connects to the next link in the chain? It didn't look like you added that. +1 for refactoring this (in another patch). The logic of this is hard to follow!

          People

          • Assignee:
            Owen O'Malley
            Reporter:
            Owen O'Malley
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development