Hadoop Common
  1. Hadoop Common
  2. HADOOP-66

dfs client writes all data for a chunk to /tmp

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.1.0
    • Fix Version/s: 0.1.0
    • Component/s: None
    • Labels:
      None

      Description

      The dfs client writes all the data for the current chunk to a file in /tmp, when the chunk is complete it is shipped out to the Datanodes. This can cause /tmp to fill up fast when a lot of files are being written. A potentially better scheme is to buffer the written data in RAM (application code can set the buffer size) and flush it to the Datanodes when the buffer fills up.

      1. no-tmp.patch
        8 kB
        Doug Cutting
      2. tmp-delete.patch
        0.6 kB
        Owen O'Malley

        Activity

        Hide
        Doug Cutting added a comment -

        Should we worry about the space this consumes? For each block a client writes there will be memory allocated containing the path name to the temporary file that cannot be gc'd. If that's 100 bytes, then 1M blocks would generate 100MB, which would be a big leak. But writing 1M blocks means writing 32TB from a single JVM, which would take around a month (at current dfs speeds). If we increase the block size (as has been discussed) then the rate is slowed proportionally. So I guess we don't worry about this "leak"?

        Show
        Doug Cutting added a comment - Should we worry about the space this consumes? For each block a client writes there will be memory allocated containing the path name to the temporary file that cannot be gc'd. If that's 100 bytes, then 1M blocks would generate 100MB, which would be a big leak. But writing 1M blocks means writing 32TB from a single JVM, which would take around a month (at current dfs speeds). If we increase the block size (as has been discussed) then the rate is slowed proportionally. So I guess we don't worry about this "leak"?
        Hide
        Owen O'Malley added a comment -

        There is a method on File to delete the file when the jvm exits. Here is a patch that calls the method on the temporary block files. Apparently there is a bug in the jvm under windows so that the files are only deleted if they are closed. For linux or solaris users, this should make sure that no block files end up being dropped by the application.

        Show
        Owen O'Malley added a comment - There is a method on File to delete the file when the jvm exits. Here is a patch that calls the method on the temporary block files. Apparently there is a bug in the jvm under windows so that the files are only deleted if they are closed. For linux or solaris users, this should make sure that no block files end up being dropped by the application.
        Hide
        Doug Cutting added a comment -

        I just committed a fix for this and some problems that it was hiding.

        Show
        Doug Cutting added a comment - I just committed a fix for this and some problems that it was hiding.
        Hide
        Doug Cutting added a comment -

        Here's a minimally-tested patch that removes the use of temp files.

        Show
        Doug Cutting added a comment - Here's a minimally-tested patch that removes the use of temp files.
        Hide
        Doug Cutting added a comment -

        It's hard to resume writing a block when a connection fails, since you don't know how much of the previous write succeeded. Currently the block is streamed over TCP connections. We could instead write it as a series of length-prefixed buffers, and query the remote datanode on reconnect about which buffers it had recieved, etc. But that seems like reinventing a lot of TCP.

        If the datanode goes down then currently the entire block is in a temp file so that it can instead be written to a different datanode. Thus if datanodes die during, e.g., a reduce, then the reduce task does not have to restart. But if reduce tasks are running on the same pool of machines as datanodes, then, when a node fails, some reduce tasks will need to be restarted anyway. So I agree that this may not be helping us much. I think throwing an exception when the connection to the datanode fails would be fine.

        Show
        Doug Cutting added a comment - It's hard to resume writing a block when a connection fails, since you don't know how much of the previous write succeeded. Currently the block is streamed over TCP connections. We could instead write it as a series of length-prefixed buffers, and query the remote datanode on reconnect about which buffers it had recieved, etc. But that seems like reinventing a lot of TCP. If the datanode goes down then currently the entire block is in a temp file so that it can instead be written to a different datanode. Thus if datanodes die during, e.g., a reduce, then the reduce task does not have to restart. But if reduce tasks are running on the same pool of machines as datanodes, then, when a node fails, some reduce tasks will need to be restarted anyway. So I agree that this may not be helping us much. I think throwing an exception when the connection to the datanode fails would be fine.
        Hide
        Sameer Paranjpye added a comment -

        It doesn't make a lot of sense to buffer the entire block in RAM. On the other hand, an application ought to be able to control the buffering strategy to some extent. Most stream implementations have a setBufferSize() or equivalent method that allow programmers to do this. The default buffer size is reasonably small so that many files can be opened without worrying too much about buffering.

        Besides the issues with filling up /tmp (32MB is a pretty large chunk to be writing there), it's unclear that the scheme adds a lot of value, it may even be detrimental. If a connection to a Datanode fails then why not try to recover by re-connecting and throw an exception if that fails. If it's just the connection that has failed the client should be able to reconnect pretty easily. If the Datanode is down for the count the odds are low (or are they?) that it'll come back by the time the client finishes writing the block and the write will fail anyway, so why write to a temp file. If it's common for Datanodes to bounce and come back then Datanode stability is an problem that we should be working on. In that case, the temp file is only a workaround and not a real solution, it might even be masking the problem in many cases.

        Show
        Sameer Paranjpye added a comment - It doesn't make a lot of sense to buffer the entire block in RAM. On the other hand, an application ought to be able to control the buffering strategy to some extent. Most stream implementations have a setBufferSize() or equivalent method that allow programmers to do this. The default buffer size is reasonably small so that many files can be opened without worrying too much about buffering. Besides the issues with filling up /tmp (32MB is a pretty large chunk to be writing there), it's unclear that the scheme adds a lot of value, it may even be detrimental. If a connection to a Datanode fails then why not try to recover by re-connecting and throw an exception if that fails. If it's just the connection that has failed the client should be able to reconnect pretty easily. If the Datanode is down for the count the odds are low (or are they?) that it'll come back by the time the client finishes writing the block and the write will fail anyway, so why write to a temp file. If it's common for Datanodes to bounce and come back then Datanode stability is an problem that we should be working on. In that case, the temp file is only a workaround and not a real solution, it might even be masking the problem in many cases.
        Hide
        Doug Cutting added a comment -

        Eric: I agree. We should probably change this use a temp directory under dfs.data.dir instead of /tmp.

        Show
        Doug Cutting added a comment - Eric: I agree. We should probably change this use a temp directory under dfs.data.dir instead of /tmp.
        Hide
        eric baldeschwieler added a comment -

        So the problem with /tmp is that this can fill up and cause failures. This is very config / install specific. We almost never use /tmp because it gets blown out by something sometime, always when you least expect it. Maybe we should throw by default and provide some config to do something else, such as provide a a file path for temp files? This could be in /tmp if you chose, or map reduce could default to its temp directory where it is storing everything else.

        Performance is clearly not an issue if this is truly an exceptional case.

        Show
        eric baldeschwieler added a comment - So the problem with /tmp is that this can fill up and cause failures. This is very config / install specific. We almost never use /tmp because it gets blown out by something sometime, always when you least expect it. Maybe we should throw by default and provide some config to do something else, such as provide a a file path for temp files? This could be in /tmp if you chose, or map reduce could default to its temp directory where it is storing everything else. Performance is clearly not an issue if this is truly an exceptional case.
        Hide
        Doug Cutting added a comment -

        It looks to me like the temp file is only in fact used when the connection to the datanode fails. Normally the block is streamed to the datanode as it is written. But if the connection to the datanode fails then an application exception is not thrown, instead the temp file is used to recover, by reconnecting to a datanode and trying to write the block again.

        Data is bufferred in RAM first, just in chunks much smaller than the block. I don't think we should buffer the entire block in RAM, as this would, e.g., prohibit applications which write lots of files in parallel.

        We could get rid of the temp file and simply throw an application exception when we lose a connection to a datanode while writing. What is the objection to the temp file?

        Show
        Doug Cutting added a comment - It looks to me like the temp file is only in fact used when the connection to the datanode fails. Normally the block is streamed to the datanode as it is written. But if the connection to the datanode fails then an application exception is not thrown, instead the temp file is used to recover, by reconnecting to a datanode and trying to write the block again. Data is bufferred in RAM first, just in chunks much smaller than the block. I don't think we should buffer the entire block in RAM, as this would, e.g., prohibit applications which write lots of files in parallel. We could get rid of the temp file and simply throw an application exception when we lose a connection to a datanode while writing. What is the objection to the temp file?

          People

          • Assignee:
            Doug Cutting
            Reporter:
            Sameer Paranjpye
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development