Hadoop Common
  1. Hadoop Common
  2. HADOOP-4614

"Too many open files" error while processing a large gzip file

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Blocker Blocker
    • Resolution: Fixed
    • Affects Version/s: 0.18.2
    • Fix Version/s: 0.18.3
    • Component/s: None
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      I am running a simple word count program on a gzip compressed data of size 4 GB (Uncompressed size is about 7 GB). I have setup of 17 nodes in my Hadoop cluster. After some time, I get the following exception:

      java.io.FileNotFoundException: /usr/local/hadoop/hadoop-hadoop/mapred/local/taskTracker/jobcache/job_200811041109_0003/attempt_200811041109_0003_m_000000_0/output/spill4055.out.index
      (Too many open files)
      at java.io.FileInputStream.open(Native Method)
      at java.io.FileInputStream.(FileInputStream.java:137)
      at org.apache.hadoop.fs.RawLocalFileSystem$TrackingFileInputStream.(RawLocalFileSystem.java:62)
      at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileInputStream.(RawLocalFileSystem.java:98)
      at org.apache.hadoop.fs.RawLocalFileSystem.open(RawLocalFileSystem.java:168)
      at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:359)
      at org.apache.hadoop.mapred.IndexRecord.readIndexFile(IndexRecord.java:47)
      at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.getIndexInformation(MapTask.java:1339)
      at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:1237)
      at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:857)
      at org.apache.hadoop.mapred.MapTask.run(MapTask.java:333)
      at org.apache.hadoop.mapred.Child.main(Child.java:155)

      From a user's perspective I know that Hadoop will use only one mapper for a gzipped file. The above exception suggests that probably Hadoop puts the intermediate data into many files. But the question is that "exactly how many open files are required in the worst case for any data size and cluster size?" Currently it looks as if Hadoop needs more number of open files as the size of input or the cluster size (in terms of nodes, mapper, reducers) increases. This is not plausible as far as scalability is concerned. A user needs to write some number in the /etc/security/limits.conf file that how many open files are allowed by hadoop node. The question is what that "magical number" should be?

      So probably the best solution to this problem is to change Hadoop such a way that it can work with some moderate number of allowed open files (e.g. 4 K) or any other number should be suggested as an upper limit such that a user is sure that for any data size and cluster size, hadoop will not run into this "too many open files" issue.

      1. HADOOP-4614.patch
        4 kB
        Yuri Pradkin
      2. 4614-trunk.patch
        4 kB
        Chris Douglas
      3. HADOOP-4614.patch
        3 kB
        Yuri Pradkin
      4. HADOOP-4614-branch0.18.patch
        3 kB
        Yuri Pradkin
      5. HADOOP-4614.patch
        3 kB
        Yuri Pradkin
      6. HADOOP-4614.patch
        3 kB
        Yuri Pradkin
      7. HADOOP-4614.patch
        1.0 kB
        Yuri Pradkin
      8. openfds.txt
        575 kB
        Yuri Pradkin

        Activity

        Owen O'Malley made changes -
        Component/s mapred [ 12310690 ]
        Nigel Daley made changes -
        Status Resolved [ 5 ] Closed [ 6 ]
        Chris Douglas made changes -
        Resolution Fixed [ 1 ]
        Hadoop Flags [Reviewed]
        Status Patch Available [ 10002 ] Resolved [ 5 ]
        Yuri Pradkin made changes -
        Attachment HADOOP-4614.patch [ 12394702 ]
        Chris Douglas made changes -
        Attachment 4614-trunk.patch [ 12394699 ]
        Chris Douglas made changes -
        Priority Major [ 3 ] Blocker [ 1 ]
        Yuri Pradkin made changes -
        Attachment HADOOP-4614.patch [ 12394595 ]
        Yuri Pradkin made changes -
        Attachment HADOOP-4614-branch0.18.patch [ 12394576 ]
        Yuri Pradkin made changes -
        Status In Progress [ 3 ] Patch Available [ 10002 ]
        Yuri Pradkin made changes -
        Attachment HADOOP-4614.patch [ 12394567 ]
        Yuri Pradkin made changes -
        Status Open [ 1 ] In Progress [ 3 ]
        Yuri Pradkin made changes -
        Status Patch Available [ 10002 ] Open [ 1 ]
        Yuri Pradkin made changes -
        Status In Progress [ 3 ] Patch Available [ 10002 ]
        Yuri Pradkin made changes -
        Attachment HADOOP-4614.patch [ 12394101 ]
        Yuri Pradkin made changes -
        Status Open [ 1 ] In Progress [ 3 ]
        Yuri Pradkin made changes -
        Assignee Yuri Pradkin [ yurip ]
        Devaraj Das made changes -
        Status Patch Available [ 10002 ] Open [ 1 ]
        Yuri Pradkin made changes -
        Status Open [ 1 ] Patch Available [ 10002 ]
        Yuri Pradkin made changes -
        Attachment HADOOP-4614.patch [ 12393895 ]
        Abdul Qadeer made changes -
        Component/s mapred [ 12310690 ]
        Component/s io [ 12310687 ]
        Yuri Pradkin made changes -
        Field Original Value New Value
        Attachment openfds.txt [ 12393542 ]
        Abdul Qadeer created issue -

          People

          • Assignee:
            Yuri Pradkin
            Reporter:
            Abdul Qadeer
          • Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development