Uploaded image for project: 'Chukwa'
  1. Chukwa
  2. CHUKWA-323

Chukwa agent unable to stream all data source on the jobtracker node

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Blocker
    • Resolution: Invalid
    • Affects Version/s: 0.2.0
    • Fix Version/s: 0.2.0
    • Component/s: Data Collection
    • Labels:
      None
    • Environment:

      Redhat EL 5.1, Java 6

      Description

      HDFS namenode and mapreduce related metrics seem to stop sending data since 06/21/2009 00:00:00.
      Agent log contains exceptions like these:

      2009-06-21 21:28:01,165 WARN Thread-10 FileTailingAdaptor - failure reading
      /usr/local/hadoop/var/log/history/host.example.com_1245463671645_job_200906200207_0351_user_Chukwa-Demux_20090620_09_56
      java.io.FileNotFoundException:
      /usr/local/hadoop/var/log/history/host.example.com_1245463671645_job_200906200207_0351_user_Chukwa-Demux_20090620_09_56
      (Too many open files)
      at java.io.RandomAccessFile.open(Native Method)
      at java.io.RandomAccessFile.<init>(RandomAccessFile.java:212)
      at
      org.apache.hadoop.chukwa.datacollection.adaptor.filetailer.FileTailingAdaptor.tailFile(FileTailingAdaptor.java:239)
      at org.apache.hadoop.chukwa.datacollection.adaptor.filetailer.FileTailer.run(FileTailer.java:90)
      2009-06-21 21:28:01,165 WARN Thread-10 FileTailingAdaptor - Adaptor|58fb855b5c26d36cc1e69e264ce3402c| file:
      /usr/local/hadoop/var/log/history/host.example.com_1245463671645_job_200906200207_0352_user_PigLatin%3AHadoop_jvm_metrics.pig,
      has rotated and no detection - reset counters to 0L

      It looks like the number of file offset tracking pointers exceeded the jvm concurrent number of files open. Which
      triggers a feedback loop that FileTailingAdaptor assuming log file had rotated, but it wasn't the case.
      FileTailingAdaptor was simply unable to track the offset that's all.

      [root@gsbl80211 log]# /usr/sbin/lsof -p 29960|wc -l
      1084

      The concurrent # of open file is 1084 which exceeded the default limit 1024 of concurrent open files.

        Attachments

        1. testForLeaks.patch
          11 kB
          Ari Rabkin

          Activity

            People

            • Assignee:
              jboulon Jerome Boulon
              Reporter:
              eyang Eric Yang
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: