Hadoop Common
  1. Hadoop Common
  2. HADOOP-3585

Hardware Failure Monitoring in large clusters running Hadoop/HDFS

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.19.0
    • Component/s: metrics
    • Labels:
      None
    • Environment:

      Linux

    • Hadoop Flags:
      Reviewed
    • Release Note:
      Added FailMon as a contrib project for hardware failure monitoring and analysis, under /src/contrib/failmon. Created User Manual and Quick Start Guide.

      Description

      At IBM we're interested in identifying hardware failures on large clusters running Hadoop/HDFS. We are working on a framework that will enable nodes to identify failures on their hardware using the Hadoop log, the system log and various OS hardware diagnosing utilities. The implementation details are not very clear, but you can see a draft of our design in the attached document. We are pretty interested in Hadoop and system logs from failed machines, so if you are in possession of such, you are very welcome to contribute them; they would be of great value for hardware failure diagnosing.

      Some details about our design can be found in the attached document failmon.doc. More details will follow in a later post.

      1. HADOOP-3585.patch
        82 kB
        Ioannis Koltsidas
      2. HADOOP-3585.patch
        134 kB
        Ioannis Koltsidas
      3. HADOOP-3585.3.patch
        134 kB
        Ioannis Koltsidas
      4. HADOOP-3585.2.patch
        133 kB
        Ioannis Koltsidas
      5. FailMon-standalone.zip
        4.49 MB
        Ioannis Koltsidas
      6. failmon2.pdf
        6.02 MB
        Ioannis Koltsidas
      7. failmon.pdf
        28 kB
        Ioannis Koltsidas
      8. failmon.pdf
        6.02 MB
        Ioannis Koltsidas
      9. FailMon_QuickStart.html
        12 kB
        Ioannis Koltsidas
      10. FailMon_Package_descrip.html
        48 kB
        Ioannis Koltsidas

        Issue Links

          Activity

          Hide
          Hudson added a comment -
          Show
          Hudson added a comment - Integrated in Hadoop-trunk #581 (See http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/581/ )
          Hide
          Tsz Wo Nicholas Sze added a comment - - edited

          It seems that the committed patch causes some javadoc warnings. See HADOOP-3964

          Show
          Tsz Wo Nicholas Sze added a comment - - edited It seems that the committed patch causes some javadoc warnings. See HADOOP-3964
          Hide
          dhruba borthakur added a comment -

          I added the license to the beginning of the log4j and other properties files.

          Show
          dhruba borthakur added a comment - I added the license to the beginning of the log4j and other properties files.
          Hide
          dhruba borthakur added a comment -

          I just committed this. Thanks Ioannis!

          Show
          dhruba borthakur added a comment - I just committed this. Thanks Ioannis!
          Hide
          Ioannis Koltsidas added a comment -

          Then why don't I get audit warnings for the README file?

          Do I need to add the Apache License to configuration files? Other contrib projects (e.g. Chukwa) have configuration files without the license (as does the Hadoop core itself).

          Thanks

          Show
          Ioannis Koltsidas added a comment - Then why don't I get audit warnings for the README file? Do I need to add the Apache License to configuration files? Other contrib projects (e.g. Chukwa) have configuration files without the license (as does the Hadoop core itself). Thanks
          Hide
          Arun C Murthy added a comment -

          The audit warnings are due to the fact that the patch has added new files without the Apache License. Please fix them, thanks!

          Show
          Arun C Murthy added a comment - The audit warnings are due to the fact that the patch has added new files without the Apache License. Please fix them, thanks!
          Hide
          Ioannis Koltsidas added a comment -

          As pointed out by https://issues.apache.org/jira/browse/HADOOP-3949, the javadoc warnings are due to duplicated jar files under src/contrib/chukwa/lib and trunk/lib.

          As pointed out by https://issues.apache.org/jira/browse/HADOOP-3950 the failed tests are due to TestMapRed and TestMiniMRDFSSort.

          It is not clear to me yet where the release audit warnings come from. Does anyone know? They are all for non-java configuration files under /src/contrib/failmon/conf.

          Thanks

          Show
          Ioannis Koltsidas added a comment - As pointed out by https://issues.apache.org/jira/browse/HADOOP-3949 , the javadoc warnings are due to duplicated jar files under src/contrib/chukwa/lib and trunk/lib. As pointed out by https://issues.apache.org/jira/browse/HADOOP-3950 the failed tests are due to TestMapRed and TestMiniMRDFSSort. It is not clear to me yet where the release audit warnings come from. Does anyone know? They are all for non-java configuration files under /src/contrib/failmon/conf. Thanks
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12387992/HADOOP-3585.3.patch
          against trunk revision 685425.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 1 new or modified tests.

          -1 javadoc. The javadoc tool appears to have generated 1 warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          -1 release audit. The applied patch generated 279 release audit warnings (more than the trunk's current 274 warnings).

          -1 core tests. The patch failed core unit tests.

          -1 contrib tests. The patch failed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3051/testReport/
          Release audit warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3051/artifact/trunk/current/releaseAuditDiffWarnings.txt
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3051/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3051/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3051/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12387992/HADOOP-3585.3.patch against trunk revision 685425. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 1 new or modified tests. -1 javadoc. The javadoc tool appears to have generated 1 warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. -1 release audit. The applied patch generated 279 release audit warnings (more than the trunk's current 274 warnings). -1 core tests. The patch failed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3051/testReport/ Release audit warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3051/artifact/trunk/current/releaseAuditDiffWarnings.txt Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3051/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3051/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3051/console This message is automatically generated.
          Hide
          Ioannis Koltsidas added a comment -

          Fixed Hadoop QA errors

          Show
          Ioannis Koltsidas added a comment - Fixed Hadoop QA errors
          Hide
          Ioannis Koltsidas added a comment -

          Fixed findbugs errors and unit tests

          Show
          Ioannis Koltsidas added a comment - Fixed findbugs errors and unit tests
          Hide
          dhruba borthakur added a comment -

          I get a compilation error :

          init-contrib:

          compile:
          [echo] contrib: failmon

          jar:

          BUILD FAILED
          /export/home/dhruba/commit/build.xml:900: The following error occurred while executing this line:
          /export/home/dhruba/commit/src/contrib/build.xml:39: The following error occurred while executing this line:
          /export/home/dhruba/commit/src/contrib/failmon/build.xml:28: The following error occurred while executing this line:
          /export/home/dhruba/commit/build.xml:251: java.lang.ExceptionInInitializerError

          Total time: 44 seconds

          The unit tests have failed for some other reason ( I think) : http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3035/testReport/

          Show
          dhruba borthakur added a comment - I get a compilation error : init-contrib: compile: [echo] contrib: failmon jar: BUILD FAILED /export/home/dhruba/commit/build.xml:900: The following error occurred while executing this line: /export/home/dhruba/commit/src/contrib/build.xml:39: The following error occurred while executing this line: /export/home/dhruba/commit/src/contrib/failmon/build.xml:28: The following error occurred while executing this line: /export/home/dhruba/commit/build.xml:251: java.lang.ExceptionInInitializerError Total time: 44 seconds The unit tests have failed for some other reason ( I think) : http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3035/testReport/
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12387782/HADOOP-3585.2.patch
          against trunk revision 683671.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 1 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          -1 findbugs. The patch appears to cause Findbugs to fail.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3035/testReport/
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3035/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3035/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12387782/HADOOP-3585.2.patch against trunk revision 683671. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 1 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to cause Findbugs to fail. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3035/testReport/ Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3035/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3035/console This message is automatically generated.
          Hide
          dhruba borthakur added a comment -

          Submit patch for HadoopQA tests

          Show
          dhruba borthakur added a comment - Submit patch for HadoopQA tests
          Hide
          Ioannis Koltsidas added a comment -

          Please use HADOOP-3585.2.patch file

          Show
          Ioannis Koltsidas added a comment - Please use HADOOP-3585 .2.patch file
          Hide
          dhruba borthakur added a comment -

          Ok, the changes have to submitted as an "svn diff" file, not a zip file.

          Show
          dhruba borthakur added a comment - Ok, the changes have to submitted as an "svn diff" file, not a zip file.
          Hide
          dhruba borthakur added a comment -

          Ok, we got lots of comments from many people and everybody seems to agree that FailMon should go into contrib. If HadoopQA tests are successful, I will check them in.

          Show
          dhruba borthakur added a comment - Ok, we got lots of comments from many people and everybody seems to agree that FailMon should go into contrib. If HadoopQA tests are successful, I will check them in.
          Hide
          Jerome Boulon added a comment -

          +1 for review

          Please add copyright notices on top of source files
          Please include license files parallel to the included jars that are not Apache projects (lib/**).
          Files include some hostnames like: tracker_ensis21.almaden.ibm.com, ensis21-24, ibm.com

          Chukwa-FailMon integration:
          Chukwa-Failmon integration should be very easy since we're sharing several core concepts like input plugins/parsers and he current code is almost compatible with Chukwa.

          Regards,
          Jerome.

          Show
          Jerome Boulon added a comment - +1 for review Please add copyright notices on top of source files Please include license files parallel to the included jars that are not Apache projects (lib/**). Files include some hostnames like: tracker_ensis21.almaden.ibm.com, ensis21-24, ibm.com Chukwa-FailMon integration: Chukwa-Failmon integration should be very easy since we're sharing several core concepts like input plugins/parsers and he current code is almost compatible with Chukwa. Regards, Jerome.
          Hide
          Prasenjit Sarkar added a comment -

          Comment from Mac Yang:

          Mac Yang <macyang@yahoo-inc.com> wrote on 08/05/2008 09:05:28 AM:

          >
          > Hi Prasenjit,
          >
          > I completely agree that we should check in both projects to facilitate
          > getting feedback from a wider audience. And we will be happy to work
          > together with you to make that happen.
          >
          > That said, as Jerome and Ariel have pointed out, there are several areas
          > where it makes a lot of sense for FailMon and Chukwa to integrate /
          > interoperate (data source, HDFS storage and M/R based analytics for
          > example).
          >
          > While it shouldn't be a blocker for anything, I think it will be benefitial
          > for everyone if we could figure out a way to align our resources and take
          > advantage of the great synergy between FailMon and Chukwa.
          >
          > Thanks,
          > Mac
          >
          >
          >
          > On 8/4/08 2:32 PM, "Dhruba Borthakur" <dhruba@gmail.com> wrote:
          >
          > > Hi Prasenjit,
          > >
          > > All thanks to you and Ioannis for developing FailMon.
          > >
          > > It would be really nice if somebody from the Chukwa team can provide
          > > feedback on the FailMon package, especially whether it is compatible
          > > with Chukwa. It would be good to hear Mac's comments on whether these
          > > two approaches solve the same problem or how they can be complimentary
          > > to one another.
          > >
          > > thanks
          > > dhruba
          > >
          > > On Fri, Aug 1, 2008 at 4:10 PM, Prasenjit Sarkar
          > > <psarkar@almaden.ibm.com> wrote:
          > >>
          > >> Hi,
          > >>
          > >> As we discussed in our last meeting, we have uploaded the latest version of
          > >> FailMon (and some documentation) to JIRA (HADOOP-3585). If you have some
          > >> time to review it, we would be very interested to hear your comments and
          > >> suggestions before it gets committed. Dhruba has agreed to committhe patch
          > >> as soon as your team gives it a positive review. In the short term,
          > >> however, we would like different people/companies to start deploying
          > >> FailMon as soon as possible; to that end we need to commit it to the
          > >> repository as soon as possible.
          > >>
          > >> We also believe that you should commit the Chukwa code and together we can
          > >> get valuable feedback that can determine the direction of Chukwa and
          > >> FailMon. In the interim, we await your support for the commit process for
          > >> FailMon.
          > >>
          > >> Regards,
          > >>
          > >> Prasenjit Sarkar
          > >> RSM and Manager, Storage Analytics and Resiliency
          > >> Master Inventor
          > >> IBM Almaden Storage Systems Research
          > >>
          > >>
          >

          Show
          Prasenjit Sarkar added a comment - Comment from Mac Yang: Mac Yang <macyang@yahoo-inc.com> wrote on 08/05/2008 09:05:28 AM: > > Hi Prasenjit, > > I completely agree that we should check in both projects to facilitate > getting feedback from a wider audience. And we will be happy to work > together with you to make that happen. > > That said, as Jerome and Ariel have pointed out, there are several areas > where it makes a lot of sense for FailMon and Chukwa to integrate / > interoperate (data source, HDFS storage and M/R based analytics for > example). > > While it shouldn't be a blocker for anything, I think it will be benefitial > for everyone if we could figure out a way to align our resources and take > advantage of the great synergy between FailMon and Chukwa. > > Thanks, > Mac > > > > On 8/4/08 2:32 PM, "Dhruba Borthakur" <dhruba@gmail.com> wrote: > > > Hi Prasenjit, > > > > All thanks to you and Ioannis for developing FailMon. > > > > It would be really nice if somebody from the Chukwa team can provide > > feedback on the FailMon package, especially whether it is compatible > > with Chukwa. It would be good to hear Mac's comments on whether these > > two approaches solve the same problem or how they can be complimentary > > to one another. > > > > thanks > > dhruba > > > > On Fri, Aug 1, 2008 at 4:10 PM, Prasenjit Sarkar > > <psarkar@almaden.ibm.com> wrote: > >> > >> Hi, > >> > >> As we discussed in our last meeting, we have uploaded the latest version of > >> FailMon (and some documentation) to JIRA ( HADOOP-3585 ). If you have some > >> time to review it, we would be very interested to hear your comments and > >> suggestions before it gets committed. Dhruba has agreed to committhe patch > >> as soon as your team gives it a positive review. In the short term, > >> however, we would like different people/companies to start deploying > >> FailMon as soon as possible; to that end we need to commit it to the > >> repository as soon as possible. > >> > >> We also believe that you should commit the Chukwa code and together we can > >> get valuable feedback that can determine the direction of Chukwa and > >> FailMon. In the interim, we await your support for the commit process for > >> FailMon. > >> > >> Regards, > >> > >> Prasenjit Sarkar > >> RSM and Manager, Storage Analytics and Resiliency > >> Master Inventor > >> IBM Almaden Storage Systems Research > >> > >> >
          Hide
          Prasenjit Sarkar added a comment -

          Attached are a couple of threads of email conversation pertinent to this issue, in summary there is a strong interest in committing both the FailMon and Chukwa projects and awaiting user feedback.

          Ariel Rabkin <asrabkin@EECS.Berkeley.EDU> wrote on 08/04/2008 03:23:04 PM:

          > As near as I could gather from the failmon code –
          >
          > Ideally, the failmon data collection plugins ("monitors") would be
          > Chukwa adaptors. The abstractions are fairly close. Provided that
          > failmon isn't going to be patching away too intensively in the next
          > month, probably the best thing to do would be commit both, and merge later.
          >
          > --Ari
          >
          > ----- Original Message -----
          > From: Dhruba Borthakur <dhruba@gmail.com>
          > Date: Monday, August 4, 2008 2:32 pm
          > Subject: Re: support for FailMon commit...
          > To: Prasenjit Sarkar <psarkar@almaden.ibm.com>
          > Cc: jboulon@yahoo-inc.com, andyk@EECS.Berkeley.EDU, asrabkin@EECS.
          > Berkeley.EDU, runping@yahoo-inc.com, eyang@yahoo-inc.com,
          > macyang@yahoo-inc.com, Ioannis Koltsidas <ikoltsi@us.ibm.com>, Karan
          > Gupta <guptaka@us.ibm.com>
          >
          > > Hi Prasenjit,
          > >
          > > All thanks to you and Ioannis for developing FailMon.
          > >
          > > It would be really nice if somebody from the Chukwa team can provide
          > > feedback on the FailMon package, especially whether it is compatible
          > > with Chukwa. It would be good to hear Mac's comments on whether these
          > > two approaches solve the same problem or how they can be complimentary
          > > to one another.
          > >
          > > thanks
          > > dhruba
          > >
          > > On Fri, Aug 1, 2008 at 4:10 PM, Prasenjit Sarkar
          > > <psarkar@almaden.ibm.com> wrote:
          > > >
          > > > Hi,
          > > >
          > > > As we discussed in our last meeting, we have uploaded the latest
          > > version of
          > > > FailMon (and some documentation) to JIRA (HADOOP-3585). If you have
          > > some
          > > > time to review it, we would be very interested to hear your comments
          > > and
          > > > suggestions before it gets committed. Dhruba has agreed to commit
          > > the patch
          > > > as soon as your team gives it a positive review. In the short term,
          > > > however, we would like different people/companies to start deploying
          > > > FailMon as soon as possible; to that end we need to commit it to the
          > > > repository as soon as possible.
          > > >
          > > > We also believe that you should commit the Chukwa code and together
          > > we can
          > > > get valuable feedback that can determine the direction of Chukwa and
          > > > FailMon. In the interim, we await your support for the commit
          > > process for
          > > > FailMon.
          > > >
          > > > Regards,
          > > >
          > > > Prasenjit Sarkar
          > > > RSM and Manager, Storage Analytics and Resiliency
          > > > Master Inventor
          > > > IBM Almaden Storage Systems Research
          > > >
          > > >

          and

          Prasenjit Sarkar/Almaden/IBM wrote on 08/04/2008 03:19:45 PM:

          > Jerome,
          >
          > I appreciate your analysis of the integration scenarios. Taking a
          > step back, we think that both Chukwa and FailMon provide interesting
          > value propositions independent of each other. For example, we have
          > had requests from a few groups wanting to use FailMon independently
          > as a quick cluster health post-processor. I'm sure that Chukwa has a
          > similar user community. In that vein, I would not like the value
          > proposition of these two complementary projects be diluted by the
          > integration discussion.
          >
          > So, I would vote for a quick commital for both projects followed by
          > integration discussions moderated by Hadoop commiters using feedback
          > from Chukwa/FailMon users.
          >
          > I hope this is reasonable,
          >
          > Regards,
          >
          > Prasenjit Sarkar
          > RSM and Manager, Storage Analytics and Resiliency
          > Master Inventor
          > IBM Almaden Storage Systems Research
          >
          > Jerome Boulon <jboulon@yahoo-inc.com>
          > 08/04/2008 10:11 AM
          >
          > To
          >
          > Prasenjit Sarkar <psarkar@almaden.ibm.com>, <dhruba@gmail.com>,
          > Ioannis Koltsidas/Almaden/IBM@IBMUS, Karan Gupta/Almaden/IBM@IBMUS,
          > <andyk@cs.berkeley.edu>, <asrabkin@cs.berkeley.edu>, Runping Qi
          > <runping@yahoo-inc.com>, <eyang@yahoo-inc.com>, Mac Yang
          > <macyang@yahoo-inc.com>
          >
          > cc
          >
          > Subject
          >
          > FailMon - Chukwa integration
          >
          > Hi,
          > I have take a look at FailMon and here how we can integrate it to Chukwa.
          > Basically there's 3 entry points in Chukwa:
          >
          > 1- At the adaptor level (inject data)
          > 2- At the Demux level (Data analysis)
          > 3- Using the archive.
          >
          > 1- Running FailMon at the adaptor level will prevent anyone to use the real
          > data. So this should not be used in the general case.
          >
          > 2- It's possible to run FailMon as a Demux processor and output exactly what
          > we want and that would have been my suggestion but FailMon is not intended
          > to be used directly by the company that produce the output (at least for
          > now) so I would prefer not to use FailMon there since we're planning to run
          > critical processors and adding any latency here may become an issue.
          >
          > 3- So my recommendation is to use all Chukwa's archives as input for
          > FailMon. The main advantage is that all the data is group together in one or
          > more big Sequence files that can be easily processed using M/R and since
          > it's an offline post-processing the impact on the production's cluster could
          > be easily controlled.
          >
          > /Jerome.
          >

          Show
          Prasenjit Sarkar added a comment - Attached are a couple of threads of email conversation pertinent to this issue, in summary there is a strong interest in committing both the FailMon and Chukwa projects and awaiting user feedback. Ariel Rabkin <asrabkin@EECS.Berkeley.EDU> wrote on 08/04/2008 03:23:04 PM: > As near as I could gather from the failmon code – > > Ideally, the failmon data collection plugins ("monitors") would be > Chukwa adaptors. The abstractions are fairly close. Provided that > failmon isn't going to be patching away too intensively in the next > month, probably the best thing to do would be commit both, and merge later. > > --Ari > > ----- Original Message ----- > From: Dhruba Borthakur <dhruba@gmail.com> > Date: Monday, August 4, 2008 2:32 pm > Subject: Re: support for FailMon commit... > To: Prasenjit Sarkar <psarkar@almaden.ibm.com> > Cc: jboulon@yahoo-inc.com, andyk@EECS.Berkeley.EDU, asrabkin@EECS. > Berkeley.EDU, runping@yahoo-inc.com, eyang@yahoo-inc.com, > macyang@yahoo-inc.com, Ioannis Koltsidas <ikoltsi@us.ibm.com>, Karan > Gupta <guptaka@us.ibm.com> > > > Hi Prasenjit, > > > > All thanks to you and Ioannis for developing FailMon. > > > > It would be really nice if somebody from the Chukwa team can provide > > feedback on the FailMon package, especially whether it is compatible > > with Chukwa. It would be good to hear Mac's comments on whether these > > two approaches solve the same problem or how they can be complimentary > > to one another. > > > > thanks > > dhruba > > > > On Fri, Aug 1, 2008 at 4:10 PM, Prasenjit Sarkar > > <psarkar@almaden.ibm.com> wrote: > > > > > > Hi, > > > > > > As we discussed in our last meeting, we have uploaded the latest > > version of > > > FailMon (and some documentation) to JIRA ( HADOOP-3585 ). If you have > > some > > > time to review it, we would be very interested to hear your comments > > and > > > suggestions before it gets committed. Dhruba has agreed to commit > > the patch > > > as soon as your team gives it a positive review. In the short term, > > > however, we would like different people/companies to start deploying > > > FailMon as soon as possible; to that end we need to commit it to the > > > repository as soon as possible. > > > > > > We also believe that you should commit the Chukwa code and together > > we can > > > get valuable feedback that can determine the direction of Chukwa and > > > FailMon. In the interim, we await your support for the commit > > process for > > > FailMon. > > > > > > Regards, > > > > > > Prasenjit Sarkar > > > RSM and Manager, Storage Analytics and Resiliency > > > Master Inventor > > > IBM Almaden Storage Systems Research > > > > > > and Prasenjit Sarkar/Almaden/IBM wrote on 08/04/2008 03:19:45 PM: > Jerome, > > I appreciate your analysis of the integration scenarios. Taking a > step back, we think that both Chukwa and FailMon provide interesting > value propositions independent of each other. For example, we have > had requests from a few groups wanting to use FailMon independently > as a quick cluster health post-processor. I'm sure that Chukwa has a > similar user community. In that vein, I would not like the value > proposition of these two complementary projects be diluted by the > integration discussion. > > So, I would vote for a quick commital for both projects followed by > integration discussions moderated by Hadoop commiters using feedback > from Chukwa/FailMon users. > > I hope this is reasonable, > > Regards, > > Prasenjit Sarkar > RSM and Manager, Storage Analytics and Resiliency > Master Inventor > IBM Almaden Storage Systems Research > > Jerome Boulon <jboulon@yahoo-inc.com> > 08/04/2008 10:11 AM > > To > > Prasenjit Sarkar <psarkar@almaden.ibm.com>, <dhruba@gmail.com>, > Ioannis Koltsidas/Almaden/IBM@IBMUS, Karan Gupta/Almaden/IBM@IBMUS, > <andyk@cs.berkeley.edu>, <asrabkin@cs.berkeley.edu>, Runping Qi > <runping@yahoo-inc.com>, <eyang@yahoo-inc.com>, Mac Yang > <macyang@yahoo-inc.com> > > cc > > Subject > > FailMon - Chukwa integration > > Hi, > I have take a look at FailMon and here how we can integrate it to Chukwa. > Basically there's 3 entry points in Chukwa: > > 1- At the adaptor level (inject data) > 2- At the Demux level (Data analysis) > 3- Using the archive. > > 1- Running FailMon at the adaptor level will prevent anyone to use the real > data. So this should not be used in the general case. > > 2- It's possible to run FailMon as a Demux processor and output exactly what > we want and that would have been my suggestion but FailMon is not intended > to be used directly by the company that produce the output (at least for > now) so I would prefer not to use FailMon there since we're planning to run > critical processors and adding any latency here may become an issue. > > 3- So my recommendation is to use all Chukwa's archives as input for > FailMon. The main advantage is that all the data is group together in one or > more big Sequence files that can be easily processed using M/R and since > it's an offline post-processing the impact on the production's cluster could > be easily controlled. > > /Jerome. >
          Hide
          Ioannis Koltsidas added a comment -

          Thanks for your comment, Otis. By "decoupled" I mean that it is not started directly by a Hadoop component, as it was in the initial version (then, it was started by NameNode.java, DataNode.java). However, since FailMon not only uses Hadoop, but also is tailored for Hadoop log collection, we believe it is a good idea to be part of the project (since this will make it more visible to people running large clusters, since most of them use Hadoop).

          In order to make ti more visible (and more usable in the first place , we plan to set up on a website/wiki for FailMon, where we will upload all info and documentation...

          Show
          Ioannis Koltsidas added a comment - Thanks for your comment, Otis. By "decoupled" I mean that it is not started directly by a Hadoop component, as it was in the initial version (then, it was started by NameNode.java, DataNode.java). However, since FailMon not only uses Hadoop, but also is tailored for Hadoop log collection, we believe it is a good idea to be part of the project (since this will make it more visible to people running large clusters, since most of them use Hadoop). In order to make ti more visible (and more usable in the first place , we plan to set up on a website/wiki for FailMon, where we will upload all info and documentation...
          Hide
          Otis Gospodnetic added a comment -

          Curious observer's comment to the following statement:
          "FailMon is now a contrib project and its code is decoupled from the Hadoop core."

          Would it then make sense to package and publish this separately? Publishing it in Hadoop's contrib may hide it from those who could use failure monitoring outside Hadoop, but do not know to look for this gem in Hadoop's contrib.

          Show
          Otis Gospodnetic added a comment - Curious observer's comment to the following statement: "FailMon is now a contrib project and its code is decoupled from the Hadoop core." Would it then make sense to package and publish this separately? Publishing it in Hadoop's contrib may hide it from those who could use failure monitoring outside Hadoop, but do not know to look for this gem in Hadoop's contrib.
          Hide
          Ioannis Koltsidas added a comment -

          Quick Start guide

          Show
          Ioannis Koltsidas added a comment - Quick Start guide
          Hide
          Ioannis Koltsidas added a comment -

          Failmon Description and User Manual

          Show
          Ioannis Koltsidas added a comment - Failmon Description and User Manual
          Hide
          Ioannis Koltsidas added a comment -

          Release of FailMon as a contrib project, with some additional features and many bug fixes. Please refer to the user manual (failmon2.pdf) for a complete description and instructions for deployment and execution of FailMon, especially Section 4. File FailMon_QuickStart.html provides a guide to quickly set up and run FailMon. Here is the summary of changes we have made since the previous patch:

          • FailMon is now a contrib project and its code is decoupled from the Hadoop core. Only the javadoc target in the hadoop-core-trunk/build.xml file has been changed to account for the FailMon javadoc. Everything else lies under src/contrib/failmon.
          • Scheduling of monitoring jobs is now done in an ad-hoc fashion by one or more "scheduler" nodes. Execution of FailMon is thus independent of Hadoop and can be started/stopped arbitrarily. Also, it can be run with arbitrary user permissions (it doesn't have to be run by the user that runs hadoop on nodes). It can also be run selectively on nodes and even at times when Hadoop is not running.
          • We have added a mechanism for concatenating all HDFS files created by FailMon into a single HDFS file (to reduce metadata overhead at the namenode). A limit on the maximum number of HDFS files created by FailMon can be set via the configuration files.
          • We use the Commons Logging API to log messages and stack traces.
          • The user can now specify entire directories with log files to be parsed. Note that FailMon will now collect log files no matter how old they are, and upload their entries into HDFS.
          • We have made some bookkeeping information about the state of log file parsing persistent locally on nodes. For each log file ever opened on a node, we store its first log entry and the byte offset of the last entry parsed. The former enables FailMon to detect log file rotation, while the latter is used to resume parsing from the last entry parsed.
          • Added the ant tar target, which packages FailMon in a jar file and inserts it into an archive (with all required libraries and configuration files), so that it can be deployed and run independently of Hadoop.
          Show
          Ioannis Koltsidas added a comment - Release of FailMon as a contrib project, with some additional features and many bug fixes. Please refer to the user manual (failmon2.pdf) for a complete description and instructions for deployment and execution of FailMon, especially Section 4. File FailMon_QuickStart.html provides a guide to quickly set up and run FailMon. Here is the summary of changes we have made since the previous patch: FailMon is now a contrib project and its code is decoupled from the Hadoop core. Only the javadoc target in the hadoop-core-trunk/build.xml file has been changed to account for the FailMon javadoc. Everything else lies under src/contrib/failmon. Scheduling of monitoring jobs is now done in an ad-hoc fashion by one or more "scheduler" nodes. Execution of FailMon is thus independent of Hadoop and can be started/stopped arbitrarily. Also, it can be run with arbitrary user permissions (it doesn't have to be run by the user that runs hadoop on nodes). It can also be run selectively on nodes and even at times when Hadoop is not running. We have added a mechanism for concatenating all HDFS files created by FailMon into a single HDFS file (to reduce metadata overhead at the namenode). A limit on the maximum number of HDFS files created by FailMon can be set via the configuration files. We use the Commons Logging API to log messages and stack traces. The user can now specify entire directories with log files to be parsed. Note that FailMon will now collect log files no matter how old they are, and upload their entries into HDFS. We have made some bookkeeping information about the state of log file parsing persistent locally on nodes. For each log file ever opened on a node, we store its first log entry and the byte offset of the last entry parsed. The former enables FailMon to detect log file rotation, while the latter is used to resume parsing from the last entry parsed. Added the ant tar target, which packages FailMon in a jar file and inserts it into an archive (with all required libraries and configuration files), so that it can be deployed and run independently of Hadoop.
          Hide
          dhruba borthakur added a comment -

          It would be nice if we could do the folowing:

          1. Remove the code changes from Namenode.java and DataNode.java. Instead run this app from a bunch os shell scripts.

          2. Move failmon.properties from conf to src/contrib/failmon/conf/ or something like that.

          3. Make the code reside in src/contrib/failmon. Let it be a contrib project.

          4. Write a junit test to test some amount of functionality. It could be based on standalone class testing.

          5. Integrate with the over build process so that "ant compile-contrib" builds FailMon too. Similarly, "ant test" should run FailMon junit test(s).

          6. Maybe some people from the chukwa project should browse this code and give a +1.

          Once these are done, we should check this as a contrib project.

          Show
          dhruba borthakur added a comment - It would be nice if we could do the folowing: 1. Remove the code changes from Namenode.java and DataNode.java. Instead run this app from a bunch os shell scripts. 2. Move failmon.properties from conf to src/contrib/failmon/conf/ or something like that. 3. Make the code reside in src/contrib/failmon. Let it be a contrib project. 4. Write a junit test to test some amount of functionality. It could be based on standalone class testing. 5. Integrate with the over build process so that "ant compile-contrib" builds FailMon too. Similarly, "ant test" should run FailMon junit test(s). 6. Maybe some people from the chukwa project should browse this code and give a +1. Once these are done, we should check this as a contrib project.
          Hide
          Ioannis Koltsidas added a comment -

          Failmon description & usage manual

          Show
          Ioannis Koltsidas added a comment - Failmon description & usage manual
          Hide
          Ioannis Koltsidas added a comment -

          Thanks for your comments. I thing that integrating failmon with Chukwa and making it independent of TaskTrackers/DataNodes. Having the failmon data collection code wrapped in a Chukwa adaptor seems a very good idea to me. We can discuss it in more detail when the initial version of Chuckwa is posted...

          Steve: Thanks for the suggestion, I'm working on this...

          Show
          Ioannis Koltsidas added a comment - Thanks for your comments. I thing that integrating failmon with Chukwa and making it independent of TaskTrackers/DataNodes. Having the failmon data collection code wrapped in a Chukwa adaptor seems a very good idea to me. We can discuss it in more detail when the initial version of Chuckwa is posted... Steve: Thanks for the suggestion, I'm working on this...
          Hide
          steve_l added a comment -

          I think the capture stuff is independent of the nodes deployed , and so shouldn't be automatically started. When you run a whole cluster in VM during testing, you'd be deploying many duplicate monitors. Better to have some switch on the command line like -failmon to turn failure monitoring on for that process; that switch could start a failure monitor service alongside the rest of the system.

          Show
          steve_l added a comment - I think the capture stuff is independent of the nodes deployed , and so shouldn't be automatically started. When you run a whole cluster in VM during testing, you'd be deploying many duplicate monitors. Better to have some switch on the command line like -failmon to turn failure monitoring on for that process; that switch could start a failure monitor service alongside the rest of the system.
          Hide
          Ari Rabkin added a comment - - edited

          Rick: Chukwa opted to go for the separate-process design for reasons along the lines you lay out. I haven't studied the failmon code very closely, but it looks like most of what it does could be done pretty easily from a separate process.

          Runping: My sense is that having the failmon data collection code wrapped in a Chukwa adaptor [the first option you mentioned] is more convenient. That approach avoids the complications of failmon log rotation, and removes some unneeded components. Failmon was written with a fairly similar programming model, so the work involved in merging the two efforts should be quite modest.

          Show
          Ari Rabkin added a comment - - edited Rick: Chukwa opted to go for the separate-process design for reasons along the lines you lay out. I haven't studied the failmon code very closely, but it looks like most of what it does could be done pretty easily from a separate process. Runping: My sense is that having the failmon data collection code wrapped in a Chukwa adaptor [the first option you mentioned] is more convenient. That approach avoids the complications of failmon log rotation, and removes some unneeded components. Failmon was written with a fairly similar programming model, so the work involved in merging the two efforts should be quite modest.
          Hide
          Rick Cox added a comment -

          This effort seems independent of providing a distributed file system (as evidenced by the availability of a standalone version). Could the implementation be decoupled from the DataNode/NameNode daemons? Many users will find this sort of hardware failure detection useful for their entire set of hosts (including nodes that are not otherwise running any hadoop daemons). Conversely, many Hadoop users will already be running software with similar functionality, and will not need or want Hadoop to provide it bundled with the DataNodes.

          In that light, it seems like this would make more sense as a piece that can evolve independently of the Hadoop core releases, either as a sub-project or incubator project (I don't know what the Apache rules regrading those are) or as a contrib module (though that has the disadvantage of coupling the release cycles).

          Show
          Rick Cox added a comment - This effort seems independent of providing a distributed file system (as evidenced by the availability of a standalone version). Could the implementation be decoupled from the DataNode/NameNode daemons? Many users will find this sort of hardware failure detection useful for their entire set of hosts (including nodes that are not otherwise running any hadoop daemons). Conversely, many Hadoop users will already be running software with similar functionality, and will not need or want Hadoop to provide it bundled with the DataNodes. In that light, it seems like this would make more sense as a piece that can evolve independently of the Hadoop core releases, either as a sub-project or incubator project (I don't know what the Apache rules regrading those are) or as a contrib module (though that has the disadvantage of coupling the release cycles).
          Hide
          Runping Qi added a comment -

          Looks like this will be complementary with Chukwa project:
          https://issues.apache.org/jira/browse/HADOOP-3719.
          Chukwa is an hdfs based storage system for collecting and mining log data.
          Chukwa will provide simple APIs for applications to push log data (and metrics data, or any kind of semi structured data) to the storage.
          Once the data get to the storage, one can run map/reduce jobs or pig jobs to mine the data.
          Currently, we are planning to implement a local agent that will collect the log files of Hadoop service processes (Data nodes, Name nodes, Task trackers, etc) and push the data to Chukwa storage. This agent will be running on a machine outside of Hadoop processes.
          This agent may also be used for collecting system and other application metrics.

          There seems to be two possible ways the failmon proposed in this Jira can work with Chukwa.
          One is that it pushes data to Chukwa directly using Chukwa APIs.
          The other one is to produce log files and let Chukwa agent to push the data to Chukwa.

          Show
          Runping Qi added a comment - Looks like this will be complementary with Chukwa project: https://issues.apache.org/jira/browse/HADOOP-3719 . Chukwa is an hdfs based storage system for collecting and mining log data. Chukwa will provide simple APIs for applications to push log data (and metrics data, or any kind of semi structured data) to the storage. Once the data get to the storage, one can run map/reduce jobs or pig jobs to mine the data. Currently, we are planning to implement a local agent that will collect the log files of Hadoop service processes (Data nodes, Name nodes, Task trackers, etc) and push the data to Chukwa storage. This agent will be running on a machine outside of Hadoop processes. This agent may also be used for collecting system and other application metrics. There seems to be two possible ways the failmon proposed in this Jira can work with Chukwa. One is that it pushes data to Chukwa directly using Chukwa APIs. The other one is to produce log files and let Chukwa agent to push the data to Chukwa.
          Hide
          Ioannis Koltsidas added a comment -

          Thanks very much for your input!

          Regarding Steve's comments:

          • My limited experience with smartmontools on different kinds of disks shows that the smartctl output format is different for SCSI and SATA disks. I have tested it with 2 different IBM SCSI disks and 2 hitachi SATA disks; that's why I mention it in the comments. I believe that these 2 formats cover many different brands, although the attributes that appear in the smartctl output will vary among different brands and models. I can remove the brand names for the comments to make it more clear (or I would be happy to include all different disk models it has been tested with, if people submit their 'smartctl -A /dev/xxx' output to me for other disk models). I can extend it for other smartctl output formats fairly easily as well, provided that I get a sample of them.
          • We plan to use the commons logging API for logging messages. I'm working on that.
          • Since it will take a considerable amount of work to make it portable to non-unix system, I think that it would be better to stick linux for now. Therefore, the Executor thread will read the system os.name property and will only start if it is a linux system. (currently this does not happen; a linux system is assumed and I'm not sure how it will behave on other types of systems).
          • I haven't really looked into how testing of the package should be done. But considering that failures cannot easily be injected/simulated from within java, and especially from user-space, I guess that what you suggest is probably the best way to go.
          • Each monitor runs in the Executor thread. An Executor is started for each NameNode and DataNode instance (in the costructor of classes NameNode and DataNode) and is terminated when that node is terminated (we plan to do the same for JobTrackers and Tasktrackers as well). Other than that, no startup or shutdown code is required.
          • One should not start a monitor outside of an executor (unless he knows exactly what he's doing). Then, an Executor thread runs for each and every NameNode and DataNode instance on a given machine. However, if more than one Executors run in the same machine (.i.e., the machine is both a NameNode and a DataNode or a DataNode and a TaskTracker) then the Executor that has been spawned first will monitor all system metrics (system log and output of utilities such as ifconfig, smartctl etc) and the Hadoop log for the object by which it was spawned (i.e., the NameNode log for a NameNode, the DataNode log for a DataNode etc). All executors started on the same machine after this one will monitor no system-related metrics; only the hadoop-related logs for the object that spawned the Executor will be monitored by those. Note that if more than one hadoop/HDFS instances are running on the same machine you have to replace "machine" with "hadoop/HDFS instance" on the above.

          Regarding Dhruba's comments:

          1. I agree with your idea, but I'm not sure how feasible it is. Some concerns about this approach:

          • This map-reduce job is needed to run on all machines, i.e. all TaskTrackers and the JobTracker. I'm not sure how easy it is to force this to happen. Furthermore, if some DataNodes are not TaskTrackers, then how would we collect the data from those? If you think that forcing a map-reduce job to run on all nodes is feasible, then we could go for it.
          • I think this is ok for parsing the logs and uploading the collected records, but I am not sure how appropriate it is for reading the output of system utilities. I suppose that an administrator would like to run the log-parsing monitors infrequently (e.g once a day) as they might take non-negligible time to complete. On the other hand, he is more likely to want to read the output of system utilities at smaller intervals (e.g. for ifconfig, smart atributes and temperature sensors). This interval could be an hour or less. So, if a map-reduce job would need to be created for these every hour, a substantial overhead might be introduced (especially if map-reduce jobs are to be run one all nodes).

          2. I believe we can do that. I'll look into it.

          3. I would be happy to change the name to whatever people think is more representative of the contents of the package. Maybe we can have a logcollector package and a failure monitoring subpackage (to capture that also system utilities are read and for the failure identification code).

          4. The filename of the uploaded HDFS file has the form failmon<hostname><timestamp>.zip, so filenames are expected to be unique. In the same context, the best thing to do, in my opinion, would be to append all locally gathered records to an HDFS file, provided that the upload can be in a compressed form. I'm not very familiar with the append API yet and I also am not sure whether the communication can be compressed, but if it is feasible I think it would be the best way to go. In the current approach, if very small files are uploaded, a lot of space will be wasted (since the block size is large).

          Show
          Ioannis Koltsidas added a comment - Thanks very much for your input! Regarding Steve's comments: My limited experience with smartmontools on different kinds of disks shows that the smartctl output format is different for SCSI and SATA disks. I have tested it with 2 different IBM SCSI disks and 2 hitachi SATA disks; that's why I mention it in the comments. I believe that these 2 formats cover many different brands, although the attributes that appear in the smartctl output will vary among different brands and models. I can remove the brand names for the comments to make it more clear (or I would be happy to include all different disk models it has been tested with, if people submit their 'smartctl -A /dev/xxx' output to me for other disk models). I can extend it for other smartctl output formats fairly easily as well, provided that I get a sample of them. We plan to use the commons logging API for logging messages. I'm working on that. Since it will take a considerable amount of work to make it portable to non-unix system, I think that it would be better to stick linux for now. Therefore, the Executor thread will read the system os.name property and will only start if it is a linux system. (currently this does not happen; a linux system is assumed and I'm not sure how it will behave on other types of systems). I haven't really looked into how testing of the package should be done. But considering that failures cannot easily be injected/simulated from within java, and especially from user-space, I guess that what you suggest is probably the best way to go. Each monitor runs in the Executor thread. An Executor is started for each NameNode and DataNode instance (in the costructor of classes NameNode and DataNode) and is terminated when that node is terminated (we plan to do the same for JobTrackers and Tasktrackers as well). Other than that, no startup or shutdown code is required. One should not start a monitor outside of an executor (unless he knows exactly what he's doing). Then, an Executor thread runs for each and every NameNode and DataNode instance on a given machine. However, if more than one Executors run in the same machine (.i.e., the machine is both a NameNode and a DataNode or a DataNode and a TaskTracker) then the Executor that has been spawned first will monitor all system metrics (system log and output of utilities such as ifconfig, smartctl etc) and the Hadoop log for the object by which it was spawned (i.e., the NameNode log for a NameNode, the DataNode log for a DataNode etc). All executors started on the same machine after this one will monitor no system-related metrics; only the hadoop-related logs for the object that spawned the Executor will be monitored by those. Note that if more than one hadoop/HDFS instances are running on the same machine you have to replace "machine" with "hadoop/HDFS instance" on the above. Regarding Dhruba's comments: 1. I agree with your idea, but I'm not sure how feasible it is. Some concerns about this approach: This map-reduce job is needed to run on all machines, i.e. all TaskTrackers and the JobTracker. I'm not sure how easy it is to force this to happen. Furthermore, if some DataNodes are not TaskTrackers, then how would we collect the data from those? If you think that forcing a map-reduce job to run on all nodes is feasible, then we could go for it. I think this is ok for parsing the logs and uploading the collected records, but I am not sure how appropriate it is for reading the output of system utilities. I suppose that an administrator would like to run the log-parsing monitors infrequently (e.g once a day) as they might take non-negligible time to complete. On the other hand, he is more likely to want to read the output of system utilities at smaller intervals (e.g. for ifconfig, smart atributes and temperature sensors). This interval could be an hour or less. So, if a map-reduce job would need to be created for these every hour, a substantial overhead might be introduced (especially if map-reduce jobs are to be run one all nodes). 2. I believe we can do that. I'll look into it. 3. I would be happy to change the name to whatever people think is more representative of the contents of the package. Maybe we can have a logcollector package and a failure monitoring subpackage (to capture that also system utilities are read and for the failure identification code). 4. The filename of the uploaded HDFS file has the form failmon<hostname><timestamp>.zip, so filenames are expected to be unique. In the same context, the best thing to do, in my opinion, would be to append all locally gathered records to an HDFS file, provided that the upload can be in a compressed form. I'm not very familiar with the append API yet and I also am not sure whether the communication can be compressed, but if it is feasible I think it would be the best way to go. In the current approach, if very small files are uploaded, a lot of space will be wasted (since the block size is large).
          Hide
          dhruba borthakur added a comment -

          Cool stuff!

          1. It would be really nice to be able to deploy this without changing the namenode/datanode. One option would be to manually start your scheduler (that looks for the next report to be collected) and then runs a map-reduce job to collect the statistics. Is this possible using your current code?

          2. Regarding the format of the serialized logs, can we use an existing serialization format rather than inventing another one? One option would be to store them as Java properties (name, value pairs) and then serialize them using Java serialization. Another option would be to use Hadoop recordio (org.apache.hadoop.record.*)

          3. Instead of calling it the failmon package, a better name could be logcollector or something more general. The logs could be used to detect failures, analyze performance of specific machines, correlate events of one machine with another, etc. In the same vein, it might make sense to rename all configurable property names to the form "logcollector.nic.list, "logcollector.sensors.interval", etc.etc.

          4. What happens when the framework tries to upload a file into HDFDS but the HDFS file already exists?

          Show
          dhruba borthakur added a comment - Cool stuff! 1. It would be really nice to be able to deploy this without changing the namenode/datanode. One option would be to manually start your scheduler (that looks for the next report to be collected) and then runs a map-reduce job to collect the statistics. Is this possible using your current code? 2. Regarding the format of the serialized logs, can we use an existing serialization format rather than inventing another one? One option would be to store them as Java properties (name, value pairs) and then serialize them using Java serialization. Another option would be to use Hadoop recordio (org.apache.hadoop.record.*) 3. Instead of calling it the failmon package, a better name could be logcollector or something more general. The logs could be used to detect failures, analyze performance of specific machines, correlate events of one machine with another, etc. In the same vein, it might make sense to rename all configurable property names to the form "logcollector.nic.list, "logcollector.sensors.interval", etc.etc. 4. What happens when the framework tries to upload a file into HDFDS but the HDFS file already exists?
          Hide
          steve_l added a comment -

          Some comments from a quick look at the code

          • only IBM and hitachi HDDs are logged? Is there any way to make it easily extensible for other SMART disks, since most SCSI HDDs have this facility
          • why isn't commons logging API being used for logging messages and stack traces.?
          • what happens when you deploy on non-unix systems?
          • How do propose to test all of this. I could imagine that test data of various OS log files could be used in unit testing, something to push out spoof
            files to test the live thread, and something else to parse the output via DFS, but didn't see that in the patch.
          • What is the lifecycle of the Monitor, the class called Executor ? I see lots of shutdown code in various places.
          • What happens if you try and start >1 monitor in the same process, or on the same machine?

          This is interesting, but I'd like to see it deployable as a standalone Service under the service code I'm putting together, rather than hidden under every kind of hadoop service that can be brought up, and the polling worries me. Others may have different opinions,

          Show
          steve_l added a comment - Some comments from a quick look at the code only IBM and hitachi HDDs are logged? Is there any way to make it easily extensible for other SMART disks, since most SCSI HDDs have this facility why isn't commons logging API being used for logging messages and stack traces.? what happens when you deploy on non-unix systems? How do propose to test all of this. I could imagine that test data of various OS log files could be used in unit testing, something to push out spoof files to test the live thread, and something else to parse the output via DFS, but didn't see that in the patch. What is the lifecycle of the Monitor, the class called Executor ? I see lots of shutdown code in various places. What happens if you try and start >1 monitor in the same process, or on the same machine? This is interesting, but I'd like to see it deployable as a standalone Service under the service code I'm putting together, rather than hidden under every kind of hadoop service that can be brought up, and the polling worries me. Others may have different opinions,
          Hide
          Ioannis Koltsidas added a comment -

          We have uploaded an initial version of our tool. Using the patch for the trunk code, one can run FailMon on every DataNode and NameNode. All data gathered are uploaded into HDFS.

          Also provided is an OfflineAnonymizer that anonymizes system and hadoop log files, so that they can be easily distributed.

          Details can be found in the attached FailMon_Package_Descrip.html.

          Our greatest concern now is to be able to identify read hardware failures from the gathered data. To that end, we need to gather as many data from real clusters as possible, to that we can see how all kinds of errors and failures are actually logged by the system and hadoop. By correlating them, we will be able to systematically identify actual failures.

          So, you are very welcome to use our patch and share the collected data and/or anonymize and share any log files you may already have

          Show
          Ioannis Koltsidas added a comment - We have uploaded an initial version of our tool. Using the patch for the trunk code, one can run FailMon on every DataNode and NameNode. All data gathered are uploaded into HDFS. Also provided is an OfflineAnonymizer that anonymizes system and hadoop log files, so that they can be easily distributed. Details can be found in the attached FailMon_Package_Descrip.html. Our greatest concern now is to be able to identify read hardware failures from the gathered data. To that end, we need to gather as many data from real clusters as possible, to that we can see how all kinds of errors and failures are actually logged by the system and hadoop. By correlating them, we will be able to systematically identify actual failures. So, you are very welcome to use our patch and share the collected data and/or anonymize and share any log files you may already have
          Hide
          Ioannis Koltsidas added a comment -

          Patch for running FailMon on every NameNode and DataNode.

          Show
          Ioannis Koltsidas added a comment - Patch for running FailMon on every NameNode and DataNode.
          Hide
          Ioannis Koltsidas added a comment -

          The code for running FailMon outside Hadoop (as a standa-alone monitoring tool)

          Show
          Ioannis Koltsidas added a comment - The code for running FailMon outside Hadoop (as a standa-alone monitoring tool)
          Hide
          Ioannis Koltsidas added a comment -

          A description of the package and the implementation

          Show
          Ioannis Koltsidas added a comment - A description of the package and the implementation
          Hide
          Runping Qi added a comment -

          One challenging problem is to detect slow links/nodes.
          Do you have any good suggestions for https://issues.apache.org/jira/browse/HADOOP-3589

          Show
          Runping Qi added a comment - One challenging problem is to detect slow links/nodes. Do you have any good suggestions for https://issues.apache.org/jira/browse/HADOOP-3589
          Hide
          Ioannis Koltsidas added a comment -

          Initial Design Description

          Show
          Ioannis Koltsidas added a comment - Initial Design Description

            People

            • Assignee:
              Ioannis Koltsidas
              Reporter:
              Ioannis Koltsidas
            • Votes:
              1 Vote for this issue
              Watchers:
              13 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Time Tracking

                Estimated:
                Original Estimate - 480h
                480h
                Remaining:
                Remaining Estimate - 480h
                480h
                Logged:
                Time Spent - Not Specified
                Not Specified

                  Development