Chukwa
  1. Chukwa
  2. CHUKWA-680

Pattern recognition of Hadoop generated metrics

    Details

    • Type: New Feature New Feature
    • Status: Open
    • Priority: Minor Minor
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: Data Collection
    • Labels:
    • Environment:

      IBM InfoSphere BigInsights Enterprise

      Description

      Charles Lin and I are working on our IBM SJSU masters project on "Pattern recognition of Hadoop generated metrics".

      The purpose of the project is to use libsvm to predict the health of the cluster.

      The scope of the project includes:
      1) gathering large scale data set of metrics for healthy and unhealthy clusters
      2) use #1 and libsvm to generate training model
      3) periodic collection of metrics and comparing against training model using libsvm to predict the cluster health
      a) if unhealthy, send email notification to system administrator

        Activity

        Hide
        Eric Yang added a comment -

        Could we setup this up for GSoC?

        See rules: http://community.apache.org/gsoc.html

        Show
        Eric Yang added a comment - Could we setup this up for GSoC? See rules: http://community.apache.org/gsoc.html
        Hide
        michael yu added a comment -

        Design document

        Show
        michael yu added a comment - Design document
        Hide
        Eric Yang added a comment -

        Could we create a subdirectory and make the related code as a maven module? This will help us to refactor Chukwa code into a set of useful utilities. The pattern recognition libraries is interesting way to look at repeated metrics patterns to find problems in the system. Thanks

        Show
        Eric Yang added a comment - Could we create a subdirectory and make the related code as a maven module? This will help us to refactor Chukwa code into a set of useful utilities. The pattern recognition libraries is interesting way to look at repeated metrics patterns to find problems in the system. Thanks
        Hide
        michael yu added a comment -

        Yes, we can do that.

        What is the process for doing this? Where do I create this subdirectory? In JIRA?

        Show
        michael yu added a comment - Yes, we can do that. What is the process for doing this? Where do I create this subdirectory? In JIRA?
        Hide
        Eric Yang added a comment -

        yes, create a subdirectory as part of the patch for this JIRA.

        Show
        Eric Yang added a comment - yes, create a subdirectory as part of the patch for this JIRA.
        Hide
        Otis Gospodnetic added a comment -

        This looks very interesting. I took a quick look at the paper to see if there are any mentions of accuracy, but I couldn't find anything that provides specific numbers around accuracy. It is possible I missed them.

        Also, can you speak about how one might be able to generalize and apply this approach/code/etc. to non-Hadoop systems? Are there any pluggable points that would allow one to apply this to a non-Hadoop system?

        Show
        Otis Gospodnetic added a comment - This looks very interesting. I took a quick look at the paper to see if there are any mentions of accuracy, but I couldn't find anything that provides specific numbers around accuracy. It is possible I missed them. Also, can you speak about how one might be able to generalize and apply this approach/code/etc. to non-Hadoop systems? Are there any pluggable points that would allow one to apply this to a non-Hadoop system?
        Hide
        michael yu added a comment -

        Hi Eric,

        How do I create a subdirectory in JIRA?

        Show
        michael yu added a comment - Hi Eric, How do I create a subdirectory in JIRA?
        Hide
        michael yu added a comment -

        Hi Otis,

        I may have no included a screenshot of the accuracy. You can reference Chapter 6 Performance and Benchmarks. From all of my testing for my provided data set, I recall the accuracy being anywhere between 95% to 100%.

        In general, the larger the data set you feed to SVM, the better (and more accurate) the training model.

        Unfortunately, the code was implemented in such a way specific to querying and parsing the metrics data from HBase in a Hadoop environment. The code can (and should) be refactored and generalized to process metrics from different datasource types.

        Show
        michael yu added a comment - Hi Otis, I may have no included a screenshot of the accuracy. You can reference Chapter 6 Performance and Benchmarks. From all of my testing for my provided data set, I recall the accuracy being anywhere between 95% to 100%. In general, the larger the data set you feed to SVM, the better (and more accurate) the training model. Unfortunately, the code was implemented in such a way specific to querying and parsing the metrics data from HBase in a Hadoop environment. The code can (and should) be refactored and generalized to process metrics from different datasource types.
        Hide
        Otis Gospodnetic added a comment -

        Thanks Michael. I see tables with 1/1 and 100% in Chapter 6, so that must be the accuracy. I have more questions

        1. I assume each cluster is different so one has to train a model for one's own cluster?
        2. Is this really about healthy vs unhealthy or is this more about typical vs. atypically cluster workload?
        3. If cluster's workload changes, does the model need to be retrained?
          Thanks!
          (over at http://sematext.com/spm we collect lots of metrics from different types of systems, including Hadoop and HBase, so this is the angle my questions are coming from)
        Show
        Otis Gospodnetic added a comment - Thanks Michael. I see tables with 1/1 and 100% in Chapter 6, so that must be the accuracy. I have more questions I assume each cluster is different so one has to train a model for one's own cluster? Is this really about healthy vs unhealthy or is this more about typical vs. atypically cluster workload? If cluster's workload changes, does the model need to be retrained? Thanks! (over at http://sematext.com/spm we collect lots of metrics from different types of systems, including Hadoop and HBase, so this is the angle my questions are coming from)
        Hide
        michael yu added a comment -

        Sure thing.

        1. Each cluster will have its own train model.
        2. You are correct. It is more along the lings of typical vs. atypical.
        3. If the workload changes and the existing training model has never seen it (i.e. has not processed this kind of relevant data)... then the SVM engine will most likely predict (indicate) that it's "atypical". At that point, a notification will be sent to any registered email addresses. The user has the ability to correct that "atypical" data point if it actually is "typical". If this is done, the model will be retrained.
        Show
        michael yu added a comment - Sure thing. Each cluster will have its own train model. You are correct. It is more along the lings of typical vs. atypical. If the workload changes and the existing training model has never seen it (i.e. has not processed this kind of relevant data)... then the SVM engine will most likely predict (indicate) that it's "atypical". At that point, a notification will be sent to any registered email addresses. The user has the ability to correct that "atypical" data point if it actually is "typical". If this is done, the model will be retrained.
        Hide
        Otis Gospodnetic added a comment -

        Thanks Michael. Re 3. I see this in the paper:

        The predictions generated for the current day along with any corrections on false
        alarms made by the administrator are fed into libsvm engine to generate an updated
        model. The updated model will be used for interpreting next day's metrics to generate
        predictions. These steps are automated.

        Does that essentially translate to:
        if an email arrives and says "cluster unhealthy" and the person "corrects" that, then take that model and use it as the healthy/typical model tomorrow?

        Or is there something more sophisticated involved that really corrects the existing model – something that feeds the human's correction into the existing model and teaches it through this correction without either doing what I wrote above - using the latest model as the new "healthy/typical model" or explicitly retraining and building a whole new model?

        Show
        Otis Gospodnetic added a comment - Thanks Michael. Re 3. I see this in the paper: The predictions generated for the current day along with any corrections on false alarms made by the administrator are fed into libsvm engine to generate an updated model. The updated model will be used for interpreting next day's metrics to generate predictions. These steps are automated. Does that essentially translate to: if an email arrives and says "cluster unhealthy" and the person "corrects" that, then take that model and use it as the healthy/typical model tomorrow? Or is there something more sophisticated involved that really corrects the existing model – something that feeds the human's correction into the existing model and teaches it through this correction without either doing what I wrote above - using the latest model as the new "healthy/typical model" or explicitly retraining and building a whole new model?
        Hide
        michael yu added a comment -

        Hi Otis,

        I apologize for the late response. I must have not seen your response.

        If you receive an email saying the "cluster is unhealthy" for this set of metrics. But you believe this to be incorrect. You click on the "correction" link provided in the email which makes the REST API call to make the correction in the metrics data file. This will be used to update (retrain) the model with the incremental goal of having a more accurate model.

        Regards,
        Michael

        Show
        michael yu added a comment - Hi Otis, I apologize for the late response. I must have not seen your response. If you receive an email saying the "cluster is unhealthy" for this set of metrics. But you believe this to be incorrect. You click on the "correction" link provided in the email which makes the REST API call to make the correction in the metrics data file. This will be used to update (retrain) the model with the incremental goal of having a more accurate model. Regards, Michael

          People

          • Assignee:
            michael yu
            Reporter:
            michael yu
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:

              Time Tracking

              Estimated:
              Original Estimate - 2,760h
              2,760h
              Remaining:
              Remaining Estimate - 2,760h
              2,760h
              Logged:
              Time Spent - Not Specified
              Not Specified

                Development