Details

    • Type: New Feature New Feature
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      Weka is one of the most popular data mining package on the planet. It's used by numerous people around the world. Since weka is in Java, it should be pretty straight-forward to integrate weka with Hive.

      We just need to create some GenericUDAF functions that maps to Weka classifier training process. The output of the GenericUDAF can just be the serialized version of the trained classifiers.
      We should add another GenericUDF to load the classifier to classify new instances.

      The hive syntax can be as simple as this: (Note: In the example above, most of the "table." can be omitted. I put it there just for easier understanding of the query semantics.)

      The query builds a model (logistic regression) for predicting the CTR of each link on each page, based on user information, and evaluates the model on some data.

      SELECT logdata.pageid, logdata.linkid, LogisticRegression( logdata.clicked, userinfo.age, userinfo.gender, userinfo.country, userinfo.interests ) as model
      FROM logdata JOIN userinfo
      ON logdata.userid = userinfo.userid
      GROUP BY logdata.pageid, logdata.linkid;
      
      SELECT logdata.pageid, logdata.linkid, logdata.clicked, LogisticRegressionEvaluate(classifiers.model, userinfo.age, userinfo.gender, userinfo.country, userinfo.interests) AS predicted
      FROM logdata JOIN userinfo
      ON logdata.userid = userinfo.userid
      JOIN classifiers
      ON logdata.pageid = classifiers.pageid AND logdata.linkid = classifiers.linkid
      

      References:
      Use Weka in your Java Code: http://weka.wiki.sourceforge.net/Use+Weka+in+your+Java+code

      Note:
      Weka is under GPL license. We won't be able to include the code directly into Hive, but we can keep the discussions here.

      1. weka.jar
        5.09 MB
        Zheng Shao
      2. HIVE-672.2.not.to.be.included.patch
        12 kB
        Zheng Shao
      3. HIVE-672.1.not.to.be.included.patch
        20 kB
        Zheng Shao

        Activity

        Hide
        Zheng Shao added a comment -

        @HIVE-672.1.not.to.be.included.patch:

        This patch successfully integrates Weka LogisticRegression with Hive. It contains an example query, which trains a model and use the model to predict.
        It does not support classifier options, model evaluation like cross validation / ROC etc yet.

        During implementing this, I found several problems:
        1. GenericUDAF/GenericUDF are not easy to use (although they have superior performance). I don't think we should ask our users to implement GenericUDAF/GenericUDF just because they need variable-length arguments. We should be able to pass java primitive objects to a UDF like Object evaluate(Object[] parameters). This is not efficient but it's OK for machine learning/data mining stuff since the learning process takes much longer time. (HIVE-699)
        2. No way to "create temporary function" for a GenericUDAF (HIVE-698).
        3. A bug in GroupByOperator initlaization order (HIVE-697)

        I will work on these 3 items first.

        Show
        Zheng Shao added a comment - @ HIVE-672 .1.not.to.be.included.patch: This patch successfully integrates Weka LogisticRegression with Hive. It contains an example query, which trains a model and use the model to predict. It does not support classifier options, model evaluation like cross validation / ROC etc yet. During implementing this, I found several problems: 1. GenericUDAF/GenericUDF are not easy to use (although they have superior performance). I don't think we should ask our users to implement GenericUDAF/GenericUDF just because they need variable-length arguments. We should be able to pass java primitive objects to a UDF like Object evaluate(Object[] parameters). This is not efficient but it's OK for machine learning/data mining stuff since the learning process takes much longer time. ( HIVE-699 ) 2. No way to "create temporary function" for a GenericUDAF ( HIVE-698 ). 3. A bug in GroupByOperator initlaization order ( HIVE-697 ) I will work on these 3 items first.
        Hide
        Zheng Shao added a comment -

        HIVE-672.2.not.to.be.included.patch is the patch.
        weka.jar should be put into contrib/lib.

        There are test cases in the patch to show how to use the new functions.

        Show
        Zheng Shao added a comment - HIVE-672 .2.not.to.be.included.patch is the patch. weka.jar should be put into contrib/lib. There are test cases in the patch to show how to use the new functions.

          People

          • Assignee:
            Zheng Shao
            Reporter:
            Zheng Shao
          • Votes:
            5 Vote for this issue
            Watchers:
            10 Start watching this issue

            Dates

            • Created:
              Updated:

              Development