Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Won't Fix
    • Affects Version/s: None
    • Fix Version/s: 0.6
    • Component/s: Clustering
    • Labels:
      None

      Description

      This program realize a outlier detection algorithm called avf, which is kind of
      Fast Parallel Outlier Detection for Categorical Datasets using Mapreduce and introduced by this paper :
      http://thepublicgrid.org/papers/koufakou_wcci_08.pdf
      Following is an example how to run this program under haodoop:
      $hadoop jar programName.jar avfDriver inputData interTempData outputData
      The output data contains ordered avfValue in the first column, followed by original input data.

        Activity

        Hide
        Sean Owen added a comment -

        I think this has timed out?

        Show
        Sean Owen added a comment - I think this has timed out?
        Hide
        Robin Anil added a comment -

        I need some time on this one. Need to change things around, like the input output formats, I would say it shouldn't be a blocker for this release.

        Show
        Robin Anil added a comment - I need some time on this one. Need to change things around, like the input output formats, I would say it shouldn't be a blocker for this release.
        Hide
        Sean Owen added a comment -

        I'd say you're welcome to polish it up and commit then, if you judge that it's essentially useful and ready and supportable. Tests are nice, but you can write them or tony can contribute later. Either that or decide it's not going to be added so Tony doesn't have to bother.

        Show
        Sean Owen added a comment - I'd say you're welcome to polish it up and commit then, if you judge that it's essentially useful and ready and supportable. Tests are nice, but you can write them or tony can contribute later. Either that or decide it's not going to be added so Tony doesn't have to bother.
        Hide
        Robin Anil added a comment -

        This was missing some tests which tony mentioned he would fix. Otherwise it
        looks committable.

        Robin

        Show
        Robin Anil added a comment - This was missing some tests which tony mentioned he would fix. Otherwise it looks committable. Robin
        Hide
        Sean Owen added a comment -

        This is another one that seems to have died? Tony, Robin, what's the latest? seems like there was more work to do on the patch.

        Show
        Sean Owen added a comment - This is another one that seems to have died? Tony, Robin, what's the latest? seems like there was more work to do on the patch.
        Show
        Grant Ingersoll added a comment - Tony, Can you add AVF to https://cwiki.apache.org/confluence/display/MAHOUT/algorithms.html?
        Hide
        tony cui added a comment -

        Hi,all
        I upload new patch file, fix the 8problems mentioned by Robin, except tests.
        I will upload tests later, after I figure out how to write it

        Show
        tony cui added a comment - Hi,all I upload new patch file, fix the 8problems mentioned by Robin, except tests. I will upload tests later, after I figure out how to write it
        Hide
        tony cui added a comment -

        new patch file

        Show
        tony cui added a comment - new patch file
        Hide
        tony cui added a comment -

        Acctually I have already fix the above problems, except test part. I have figured out how to do unit test within hadoop.
        I will keep on trying in the following weeks.

        Show
        tony cui added a comment - Acctually I have already fix the above problems, except test part. I have figured out how to do unit test within hadoop. I will keep on trying in the following weeks.
        Hide
        Sean Owen added a comment -

        I'm willing to keep iterating on this patch – is it still live and interesting? I think the main issues were the ones identified above, making it consistent with the code base.

        Show
        Sean Owen added a comment - I'm willing to keep iterating on this patch – is it still live and interesting? I think the main issues were the ones identified above, making it consistent with the code base.
        Hide
        Ted Dunning added a comment -

        Outlier detection is (normally) unsupervised exploratory learning. Occasionally it is used to generate a feature for supervised learning, much as clustering algorithms can be used.

        As such, I would group it as a clustering into "normal" and "outlier" clusters. It won't evaluate the same way, but it definitely has the same workflow.

        Show
        Ted Dunning added a comment - Outlier detection is (normally) unsupervised exploratory learning. Occasionally it is used to generate a feature for supervised learning, much as clustering algorithms can be used. As such, I would group it as a clustering into "normal" and "outlier" clusters. It won't evaluate the same way, but it definitely has the same workflow.
        Hide
        Sean Owen added a comment -

        What do others think of 'outlier' – is this a concept on the level of 'clustering' or 'classification' or can we taxonomize it better.

        You can use Hadoop 0.20.2 (I do) but I suggest for consistency with the code and compatibility with AWS and to avoid bugs you not use the newer Hadoop APIs.

        Show
        Sean Owen added a comment - What do others think of 'outlier' – is this a concept on the level of 'clustering' or 'classification' or can we taxonomize it better. You can use Hadoop 0.20.2 (I do) but I suggest for consistency with the code and compatibility with AWS and to avoid bugs you not use the newer Hadoop APIs.
        Hide
        tony cui added a comment -

        Thanks, Robin. I will check the suggestion list one by one as soon as possible.

        Thanks, Sean. I think oulier is a kind of data mining algorithm like classification or cluster, which can have a bunch of functions, AVF is just a simple one of them. That is why I created a "outlier" folder as the same level as classification.

        Another problem, which I think may be significant to me. Must I use hadoop 0.19.X? I have not use this version before.

        Show
        tony cui added a comment - Thanks, Robin. I will check the suggestion list one by one as soon as possible. Thanks, Sean. I think oulier is a kind of data mining algorithm like classification or cluster, which can have a bunch of functions, AVF is just a simple one of them. That is why I created a "outlier" folder as the same level as classification. Another problem, which I think may be significant to me. Must I use hadoop 0.19.X? I have not use this version before.
        Hide
        Sean Owen added a comment -

        Let's also think about where it fits into the project. This is not a CF algorithm, is it? It looks more like classification. So I am not sure if a "top-level" outlier package is the right place?

        Yes, as Robin says this ought to look a lot more like the other jobs in classification. More broadly we should be moving all jobs to work more alike (e.g. around AbstractJob) but if it looks like its neighbors, that's good. Right now we are using the older Hadoop 0.19.x APIs (i.e. not Configuration) since, well, the new APIs don't quite work in all cases and services like AWS don't support them yet.

        Show
        Sean Owen added a comment - Let's also think about where it fits into the project. This is not a CF algorithm, is it? It looks more like classification. So I am not sure if a "top-level" outlier package is the right place? Yes, as Robin says this ought to look a lot more like the other jobs in classification. More broadly we should be moving all jobs to work more alike (e.g. around AbstractJob) but if it looks like its neighbors, that's good. Right now we are using the older Hadoop 0.19.x APIs (i.e. not Configuration) since, well, the new APIs don't quite work in all cases and services like AWS don't support them yet.
        Hide
        Robin Anil added a comment -

        Hi Tony. Nice work on the patch. But before we commit this, there are a couple of things you need to cover. I still have to read the algorithm in detail to know whats happening. But I have some queries and suggestions below which is a kind of a checklist to make this a commitable patch

        1) I am not a fan of Text based input, though it is what most of the algorithms in Mahout was first implement in. The idea of splitting and joining text files based on comma is not very clean. Can you convert this to deal with SequenceFile of VectorWritable OR some other Writable Format? Whats your input schema?
        2) There is a code-style we enforce in Mahout. You can use the mvn checkstyle:checkstyle to see the violations. We also have an eclipse formatter which formats code that almost match the checkstyle(there are rare manual interventions required). Take a look at this https://cwiki.apache.org/MAHOUT/howtocontribute.html you will find the Eclipse formatter file at the bottom
        3) For parsing args use the apache commons cli2 library. Take a look at o/a/m/clustering/kmeans/KMeansDriver to see usage
        4) What is Utils being used for?
        5) @Override
        + public void setup(Context context) throws IOException,InterruptedException

        { + + String filePath = context.getConfiguration().get("a"); + sumAttribute = Utils.readFile(filePath+"/part-r-00000"); + + }

        Please use distributed cache to read the file in a map/reduce context. See the DictionaryVectorizer Map/Reduce classes for usage
        6) job.setNumReduceTasks(1); ? Is this necessary? Doesn't it hurt scalability of this algorithm? Is the single reducer going to get a lot of data from the mapper? If Yes, then you should think of removing this constraint and let it use the hadoop parameters or parameterize it
        7) Can this job be Optimised using a Combiner? If yes, its really worth spending time to make one
        8) Tests!

        Show
        Robin Anil added a comment - Hi Tony. Nice work on the patch. But before we commit this, there are a couple of things you need to cover. I still have to read the algorithm in detail to know whats happening. But I have some queries and suggestions below which is a kind of a checklist to make this a commitable patch 1) I am not a fan of Text based input, though it is what most of the algorithms in Mahout was first implement in. The idea of splitting and joining text files based on comma is not very clean. Can you convert this to deal with SequenceFile of VectorWritable OR some other Writable Format? Whats your input schema? 2) There is a code-style we enforce in Mahout. You can use the mvn checkstyle:checkstyle to see the violations. We also have an eclipse formatter which formats code that almost match the checkstyle(there are rare manual interventions required). Take a look at this https://cwiki.apache.org/MAHOUT/howtocontribute.html you will find the Eclipse formatter file at the bottom 3) For parsing args use the apache commons cli2 library. Take a look at o/a/m/clustering/kmeans/KMeansDriver to see usage 4) What is Utils being used for? 5) @Override + public void setup(Context context) throws IOException,InterruptedException { + + String filePath = context.getConfiguration().get("a"); + sumAttribute = Utils.readFile(filePath+"/part-r-00000"); + + } Please use distributed cache to read the file in a map/reduce context. See the DictionaryVectorizer Map/Reduce classes for usage 6) job.setNumReduceTasks(1); ? Is this necessary? Doesn't it hurt scalability of this algorithm? Is the single reducer going to get a lot of data from the mapper? If Yes, then you should think of removing this constraint and let it use the hadoop parameters or parameterize it 7) Can this job be Optimised using a Combiner? If yes, its really worth spending time to make one 8) Tests!
        Hide
        tony cui added a comment -

        I mean, what am I supposed to do next?

        Show
        tony cui added a comment - I mean, what am I supposed to do next?
        Hide
        tony cui added a comment -

        I just committed this patch which realize avf algorithm.
        I'm sorry that I am freshman here, and I don't familiar with the process of committing to mahout.
        Can any committer give me some suggestion?

        Thanks for advance!

        Show
        tony cui added a comment - I just committed this patch which realize avf algorithm. I'm sorry that I am freshman here, and I don't familiar with the process of committing to mahout. Can any committer give me some suggestion? Thanks for advance!

          People

          • Assignee:
            Robin Anil
            Reporter:
            tony cui
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development