Details

    • Type: Sub-task Sub-task
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.1
    • Component/s: Classification
    • Labels:
      None

      Description

      The focus is to implement an improved text classifier based on this paper http://people.csail.mit.edu/jrennie/papers/icml03-nb.pdf.

      1. twcnb.jpg
        48 kB
        Robin Anil
      2. MAHOUT-60-17082008.patch
        246 kB
        Robin Anil
      3. MAHOUT-60-15082008.patch
        238 kB
        Robin Anil
      4. MAHOUT-60-13082008.patch
        236 kB
        Robin Anil
      5. MAHOUT-60.patch
        49 kB
        Robin Anil
      6. MAHOUT-60.patch
        66 kB
        Robin Anil
      7. MAHOUT-60.patch
        127 kB
        Robin Anil
      8. MAHOUT-60.patch
        66 kB
        Robin Anil
      9. MAHOUT-60.patch
        66 kB
        Robin Anil
      10. country.txt
        2 kB
        Robin Anil

        Activity

        Hide
        Robin Anil added a comment -

        Before using this patch please use MAHOUT-9 (Implement MapReduce BayesianClassifier) patch and the instructions given there.

        the 20Newsgroups Trainer requires the collapsed version as given in MAHOUT-9

        Steps to get it running

        ant extract-20news-18828
        ant examples-job

        bin/start-all.sh //Start Hadoop
        bin/hadoop dfs -put <MAHOUT_HOME>/work/20news-18828-collapse 20newsInput
        bin/hadoop jar <MAHOUT_HOME>/build/apache-mahout-0.1-dev-ex.jar org.apache.mahout.examples.classifiers.cbayes.TwentyNewsgroups -t -i 20newsinput/20news-18828-collapse -o 20newsoutput //This will train a classifier and write the model in a file named model in the folder 20newsoutput

        Copy the model file from the DFS to the local filesystem

        bin/hadoop dfs -get 20newsoutput 20newsoutput

        Test on the 20newsgroups data to check how well it is able to classify the train set. Accuracy is around 98.4% on the train set. But only way to check the implementation is correct is by doing some cross validation which is yet to be done.

        java -Xmx1024M org.apache.mahout.examples.classifiers.cbayes.Test20Newsgroups -p 20newsoutput/model -t work/20news-18828/

        TODO: Option to Split the 20newsgroups dataset into a train and a test set. Meanwhile if you have a set of test and train set on the 20newsgroups data you can build model on one of them and test on the other.

        Show
        Robin Anil added a comment - Before using this patch please use MAHOUT-9 (Implement MapReduce BayesianClassifier) patch and the instructions given there. the 20Newsgroups Trainer requires the collapsed version as given in MAHOUT-9 Steps to get it running ant extract-20news-18828 ant examples-job bin/start-all.sh //Start Hadoop bin/hadoop dfs -put <MAHOUT_HOME>/work/20news-18828-collapse 20newsInput bin/hadoop jar <MAHOUT_HOME>/build/apache-mahout-0.1-dev-ex.jar org.apache.mahout.examples.classifiers.cbayes.TwentyNewsgroups -t -i 20newsinput/20news-18828-collapse -o 20newsoutput //This will train a classifier and write the model in a file named model in the folder 20newsoutput Copy the model file from the DFS to the local filesystem bin/hadoop dfs -get 20newsoutput 20newsoutput Test on the 20newsgroups data to check how well it is able to classify the train set. Accuracy is around 98.4% on the train set. But only way to check the implementation is correct is by doing some cross validation which is yet to be done. java -Xmx1024M org.apache.mahout.examples.classifiers.cbayes.Test20Newsgroups -p 20newsoutput/model -t work/20news-18828/ TODO: Option to Split the 20newsgroups dataset into a train and a test set. Meanwhile if you have a set of test and train set on the 20newsgroups data you can build model on one of them and test on the other.
        Hide
        Grant Ingersoll added a comment -

        Hi Robin,

        Can you work on building up some Junit tests now? Also, can you generate the patch again against the latest trunk?

        Thanks,
        Grant

        Show
        Grant Ingersoll added a comment - Hi Robin, Can you work on building up some Junit tests now? Also, can you generate the patch again against the latest trunk? Thanks, Grant
        Hide
        Robin Anil added a comment - - edited

        Hi i have been working on the output statistics generation. I will update it
        shortly

        Show
        Robin Anil added a comment - - edited Hi i have been working on the output statistics generation. I will update it shortly
        Hide
        Robin Anil added a comment - - edited

        This is the latest diff against the trunk

        Changes:
        *Added a Result Analyzer Class to generate Classification Statistics.
        *Currently generates Confusion Matrix and Percentage accuracy.
        *It will be extended to include (Precison, Recall, RMSE , Relative Absolute Error, Kappa Statistic)
        *All such instances extends a Summarizable Interface

        Before using this patch please use MAHOUT-9 (Implement MapReduce BayesianClassifier) patch and the instructions given above.

        The number of reducers are limited to 1 at the moment. Will need to figure out a way to read intermediate result

        You can directly run the TestTwentyNewsgroups from the dfs as follows

         
        $bin/hadoop jar <MAHOUT_HOME>/build/apache-mahout-0.1-dev-ex.jar org.apache.mahout.examples.classifiers.cbayes.TestTwentyNewsgroups -p 20newsoutput/model -t work/20news-18828
        
        Show
        Robin Anil added a comment - - edited This is the latest diff against the trunk Changes: *Added a Result Analyzer Class to generate Classification Statistics. *Currently generates Confusion Matrix and Percentage accuracy. *It will be extended to include (Precison, Recall, RMSE , Relative Absolute Error, Kappa Statistic) *All such instances extends a Summarizable Interface Before using this patch please use MAHOUT-9 (Implement MapReduce BayesianClassifier) patch and the instructions given above. The number of reducers are limited to 1 at the moment. Will need to figure out a way to read intermediate result You can directly run the TestTwentyNewsgroups from the dfs as follows $bin/hadoop jar <MAHOUT_HOME>/build/apache-mahout-0.1-dev-ex.jar org.apache.mahout.examples.classifiers.cbayes.TestTwentyNewsgroups -p 20newsoutput/model -t work/20news-18828
        Hide
        Robin Anil added a comment -

        There are a lot of changes in this patch. Most of the Files have been renamed. The trainer is now a bunch of 5 Map Reduce jobs. The exact functionality of each job is as follows. The trainer can support any number of maps and any number of reduces. Also i am using Apache Lang library commons-lang-2.4.jar ( which should be put in the classpath)

         
              //Read the features in each document normalized by length of each document
              CBayesFeatureDriver.runJob(input, output);
              
        
              //Calculate the TfIdf for each word in each label
              CBayesTfIdfDriver.runJob(input, output);
              
        
              //Calculate the Sums of weights for each label, for each feature and for each feature and for each label
              CBayesWeightSummerDriver.runJob(input, output);
              
        
              //Calculate the W_ij = log(Theta) for each label, feature. This step actually generates the complement class
              CBayesThetaDriver.runJob(input, output);
              
           
              //Calculate the normalization factor Sigma_W_ij for each complement class. 
              CBayesThetaNormalizerDriver.runJob(input, output);
        

        I have tested it on a 6 system cluster. On 20 newsgroups dataset, it takes around 4 minutes to train. It just used to take 20-30 seconds when creating the CNB model in-memory. But the design is based on the assumption that the datasets are going to be too huge to fit into memory.

        There can be a lot of speed improvement if the Map-Reduce operations can be somehow chained.
        So Instead of Map1 -> Reduce1 - > Map1 -> Reduce2....
        if it is possible to do. Map1 -> Reduce1 - > Reduce2 -> Reduce3 ->... then we could save a lot of time on IO. I am not sure if such a functionality exists in hadoop

        I will test it out on Dmoz or Wikipedia dataset (if i can preprocess it within a reasonable amount of time)

        The other changes are that there is no longer a model file. The model is stored in multiple part files in the folders trainer-theta and trainer-thetaNormalizer

        To Train

         
        $bin/hadoop jar <MAHOUT_HOME>/build/apache-mahout-0.1-dev-ex.jar org.apache.mahout.examples.classifiers.cbayes.TrainTwentyNewsgroups -t -i 20newsinput -o 20newsoutput
        

        To Test

         
        $bin/hadoop jar <MAHOUT_HOME>/build/apache-mahout-0.1-dev-ex.jar org.apache.mahout.examples.classifiers.cbayes.TestTwentyNewsgroups -p 20newsoutput  -t work/20news-18828
        

        Next Step, to make the Classifier and the Testing completely Map Reduce.

        Show
        Robin Anil added a comment - There are a lot of changes in this patch. Most of the Files have been renamed. The trainer is now a bunch of 5 Map Reduce jobs . The exact functionality of each job is as follows. The trainer can support any number of maps and any number of reduces . Also i am using Apache Lang library commons-lang-2.4.jar ( which should be put in the classpath) //Read the features in each document normalized by length of each document CBayesFeatureDriver.runJob(input, output); //Calculate the TfIdf for each word in each label CBayesTfIdfDriver.runJob(input, output); //Calculate the Sums of weights for each label, for each feature and for each feature and for each label CBayesWeightSummerDriver.runJob(input, output); //Calculate the W_ij = log(Theta) for each label, feature. This step actually generates the complement class CBayesThetaDriver.runJob(input, output); //Calculate the normalization factor Sigma_W_ij for each complement class. CBayesThetaNormalizerDriver.runJob(input, output); I have tested it on a 6 system cluster. On 20 newsgroups dataset, it takes around 4 minutes to train. It just used to take 20-30 seconds when creating the CNB model in-memory. But the design is based on the assumption that the datasets are going to be too huge to fit into memory. There can be a lot of speed improvement if the Map-Reduce operations can be somehow chained. So Instead of Map1 -> Reduce1 - > Map1 -> Reduce2.... if it is possible to do. Map1 -> Reduce1 - > Reduce2 -> Reduce3 ->... then we could save a lot of time on IO. I am not sure if such a functionality exists in hadoop I will test it out on Dmoz or Wikipedia dataset (if i can preprocess it within a reasonable amount of time) The other changes are that there is no longer a model file. The model is stored in multiple part files in the folders trainer-theta and trainer-thetaNormalizer To Train $bin/hadoop jar <MAHOUT_HOME>/build/apache-mahout-0.1-dev-ex.jar org.apache.mahout.examples.classifiers.cbayes.TrainTwentyNewsgroups -t -i 20newsinput -o 20newsoutput To Test $bin/hadoop jar <MAHOUT_HOME>/build/apache-mahout-0.1-dev-ex.jar org.apache.mahout.examples.classifiers.cbayes.TestTwentyNewsgroups -p 20newsoutput -t work/20news-18828 Next Step, to make the Classifier and the Testing completely Map Reduce.
        Hide
        Robin Anil added a comment -

        FYI. This Transformed Weight Normalized Complementary Naive Bayes. Steps 1 -8.
        This is being shrunk into 5 Map-Red Jobs

        Show
        Robin Anil added a comment - FYI. This Transformed Weight Normalized Complementary Naive Bayes. Steps 1 -8. This is being shrunk into 5 Map-Red Jobs
        Hide
        Steven Handerson added a comment -

        I'm still a newbie at this, just running your code, but I noticed that I was
        able to get high utilization (30-40%) on the early map-reduce tasks,
        but down to 3% on the fourth map-reduce pass.

        Any idea why this would be? Obviously I'm still playing with the machine
        hadoop installation parameters (this is an 8-core 1-chip Sun box).

        Show
        Steven Handerson added a comment - I'm still a newbie at this, just running your code, but I noticed that I was able to get high utilization (30-40%) on the early map-reduce tasks, but down to 3% on the fourth map-reduce pass. Any idea why this would be? Obviously I'm still playing with the machine hadoop installation parameters (this is an 8-core 1-chip Sun box).
        Hide
        Steven Handerson added a comment -

        Maybe one thing you need is a configurable number of reduce tasks –
        mapping is high-util, but even the first reduce drops way down
        (only 1 reduce?).

        Show
        Steven Handerson added a comment - Maybe one thing you need is a configurable number of reduce tasks – mapping is high-util, but even the first reduce drops way down (only 1 reduce?).
        Hide
        Robin Anil added a comment -

        The number of reduce tasks are automatically picked up from your hadoop
        configuration. Just set it in your hadoop-site.xml. BTW the 5th map reduce
        pass outputs the data and calculates the sum for length normalization. 4th
        one ought to have high CPU utilization(this is the stage where weight is
        calculated as per the CNB formulation)

        On Wed, Jul 9, 2008 at 9:33 PM, Steven Handerson (JIRA) <jira@apache.org>


        Robin Anil
        Senior Dual Degree Student
        Department of Computer Science & Engineering
        IIT Kharagpur

        --------------------------------------------------------------------------------------------
        techdigger.wordpress.com
        A discursive take on the world around us

        www.minekey.com
        You Might Like This

        www.ithink.com
        Express Yourself

        Show
        Robin Anil added a comment - The number of reduce tasks are automatically picked up from your hadoop configuration. Just set it in your hadoop-site.xml. BTW the 5th map reduce pass outputs the data and calculates the sum for length normalization. 4th one ought to have high CPU utilization(this is the stage where weight is calculated as per the CNB formulation) On Wed, Jul 9, 2008 at 9:33 PM, Steven Handerson (JIRA) <jira@apache.org> – Robin Anil Senior Dual Degree Student Department of Computer Science & Engineering IIT Kharagpur -------------------------------------------------------------------------------------------- techdigger.wordpress.com A discursive take on the world around us www.minekey.com You Might Like This www.ithink.com Express Yourself
        Hide
        Steven Handerson added a comment -

        Robin,

        I can get the training working very well – I've even started working with a very
        large file (700+ Meg, and not done creating yet, the old slow way). No problem.
        But I'd say the judgment / application of a model maybe needs a better map-reduce
        treatment now – at least I think it's working (I've seen it work on smaller
        training data) but with my larger task it's getting bogged down.

        Maybe I'll think about it / try it, but I'm very new to map-reduce, but it seems
        like you should be able to do something clever with throwing the
        test data (feature|doc) and model data (feature|category, increment) together,
        reducing and emitting category increments / decrements for each
        (doc, category) pair, and then summing them up in a reduce.
        Or just emitting (doc|category,increment) for all features, and then you
        can easily also in the reduce find the maximal category.

        I don't think this is what you're doing yet – you're thinking of loading
        the model, rather than shoving it through a map/reduce sequence. I think.

        Show
        Steven Handerson added a comment - Robin, I can get the training working very well – I've even started working with a very large file (700+ Meg, and not done creating yet, the old slow way). No problem. But I'd say the judgment / application of a model maybe needs a better map-reduce treatment now – at least I think it's working (I've seen it work on smaller training data) but with my larger task it's getting bogged down. Maybe I'll think about it / try it, but I'm very new to map-reduce, but it seems like you should be able to do something clever with throwing the test data (feature|doc) and model data (feature|category, increment) together, reducing and emitting category increments / decrements for each (doc, category) pair, and then summing them up in a reduce. Or just emitting (doc|category,increment) for all features, and then you can easily also in the reduce find the maximal category. I don't think this is what you're doing yet – you're thinking of loading the model, rather than shoving it through a map/reduce sequence. I think.
        Hide
        Robin Anil added a comment - - edited

        I thought of that. But i wasnt sure. If classifying one document requires
        one map-reduce over the whole model. Then its more or less a waste of
        resource utilization. But if i do batch classification. This is what i would
        do. I want to know whether there is some tweak that can be done.

        For the model Map:output (docid:label:featureid, weight) will lead to (docNo

        • number of label,features) keys (too huge)
          For each document Map:output (docid:label:featureid, featureFrequency)

        Reducer: each reducer will get 2 values or 1 value. if its one value
        ignore.
        if its 2 value then multiply and output(docid:label:featureid, weight)

        Start a second map reduce on this Map:output (docid:label, weight)
        then reducer sums up the probabilities for each docid:label pair

        Third Map:reduce can take the doc,label emit docid => label:weight
        then Reduce takes the min weight label and output the result.

        Any thoughts

        Robin

        Show
        Robin Anil added a comment - - edited I thought of that. But i wasnt sure. If classifying one document requires one map-reduce over the whole model. Then its more or less a waste of resource utilization. But if i do batch classification. This is what i would do. I want to know whether there is some tweak that can be done. For the model Map:output (docid:label:featureid, weight) will lead to (docNo number of label,features) keys (too huge) For each document Map:output (docid:label:featureid, featureFrequency) Reducer: each reducer will get 2 values or 1 value. if its one value ignore. if its 2 value then multiply and output(docid:label:featureid, weight) Start a second map reduce on this Map:output (docid:label, weight) then reducer sums up the probabilities for each docid:label pair Third Map:reduce can take the doc,label emit docid => label:weight then Reduce takes the min weight label and output the result. Any thoughts Robin
        Hide
        Ted Dunning added a comment -

        Classifying a single document isn't particularly an interesting task to parallelize since it is already so fast.

        The interesting parallel tasks are training and batch classification. This is pretty much as you say. For batch classification, I would find it tempting to have each map do a single document classification and emit the result. At that point, you have trivial parallelism and no need for a reduce. You need to have a bit of a lookup table on each mapper, but this isn't usually all that big (typically only thousands of interesting term weights, possibly hundreds of thousands for some kinds of application). Not only do you not need the reduce, but you don't need three phases of map-reduce either.

        Training is a different matter since it involves data that is found in one way (terms in documents) that needs to be aggregated another way (terms for different categories). That is natural for map-reduce as well.

        Show
        Ted Dunning added a comment - Classifying a single document isn't particularly an interesting task to parallelize since it is already so fast. The interesting parallel tasks are training and batch classification. This is pretty much as you say. For batch classification, I would find it tempting to have each map do a single document classification and emit the result. At that point, you have trivial parallelism and no need for a reduce. You need to have a bit of a lookup table on each mapper, but this isn't usually all that big (typically only thousands of interesting term weights, possibly hundreds of thousands for some kinds of application). Not only do you not need the reduce, but you don't need three phases of map-reduce either. Training is a different matter since it involves data that is found in one way (terms in documents) that needs to be aggregated another way (terms for different categories). That is natural for map-reduce as well.
        Hide
        Robin Anil added a comment -

        If you remember my other post about creating Server to hold the matrix to lookup the term:label weight. Such a server can enable you to classify one document from anywhere. The Mapper just requests the the weights and passes it to the reducer. Maybe HBase can help? I will have to experiment with it a bit

        Show
        Robin Anil added a comment - If you remember my other post about creating Server to hold the matrix to lookup the term:label weight. Such a server can enable you to classify one document from anywhere. The Mapper just requests the the weights and passes it to the reducer. Maybe HBase can help? I will have to experiment with it a bit
        Hide
        Steven Handerson added a comment -

        Well, yes, I was thinking of batch classification (like, constructing
        confusion matrices, or running the training data back through the model to test the model).
        But the problem I'm running into with the code is that the model
        is too large to load in a single process, let alone multiple mappers.
        So classifying fast doesn't help if simply loading the model is very slow
        (and I mean very slow, and then doesn't necessarily succeed anyway – out of mem).

        I also admit that "batch classification" – in the sense that there is overlap
        in the different feature sets from different documents –
        makes it more interesting / saves some work perhaps, but you can't count on that anyway.
        Yes, you might want something fast for single-document classification,
        but map-reduce isn't the right tool for that. Indexed structures are better.

        The choices are either some indexed structure (like HBase) which can
        handle large datasets / models, or just use map-reduce to join
        the model to the data. The latter is definitely not useless –
        usages similiarly divide into people who have a lot of data / docs to classify,
        versus people who are building some kind of online system.
        Throughput versus round-trip time.
        Also, note that with an indexed solution, you might have contention for the indexed data –
        if there's only one copy (which should probably be the case, for large models).

        So I'd suggest implementing both, and to consider the cases where the models are very large
        (which is where map-reduce shines anyway). I might be the only person commenting
        who has tried a lot of data (800Meg input document file), and as I said it would
        be nice to have some results (confusion matrices)
        to see if the method is working for me and my particular data.

        If nobody else agrees, I might have to try it myself, but I'm new at this
        and sometimes get pulled away for other work.

        Show
        Steven Handerson added a comment - Well, yes, I was thinking of batch classification (like, constructing confusion matrices, or running the training data back through the model to test the model). But the problem I'm running into with the code is that the model is too large to load in a single process, let alone multiple mappers. So classifying fast doesn't help if simply loading the model is very slow (and I mean very slow, and then doesn't necessarily succeed anyway – out of mem). I also admit that "batch classification" – in the sense that there is overlap in the different feature sets from different documents – makes it more interesting / saves some work perhaps, but you can't count on that anyway. Yes, you might want something fast for single-document classification, but map-reduce isn't the right tool for that. Indexed structures are better. The choices are either some indexed structure (like HBase) which can handle large datasets / models, or just use map-reduce to join the model to the data. The latter is definitely not useless – usages similiarly divide into people who have a lot of data / docs to classify, versus people who are building some kind of online system. Throughput versus round-trip time. Also, note that with an indexed solution, you might have contention for the indexed data – if there's only one copy (which should probably be the case, for large models). So I'd suggest implementing both, and to consider the cases where the models are very large (which is where map-reduce shines anyway). I might be the only person commenting who has tried a lot of data (800Meg input document file), and as I said it would be nice to have some results (confusion matrices) to see if the method is working for me and my particular data. If nobody else agrees, I might have to try it myself, but I'm new at this and sometimes get pulled away for other work.
        Hide
        Steven Handerson added a comment -

        Hmm – well, in a sense I agree that all you really need is the model
        sorted by feature (being able to find all information about a particular feature).
        So maybe a final sorting step, and something that can find
        everything about a particular feature would make sense.
        But again without the map-reduce, there will be contention for that structure.

        Show
        Steven Handerson added a comment - Hmm – well, in a sense I agree that all you really need is the model sorted by feature (being able to find all information about a particular feature). So maybe a final sorting step, and something that can find everything about a particular feature would make sense. But again without the map-reduce, there will be contention for that structure.
        Hide
        Steven Handerson added a comment -

        I'll add one more thing so you see where I'm coming from.

        Map-reduce is basically a hash join (database term),
        which are actually slow in a database in my experience
        (because of the space allocation required, but also just not using an index).
        Map-reduce makes hash joins relatively fast,
        by using multiple processors and networking.
        You could do other kinds of joins in map-reduce, use indexes, use ordered data,
        things like "partitions" in database parlance,
        or you could just use a database.
        Databases have a problem that the round-trip to the database
        sometimes makes your application much slower than necessary,
        for doing lots of individual queries in sequence (randomly accessing
        but doing so using an index) –
        they are good at streaming the results of a join out (which use indexes).
        Of course, some applications (like web-based) are slower in
        aggregate, in order to answer individual queries relatively quickly
        (faster round-trip time).
        There may be similar issues with respect to map-reduce,
        but you can see there's a kind of connection between what
        databases do and map-reduce does: join data sources on some field or computed value.

        Hmm – also, it's not that the data is too large (yet) –
        my model is about 1.5 Gig, and I'm (as of now) trying using the code
        in a single process rather than hadoop, but maybe the model
        size isn't the max process size (-Xmx4g didn't work), so I'm trying larger and larger -Xmx
        (and I do have 64 bit java available – right now trying 8 Gig).

        Show
        Steven Handerson added a comment - I'll add one more thing so you see where I'm coming from. Map-reduce is basically a hash join (database term), which are actually slow in a database in my experience (because of the space allocation required, but also just not using an index). Map-reduce makes hash joins relatively fast, by using multiple processors and networking. You could do other kinds of joins in map-reduce, use indexes, use ordered data, things like "partitions" in database parlance, or you could just use a database. Databases have a problem that the round-trip to the database sometimes makes your application much slower than necessary, for doing lots of individual queries in sequence (randomly accessing but doing so using an index) – they are good at streaming the results of a join out (which use indexes). Of course, some applications (like web-based) are slower in aggregate, in order to answer individual queries relatively quickly (faster round-trip time). There may be similar issues with respect to map-reduce, but you can see there's a kind of connection between what databases do and map-reduce does: join data sources on some field or computed value. Hmm – also, it's not that the data is too large (yet) – my model is about 1.5 Gig, and I'm (as of now) trying using the code in a single process rather than hadoop, but maybe the model size isn't the max process size (-Xmx4g didn't work), so I'm trying larger and larger -Xmx (and I do have 64 bit java available – right now trying 8 Gig).
        Hide
        Robin Anil added a comment -

        To Split Wikipedia xml dump into small XML chunks

         
        hadoop jar build/apache-mahout-0.1-dev-ex.jar org.apache.mahout.examples.classifiers.cbayes.WikipediaXmlSplitter -d /home/robin/data/wikipedia/articles/enwiki-latest-pages-articles.xml -o  /home/robin/data/wikipedia/chunks/ -c 64
          

        Put the chunks into the dfs

         
         hadoop dfs -put /home/robin/data/wikipedia/chunks/ wikipediadump
         

        Create the countries based Split of wikipedia dataset.(See the attached country.txt file)

         hadoop jar build/apache-mahout-0.1-dev-ex.jar org.apache.mahout.examples.classifiers.cbayes.WikipediaDatasetCreator -i wikipediadump -o wikipediainput -c pathto/country.txt
        

        Train the Classifier on the Countries bases split of wikipedia

        $bin/hadoop jar <MAHOUT_HOME>/build/apache-mahout-0.1-dev-ex.jar org.apache.mahout.examples.classifiers.cbayes.TrainTwentyNewsgroups -t -i wikipediainput -o wikipediamodel
        

        Fetch the Input Files for Testing

         hadoop dfs -get wikipediainput wikipediainput 
        

        Test the Classifier

        $bin/hadoop jar <MAHOUT_HOME>/build/apache-mahout-0.1-dev-ex.jar org.apache.mahout.examples.classifiers.cbayes.TestTwentyNewsgroups -p wikipediamodel -t  wikipediainput
        
        Show
        Robin Anil added a comment - To Split Wikipedia xml dump into small XML chunks hadoop jar build/apache-mahout-0.1-dev-ex.jar org.apache.mahout.examples.classifiers.cbayes.WikipediaXmlSplitter -d /home/robin/data/wikipedia/articles/enwiki-latest-pages-articles.xml -o /home/robin/data/wikipedia/chunks/ -c 64 Put the chunks into the dfs hadoop dfs -put /home/robin/data/wikipedia/chunks/ wikipediadump Create the countries based Split of wikipedia dataset.(See the attached country.txt file) hadoop jar build/apache-mahout-0.1-dev-ex.jar org.apache.mahout.examples.classifiers.cbayes.WikipediaDatasetCreator -i wikipediadump -o wikipediainput -c pathto/country.txt Train the Classifier on the Countries bases split of wikipedia $bin/hadoop jar <MAHOUT_HOME>/build/apache-mahout-0.1-dev-ex.jar org.apache.mahout.examples.classifiers.cbayes.TrainTwentyNewsgroups -t -i wikipediainput -o wikipediamodel Fetch the Input Files for Testing hadoop dfs -get wikipediainput wikipediainput Test the Classifier $bin/hadoop jar <MAHOUT_HOME>/build/apache-mahout-0.1-dev-ex.jar org.apache.mahout.examples.classifiers.cbayes.TestTwentyNewsgroups -p wikipediamodel -t wikipediainput
        Hide
        Robin Anil added a comment -

        Fixed some bugs in the previous patch

        Show
        Robin Anil added a comment - Fixed some bugs in the previous patch
        Hide
        Grant Ingersoll added a comment -

        Hi Robin,

        What's the Summarizable interface for? I don't see any uses or implementations.

        Show
        Grant Ingersoll added a comment - Hi Robin, What's the Summarizable interface for? I don't see any uses or implementations.
        Hide
        Robin Anil added a comment -

        It is implementd by ConfusionMatrix and ResultAnalyzer

        Show
        Robin Anil added a comment - It is implementd by ConfusionMatrix and ResultAnalyzer
        Hide
        Robin Anil added a comment - - edited

        I have merged the BayesClassifier and CBayesClassifier. Now both use some common Map reduce operation. The specific Map-Reduce operations are factored out.
        The Model is also factored out.

        The new feature in this patch is a n-gram generator using the cli parameter -ng <gram-size>
        If a model is made using a 3-gram then you can use 1/2/3 gram to classify.

        Try increasing n-gram and see how the classification accuracy grow with it.

        cbayes.TestTwentyNewsgroups is renamed to bayes.TestClassifier
        cbayes.TrainTwentyNewsgrousp is renamed to bayes.TrainClassifier

        The Tests will fail when using this patch. So dont worry. New tests will be put up shortly.

         
             //To Train a Bayes Classifier using tri-grams
              hadoop jar build/apache-mahout-0.1-dev-ex.jar org.apache.mahout.examples.classifiers.bayes.TrainClassifier -t -i newstrain -o newsmodel -ng 3 -type bayes
             //To Test a Bayes Classifier using tri-grams
              hadoop jar build/apache-mahout-0.1-dev-ex.jar org.apache.mahout.examples.classifiers.bayes.TestClassifier -p newsmodel -t work/newstest -ng 3 -type bayes
        
             //To Train a CBayes Classifier using bi-grams
              hadoop jar build/apache-mahout-0.1-dev-ex.jar org.apache.mahout.examples.classifiers.bayes.TrainClassifier -t -i newstrain -o newsmodel -ng 2 -type cbayes
             //To Test a CBayes Classifier using bi-grams
              hadoop jar build/apache-mahout-0.1-dev-ex.jar org.apache.mahout.examples.classifiers.bayes.TestClassifier -p newsmodel -t work/newstest -ng 2 -type cbayes
         

        Hope you will enjoy using this patch.

        Show
        Robin Anil added a comment - - edited I have merged the BayesClassifier and CBayesClassifier. Now both use some common Map reduce operation. The specific Map-Reduce operations are factored out. The Model is also factored out. The new feature in this patch is a n-gram generator using the cli parameter -ng <gram-size> If a model is made using a 3-gram then you can use 1/2/3 gram to classify. Try increasing n-gram and see how the classification accuracy grow with it. cbayes.TestTwentyNewsgroups is renamed to bayes.TestClassifier cbayes.TrainTwentyNewsgrousp is renamed to bayes.TrainClassifier The Tests will fail when using this patch. So dont worry. New tests will be put up shortly. //To Train a Bayes Classifier using tri-grams hadoop jar build/apache-mahout-0.1-dev-ex.jar org.apache.mahout.examples.classifiers.bayes.TrainClassifier -t -i newstrain -o newsmodel -ng 3 -type bayes //To Test a Bayes Classifier using tri-grams hadoop jar build/apache-mahout-0.1-dev-ex.jar org.apache.mahout.examples.classifiers.bayes.TestClassifier -p newsmodel -t work/newstest -ng 3 -type bayes //To Train a CBayes Classifier using bi-grams hadoop jar build/apache-mahout-0.1-dev-ex.jar org.apache.mahout.examples.classifiers.bayes.TrainClassifier -t -i newstrain -o newsmodel -ng 2 -type cbayes //To Test a CBayes Classifier using bi-grams hadoop jar build/apache-mahout-0.1-dev-ex.jar org.apache.mahout.examples.classifiers.bayes.TestClassifier -p newsmodel -t work/newstest -ng 2 -type cbayes Hope you will enjoy using this patch.
        Hide
        Grant Ingersoll added a comment -

        Hi Robin,

        Things are coming along. Looks like you forgot to add the Classifier class to common. I think instead of putting it under the Common package, it probably should go at the top of the classifier package. Also, not sure which version of commons-lang you are using, so either upload that jar here, or just let me know which one you are using.

        -Grant

        Show
        Grant Ingersoll added a comment - Hi Robin, Things are coming along. Looks like you forgot to add the Classifier class to common. I think instead of putting it under the Common package, it probably should go at the top of the classifier package. Also, not sure which version of commons-lang you are using, so either upload that jar here, or just let me know which one you are using. -Grant
        Hide
        Grant Ingersoll added a comment -

        Also, you'll need to update the patch slightly as the examples have now been separated out.

        Show
        Grant Ingersoll added a comment - Also, you'll need to update the patch slightly as the examples have now been separated out.
        Hide
        Robin Anil added a comment -

        Use apache commons-lang-2.4.jar in the lib path

        Refactored as per the new core examples build file.

        Show
        Robin Anil added a comment - Use apache commons-lang-2.4.jar in the lib path Refactored as per the new core examples build file.
        Hide
        Grant Ingersoll added a comment -

        I think we should move Classifier and Model from the common package to the classifier package. Classifier should also be an abstract class instead of an interface for future "back-compatibility". The other question is whether the Classifier and Model classes are "generic" enough to be considered usable by any classifier that is implemented, or are the Bayes specific. If it is the latter, than perhaps we should just create a single bayes package, and put them all in there, including the CBayes classes. Not a big deal, but something to think about.

        I committed commons-lang to lib.

        Show
        Grant Ingersoll added a comment - I think we should move Classifier and Model from the common package to the classifier package. Classifier should also be an abstract class instead of an interface for future "back-compatibility". The other question is whether the Classifier and Model classes are "generic" enough to be considered usable by any classifier that is implemented, or are the Bayes specific. If it is the latter, than perhaps we should just create a single bayes package, and put them all in there, including the CBayes classes. Not a big deal, but something to think about. I committed commons-lang to lib.
        Hide
        Robin Anil added a comment -

        Right now the Model (Abstract class is not completely bayes specific)
        Certain things like storing a score for a label, feature matrix, storing the
        sum score of a label and also of a feature. Storing the total sum score.
        Storing label normalizatin factors These are reused in other kind of
        classifiers also. These can indeed be taken out and put as a base class
        which stores it on the dfs or in a Distributed Matrix Storage/Retrieval
        system(later). The other bayes specific Datastructs can be taken out and
        put in a BayesBaseModel class ?

        Show
        Robin Anil added a comment - Right now the Model (Abstract class is not completely bayes specific) Certain things like storing a score for a label, feature matrix, storing the sum score of a label and also of a feature. Storing the total sum score. Storing label normalizatin factors These are reused in other kind of classifiers also. These can indeed be taken out and put as a base class which stores it on the dfs or in a Distributed Matrix Storage/Retrieval system(later). The other bayes specific Datastructs can be taken out and put in a BayesBaseModel class ?
        Hide
        Grant Ingersoll added a comment -

        Sounds reasonable, besides we are only on 0.1, we needn't fret over it too much yet. If you can finish it up, then I can put on a final review and commit.

        Show
        Grant Ingersoll added a comment - Sounds reasonable, besides we are only on 0.1, we needn't fret over it too much yet. If you can finish it up, then I can put on a final review and commit.
        Hide
        Robin Anil added a comment -

        Added some tests for the Model. and Some code tidy ups. No functionality Change

        Show
        Robin Anil added a comment - Added some tests for the Model. and Some code tidy ups. No functionality Change
        Hide
        Grant Ingersoll added a comment -

        I'm getting failures in the BayesFileFormatterTest. Namely due to the change to \t, which is an easy fix. However, I wonder why the check to the "seen" CharSet was removed? I seem to recall that we only want unique words for training, otherwise the calculations get screwed up, at least in the NB implementation (not sure what you want in CNB)

        The loop used to look like:

        while ((token = ts.next(token)) != null) {
              char[] termBuffer = token.termBuffer();
              int termLen = token.termLength();
              if (seen.contains(termBuffer, 0, termLen) == false) {
                if (numTokens > 0) {
                  writer.write(' ');
                }
                writer.write(termBuffer, 0, termLen);
                char [] tmp = new char[termLen];
                System.arraycopy(termBuffer, 0, tmp, 0, termLen);
                seen.add(tmp);//do this b/c CharArraySet doesn't allow offsets
              }
        
        Show
        Grant Ingersoll added a comment - I'm getting failures in the BayesFileFormatterTest. Namely due to the change to \t, which is an easy fix. However, I wonder why the check to the "seen" CharSet was removed? I seem to recall that we only want unique words for training, otherwise the calculations get screwed up, at least in the NB implementation (not sure what you want in CNB) The loop used to look like: while ((token = ts.next(token)) != null ) { char [] termBuffer = token.termBuffer(); int termLen = token.termLength(); if (seen.contains(termBuffer, 0, termLen) == false ) { if (numTokens > 0) { writer.write(' '); } writer.write(termBuffer, 0, termLen); char [] tmp = new char [termLen]; System .arraycopy(termBuffer, 0, tmp, 0, termLen); seen.add(tmp); // do this b/c CharArraySet doesn't allow offsets }
        Hide
        Robin Anil added a comment -

        I am generating the bigrams. So if you keep only unique words then bigrams
        dont get generated correctly.

        Show
        Robin Anil added a comment - I am generating the bigrams. So if you keep only unique words then bigrams dont get generated correctly.
        Hide
        Grant Ingersoll added a comment -

        Do we always want to use ngrams, though? If n == 1, do we have a way of filtering out duplicates? Seems like even if n > 1, you could still have duplicates. Not sure how this is supposed to be handled, will have to look into the code more.

        Show
        Grant Ingersoll added a comment - Do we always want to use ngrams, though? If n == 1, do we have a way of filtering out duplicates? Seems like even if n > 1, you could still have duplicates. Not sure how this is supposed to be handled, will have to look into the code more.
        Hide
        Grant Ingersoll added a comment -

        OK, I committed this. I think we can leave this open for some more patches. I'd also like to see some more docs on the interplay between the various drivers, although it seems like some of them should just be package protected if they are not intended for use by the public.

        Show
        Grant Ingersoll added a comment - OK, I committed this. I think we can leave this open for some more patches. I'd also like to see some more docs on the interplay between the various drivers, although it seems like some of them should just be package protected if they are not intended for use by the public.
        Hide
        Grant Ingersoll added a comment -

        Hey Robin,

        On the country.txt, where did that come from? Is it something that can be checked in?

        Show
        Grant Ingersoll added a comment - Hey Robin, On the country.txt, where did that come from? Is it something that can be checked in?
        Hide
        Grant Ingersoll added a comment -

        Never mind, it's just a list of countries. If that isn't public domain, I don't know what is.

        Show
        Grant Ingersoll added a comment - Never mind, it's just a list of countries. If that isn't public domain, I don't know what is.

          People

          • Assignee:
            Grant Ingersoll
            Reporter:
            Robin Anil
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development