Details

    • Type: Task Task
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.1
    • Component/s: None
    • Labels:
      None

      Description

      The goal of this task is to allow watchmaker definded problems be solved in Mahout.

      1. libs.zip
        833 kB
        Deneche A. Hakim
      2. libs.zip
        499 kB
        Deneche A. Hakim
      3. libs.zip
        531 kB
        Deneche A. Hakim
      4. tsp-screenshot-1.jpg
        55 kB
        Deneche A. Hakim
      5. watchmaker-tsp.patch
        138 kB
        Deneche A. Hakim
      6. watchmaker-tsp.patch
        67 kB
        Deneche A. Hakim
      7. watchmaker-tsp.patch
        64 kB
        Deneche A. Hakim
      8. watchmaker-tsp.patch
        50 kB
        Deneche A. Hakim
      9. watchmaker-tsp.patch
        338 kB
        Deneche A. Hakim
      10. watchmaker-tsp.patch
        332 kB
        Deneche A. Hakim
      11. watchmaker-tsp.patch
        382 kB
        Grant Ingersoll
      12. watchmaker-tsp.patch
        305 kB
        Deneche A. Hakim
      13. watchmaker-tsp.patch
        440 kB
        Deneche A. Hakim
      14. watchmaker-tsp.patch
        407 kB
        Deneche A. Hakim
      15. watchmaker-tsp.patch
        404 kB
        Deneche A. Hakim
      16. watchmaker-tsp.patch
        402 kB
        Deneche A. Hakim
      17. watchmaker-tsp.patch
        763 kB
        Deneche A. Hakim
      18. watchmaker-tsp.patch
        522 kB
        Deneche A. Hakim
      19. watchmaker-tsp.patch
        102 kB
        Deneche A. Hakim
      20. watchmaker-tsp.patch
        80 kB
        Deneche A. Hakim
      21. watchmaker-tsp.patch
        53 kB
        Deneche A. Hakim

        Activity

        Hide
        Grant Ingersoll added a comment -

        Going to close this one, we can open up new issues as they arise.

        Show
        Grant Ingersoll added a comment - Going to close this one, we can open up new issues as they arise.
        Hide
        Deneche A. Hakim added a comment -

        A added a small (tiny) tutorial in the wiki. And I don't remember when, but I think that I accidently removed some lines from the file NOTICE.TXT,
        so if a committer can add them it'll be great

        This product includes software developed by the Indiana University
          Extreme! Lab (http://www.extreme.indiana.edu/).
        
        This product includes examples code from the Watchmaker project   
          https://watchmaker.dev.java.net/
        
        Show
        Deneche A. Hakim added a comment - A added a small (tiny) tutorial in the wiki. And I don't remember when, but I think that I accidently removed some lines from the file NOTICE.TXT, so if a committer can add them it'll be great This product includes software developed by the Indiana University Extreme! Lab (http://www.extreme.indiana.edu/). This product includes examples code from the Watchmaker project https://watchmaker.dev.java.net/
        Hide
        Deneche A. Hakim added a comment -

        This patch should now work fine. I added the wdbc dataset and modified the tests to look in the correct directory. I also correctected CDGA, it should run now with the following command:

        $ ~/hadoop-0.17.0/bin/hadoop jar apache-mahout-examples-0.1-dev.job org.apache.mahout.ga.watchmaker.cd.CDGA wdbc 1 0.9 1 0.033 0.1 0 100 10
        
        Show
        Deneche A. Hakim added a comment - This patch should now work fine. I added the wdbc dataset and modified the tests to look in the correct directory. I also correctected CDGA, it should run now with the following command: $ ~/hadoop-0.17.0/bin/hadoop jar apache-mahout-examples-0.1-dev.job org.apache.mahout.ga.watchmaker.cd.CDGA wdbc 1 0.9 1 0.033 0.1 0 100 10
        Hide
        Deneche A. Hakim added a comment -

        Also, I thought we had the wdbc dataset somewhere, but now the example above doesn't work for me for the class discovery.

        wdbc was in test/ressources, and now it should be in examples/test/ressources. CDGA dos not work anymore because the code in the repository is weird !
        Some of the code is not the latest one of the patch !!!

        I am verifying all my code and should post soon a correcting patch. In the mean time the following command should run CDGA

        $ hadoop-0.17.1/bin/hadoop jar apache-mahout-examples-0.1-dev.job org.apache.mahout.ga.watchmaker.cd.CDGA wdbc 0.9 1 0.033 0.1 0 100 10
        
        Show
        Deneche A. Hakim added a comment - Also, I thought we had the wdbc dataset somewhere, but now the example above doesn't work for me for the class discovery. wdbc was in test/ressources, and now it should be in examples/test/ressources. CDGA dos not work anymore because the code in the repository is weird ! Some of the code is not the latest one of the patch !!! I am verifying all my code and should post soon a correcting patch. In the mean time the following command should run CDGA $ hadoop-0.17.1/bin/hadoop jar apache-mahout-examples-0.1-dev.job org.apache.mahout.ga.watchmaker.cd.CDGA wdbc 0.9 1 0.033 0.1 0 100 10
        Hide
        Grant Ingersoll added a comment -

        I'm going to commit and then move the core/test/examples over to examples/...

        Also, I thought we had the wdbc dataset somewhere, but now the example above doesn't work for me for the class discovery.

        Show
        Grant Ingersoll added a comment - I'm going to commit and then move the core/test/examples over to examples/... Also, I thought we had the wdbc dataset somewhere, but now the example above doesn't work for me for the class discovery.
        Hide
        Deneche A. Hakim added a comment -

        I tested CDGA on a pseudo-distributed (a single PC) manner, and I discovered that I forgot to pass the dataset to the mappers Well, it's done now, and it works on pseudo-distributed.

        Show
        Deneche A. Hakim added a comment - I tested CDGA on a pseudo-distributed (a single PC) manner, and I discovered that I forgot to pass the dataset to the mappers Well, it's done now, and it works on pseudo-distributed.
        Hide
        Deneche A. Hakim added a comment -

        A made a small (relatively) modification to CDGA that allows him to cope with multi-class classification. You can now give it a target class, and it will (try to) dicover the classification rule for this class. If you have N classes, just run it N times with a different target each time.

        This modification allowed me to run CDGA over the KDD dataset, but it's veryyyyyyyyyyyyy slow. It takes more than 8 minutes to do one single iteration for one target over the 10% dataset (I didn't have the courage to run it over the whole dataset). At least now, I have a good dataset to test on a cluster

        the target class (the index of the value for the LABEL in the info file) is specified just after the dataset name. The following examples run CDGA over the WDBC dataset with target 1:

        $ ~/hadoop-0.17.0/bin/hadoop jar apache-mahout-0.1-dev-ex.jar org.apache.mahout.ga.watchmaker.cd.CDGA wdbc 1 0.9 1 0.033 0.1 0 100 10
        

        This is the last week of GSoC, so if you have any suggestions about the tests, the comments and the code I think its time for them

        Show
        Deneche A. Hakim added a comment - A made a small (relatively) modification to CDGA that allows him to cope with multi-class classification. You can now give it a target class, and it will (try to) dicover the classification rule for this class. If you have N classes, just run it N times with a different target each time. This modification allowed me to run CDGA over the KDD dataset, but it's veryyyyyyyyyyyyy slow. It takes more than 8 minutes to do one single iteration for one target over the 10% dataset (I didn't have the courage to run it over the whole dataset). At least now, I have a good dataset to test on a cluster the target class (the index of the value for the LABEL in the info file) is specified just after the dataset name. The following examples run CDGA over the WDBC dataset with target 1: $ ~/hadoop-0.17.0/bin/hadoop jar apache-mahout-0.1-dev-ex.jar org.apache.mahout.ga.watchmaker.cd.CDGA wdbc 1 0.9 1 0.033 0.1 0 100 10 This is the last week of GSoC, so if you have any suggestions about the tests, the comments and the code I think its time for them
        Hide
        Deneche A. Hakim added a comment -

        Changes

        • org.apache.mahout.ga.watchmaker.MahoutEvaluator removes any axisting input directory before storing the population
        • org.apache.mahout.ga.watchmaker.cd.FileInfosParser Uses the CATEGORICAL token for symbolic (nominal) attributes. This makes it easy to identify a token using the first character.
        • org.apache.mahout.ga.watchmaker.cd.tool.CDInfosTool is used to generate the .infos file needed by the CDGA for a new dataset.

        The new tool works as follow:

        • he is invoked using the following command (the dataset path is given as a parameter):
        $ ~/hadoop-0.17.0/bin/hadoop jar apache-mahout-0.1-dev-ex.jar org.apache.mahout.ga.watchmaker.cd.tool.CDInfosTool dataset_path
        
        • the tool searches for an existing infos file, in the same directory of the dataset with the same name and with the ".infos" extension, that contain the type of the attributes:
          • 'N' numerical attribute
          • 'C' categorical attribute
          • 'L' label (this also a categorical attribute)
          • 'I' to ignore the attribute
            each attribute is in a separate line
        • the tool uses a Hadoop job to parse the dataset and collect the informations
        • the results are writen back in the same .info file, in a format compatible with CDGA

        for example, this is the info file generated for the KDDCup (1999) 10% Training Dataset :

        kddcup.data_10_percent.infos

        NUMERICAL, 0.0,58329.0
        CATEGORICAL, icmp,udp,tcp
        CATEGORICAL, rje,login,time,systat,ntp_u,mtp,uucp_path,bgp,nntp,efs,Z39_50,csnet_ns,tim_i,X11,telnet,ftp_data,finger,other,exec,uucp,netstat,klogin,ecr_i,remote_job,urh_i,netbios_dgm,pop_2,auth,private,shell,printer,kshell,urp_i,vmnet,pop_3,echo,daytime,iso_tsap,courier,tftp_u,sunrpc,red_i,ctf,supdup,gopher,ssh,sql_net,name,smtp,hostnames,netbios_ssn,ftp,IRC,imap4,netbios_ns,http,ldap,eco_i,link,http_443,domain_u,discard,nnsp,pm_dump,domain,whois
        CATEGORICAL, S2,SF,OTH,S0,S3,RSTR,RSTO,SH,S1,RSTOS0,REJ
        NUMERICAL, 0.0,6.9337562E8
        NUMERICAL, 0.0,5155468.0
        CATEGORICAL, 0,1
        NUMERICAL, 0.0,3.0
        NUMERICAL, 0.0,3.0
        NUMERICAL, 0.0,30.0
        NUMERICAL, 0.0,5.0
        CATEGORICAL, 0,1
        NUMERICAL, 0.0,884.0
        NUMERICAL, 0.0,1.0
        NUMERICAL, 0.0,2.0
        NUMERICAL, 0.0,993.0
        NUMERICAL, 0.0,28.0
        NUMERICAL, 0.0,2.0
        NUMERICAL, 0.0,8.0
        NUMERICAL, 0.0,1.4E-45
        CATEGORICAL, 0
        CATEGORICAL, 0,1
        NUMERICAL, 0.0,511.0
        NUMERICAL, 0.0,511.0
        NUMERICAL, 0.0,1.0
        NUMERICAL, 0.0,1.0
        NUMERICAL, 0.0,1.0
        NUMERICAL, 0.0,1.0
        NUMERICAL, 0.0,1.0
        NUMERICAL, 0.0,1.0
        NUMERICAL, 0.0,1.0
        NUMERICAL, 0.0,255.0
        NUMERICAL, 0.0,255.0
        NUMERICAL, 0.0,1.0
        NUMERICAL, 0.0,1.0
        NUMERICAL, 0.0,1.0
        NUMERICAL, 0.0,1.0
        NUMERICAL, 0.0,1.0
        NUMERICAL, 0.0,1.0
        NUMERICAL, 0.0,1.0
        NUMERICAL, 0.0,1.0
        LABEL, teardrop.,ipsweep.,phf.,nmap.,land.,portsweep.,warezmaster.,smurf.,guess_passwd.,ftp_write.,perl.,loadmodule.,back.,imap.,normal.,pod.,spy.,neptune.,satan.,buffer_overflow.,rootkit.,warezclient.,multihop.

        What's Next

        • I think I found a quick workaround to allow CDGA to handle multi-class classification, I should implement it and try it on the KDD dataset
        • Run the code on a small cluster and hope that it will work
        Show
        Deneche A. Hakim added a comment - Changes org.apache.mahout.ga.watchmaker.MahoutEvaluator removes any axisting input directory before storing the population org.apache.mahout.ga.watchmaker.cd.FileInfosParser Uses the CATEGORICAL token for symbolic (nominal) attributes. This makes it easy to identify a token using the first character. org.apache.mahout.ga.watchmaker.cd.tool.CDInfosTool is used to generate the .infos file needed by the CDGA for a new dataset. The new tool works as follow: he is invoked using the following command (the dataset path is given as a parameter): $ ~/hadoop-0.17.0/bin/hadoop jar apache-mahout-0.1-dev-ex.jar org.apache.mahout.ga.watchmaker.cd.tool.CDInfosTool dataset_path the tool searches for an existing infos file, in the same directory of the dataset with the same name and with the ".infos" extension, that contain the type of the attributes: 'N' numerical attribute 'C' categorical attribute 'L' label (this also a categorical attribute) 'I' to ignore the attribute each attribute is in a separate line the tool uses a Hadoop job to parse the dataset and collect the informations the results are writen back in the same .info file, in a format compatible with CDGA for example, this is the info file generated for the KDDCup (1999) 10% Training Dataset : kddcup.data_10_percent.infos NUMERICAL, 0.0,58329.0 CATEGORICAL, icmp,udp,tcp CATEGORICAL, rje,login,time,systat,ntp_u,mtp,uucp_path,bgp,nntp,efs,Z39_50,csnet_ns,tim_i,X11,telnet,ftp_data,finger,other,exec,uucp,netstat,klogin,ecr_i,remote_job,urh_i,netbios_dgm,pop_2,auth,private,shell,printer,kshell,urp_i,vmnet,pop_3,echo,daytime,iso_tsap,courier,tftp_u,sunrpc,red_i,ctf,supdup,gopher,ssh,sql_net,name,smtp,hostnames,netbios_ssn,ftp,IRC,imap4,netbios_ns,http,ldap,eco_i,link,http_443,domain_u,discard,nnsp,pm_dump,domain,whois CATEGORICAL, S2,SF,OTH,S0,S3,RSTR,RSTO,SH,S1,RSTOS0,REJ NUMERICAL, 0.0,6.9337562E8 NUMERICAL, 0.0,5155468.0 CATEGORICAL, 0,1 NUMERICAL, 0.0,3.0 NUMERICAL, 0.0,3.0 NUMERICAL, 0.0,30.0 NUMERICAL, 0.0,5.0 CATEGORICAL, 0,1 NUMERICAL, 0.0,884.0 NUMERICAL, 0.0,1.0 NUMERICAL, 0.0,2.0 NUMERICAL, 0.0,993.0 NUMERICAL, 0.0,28.0 NUMERICAL, 0.0,2.0 NUMERICAL, 0.0,8.0 NUMERICAL, 0.0,1.4E-45 CATEGORICAL, 0 CATEGORICAL, 0,1 NUMERICAL, 0.0,511.0 NUMERICAL, 0.0,511.0 NUMERICAL, 0.0,1.0 NUMERICAL, 0.0,1.0 NUMERICAL, 0.0,1.0 NUMERICAL, 0.0,1.0 NUMERICAL, 0.0,1.0 NUMERICAL, 0.0,1.0 NUMERICAL, 0.0,1.0 NUMERICAL, 0.0,255.0 NUMERICAL, 0.0,255.0 NUMERICAL, 0.0,1.0 NUMERICAL, 0.0,1.0 NUMERICAL, 0.0,1.0 NUMERICAL, 0.0,1.0 NUMERICAL, 0.0,1.0 NUMERICAL, 0.0,1.0 NUMERICAL, 0.0,1.0 NUMERICAL, 0.0,1.0 LABEL, teardrop.,ipsweep.,phf.,nmap.,land.,portsweep.,warezmaster.,smurf.,guess_passwd.,ftp_write.,perl.,loadmodule.,back.,imap.,normal.,pod.,spy.,neptune.,satan.,buffer_overflow.,rootkit.,warezclient.,multihop. What's Next I think I found a quick workaround to allow CDGA to handle multi-class classification, I should implement it and try it on the KDD dataset Run the code on a small cluster and hope that it will work
        Hide
        Deneche A. Hakim added a comment -

        Committed revision 681327.

        Cool, now the patches should be easier to create

        Let's open up bugs/issues off of this, or add to this one if needed.

        It'll be easier for me if the bug/issues are added here. I should my self add some known open issues soon.

        Show
        Deneche A. Hakim added a comment - Committed revision 681327. Cool, now the patches should be easier to create Let's open up bugs/issues off of this, or add to this one if needed. It'll be easier for me if the bug/issues are added here. I should my self add some known open issues soon.
        Hide
        Grant Ingersoll added a comment -

        Committed revision 681327.

        Let's open up bugs/issues off of this, or add to this one if needed. I think the ARFF support should be done separately. Deneche, do you want to add an issue for that?

        Show
        Grant Ingersoll added a comment - Committed revision 681327. Let's open up bugs/issues off of this, or add to this one if needed. I think the ARFF support should be done separately. Deneche, do you want to add an issue for that?
        Hide
        Ted Dunning added a comment -


        Yes.

        Show
        Ted Dunning added a comment - Yes.
        Hide
        Deneche A. Hakim added a comment -

        After meny attemps to load all the informations that you gave me in my brain-processing-cluster-that-doesnt-work-quit-well, let's see if I understand it correctly:

        The algortihm handles any dataset in a matrix format, where (in my case) the collumns are the attributes (and one of them is the Label) and the rows are the datas.

        Working with Hadoop, we'll need to pass the dataset in the mapper's input, so it must be a file (or many files). We'll then need a custom InputFormat to feed the mappers with the data, and here comes the lovely-named "row-wise splitting matrix input format".

        Now we want to be able to work with any given dataset file format (including the ARFF and my custom format), and thus the InputFormat needs a decoder that converts the dataset lines into matrix rows.

        Show
        Deneche A. Hakim added a comment - After meny attemps to load all the informations that you gave me in my brain-processing-cluster-that-doesnt-work-quit-well, let's see if I understand it correctly: The algortihm handles any dataset in a matrix format, where (in my case) the collumns are the attributes (and one of them is the Label) and the rows are the datas. Working with Hadoop, we'll need to pass the dataset in the mapper's input, so it must be a file (or many files). We'll then need a custom InputFormat to feed the mappers with the data, and here comes the lovely-named "row-wise splitting matrix input format". Now we want to be able to work with any given dataset file format (including the ARFF and my custom format), and thus the InputFormat needs a decoder that converts the dataset lines into matrix rows.
        Hide
        Ted Dunning added a comment -

        I spoke poorly.

        In-memory is a misnomer.

        It should be possible to have a large arff dataset in HDFS to be used as input as well as a large dataset in your format.

        However you decide to read your data in, it should be usable by others. Likewise, by symmetrically, with the arff input.

        How that works should depend a little on your data. My feeling is that we will need something like a "row-wise splitting matrix input format" that sends groups of rows of a matrix to different mappers. This input format should accept a configuration argument which is the class to be used to actually decode the format.

        It will probably happen that not all algorithms will be quite so happy with this, especially the groups of rows part. They may want all mappers to see the entire data set (if the data set is, say, a set of population members rather than real data). They may want the mappers to have some row-wise map input, but have some side data that is read without using an input format.

        You are really one of the first to define a real user story for this so you should feel free to define what you need in the context of what you think others might be able to use as well.

        Show
        Ted Dunning added a comment - I spoke poorly. In-memory is a misnomer. It should be possible to have a large arff dataset in HDFS to be used as input as well as a large dataset in your format. However you decide to read your data in, it should be usable by others. Likewise, by symmetrically, with the arff input. How that works should depend a little on your data. My feeling is that we will need something like a "row-wise splitting matrix input format" that sends groups of rows of a matrix to different mappers. This input format should accept a configuration argument which is the class to be used to actually decode the format. It will probably happen that not all algorithms will be quite so happy with this, especially the groups of rows part. They may want all mappers to see the entire data set (if the data set is, say, a set of population members rather than real data). They may want the mappers to have some row-wise map input, but have some side data that is read without using an input format. You are really one of the first to define a real user story for this so you should feel free to define what you need in the context of what you think others might be able to use as well.
        Hide
        Deneche A. Hakim added a comment -

        The whole point of the CDGA example is to show how to use Mahout to run a Genetic Algorithm on a very large dataset, cause this is what Map-Reduce is about : large and distributed data.

        Now, it wont harm my program to be able to work with in-memory datasets, and I'll be more than happy to implement (a) as soon as there is a "stable" solution for (b) and (c).

        I have one question about in-memory datasets : how to pass them to the mappers ? we can't use the job input if the dataset is in-memory ? so I assume it is passed as a job parameter, is it ?

        Show
        Deneche A. Hakim added a comment - The whole point of the CDGA example is to show how to use Mahout to run a Genetic Algorithm on a very large dataset, cause this is what Map-Reduce is about : large and distributed data. Now, it wont harm my program to be able to work with in-memory datasets, and I'll be more than happy to implement (a) as soon as there is a "stable" solution for (b) and (c). I have one question about in-memory datasets : how to pass them to the mappers ? we can't use the job input if the dataset is in-memory ? so I assume it is passed as a job parameter, is it ?
        Hide
        Ted Dunning added a comment -

        Actually, I think what we need are three things:

        a) your program that should work from in-memory data sets (probably labeled matrices of some kind).

        b) we need a matrix reader of the kind you propose

        c) we need an arff matrix reader.

        I think that there is a jira around that could cover b & c.

        Show
        Ted Dunning added a comment - Actually, I think what we need are three things: a) your program that should work from in-memory data sets (probably labeled matrices of some kind). b) we need a matrix reader of the kind you propose c) we need an arff matrix reader. I think that there is a jira around that could cover b & c.
        Hide
        Deneche A. Hakim added a comment -

        In fact the info file is inspired from ARFF. The main differences are :

        • I need to be able to ignore some attributes (for example : ID)
        • I need to store the min and max values for the numerical attributes.
        • If the dataset is not in the ARFF format, I just need to generate its info file, I think its much more efficient than converting it to the ARFF format (I am talking here about very large datasets)

        And you're right about the Weka and Rapidminer compatibility, so I'll add to my todo list : support the ARFF dataset format.

        Show
        Deneche A. Hakim added a comment - In fact the info file is inspired from ARFF. The main differences are : I need to be able to ignore some attributes (for example : ID) I need to store the min and max values for the numerical attributes. If the dataset is not in the ARFF format, I just need to generate its info file, I think its much more efficient than converting it to the ARFF format (I am talking here about very large datasets) And you're right about the Weka and Rapidminer compatibility, so I'll add to my todo list : support the ARFF dataset format.
        Hide
        Ted Dunning added a comment -


        R handles arff as well.

        Show
        Ted Dunning added a comment - R handles arff as well.
        Hide
        Andrew Purtell added a comment -

        Was Weka's ARFF insufficient? Please see http://weka.sourceforge.net/wekadoc/index.php/en:ARFF_(3.5.1) . Just a suggestion from a potential Mahout user, but ARFF is a de-facto standard in some ML circles, and being able to move from Weka or Rapidminer to Mahout and back, depending on the scale, would be highly advantageous.

        Show
        Andrew Purtell added a comment - Was Weka's ARFF insufficient? Please see http://weka.sourceforge.net/wekadoc/index.php/en:ARFF_(3.5.1 ) . Just a suggestion from a potential Mahout user, but ARFF is a de-facto standard in some ML circles, and being able to move from Weka or Rapidminer to Mahout and back, depending on the scale, would be highly advantageous.
        Hide
        Deneche A. Hakim added a comment -

        What's new
        CDGA should be able to cope with any given dataset (of course with a certain file format). It uses a special format file that contains enough informations about the dataset. This file (called info file) has the following format:
        for each attribute a corresponding line in the info file describes it, it can be one of the following:

        • IGNORED
          if the attribute is ignored
        • LABEL val1, val2,...
          if the attribute is the label (class), and its possible values
        • NOMINAL val1, val2,...
          if the attribute is nominal (categorial), and its possible values
        • NUMERICAL min, max
          if the attribute is numerical, and its min and max values

        For now I generated the info file manually for the WDBC dataset. The info file should be in the same parent directory of the input, with the same name as the input directory followed by ".info". For ex. for a dataset

        build/examples/wdbc/

        the info file should be

        build/examples/wdbc.infos

        What's next

        • Map-Reduce program to automaticly generate the info file from any given dataset.
        • Run CDGA with other datasets
        • Multi-class classification
        Show
        Deneche A. Hakim added a comment - What's new CDGA should be able to cope with any given dataset (of course with a certain file format). It uses a special format file that contains enough informations about the dataset. This file (called info file) has the following format: for each attribute a corresponding line in the info file describes it, it can be one of the following: IGNORED if the attribute is ignored LABEL val1, val2,... if the attribute is the label (class), and its possible values NOMINAL val1, val2,... if the attribute is nominal (categorial), and its possible values NUMERICAL min, max if the attribute is numerical, and its min and max values For now I generated the info file manually for the WDBC dataset. The info file should be in the same parent directory of the input, with the same name as the input directory followed by ".info". For ex. for a dataset build/examples/wdbc/ the info file should be build/examples/wdbc.infos What's next Map-Reduce program to automaticly generate the info file from any given dataset. Run CDGA with other datasets Multi-class classification
        Hide
        Deneche A. Hakim added a comment -

        Updated dependencies.

        Show
        Deneche A. Hakim added a comment - Updated dependencies.
        Hide
        Deneche A. Hakim added a comment -

        What's new

        • Fixed some bugs that were well hidden in DummyOutputCollector and CDMutation(why the bugs are always hidden !!!), the later unit test has been improved to catch the bug if it manages to come again
        • The ClassDiscovery example should be able to handle Categorical attributes now, but I still need to add a tool that generate Dataset information from any given dataset.
        • The Travelling Salesman comments have been cleared, and a reference to Watchmaker project has been added to the comments inplace of the @author tag. I also added a readme.txt that describes where to look for the changes in the original code.

        what's next

        • A generic map-reduce program to generate dataset informations from the dataset itself.
        • multi-class classification
        Show
        Deneche A. Hakim added a comment - What's new Fixed some bugs that were well hidden in DummyOutputCollector and CDMutation (why the bugs are always hidden !!!), the later unit test has been improved to catch the bug if it manages to come again The ClassDiscovery example should be able to handle Categorical attributes now, but I still need to add a tool that generate Dataset information from any given dataset. The Travelling Salesman comments have been cleared, and a reference to Watchmaker project has been added to the comments inplace of the @author tag. I also added a readme.txt that describes where to look for the changes in the original code. what's next A generic map-reduce program to generate dataset informations from the dataset itself. multi-class classification
        Hide
        Deneche A. Hakim added a comment -

        Deneche, I think you need to clean up the examples that refer to Daniel Dyer. I'm assuming this is a watchmaker example that you modified. I believe the way to handle this is to mark it as ASL and somehow link to where you got the code from. It is already ASL to begin with, but the copyright is Daniel Dyer. You probably should also put a reference in NOTICES.txt that some of the code was developed by Daniel.

        Ok, should be evailable in the next patch

        Otherwise, looks pretty good. I'm no GA expert, but I like the TSP GUI! Would be interested in seeing some performance numbers as you distribute this out over multiple nodes, but that is not a requirement for committing.

        This is a very good idea, but it needs a larger TSP problem (should be able to find one), and a cluster. I'll definitely try it.

        Show
        Deneche A. Hakim added a comment - Deneche, I think you need to clean up the examples that refer to Daniel Dyer. I'm assuming this is a watchmaker example that you modified. I believe the way to handle this is to mark it as ASL and somehow link to where you got the code from. It is already ASL to begin with, but the copyright is Daniel Dyer. You probably should also put a reference in NOTICES.txt that some of the code was developed by Daniel. Ok, should be evailable in the next patch Otherwise, looks pretty good. I'm no GA expert, but I like the TSP GUI! Would be interested in seeing some performance numbers as you distribute this out over multiple nodes, but that is not a requirement for committing. This is a very good idea, but it needs a larger TSP problem (should be able to find one), and a cluster. I'll definitely try it.
        Hide
        Grant Ingersoll added a comment -

        Added ASL where needed.

        Moved StringUtils to utils package.

        Deneche, I think you need to clean up the examples that refer to Daniel Dyer. I'm assuming this is a watchmaker example that you modified. I believe the way to handle this is to mark it as ASL and somehow link to where you got the code from. It is already ASL to begin with, but the copyright is Daniel Dyer. You probably should also put a reference in NOTICES.txt that some of the code was developed by Daniel.

        Otherwise, looks pretty good. I'm no GA expert, but I like the TSP GUI! Would be interested in seeing some performance numbers as you distribute this out over multiple nodes, but that is not a requirement for committing.

        Show
        Grant Ingersoll added a comment - Added ASL where needed. Moved StringUtils to utils package. Deneche, I think you need to clean up the examples that refer to Daniel Dyer. I'm assuming this is a watchmaker example that you modified. I believe the way to handle this is to mark it as ASL and somehow link to where you got the code from. It is already ASL to begin with, but the copyright is Daniel Dyer. You probably should also put a reference in NOTICES.txt that some of the code was developed by Daniel. Otherwise, looks pretty good. I'm no GA expert, but I like the TSP GUI! Would be interested in seeing some performance numbers as you distribute this out over multiple nodes, but that is not a requirement for committing.
        Hide
        Deneche A. Hakim added a comment -

        TravellingSalesman example GUI

        Show
        Deneche A. Hakim added a comment - TravellingSalesman example GUI
        Hide
        Deneche A. Hakim added a comment -

        updated "zipped" dependencies

        Show
        Deneche A. Hakim added a comment - updated "zipped" dependencies
        Hide
        Deneche A. Hakim added a comment -

        updated dependencies

        Show
        Deneche A. Hakim added a comment - updated dependencies
        Hide
        Deneche A. Hakim added a comment -

        changes

        • simplified tests using dummy classes instead of using tsp and soduko
        • added a complete watchmaker example related to TSP, this example comes with watchmaker, I made some modifications to allow the user to choose how the result will be calculated (standalone or distributed)
        • no more need for watchmaker-examples-0.4.3.jar, the examples now need the folliwing library : watchmaker-swing-0.4.3.jar (the new libs.jar contains the required libraries and their licence files)
        • you can run the CDGA algorithm, after generating the examples-job, by using the following command
         
        <hadoop-0.17.0_HOME>/bin/hadoop jar <mahout_HOME>/core/build/apache-mahout-0.1-dev-ex.jar org.apache.mahout.ga.watchmaker.travellingsalesman.TravellingSalesman
        

        make sure to check the "distributed" option to solve the problem using mahout.ga

        Show
        Deneche A. Hakim added a comment - changes simplified tests using dummy classes instead of using tsp and soduko added a complete watchmaker example related to TSP, this example comes with watchmaker, I made some modifications to allow the user to choose how the result will be calculated (standalone or distributed) no more need for watchmaker-examples-0.4.3.jar, the examples now need the folliwing library : watchmaker-swing-0.4.3.jar (the new libs.jar contains the required libraries and their licence files) you can run the CDGA algorithm, after generating the examples-job, by using the following command <hadoop-0.17.0_HOME>/bin/hadoop jar <mahout_HOME>/core/build/apache-mahout-0.1-dev-ex.jar org.apache.mahout.ga.watchmaker.travellingsalesman.TravellingSalesman make sure to check the "distributed" option to solve the problem using mahout.ga
        Hide
        Deneche A. Hakim added a comment -

        Added comments to the new classes. The CDGA comment describes the meaning of the parameters for the program.

        Show
        Deneche A. Hakim added a comment - Added comments to the new classes. The CDGA comment describes the meaning of the parameters for the program.
        Hide
        Deneche A. Hakim added a comment -

        I moved the class discovery code (org.apache.mahout.ga.watchmaker.ca) to the examples directory, until I figure out how to make it more generic

        I made some changes to the build.xml :

        • ant compile-examples will now compile all the code int src/main/examples, not that you'll need the ejb.jar library in order to compile the cf.taste.ejb example
        • ant examples-test will lunch all the tests in the src/test/examples directory. It will allow us to add unit tests for the examples

        you can run the CDGA algorithm, after generating the examples-job, by using the following command

        <hadoop-0.17.0_HOME>/bin/hadoop jar <mahout_HOME>/core/build/apache-mahout-0.1-dev-ex.jar org.apache.mahout.ga.watchmaker.cd.CDGA <mahout_HOME>/core/src/main/resources/wdbc/ 0.9 1 0.033 0.1 0 100 10
        

        I will explain later what all those parameters mean...

        Show
        Deneche A. Hakim added a comment - I moved the class discovery code (org.apache.mahout.ga.watchmaker.ca) to the examples directory, until I figure out how to make it more generic I made some changes to the build.xml : ant compile-examples will now compile all the code int src/main/examples, not that you'll need the ejb.jar library in order to compile the cf.taste.ejb example ant examples-test will lunch all the tests in the src/test/examples directory. It will allow us to add unit tests for the examples you can run the CDGA algorithm, after generating the examples-job, by using the following command <hadoop-0.17.0_HOME>/bin/hadoop jar <mahout_HOME>/core/build/apache-mahout-0.1-dev-ex.jar org.apache.mahout.ga.watchmaker.cd.CDGA <mahout_HOME>/core/src/main/resources/wdbc/ 0.9 1 0.033 0.1 0 100 10 I will explain later what all those parameters mean...
        Hide
        Deneche A. Hakim added a comment -

        ouf ! I found the bug, it was hidden in CDFitness, and caused the GA to return weird solutions

        Show
        Deneche A. Hakim added a comment - ouf ! I found the bug, it was hidden in CDFitness, and caused the GA to return weird solutions
        Hide
        Deneche A. Hakim added a comment -

        This patch should work (I tried it)

        Also contains DatasetTextOutputFormat, this is a TextOutputFormat that allows the input to be split into tow disjoint subsets (training and testing)

        the main algo CDGA contains a bug somewhere, cause the results are weird...guess I know what I have to do for the next days (apart from hitting the keyboard with my head)

        Show
        Deneche A. Hakim added a comment - This patch should work (I tried it) Also contains DatasetTextOutputFormat, this is a TextOutputFormat that allows the input to be split into tow disjoint subsets (training and testing) the main algo CDGA contains a bug somewhere, cause the results are weird...guess I know what I have to do for the next days (apart from hitting the keyboard with my head)
        Hide
        Deneche A. Hakim added a comment -

        this zip file contains the additional librairies

        Show
        Deneche A. Hakim added a comment - this zip file contains the additional librairies
        Hide
        Deneche A. Hakim added a comment -

        Seems the latest patch doesn't apply all that well. Seems I'm getting double entries of each class in the same file.

        Yeah for me too !!! I seems I am using a rather old version of TortoiseSVN, I updated now and should provide a working patch soon

        Show
        Deneche A. Hakim added a comment - Seems the latest patch doesn't apply all that well. Seems I'm getting double entries of each class in the same file. Yeah for me too !!! I seems I am using a rather old version of TortoiseSVN, I updated now and should provide a working patch soon
        Hide
        Grant Ingersoll added a comment -

        Hi Deneche,

        Seems the latest patch doesn't apply all that well. Seems I'm getting double entries of each class in the same file.

        From the top directory, do:
        svn status
        svn diff > watchmaker-tsp.patch

        Also, no need for the "gsoc" package. This is full-fledged goodness, no need to qualify. I'd suggest something like org.apache.mahout.genetic.watchmaker or org.apache.mahout.ga.watchmaker would be good.

        Also, if you can zip up the required libraries and attach them, that would save a few trips to track them down.

        Thanks,
        Grant

        Show
        Grant Ingersoll added a comment - Hi Deneche, Seems the latest patch doesn't apply all that well. Seems I'm getting double entries of each class in the same file. From the top directory, do: svn status svn diff > watchmaker-tsp.patch Also, no need for the "gsoc" package. This is full-fledged goodness, no need to qualify. I'd suggest something like org.apache.mahout.genetic.watchmaker or org.apache.mahout.ga.watchmaker would be good. Also, if you can zip up the required libraries and attach them, that would save a few trips to track them down. Thanks, Grant
        Hide
        Deneche A. Hakim added a comment -

        classdiscovery.ga.CDGA (the main tool) now accepts command-line parameters

        Show
        Deneche A. Hakim added a comment - classdiscovery.ga.CDGA (the main tool) now accepts command-line parameters
        Hide
        Deneche A. Hakim added a comment - - edited

        what's new
        . class discovery: based on the following paper Discovering Comprehensible Classification Rules using Genetic Programming, a genetic algorithm that searches for the best binary classification rule for a given dataset. The population, which is a list of possible rules, is passed to each mapper that handles a subset of the dataset. All the new stuff is in the package:

        org.apache.mahout.gsoc.watchmaker.classdiscovery

        . I refactored some classes from the previous patch to reuse the existing code. The main change is the class STEvolutionEngine<T> that uses a single thread and the corresponding STFitnessEvaluator<T>. More details will be added to the comments

        . I added easymock library needed to run the tests

        What's need to be done
        The following steps need to be done before considering this patch to be complete:
        . classdiscovery.ga.CDGA (the main tool) need to become a full functional command-line tool
        . for now CDGA uses the whole dataset for training, it should split it in a training set and a testing set
        . because classdiscovery is not generic (at least for now), I should move it to the examples along with its corresponding tests
        . arrange the comments
        . there is no need to test the code againt TSP and Soduko, I should remove the Soduko test to make the tests more comprehensible
        . pass the population using the DestributedCache instead of job parameter

        Show
        Deneche A. Hakim added a comment - - edited what's new . class discovery: based on the following paper Discovering Comprehensible Classification Rules using Genetic Programming , a genetic algorithm that searches for the best binary classification rule for a given dataset. The population, which is a list of possible rules, is passed to each mapper that handles a subset of the dataset. All the new stuff is in the package: org.apache.mahout.gsoc.watchmaker.classdiscovery . I refactored some classes from the previous patch to reuse the existing code. The main change is the class STEvolutionEngine<T> that uses a single thread and the corresponding STFitnessEvaluator<T>. More details will be added to the comments . I added easymock library needed to run the tests What's need to be done The following steps need to be done before considering this patch to be complete: . classdiscovery.ga.CDGA (the main tool) need to become a full functional command-line tool . for now CDGA uses the whole dataset for training, it should split it in a training set and a testing set . because classdiscovery is not generic (at least for now), I should move it to the examples along with its corresponding tests . arrange the comments . there is no need to test the code againt TSP and Soduko, I should remove the Soduko test to make the tests more comprehensible . pass the population using the DestributedCache instead of job parameter
        Hide
        Deneche A. Hakim added a comment -

        I modified the NOTICE.TXT file to conform to xpp3 license.

        also added more tests using another Watchmaker example (Sudoku solver) along with TSP. We don't probably need them both, but more tests are always welcome.

        I should post soon into the dev-list to talk about the next possible steps...

        Show
        Deneche A. Hakim added a comment - I modified the NOTICE.TXT file to conform to xpp3 license. also added more tests using another Watchmaker example (Sudoku solver) along with TSP. We don't probably need them both, but more tests are always welcome. I should post soon into the dev-list to talk about the next possible steps...
        Hide
        Grant Ingersoll added a comment -

        "This product includes software developed by the Indiana University
        Extreme! Lab (http://www.extreme.indiana.edu/)."

        This typically goes in NOTICE.txt in the root directory. Feel free to add it, we will clean it up before release.

        I hope to look at the rest of this soon, but others should too.

        Show
        Grant Ingersoll added a comment - "This product includes software developed by the Indiana University Extreme! Lab ( http://www.extreme.indiana.edu/ )." This typically goes in NOTICE.txt in the root directory. Feel free to add it, we will clean it up before release. I hope to look at the rest of this soon, but others should too.
        Hide
        Deneche A. Hakim added a comment -

        Description of the changes
        I made the code problem independent, and so I changed the class names to remove any reference to TSP.
        The classes are :
        . StringUtils: inspired from the future Stringifier of Hadoop. Translates any given object (even not a Serializable one)
        to a one-line xml representation, and vice versa.
        . MahoutEvolutionEngine: generic distributed Genetic algorithms. Now the constructor takes a FitnessEvaluator
        that takes care of evaluating every candidate.
        . MahoutEvaluator: evaluate a population of individuals using a given FitnessEvaluator. Uses StringUtils to store
        the population into an input file, and the FitnessEvaluator into the JobConf.
        . EvalMapper: Mapper that evaluate a candidate using the FitnessEvaluator passed into the JobConf.

        Note that we no more need watchmaker-examples to build the code, but we still need it in the tests to
        compare this code with the reference implementation.

        Needed libraries
        you'll need the xstream library http://xstream.codehaus.org/, I used the 1.2.1 version. Add the following jars to core/lib

        xpp3_min-*.jar
        xstream-*.jar

        I also included the licenses for all the libraries that I added. And if we plan to use xpp3_min-*.jar we need
        to include the following lines somewhere in the Mahout documentation or in the software:

        "This product includes software developed by the Indiana University
        Extreme! Lab (http://www.extreme.indiana.edu/)."

        Next steps
        There is another example with wathmaker that I want to test with this new code, just to confirm that
        the integration is fine. Then we can talk in the mailing list about the next move, wich could be one of the following :
        . meta-mutations
        . for now Im assuming that each node contains the whole dataset needed to evaluate a candidate. But if the dataset is large enough
        to span on multiple nodes, the user should have the possibility of writing the evaluation funtion in terms of mappers and reducers
        . ...any suggestion ?

        Show
        Deneche A. Hakim added a comment - Description of the changes I made the code problem independent, and so I changed the class names to remove any reference to TSP. The classes are : . StringUtils: inspired from the future Stringifier of Hadoop. Translates any given object (even not a Serializable one) to a one-line xml representation, and vice versa. . MahoutEvolutionEngine: generic distributed Genetic algorithms. Now the constructor takes a FitnessEvaluator that takes care of evaluating every candidate. . MahoutEvaluator: evaluate a population of individuals using a given FitnessEvaluator. Uses StringUtils to store the population into an input file, and the FitnessEvaluator into the JobConf. . EvalMapper: Mapper that evaluate a candidate using the FitnessEvaluator passed into the JobConf. Note that we no more need watchmaker-examples to build the code, but we still need it in the tests to compare this code with the reference implementation. Needed libraries you'll need the xstream library http://xstream.codehaus.org/ , I used the 1.2.1 version. Add the following jars to core/lib xpp3_min-*.jar xstream-*.jar I also included the licenses for all the libraries that I added. And if we plan to use xpp3_min-*.jar we need to include the following lines somewhere in the Mahout documentation or in the software: "This product includes software developed by the Indiana University Extreme! Lab ( http://www.extreme.indiana.edu/ )." Next steps There is another example with wathmaker that I want to test with this new code, just to confirm that the integration is fine. Then we can talk in the mailing list about the next move, wich could be one of the following : . meta-mutations . for now Im assuming that each node contains the whole dataset needed to evaluate a candidate. But if the dataset is large enough to span on multiple nodes, the user should have the possibility of writing the evaluation funtion in terms of mappers and reducers . ...any suggestion ?
        Hide
        Ted Dunning added a comment -

        What is the license on watchmaker?

        What about the other jars (uncommon-maths and uncommon-utils)?

        Show
        Ted Dunning added a comment - What is the license on watchmaker? What about the other jars (uncommon-maths and uncommon-utils)?
        Hide
        Deneche A. Hakim added a comment -

        I started with the Traveling Salesman Problem (TSP) because the reference implementation already exists within Watchmaker.

        You'll need to add the following jars to the Mahout/core/lib/ :

        watchmaker-framework-0.4.3.jar
        watchmaker-examples-0.4.3.jar (contains reference implementation of the TSP)
        uncommons-maths-1.0.2.jar
        uncommons-utils.jar

        they are all available with watchmaker0.4.3 https://watchmaker.dev.java.net/

        I also included some unit tests that should pass without problem.

        The code contains the following 4 classes:
        . RouteEvalMapper : a Hadoop mapper that evaluate the fitness of one candidate solution (GA individual)
        . MahoutRouteEvaluator : takes a GA population in input and launch a Hadoop job to evaluate the fitness of each individual,
        returns back the results. Takes care of storing the population into an input file, and loading the fitnesses from job outputs
        . MahoutTspEvolutionEngine : Distributed implementation of the evolution engine that uses MahoutRouteEvaluator for the evaluations
        . PopulationUtils : Utility class to store the population into a given FileSystem

        This is the easiest possible implementation, the next steps are :
        . Use serialization to store/load any king of individuals and not only List<String>
        . Use serialization to pass any possible FitnessEvaluator, thus we can use MahoutEvolutionEngine for other problems
        . and as suggested by Ted: use meta-mutation (But I think it will be in a separate Task)

        Show
        Deneche A. Hakim added a comment - I started with the Traveling Salesman Problem (TSP) because the reference implementation already exists within Watchmaker. You'll need to add the following jars to the Mahout/core/lib/ : watchmaker-framework-0.4.3.jar watchmaker-examples-0.4.3.jar (contains reference implementation of the TSP) uncommons-maths-1.0.2.jar uncommons-utils.jar they are all available with watchmaker0.4.3 https://watchmaker.dev.java.net/ I also included some unit tests that should pass without problem. The code contains the following 4 classes: . RouteEvalMapper : a Hadoop mapper that evaluate the fitness of one candidate solution (GA individual) . MahoutRouteEvaluator : takes a GA population in input and launch a Hadoop job to evaluate the fitness of each individual, returns back the results. Takes care of storing the population into an input file, and loading the fitnesses from job outputs . MahoutTspEvolutionEngine : Distributed implementation of the evolution engine that uses MahoutRouteEvaluator for the evaluations . PopulationUtils : Utility class to store the population into a given FileSystem This is the easiest possible implementation, the next steps are : . Use serialization to store/load any king of individuals and not only List<String> . Use serialization to pass any possible FitnessEvaluator, thus we can use MahoutEvolutionEngine for other problems . and as suggested by Ted: use meta-mutation (But I think it will be in a separate Task)

          People

          • Assignee:
            Grant Ingersoll
            Reporter:
            Deneche A. Hakim
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development