Mahout
  1. Mahout
  2. MAHOUT-588

Benchmark Mahout's clustering performance on EC2 and publish the results

    Details

    • Type: Task Task
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.5
    • Fix Version/s: 0.5
    • Component/s: None
    • Labels:
      None

      Description

      For Taming Text, I've commissioned some benchmarking work on Mahout's clustering algorithms. I've asked the two doing the project to do all the work in the open here. The goal is to use a publicly reusable dataset (for now, the ASF mail archives, assuming it is big enough) and run on EC2 and make all resources available so others can reproduce/improve.

      I'd like to add the setup code to utils (although it could possibly be done as a Vectorizer) and the publication of the results will be put up on the Wiki as well as in the book. This issue is to track the patches, etc.

      1. prep_asf_mail_archives.sh
        4 kB
        Timothy Potter
      2. MAHOUT-588.patch
        35 kB
        Timothy Potter
      3. SequenceFilesFromMailArchivesTest.java
        7 kB
        Timothy Potter
      4. MailArchivesClusteringAnalyzerTest.java
        2 kB
        Timothy Potter
      5. SequenceFilesFromMailArchives.java
        12 kB
        Timothy Potter
      6. MailArchivesClusteringAnalyzer.java
        8 kB
        Timothy Potter
      7. mahout-588_canopy.pdf
        161 kB
        Szymon Chojnacki
      8. ec2_setup_notes_v2.txt
        6 kB
        Timothy Potter
      9. prep_asf_mail_archives.sh
        3 kB
        Timothy Potter
      10. ec2_setup_notes_v2.txt
        6 kB
        Timothy Potter
      11. prep_asf_mail_archives.sh
        3 kB
        Timothy Potter
      12. mahout-588_distribution.pdf
        311 kB
        Szymon Chojnacki
      13. TamingSubsetMapper.java
        0.9 kB
        Szymon Chojnacki
      14. TamingSubset.java
        2 kB
        Szymon Chojnacki
      15. 60_clusters_kmeans_10_iterations_100K_coordinates.txt
        7 kB
        Szymon Chojnacki
      16. TamingAnalyzer.java
        2 kB
        Timothy Potter
      17. TamingAnalyzerTest.java
        1 kB
        Timothy Potter
      18. ec2_setup_notes.txt
        6 kB
        Timothy Potter
      19. clusters1.txt
        203 kB
        Szymon Chojnacki
      20. TamingTFIDF.java
        0.9 kB
        Szymon Chojnacki
      21. TamingCollocMapper.java
        7 kB
        Szymon Chojnacki
      22. TamingDictionaryVectorizer.java
        14 kB
        Szymon Chojnacki
      23. TamingGramKeyGroupComparator.java
        0.7 kB
        Szymon Chojnacki
      24. TamingCollocDriver.java
        10 kB
        Szymon Chojnacki
      25. TamingDictVect.java
        1 kB
        Szymon Chojnacki
      26. TamingAnalyzer.java
        3 kB
        Szymon Chojnacki
      27. TamingTokenizer.java
        0.8 kB
        Szymon Chojnacki
      28. clusters_kMeans.txt
        11 kB
        Szymon Chojnacki
      29. Top1000Tokens_maybe_stopWords
        14 kB
        Szymon Chojnacki
      30. distcp_large_to_s3_failed.log
        47 kB
        Timothy Potter
      31. seq2sparse_small_failed.log
        118 kB
        Timothy Potter
      32. seq2sparse_xlarge_ok.log
        230 kB
        Timothy Potter
      33. SequenceFilesFromMailArchives2.java
        10 kB
        Szymon Chojnacki
      34. Uncompress.java
        4 kB
        Szymon Chojnacki
      35. SequenceFilesFromMailArchives.java
        12 kB
        Timothy Potter

        Issue Links

          Activity

          Hide
          Timothy Potter added a comment -

          While working on this issue with Grant, I found an issue with using seq2sparse in EMR, see MAHOUT-598

          Show
          Timothy Potter added a comment - While working on this issue with Grant, I found an issue with using seq2sparse in EMR, see MAHOUT-598
          Hide
          Timothy Potter added a comment -

          The SequenceFilesFromMailArchives is based on SequenceFilesFromDirectory, but adds block compression to the SequenceFile.Writer and parses individual mail messages using simple regex patterns. I used ^From \S+@\S.*\d

          {4}

          $ for my message boundary pattern.

          Running on the asf-mail-archives, I get 6,107,076 messages (which is slightly more than Szymon's?).

          To run this, you would need to save to utils/src/main/java/org/apache/mahout/text and then do something like:

          $MAHOUT_HOME/bin/mahout org.apache.mahout.text.SequenceFilesFromMailArchives \
          --input /mnt/asf-mail-archives/extracted \
          --output /mnt/asf-mail-archives/sequence-files \
          -c UTF-8 -chunk 1024 -prefix TamingText

          The chunk size is rather large because it is the raw size before compression.

          Show
          Timothy Potter added a comment - The SequenceFilesFromMailArchives is based on SequenceFilesFromDirectory, but adds block compression to the SequenceFile.Writer and parses individual mail messages using simple regex patterns. I used ^From \S+@\S.*\d {4} $ for my message boundary pattern. Running on the asf-mail-archives, I get 6,107,076 messages (which is slightly more than Szymon's?). To run this, you would need to save to utils/src/main/java/org/apache/mahout/text and then do something like: $MAHOUT_HOME/bin/mahout org.apache.mahout.text.SequenceFilesFromMailArchives \ --input /mnt/asf-mail-archives/extracted \ --output /mnt/asf-mail-archives/sequence-files \ -c UTF-8 -chunk 1024 -prefix TamingText The chunk size is rather large because it is the raw size before compression.
          Hide
          Szymon Chojnacki added a comment -

          Tim,
          Uncompress.java is a short script I run at the very beginning to place all dumps in one directory. Next step is a SequenceFilesFromMailArchives.java (or for comparison version (2) I soon upload, which basically uses all mails content and MIME headings)

          Szymon

          Show
          Szymon Chojnacki added a comment - Tim, Uncompress.java is a short script I run at the very beginning to place all dumps in one directory. Next step is a SequenceFilesFromMailArchives.java (or for comparison version (2) I soon upload, which basically uses all mails content and MIME headings) Szymon
          Hide
          Szymon Chojnacki added a comment -

          Tim,
          I modified slightly your Parser, and started it on my machine for comparison. I understand that you really parse the messages. I did simple splitting. Hence, I preserve all MIME headings (I know that most of it will become redundant after tfidf calculation, but at least we will not decrease the size of the dataset). I also think that the Taming is addressed to a general audience and finally we should drastically simplify the code, currently it is nicely generic but maybe to large to interpret within a single page (so we can later consider using pure SequenceFile.createWriter without Mahout's utils and wrappers, but it is just a detail).

          Regards

          Show
          Szymon Chojnacki added a comment - Tim, I modified slightly your Parser, and started it on my machine for comparison. I understand that you really parse the messages. I did simple splitting. Hence, I preserve all MIME headings (I know that most of it will become redundant after tfidf calculation, but at least we will not decrease the size of the dataset). I also think that the Taming is addressed to a general audience and finally we should drastically simplify the code, currently it is nicely generic but maybe to large to interpret within a single page (so we can later consider using pure SequenceFile.createWriter without Mahout's utils and wrappers, but it is just a detail). Regards
          Hide
          Timothy Potter added a comment -

          Vectorization process using seq2sparse is complete and are available in my S3 bucket :

          s3://thelabdude/asf-mail-archives/vectors/

          (Note: I'll move to Grant's asf-mail-archives bucket once we have some of the clustering algorithms working as I didn't want to move all this data around if it's not correct)

          Here are the parameters I used to create the vectors:

          org.apache.mahout.driver.MahoutDriver seq2sparse \
          -i s3n://thelabdude/asf-mail-archives/mahout-0.4/sequence-files/ \
          -o /asf-mail-archives/mahout-0.4/vectors/ \
          --weight tfidf --chunkSize 100 --minSupport 2 \
          --minDF 1 --maxDFPercent 90 --norm 2 \
          --numReducers 31 --sequentialAccessVector

          Vectorizing the sequence files took some serious horse-power; took 52 minutes on a 19 Node Cluster of Extra Large instances in EMR with 31 reducers. The log from the successful run is attached – seq2sparse_xlarge_ok.log. Notice that I built sequential access vectors (which I've heard may help with kmeans performance).

          A few lessons learned:

          • The resulting tf-vectors or tfidf-vectors files are large (~11.5GB) so you need to have at least 3 reducers if you intend to load the vectors into S3 as the max file size is 5GB. I'm storing the vectors in S3 so that we can re-use them for multiple clustering job runs.
          • The MR job has 20 steps and benefits greatly from distributing the processing; don't try to vectorize this much data on a single node and multiple reducers is a must!
          • After failing to get this working on a development machine, I started with a cluster of 9 m1.small instances (in Amazon EMR) and the job crashed (see attached log - seq2sparse_small_failed.log). Then I used a cluster of 13 large instances and the process completed successfully after a couple of hours, but I wasn't able to "distcp" the results to S3 – real bummer! (see attached log - distcp_large_to_s3_failed.log). This may be a configuration issue with Amazon's EMR large instance as xlarge works as expected???

          Here is the ls output for the aforementioned bucket:

          $ s3cmd ls s3://thelabdude/asf-mail-archives/vectors/

          DIR s3://thelabdude/asf-mail-archives/vectors/df-count/
          DIR s3://thelabdude/asf-mail-archives/vectors/tf-vectors/
          DIR s3://thelabdude/asf-mail-archives/vectors/tfidf-vectors/
          DIR s3://thelabdude/asf-mail-archives/vectors/tokenized-documents/
          DIR s3://thelabdude/asf-mail-archives/vectors/wordcount/
          2011-01-30 15:29 0 s3://thelabdude/asf-mail-archives/vectors/df-count_$folder$
          2011-01-30 15:30 70926210 s3://thelabdude/asf-mail-archives/vectors/dictionary.file-0
          2011-01-30 15:32 70863447 s3://thelabdude/asf-mail-archives/vectors/dictionary.file-1
          2011-01-30 15:33 70892506 s3://thelabdude/asf-mail-archives/vectors/dictionary.file-2
          2011-01-30 15:31 70877571 s3://thelabdude/asf-mail-archives/vectors/dictionary.file-3
          2011-01-30 15:32 70824816 s3://thelabdude/asf-mail-archives/vectors/dictionary.file-4
          2011-01-30 15:34 70895476 s3://thelabdude/asf-mail-archives/vectors/dictionary.file-5
          2011-01-30 15:35 40982506 s3://thelabdude/asf-mail-archives/vectors/dictionary.file-6
          2011-01-30 15:36 37160153 s3://thelabdude/asf-mail-archives/vectors/frequency.file-0
          2011-01-30 15:36 37160173 s3://thelabdude/asf-mail-archives/vectors/frequency.file-1
          2011-01-30 15:30 37160173 s3://thelabdude/asf-mail-archives/vectors/frequency.file-2
          2011-01-30 15:31 37160173 s3://thelabdude/asf-mail-archives/vectors/frequency.file-3
          2011-01-30 15:32 37160173 s3://thelabdude/asf-mail-archives/vectors/frequency.file-4
          2011-01-30 15:33 37160173 s3://thelabdude/asf-mail-archives/vectors/frequency.file-5
          2011-01-30 15:34 37160173 s3://thelabdude/asf-mail-archives/vectors/frequency.file-6
          2011-01-30 15:34 1727033 s3://thelabdude/asf-mail-archives/vectors/frequency.file-7

          Now on to running the clustering algorithms!

          Szymon plans to start on the algorithms in the following order:

          canopy -> k-means -> fuzzy -> mean-shift -> dirichlet

          I'll start on the other end and begin with dirichlet and work backwards.

          Show
          Timothy Potter added a comment - Vectorization process using seq2sparse is complete and are available in my S3 bucket : s3://thelabdude/asf-mail-archives/vectors/ (Note: I'll move to Grant's asf-mail-archives bucket once we have some of the clustering algorithms working as I didn't want to move all this data around if it's not correct) Here are the parameters I used to create the vectors: org.apache.mahout.driver.MahoutDriver seq2sparse \ -i s3n://thelabdude/asf-mail-archives/mahout-0.4/sequence-files/ \ -o /asf-mail-archives/mahout-0.4/vectors/ \ --weight tfidf --chunkSize 100 --minSupport 2 \ --minDF 1 --maxDFPercent 90 --norm 2 \ --numReducers 31 --sequentialAccessVector Vectorizing the sequence files took some serious horse-power; took 52 minutes on a 19 Node Cluster of Extra Large instances in EMR with 31 reducers. The log from the successful run is attached – seq2sparse_xlarge_ok.log. Notice that I built sequential access vectors (which I've heard may help with kmeans performance). A few lessons learned: The resulting tf-vectors or tfidf-vectors files are large (~11.5GB) so you need to have at least 3 reducers if you intend to load the vectors into S3 as the max file size is 5GB. I'm storing the vectors in S3 so that we can re-use them for multiple clustering job runs. The MR job has 20 steps and benefits greatly from distributing the processing; don't try to vectorize this much data on a single node and multiple reducers is a must! After failing to get this working on a development machine, I started with a cluster of 9 m1.small instances (in Amazon EMR) and the job crashed (see attached log - seq2sparse_small_failed.log). Then I used a cluster of 13 large instances and the process completed successfully after a couple of hours, but I wasn't able to "distcp" the results to S3 – real bummer! (see attached log - distcp_large_to_s3_failed.log). This may be a configuration issue with Amazon's EMR large instance as xlarge works as expected??? Here is the ls output for the aforementioned bucket: $ s3cmd ls s3://thelabdude/asf-mail-archives/vectors/ DIR s3://thelabdude/asf-mail-archives/vectors/df-count/ DIR s3://thelabdude/asf-mail-archives/vectors/tf-vectors/ DIR s3://thelabdude/asf-mail-archives/vectors/tfidf-vectors/ DIR s3://thelabdude/asf-mail-archives/vectors/tokenized-documents/ DIR s3://thelabdude/asf-mail-archives/vectors/wordcount/ 2011-01-30 15:29 0 s3://thelabdude/asf-mail-archives/vectors/df-count_$folder$ 2011-01-30 15:30 70926210 s3://thelabdude/asf-mail-archives/vectors/dictionary.file-0 2011-01-30 15:32 70863447 s3://thelabdude/asf-mail-archives/vectors/dictionary.file-1 2011-01-30 15:33 70892506 s3://thelabdude/asf-mail-archives/vectors/dictionary.file-2 2011-01-30 15:31 70877571 s3://thelabdude/asf-mail-archives/vectors/dictionary.file-3 2011-01-30 15:32 70824816 s3://thelabdude/asf-mail-archives/vectors/dictionary.file-4 2011-01-30 15:34 70895476 s3://thelabdude/asf-mail-archives/vectors/dictionary.file-5 2011-01-30 15:35 40982506 s3://thelabdude/asf-mail-archives/vectors/dictionary.file-6 2011-01-30 15:36 37160153 s3://thelabdude/asf-mail-archives/vectors/frequency.file-0 2011-01-30 15:36 37160173 s3://thelabdude/asf-mail-archives/vectors/frequency.file-1 2011-01-30 15:30 37160173 s3://thelabdude/asf-mail-archives/vectors/frequency.file-2 2011-01-30 15:31 37160173 s3://thelabdude/asf-mail-archives/vectors/frequency.file-3 2011-01-30 15:32 37160173 s3://thelabdude/asf-mail-archives/vectors/frequency.file-4 2011-01-30 15:33 37160173 s3://thelabdude/asf-mail-archives/vectors/frequency.file-5 2011-01-30 15:34 37160173 s3://thelabdude/asf-mail-archives/vectors/frequency.file-6 2011-01-30 15:34 1727033 s3://thelabdude/asf-mail-archives/vectors/frequency.file-7 Now on to running the clustering algorithms! Szymon plans to start on the algorithms in the following order: canopy -> k-means -> fuzzy -> mean-shift -> dirichlet I'll start on the other end and begin with dirichlet and work backwards.
          Hide
          Sean Owen added a comment -

          (By the way the S3 size limit is now 5TB)

          I think this would be great to put on the wiki at some point if you're so inclined.
          https://cwiki.apache.org/MAHOUT/mahout-wiki.html

          I think you will also discover some bottlenecks that aren't yet obvious, and would be great to share what you find about where the time is spent.

          Show
          Sean Owen added a comment - (By the way the S3 size limit is now 5TB) I think this would be great to put on the wiki at some point if you're so inclined. https://cwiki.apache.org/MAHOUT/mahout-wiki.html I think you will also discover some bottlenecks that aren't yet obvious, and would be great to share what you find about where the time is spent.
          Hide
          Timothy Potter added a comment -

          Hi Sean,

          Will definitely look into updating the Wiki as I work through the process.

          Thanks for the heads-up on the 5TB limit – it might be how distcp was accessing S3 as I definitely got back a "max file size exceeded" error from S3 when trying to upload files larger than 5GB. Will do some more research to find out the exact cause ...

          Show
          Timothy Potter added a comment - Hi Sean, Will definitely look into updating the Wiki as I work through the process. Thanks for the heads-up on the 5TB limit – it might be how distcp was accessing S3 as I definitely got back a "max file size exceeded" error from S3 when trying to upload files larger than 5GB. Will do some more research to find out the exact cause ...
          Hide
          Grant Ingersoll added a comment -

          For the code, I think it makes sense that this goes into $MAHOUT_HOME/utils/src/main/. Scripts can go in src/main/scripts/ec2 (or whatever). Java code can go under the src/main/java path (see the benchmark package, for instance) but it can also go in other appropriate places.

          Show
          Grant Ingersoll added a comment - For the code, I think it makes sense that this goes into $MAHOUT_HOME/utils/src/main/. Scripts can go in src/main/scripts/ec2 (or whatever). Java code can go under the src/main/java path (see the benchmark package, for instance) but it can also go in other appropriate places.
          Hide
          Szymon Chojnacki added a comment -

          Finally I managed to go through all steps (until ClusterDisplay) with the local development environment. In order to avoid OutOfMemory I changed /conf/mapred-site.xml and added:

          <parameter>
          <name>mapred.child.java.opts</name>
          <value>-Xmx5000M </value>
          </parameter>

          It enabled to go reach the end of the processing, but there were a few 600s timeout problems and a few jobs were killed. In order to avoid this, I also had to add

          <parameter>
          <name>mapred.task.timeout</name>
          <value>0</value>
          </parameter>

          I'll describe in detail all the steps (might be useful for EC2) and switch to process in the AWS. I'll also try to figure out the minimum requirements for -Xmx.

          Show
          Szymon Chojnacki added a comment - Finally I managed to go through all steps (until ClusterDisplay) with the local development environment. In order to avoid OutOfMemory I changed /conf/mapred-site.xml and added: <parameter> <name>mapred.child.java.opts</name> <value>-Xmx5000M </value> </parameter> It enabled to go reach the end of the processing, but there were a few 600s timeout problems and a few jobs were killed. In order to avoid this, I also had to add <parameter> <name>mapred.task.timeout</name> <value>0</value> </parameter> I'll describe in detail all the steps (might be useful for EC2) and switch to process in the AWS. I'll also try to figure out the minimum requirements for -Xmx.
          Hide
          Szymon Chojnacki added a comment -

          seq2sparse_small_fail does not have detailed info why a job did not succeed, e.g.

          2011-01-30 01:25:55,726 INFO org.apache.hadoop.mapred.JobClient (main): Task Id : attempt_201101300001_0003_m_000159_0, Status : FAILED

          I get such info locally, maybe changing mapred.child.java.opts parameter in conf/mapred-site.xml can help, I have:

          <property>
          <name>mapred.child.java.opts</name>
          <value>-Xmx2000M -XX:-HeapDumpOnOutOfMemoryError</value>
          </property>

          Show
          Szymon Chojnacki added a comment - seq2sparse_small_fail does not have detailed info why a job did not succeed, e.g. 2011-01-30 01:25:55,726 INFO org.apache.hadoop.mapred.JobClient (main): Task Id : attempt_201101300001_0003_m_000159_0, Status : FAILED I get such info locally, maybe changing mapred.child.java.opts parameter in conf/mapred-site.xml can help, I have: <property> <name>mapred.child.java.opts</name> <value>-Xmx2000M -XX:-HeapDumpOnOutOfMemoryError</value> </property>
          Hide
          Szymon Chojnacki added a comment - - edited

          The most common tokens are in Top1000Token_maybeStopWords, I will extend stopWord list with some of them (e.g. http, from, have, mail), default Lucene stop words are only:

          "a", "an", "and", "are", "as", "at", "be", "but", "by","for", "if", "in", "into", "is", "it","no", "not", "of", "on", "or", "such","that", "the", "their", "then", "there", "these", "they", "this", "to", "was", "will", "with"

          Show
          Szymon Chojnacki added a comment - - edited The most common tokens are in Top1000Token_maybeStopWords, I will extend stopWord list with some of them (e.g. http, from, have, mail), default Lucene stop words are only: "a", "an", "and", "are", "as", "at", "be", "but", "by","for", "if", "in", "into", "is", "it","no", "not", "of", "on", "or", "such","that", "the", "their", "then", "there", "these", "they", "this", "to", "was", "will", "with"
          Hide
          Szymon Chojnacki added a comment - - edited

          clusters_kMeans.txt is an overview of 20 clusters. Interpretation:

          1) n=23019
          2) c=[aaaaaaaaaaa:0.002, ...
          3) Top Terms: software => 16.595137777966098

          ad 1) number of emails in a cluster
          ad 2) the values of tfidf for nonempty coordinates of the centroid, sorted alphabetically
          ad 3) Coordinates with the highest tfidf - maybe perceived as cluster's labels

          Show
          Szymon Chojnacki added a comment - - edited clusters_kMeans.txt is an overview of 20 clusters. Interpretation: 1) n=23019 2) c=[aaaaaaaaaaa:0.002, ... 3) Top Terms: software => 16.595137777966098 ad 1) number of emails in a cluster ad 2) the values of tfidf for nonempty coordinates of the centroid, sorted alphabetically ad 3) Coordinates with the highest tfidf - maybe perceived as cluster's labels
          Hide
          Szymon Chojnacki added a comment -

          Collocations bottle-neck.

          As I see, so far both of us did vectorization with 1-grams. I must admit that I am affraid to calculate 2-grams or 3-grams. My understanding of:
          DictionaryVectorizer
          CollocDriver
          CollocMapper
          CollocReducer

          is that, firstly all collocations are built and counted and only those with freq>minCount are restored. If it is so, I think that one more preprocessing step would make it faster in our setting (huge amount of typos and a long,long-tail) - i.e. building collocations only for tokens (1-grams) with freq>minCount. I'll experiment with this issue few hours today.

          Show
          Szymon Chojnacki added a comment - Collocations bottle-neck. As I see, so far both of us did vectorization with 1-grams. I must admit that I am affraid to calculate 2-grams or 3-grams. My understanding of: DictionaryVectorizer CollocDriver CollocMapper CollocReducer is that, firstly all collocations are built and counted and only those with freq>minCount are restored. If it is so, I think that one more preprocessing step would make it faster in our setting (huge amount of typos and a long,long-tail) - i.e. building collocations only for tokens (1-grams) with freq>minCount. I'll experiment with this issue few hours today.
          Hide
          Timothy Potter added a comment -

          Hi Szymon,

          I'd like to use the vectors you created with the more sophisticated analzyer. Can you post the tfidf-vectors directory from your HDFS to Grant's S3 bucket, such as: s3n://asf-mail-archives/mahout-0.4/tfidf-vectors

          Best way would be to use distcp out of Hadoop using:

          hadoop distcp HDFS_PATH/tfidf-vectors s3n://asf-mail-archives/mahout-0.4/

          You'll also need to setup your s3n access key and secret key in core-site.xml, such as:

          <property>
          <name>fs.s3n.awsAccessKeyId</name>
          <value>ACCESSKEY</value>
          </property>
          <property>
          <name>fs.s3n.awsSecretAccessKey</name>
          <value>SECRETKEY</value>
          </property>

          Show
          Timothy Potter added a comment - Hi Szymon, I'd like to use the vectors you created with the more sophisticated analzyer. Can you post the tfidf-vectors directory from your HDFS to Grant's S3 bucket, such as: s3n://asf-mail-archives/mahout-0.4/tfidf-vectors Best way would be to use distcp out of Hadoop using: hadoop distcp HDFS_PATH/tfidf-vectors s3n://asf-mail-archives/mahout-0.4/ You'll also need to setup your s3n access key and secret key in core-site.xml, such as: <property> <name>fs.s3n.awsAccessKeyId</name> <value>ACCESSKEY</value> </property> <property> <name>fs.s3n.awsSecretAccessKey</name> <value>SECRETKEY</value> </property>
          Hide
          Szymon Chojnacki added a comment -

          I'll place both: tf-vectors and tfidf-vectors generated for 1-grams with TamingAnalyzer.java (uploaded to Mahout-588), parameters used for DictionaryVectorizer.createTermFrequencyVectors():

          int minSupport = 100;
          int maxNGramSize = 1;
          float minLLRValue = LLRReducer.DEFAULT_MIN_LLR;
          int reduceTasks = 10;
          int chunkSize = 128;
          boolean sequentialAccessOutput = false;
          // new parameters in new API
          float normPower=PartialVectorMerger.NO_NORMALIZING;
          boolean logNormalize=false;
          boolean namedVectors=false;

          parameters used in TFIDFConverter.processTfIdf():

          int reduceTasks = 10;
          int chunkSize = 128;
          boolean sequentialAccessOutput = false;
          int minDf = 1;
          int maxDFPercent = 80;
          float norm = PartialVectorMerger.NO_NORMALIZING;
          boolean logNormalize=false;
          boolean namedVectors=false;

          It is recommended in "Mahout in Action" that LDA should get tf-vectors as an input. Thanks for explaining distcp. Btw. so far I gave up with 3-grams (I was getting OutOfMem even for Xmx2000M), after setting Xmx4000M hadoop threw IO.Exception).

          Show
          Szymon Chojnacki added a comment - I'll place both: tf-vectors and tfidf-vectors generated for 1-grams with TamingAnalyzer.java (uploaded to Mahout-588), parameters used for DictionaryVectorizer.createTermFrequencyVectors(): int minSupport = 100; int maxNGramSize = 1; float minLLRValue = LLRReducer.DEFAULT_MIN_LLR; int reduceTasks = 10; int chunkSize = 128; boolean sequentialAccessOutput = false; // new parameters in new API float normPower=PartialVectorMerger.NO_NORMALIZING; boolean logNormalize=false; boolean namedVectors=false; parameters used in TFIDFConverter.processTfIdf(): int reduceTasks = 10; int chunkSize = 128; boolean sequentialAccessOutput = false; int minDf = 1; int maxDFPercent = 80; float norm = PartialVectorMerger.NO_NORMALIZING; boolean logNormalize=false; boolean namedVectors=false; It is recommended in "Mahout in Action" that LDA should get tf-vectors as an input. Thanks for explaining distcp. Btw. so far I gave up with 3-grams (I was getting OutOfMem even for Xmx2000M), after setting Xmx4000M hadoop threw IO.Exception).
          Hide
          Szymon Chojnacki added a comment -

          custom extension of org.apache.lucene.analysis.Analyzer;

          mail specific StopWords added,
          alphanumeric tokens preserved,
          token length limit is in range [2,40]

          Show
          Szymon Chojnacki added a comment - custom extension of org.apache.lucene.analysis.Analyzer; mail specific StopWords added, alphanumeric tokens preserved, token length limit is in range [2,40]
          Hide
          Timothy Potter added a comment -

          Thanks for posting the vectors Szymon. Unfortunately, I'm not able to read them due to permissions:

          $ s3cmd get s3://asf-mail-archives/tfidf-vectors/part-r-00000
          s3://asf-mail-archives/tfidf-vectors/part-r-00000 -> ./part-r-00000 [1 of 1]
          ERROR: S3 error: 403 (Forbidden)

          Grant - Can you make s3://asf-mail-archives/tfidf-vectors/ readable to the public? I tried doing it from my end with no success:

          $ s3cmd setacl --acl-public --recursive s3://asf-mail-archives/tfidf-vectors/
          ERROR: S3 error: 403 (AccessDenied): Access Denied

          Show
          Timothy Potter added a comment - Thanks for posting the vectors Szymon. Unfortunately, I'm not able to read them due to permissions: $ s3cmd get s3://asf-mail-archives/tfidf-vectors/part-r-00000 s3://asf-mail-archives/tfidf-vectors/part-r-00000 -> ./part-r-00000 [1 of 1] ERROR: S3 error: 403 (Forbidden) Grant - Can you make s3://asf-mail-archives/tfidf-vectors/ readable to the public? I tried doing it from my end with no success: $ s3cmd setacl --acl-public --recursive s3://asf-mail-archives/tfidf-vectors/ ERROR: S3 error: 403 (AccessDenied): Access Denied
          Hide
          Grant Ingersoll added a comment -

          I've made them public.

          Show
          Grant Ingersoll added a comment - I've made them public.
          Hide
          Szymon Chojnacki added a comment -

          TamingTokenizer.java
          TamingAnalyzer.java

          were used to tokenize sequence files stored in HDFS. Parameters are hard coded (it is assumed that seqFiles are in TamingSEQ)

          Show
          Szymon Chojnacki added a comment - TamingTokenizer.java TamingAnalyzer.java were used to tokenize sequence files stored in HDFS. Parameters are hard coded (it is assumed that seqFiles are in TamingSEQ)
          Hide
          Szymon Chojnacki added a comment -

          TamingDictVect invokes TamingDictionaryVectorizer. Parameters are hard-coded. It is assumed that the emails were tokenized with TamingToken into TamingVEC.

          TamingDictionaryVectorizer is a modified DictionaryVectorizer used in seq2sparse. Proposed implementation builds collocations only for words with a parametrized frequency, which requires one more preprocessing MR-job. This optimization is based on an observation that if we wish to restore n-grams with minCount>100, than each token in the n-gram must have count>100. In practice we can assume that the count of single tokens is few times higher than desired n-gram count.

          The implementation differences are as follows:

          /* Modified DictionaryVectorizer, now collocations are built only for tokens with defined minimum support. Most changes are enclosed by a commented informative statement. Following classes where also modified:
          1. TamingCollocDriver

          • modified CollocDriver
          • changed CollocMapper to TamingCollocMapper
          • added frequent tokens to DistributedCache

          2. TamingCollocMapper

          • modified CollocMapper
          • in setup() lilmited dictionary is loaded
          • only terms with all tokens in the dictionary are sent forward to reducers
          • used terminology: term is e.g. "Coca cola" tokens are "coca" "cola"

          3. TamingGramKeyGroupComparator

          • unmodified GramKeyGroupComparator
          • just moved to current package to obtain visibility for TamingCollocDriver
            */
          Show
          Szymon Chojnacki added a comment - TamingDictVect invokes TamingDictionaryVectorizer. Parameters are hard-coded. It is assumed that the emails were tokenized with TamingToken into TamingVEC. TamingDictionaryVectorizer is a modified DictionaryVectorizer used in seq2sparse. Proposed implementation builds collocations only for words with a parametrized frequency, which requires one more preprocessing MR-job. This optimization is based on an observation that if we wish to restore n-grams with minCount>100, than each token in the n-gram must have count>100. In practice we can assume that the count of single tokens is few times higher than desired n-gram count. The implementation differences are as follows: /* Modified DictionaryVectorizer, now collocations are built only for tokens with defined minimum support. Most changes are enclosed by a commented informative statement. Following classes where also modified: 1. TamingCollocDriver modified CollocDriver changed CollocMapper to TamingCollocMapper added frequent tokens to DistributedCache 2. TamingCollocMapper modified CollocMapper in setup() lilmited dictionary is loaded only terms with all tokens in the dictionary are sent forward to reducers used terminology: term is e.g. "Coca cola" tokens are "coca" "cola" 3. TamingGramKeyGroupComparator unmodified GramKeyGroupComparator just moved to current package to obtain visibility for TamingCollocDriver */
          Hide
          Szymon Chojnacki added a comment -

          TamingTFIDR.java
          used after TamingDictVect.java
          it calculates tfidf statistics for vectors that already contain only df statistics

          after this step we can run various clustering algorithms on vectors in
          TamingVEC/tfidf-vectors
          and
          TamingVEC/tf-vectors

          Show
          Szymon Chojnacki added a comment - TamingTFIDR.java used after TamingDictVect.java it calculates tfidf statistics for vectors that already contain only df statistics after this step we can run various clustering algorithms on vectors in TamingVEC/tfidf-vectors and TamingVEC/tf-vectors
          Hide
          Szymon Chojnacki added a comment -

          clusters1.txt

          First insight into clustering data with collocations. Clusters are of poor quality, but some 2-,3-grams appear among top coordinates. E.g.

          content type => 2.0620927878006285
          jira browse =>0.6988600139794434
          hi all =>1.1158225446599421
          apache software =>2.1034713022485554
          pgp signature version =>3.9832009961801447

          Overview of 60 clusters obtained after 1 iteration of k-Means with RandomSeedGeneration.

          Show
          Szymon Chojnacki added a comment - clusters1.txt First insight into clustering data with collocations. Clusters are of poor quality, but some 2-,3-grams appear among top coordinates. E.g. content type => 2.0620927878006285 jira browse =>0.6988600139794434 hi all =>1.1158225446599421 apache software =>2.1034713022485554 pgp signature version =>3.9832009961801447 Overview of 60 clusters obtained after 1 iteration of k-Means with RandomSeedGeneration.
          Hide
          Ted Dunning added a comment -

          Overview of 60 clusters obtained after 1 iteration of k-Means with RandomSeedGeneration.

          One iteration?

          Show
          Ted Dunning added a comment - Overview of 60 clusters obtained after 1 iteration of k-Means with RandomSeedGeneration. One iteration?
          Hide
          Grant Ingersoll added a comment -

          I'm guessing he means one run of the clustering algorithm, not one iteration of the k-means algorithm, but I'll let him say for sure

          Show
          Grant Ingersoll added a comment - I'm guessing he means one run of the clustering algorithm, not one iteration of the k-means algorithm, but I'll let him say for sure
          Hide
          Szymon Chojnacki added a comment - - edited

          One iteration was the most I could get for a few hours of struggling with both Mahout and Hadoop. The problem is a tricky one. Now I run 10-iterations task. A short description of the problem:

          0. I started a typical 10-iterations job
          1. 60 'canopies' were initialized with RandomSeedGeneration successfully
          2. First iteration run successfully over the canopies and output /clusters-1
          3. The second iteration threw Error: Heap stack overflow

          • I suspected memory leak in KMeansDriver,
          • I set up 1 iteration of KMeansDriver with canopies in /clusters-1
          • the memory problem appeared again
          • it was surprising to me because there is virtually no difference between Iteration-1 and Iteration-2 (at least I thought so)

          4. The problem turned out to be the fact that random seed centroids are very sparse, and centroids we get after the first iteration are very dense. The size of 60-random seeds is 114KB, the size of 60-centroids after the first iteration is >400MB! I had mapred.tasktracker.map.tasks.maximum=40. So I run out of memory quickly during the setup of KMeansMapper.

          5. I played with variuos XMx vs maxMappers configurations and I dont get an error with:
          -Xmx3500 and mapred.tasktracker.map.tasks.maximum=1
          I get an error with
          -Xmx2000 and mapred.tasktracker.map.tasks.maximum=2

          I think I can not put more than 2 mappers with more than Xmx2000, as I have 6GB nodes

          Show
          Szymon Chojnacki added a comment - - edited One iteration was the most I could get for a few hours of struggling with both Mahout and Hadoop. The problem is a tricky one. Now I run 10-iterations task. A short description of the problem: 0. I started a typical 10-iterations job 1. 60 'canopies' were initialized with RandomSeedGeneration successfully 2. First iteration run successfully over the canopies and output /clusters-1 3. The second iteration threw Error: Heap stack overflow I suspected memory leak in KMeansDriver, I set up 1 iteration of KMeansDriver with canopies in /clusters-1 the memory problem appeared again it was surprising to me because there is virtually no difference between Iteration-1 and Iteration-2 (at least I thought so) 4. The problem turned out to be the fact that random seed centroids are very sparse, and centroids we get after the first iteration are very dense. The size of 60-random seeds is 114KB, the size of 60-centroids after the first iteration is >400MB! I had mapred.tasktracker.map.tasks.maximum=40. So I run out of memory quickly during the setup of KMeansMapper. 5. I played with variuos XMx vs maxMappers configurations and I dont get an error with: -Xmx3500 and mapred.tasktracker.map.tasks.maximum=1 I get an error with -Xmx2000 and mapred.tasktracker.map.tasks.maximum=2 I think I can not put more than 2 mappers with more than Xmx2000, as I have 6GB nodes
          Hide
          Ted Dunning added a comment -

          A hashed representation would help here. With collocation features, your vector size can easily reach into the millions or more for a large corpus even with a frequency cutoff.

          With a hashed representation and multiple probing you could probably get decent results with size < 10^6, possible with size == 10^5

          Show
          Ted Dunning added a comment - A hashed representation would help here. With collocation features, your vector size can easily reach into the millions or more for a large corpus even with a frequency cutoff. With a hashed representation and multiple probing you could probably get decent results with size < 10^6, possible with size == 10^5
          Hide
          Szymon Chojnacki added a comment -

          Thank you for the advice,
          as I see currently CosineDistance is partially optimized in the context of kMeans and centroidLengthSquare is computed only once:

          public double distance(double centroidLengthSquare, Vector centroid, Vector v)

          which is significantly faster that standard

          public double distance(Vector v1, Vector v2)

          however it is assumed that v1 and v2 are sparse and time of dotProduct is proportional to the number of non-empty coordinates in both vectors:

          double dotProduct = v2.dot(v1);

          Your suggestion to implement v2 (centroid vector) by means of a hashmap would definitelly improve the speed of calculating the distance between points and centroids and as a result the kMeans itself.

          Regards

          Show
          Szymon Chojnacki added a comment - Thank you for the advice, as I see currently CosineDistance is partially optimized in the context of kMeans and centroidLengthSquare is computed only once: public double distance(double centroidLengthSquare, Vector centroid, Vector v) which is significantly faster that standard public double distance(Vector v1, Vector v2) however it is assumed that v1 and v2 are sparse and time of dotProduct is proportional to the number of non-empty coordinates in both vectors: double dotProduct = v2.dot(v1); Your suggestion to implement v2 (centroid vector) by means of a hashmap would definitelly improve the speed of calculating the distance between points and centroids and as a result the kMeans itself. Regards
          Hide
          Ted Dunning added a comment -

          Your suggestion to implement v2 (centroid vector) by means of a hashmap would definitelly improve the speed of calculating the distance between points and centroids and as a result the kMeans itself.

          Uh.... my suggestion was to use a hashed feature encoding of the feature vector, not to use a hashmap. Hashed feature encoding assigns multiple features to each position in the vector and multiple locations to each feature. It says nothing about the matrix representation.

          For speed considerations alone, if the centroid vector has more than about 10-20% of the elements non-zero, then you should avoid sparse representations and just use the dense form. If space is critical then you may want to use a sparse representation up to about 30-40% fill.

          So how sparse is your centroid vector?

          Show
          Ted Dunning added a comment - Your suggestion to implement v2 (centroid vector) by means of a hashmap would definitelly improve the speed of calculating the distance between points and centroids and as a result the kMeans itself. Uh.... my suggestion was to use a hashed feature encoding of the feature vector, not to use a hashmap. Hashed feature encoding assigns multiple features to each position in the vector and multiple locations to each feature. It says nothing about the matrix representation. For speed considerations alone, if the centroid vector has more than about 10-20% of the elements non-zero, then you should avoid sparse representations and just use the dense form. If space is critical then you may want to use a sparse representation up to about 30-40% fill. So how sparse is your centroid vector?
          Hide
          Timothy Potter added a comment -

          Improved TamingAnalyzer to use existing Lucene Analyzers and document steps to run Mahout clustering job in an EC2 cluster using the hadoop/src/contrib/ec2 scripts.

          Show
          Timothy Potter added a comment - Improved TamingAnalyzer to use existing Lucene Analyzers and document steps to run Mahout clustering job in an EC2 cluster using the hadoop/src/contrib/ec2 scripts.
          Hide
          Szymon Chojnacki added a comment - - edited

          Thank you Ted for your support,
          the centroid vectors in our 3-gram set have 935 173 coordinates, on average 372 452 are non-empty (39.8%). Currently we limited the dimensionality to around 500K by preserving only 1- and 2-grams. When we are successful with the smaller dimensionality we come back to the issue of hashed feature encoding with 3-grams.
          Regards

          Show
          Szymon Chojnacki added a comment - - edited Thank you Ted for your support, the centroid vectors in our 3-gram set have 935 173 coordinates, on average 372 452 are non-empty (39.8%). Currently we limited the dimensionality to around 500K by preserving only 1- and 2-grams. When we are successful with the smaller dimensionality we come back to the issue of hashed feature encoding with 3-grams. Regards
          Hide
          Szymon Chojnacki added a comment - - edited

          60_clusters_kmeans_10_iterations_100K_coordinates.txt

          contains clusters with 15 top terms. Obtained after 10 iterations of kMeans. The size of the vectors ~100K.

          A few interesting clusters with some of the top terms:

          1. fraud, pills, watches, money, prices, you, buy, euro, price,

          3. die, der, und, nicht, ich, ist, das, mit, sie, den, auf, oder, anfragen, zu,

          6. color, 1px, border, background, padding, font, 4px, margin, 10px,

          10. von, betreff, nachricht, gesendet, mittwoch,

          52. maven, repository, jar, build, pom, project, artifact, mvn, dependencies,

          54. tomcat, servlet, web, apache, jsp, file, server, mod, webapps,

          Show
          Szymon Chojnacki added a comment - - edited 60_clusters_kmeans_10_iterations_100K_coordinates.txt contains clusters with 15 top terms. Obtained after 10 iterations of kMeans. The size of the vectors ~100K. A few interesting clusters with some of the top terms: 1. fraud, pills, watches, money, prices, you, buy, euro, price, 3. die, der, und, nicht, ich, ist, das, mit, sie, den, auf, oder, anfragen, zu, 6. color, 1px, border, background, padding, font, 4px, margin, 10px, 10. von, betreff, nachricht, gesendet, mittwoch, 52. maven, repository, jar, build, pom, project, artifact, mvn, dependencies, 54. tomcat, servlet, web, apache, jsp, file, server, mod, webapps,
          Hide
          Timothy Potter added a comment -

          Here are the steps I take to vectorize using Amazon's Elastic MapReduce.

          1. Install elastic-mapreduce-ruby tool:

          On Debian-based Linux:

          sudo apt-get install ruby1.8
          sudo apt-get install libopenssl-ruby1.8
          sudo apt-get install libruby1.8-extras

          Once these dependencies are installed, download and extract the elastic-mapreduce-ruby app:

          mkdir -p /mnt/dev/elastic-mapreduce /mnt/dev/downloads
          cd /mnt/dev/downloads
          wget http://elasticmapreduce.s3.amazonaws.com/elastic-mapreduce-ruby.zip
          cd /mnt/dev/elastic-mapreduce
          unzip /mnt/dev/downloads/elastic-mapreduce-ruby.zip

          1. create a file named credentials.json in /mnt/dev/elastic-mapreduce
          2. see: http://aws.amazon.com/developertools/2264?_encoding=UTF8&jiveRedirect=1
          3. credentials.json should contain the following, note the region is significant
          { "access-id": "ACCESS_KEY", "private-key": "SECRET_KEY", "key-pair": "gsg-keypair", "key-pair-file": "/mnt/dev/aws/gsg-keypair.pem", "region": "us-east-1", "log-uri": "s3n://BUCKET/asf-mail-archives/logs/" }

          Also, it's a good idea to add /mnt/dev/elastic-mapreduce to your PATH

          2. Once elastic-mapreduce is installed, start a cluster with no jobflow steps yet:

          elastic-mapreduce --create --alive \
          --log-uri s3n://BUCKET/asf-mail-archives/logs/ \
          --key-pair gsg-keypair \
          --slave-instance-type m1.xlarge \
          --master-instance-type m1.xlarge \
          --num-instances # \
          --name mahout-0.4-vectorize \
          --bootstrap-action s3://elasticmapreduce/bootstrap-actions/configurations/latest/memory-intensive

          This will create an EMR Job Flow named "mahout-0.4-vectorize" in the US-East region. Take note of
          the Job ID returned as you will need it to add the "seq2sparse" step to the Job Flow.

          I'll leave it to you to decide how many instances to allocate, but keep in mind that one will be
          dedicated as the master. Also, it took about 75 minutes to run the seq2sparse job on 19 xlarge
          instances (~190 normalized instance hours – not cheap). I think you'll be safe to use about 10-13
          instances and still finish in under 2 hours.

          Also, notice I'm using Amazon's bootstrap-action for configuring the cluster to run memory intensive
          jobs. For more information about this, see:
          http://buyitnw.appspot.com/forums.aws.amazon.com/ann.jspa?annID=834

          3. Mahout JAR

          The Mahout 0.4 Jobs JAR with our TamingAnalyzer is available at:
          s3://thelabdude/mahout-examples-0.4-job-tt.jar

          If you need to change other Mahout code, then you'll need to post your own JAR to S3.
          Remember to reference the JAR using the s3n Hadoop protocol.

          4. Schedule a jobflow step to vectorize using Mahout's seq2sparse:

          elastic-mapreduce --jar s3n://thelabdude/mahout-examples-0.4-job-tt.jar \
          --main-class org.apache.mahout.driver.MahoutDriver \
          --arg seq2sparse \
          --arg -i --arg s3n://thelabdude/asf-mail-archives/mahout-0.4/sequence-files/ \
          --arg -o --arg /asf-mail-archives/mahout-0.4/vectors/ \
          --arg --weight --arg tfidf \
          --arg --chunkSize --arg 100 \
          --arg --minSupport --arg 400 \
          --arg --minDF --arg 20 \
          --arg --maxDFPercent --arg 80 \
          --arg --norm --arg 2 \
          --arg --numReducers --arg ## \
          --arg --analyzerName --arg org.apache.mahout.text.TamingAnalyzer \
          --arg --maxNGramSize --arg 2 \
          --arg --minLLR --arg 50 \
          --enable-debugging \
          -j JOB_ID

          These settings are pretty aggressive in order to reduce the vectors
          to around 100,000 dimensions.

          IMPORTANT: Set the number of reducers to 2 x (N-1) (where N is the size of your cluster)

          The job will send output to HDFS instead of S3 (see Mahout-598). Once the job
          completes, we'll copy the results to S3 from our cluster's HDFS using distcp.

          NOTE: To monitor the status of the job, use:
          elastic-mapreduce --logs -j JOB_ID

          5. Save log after completion

          Once the job completes, save the log output for further analysis:

          elastic-mapreduce --logs -j JOB_ID > seq2sparse.log

          6. SSH into the master node to run distcp:

          elastic-mapreduce --ssh -j JOB_ID

          hadoop fs -lsr /asf-mail-archives/mahout-0.4/vectors/
          hadoop distcp /asf-mail-archives/mahout-0.4/vectors/ s3n://ACCESS_KEY:SECRET_KEY@BUCKET/asf-mail-archives/mahout-0.4/sparse-2-gram-stem/ &

          Note: You will need all the output from the vectorize step in order to run Mahout's clusterdump.

          7. Shut down your cluster

          Once you've copied the seq2sparse output to S3, you can shutdown your cluster.

          elastic-mapreduce --terminate -j JOB_ID

          Verify the cluster is terminated in your Amazon console.

          8. Make the vectors public in S3 using the Amazon console or s3cmd:

          s3cmd setacl --acl-public --recursive s3://BUCKET/asf-mail-archives/mahout-0.4/sparse-2-gram-stem/

          9. Dump out the size of the vectors

          bin/mahout vectordump --seqFile s3n://ACCESS_KEY:SECRET_KEY@BUCKET/asf-mail-archives/mahout-0.4/sparse-2-gram-stem/tfidf-vectors/part-r-00000 --sizeOnly | more

          Show
          Timothy Potter added a comment - Here are the steps I take to vectorize using Amazon's Elastic MapReduce. 1. Install elastic-mapreduce-ruby tool: On Debian-based Linux: sudo apt-get install ruby1.8 sudo apt-get install libopenssl-ruby1.8 sudo apt-get install libruby1.8-extras Once these dependencies are installed, download and extract the elastic-mapreduce-ruby app: mkdir -p /mnt/dev/elastic-mapreduce /mnt/dev/downloads cd /mnt/dev/downloads wget http://elasticmapreduce.s3.amazonaws.com/elastic-mapreduce-ruby.zip cd /mnt/dev/elastic-mapreduce unzip /mnt/dev/downloads/elastic-mapreduce-ruby.zip create a file named credentials.json in /mnt/dev/elastic-mapreduce see: http://aws.amazon.com/developertools/2264?_encoding=UTF8&jiveRedirect=1 credentials.json should contain the following, note the region is significant { "access-id": "ACCESS_KEY", "private-key": "SECRET_KEY", "key-pair": "gsg-keypair", "key-pair-file": "/mnt/dev/aws/gsg-keypair.pem", "region": "us-east-1", "log-uri": "s3n://BUCKET/asf-mail-archives/logs/" } Also, it's a good idea to add /mnt/dev/elastic-mapreduce to your PATH 2. Once elastic-mapreduce is installed, start a cluster with no jobflow steps yet: elastic-mapreduce --create --alive \ --log-uri s3n://BUCKET/asf-mail-archives/logs/ \ --key-pair gsg-keypair \ --slave-instance-type m1.xlarge \ --master-instance-type m1.xlarge \ --num-instances # \ --name mahout-0.4-vectorize \ --bootstrap-action s3://elasticmapreduce/bootstrap-actions/configurations/latest/memory-intensive This will create an EMR Job Flow named "mahout-0.4-vectorize" in the US-East region. Take note of the Job ID returned as you will need it to add the "seq2sparse" step to the Job Flow. I'll leave it to you to decide how many instances to allocate, but keep in mind that one will be dedicated as the master. Also, it took about 75 minutes to run the seq2sparse job on 19 xlarge instances (~190 normalized instance hours – not cheap). I think you'll be safe to use about 10-13 instances and still finish in under 2 hours. Also, notice I'm using Amazon's bootstrap-action for configuring the cluster to run memory intensive jobs. For more information about this, see: http://buyitnw.appspot.com/forums.aws.amazon.com/ann.jspa?annID=834 3. Mahout JAR The Mahout 0.4 Jobs JAR with our TamingAnalyzer is available at: s3://thelabdude/mahout-examples-0.4-job-tt.jar If you need to change other Mahout code, then you'll need to post your own JAR to S3. Remember to reference the JAR using the s3n Hadoop protocol. 4. Schedule a jobflow step to vectorize using Mahout's seq2sparse: elastic-mapreduce --jar s3n://thelabdude/mahout-examples-0.4-job-tt.jar \ --main-class org.apache.mahout.driver.MahoutDriver \ --arg seq2sparse \ --arg -i --arg s3n://thelabdude/asf-mail-archives/mahout-0.4/sequence-files/ \ --arg -o --arg /asf-mail-archives/mahout-0.4/vectors/ \ --arg --weight --arg tfidf \ --arg --chunkSize --arg 100 \ --arg --minSupport --arg 400 \ --arg --minDF --arg 20 \ --arg --maxDFPercent --arg 80 \ --arg --norm --arg 2 \ --arg --numReducers --arg ## \ --arg --analyzerName --arg org.apache.mahout.text.TamingAnalyzer \ --arg --maxNGramSize --arg 2 \ --arg --minLLR --arg 50 \ --enable-debugging \ -j JOB_ID These settings are pretty aggressive in order to reduce the vectors to around 100,000 dimensions. IMPORTANT: Set the number of reducers to 2 x (N-1) (where N is the size of your cluster) The job will send output to HDFS instead of S3 (see Mahout-598). Once the job completes, we'll copy the results to S3 from our cluster's HDFS using distcp. NOTE: To monitor the status of the job, use: elastic-mapreduce --logs -j JOB_ID 5. Save log after completion Once the job completes, save the log output for further analysis: elastic-mapreduce --logs -j JOB_ID > seq2sparse.log 6. SSH into the master node to run distcp: elastic-mapreduce --ssh -j JOB_ID hadoop fs -lsr /asf-mail-archives/mahout-0.4/vectors/ hadoop distcp /asf-mail-archives/mahout-0.4/vectors/ s3n://ACCESS_KEY:SECRET_KEY@BUCKET/asf-mail-archives/mahout-0.4/sparse-2-gram-stem/ & Note: You will need all the output from the vectorize step in order to run Mahout's clusterdump. 7. Shut down your cluster Once you've copied the seq2sparse output to S3, you can shutdown your cluster. elastic-mapreduce --terminate -j JOB_ID Verify the cluster is terminated in your Amazon console. 8. Make the vectors public in S3 using the Amazon console or s3cmd: s3cmd setacl --acl-public --recursive s3://BUCKET/asf-mail-archives/mahout-0.4/sparse-2-gram-stem/ 9. Dump out the size of the vectors bin/mahout vectordump --seqFile s3n://ACCESS_KEY:SECRET_KEY@BUCKET/asf-mail-archives/mahout-0.4/sparse-2-gram-stem/tfidf-vectors/part-r-00000 --sizeOnly | more
          Hide
          Sean Owen added a comment -

          All things that would be most excellent to put into the wiki!
          https://cwiki.apache.org/MAHOUT/mahout-wiki.html

          Show
          Sean Owen added a comment - All things that would be most excellent to put into the wiki! https://cwiki.apache.org/MAHOUT/mahout-wiki.html
          Hide
          Szymon Chojnacki added a comment -

          Two classes:

          TamingSubset.java
          TamingSubsetMapper.java

          can be used to extract a subset of vectors from e.g. a /tfidf-vectors/ directory. It is useful to get some results when clustering runs too long with available resources (or throws heap error).

          Usage:
          TamingSubset [inputPath] [outputPath] [percentage to retain (0,100)]

          Show
          Szymon Chojnacki added a comment - Two classes: TamingSubset.java TamingSubsetMapper.java can be used to extract a subset of vectors from e.g. a /tfidf-vectors/ directory. It is useful to get some results when clustering runs too long with available resources (or throws heap error). Usage: TamingSubset [inputPath] [outputPath] [percentage to retain (0,100)]
          Hide
          Grant Ingersoll added a comment -

          Hey Guys,

          Really coming along nicely. I'd suggest for any code you have, that we choose logical names for them based on what they do and not the fact that you are helping me out w/ the book (i.e. TamingAnalyzer, etc.) so that they make sense in a greater context, b/c I think when this is all said in done, it will make for a nice, packaged example that anyone can use, regardless of the purchase of the book (although, of course, they should buy it!)

          Show
          Grant Ingersoll added a comment - Hey Guys, Really coming along nicely. I'd suggest for any code you have, that we choose logical names for them based on what they do and not the fact that you are helping me out w/ the book (i.e. TamingAnalyzer, etc.) so that they make sense in a greater context, b/c I think when this is all said in done, it will make for a nice, packaged example that anyone can use, regardless of the purchase of the book (although, of course, they should buy it!)
          Hide
          Timothy Potter added a comment -

          Good point about the naming conventions ... will do.

          Also, as we're nearing completion on our side. I need to update the Mahout wiki with the EC2 / EMR information we've gathered throughout this process. I see there are pages for EC2 and EMR already. Do you want me to add our EC2 / EMR stuff to these existing pages or should we have our own page, such as "Benchmarking Mahout Clustering on EC2"?

          Show
          Timothy Potter added a comment - Good point about the naming conventions ... will do. Also, as we're nearing completion on our side. I need to update the Mahout wiki with the EC2 / EMR information we've gathered throughout this process. I see there are pages for EC2 and EMR already. Do you want me to add our EC2 / EMR stuff to these existing pages or should we have our own page, such as "Benchmarking Mahout Clustering on EC2"?
          Hide
          Isabel Drost-Fromm added a comment -

          I think there are really three interesting views on your implementation that should be documented:

          Anything special that you found needed to be done to get Mahout up and running on EC2/EMR that is not yet included in the respective wiki pages would be great to have integrated and updated there.

          I'd suggest adding your findings wrt. benchmarking (running times, experimental results, size of the corpus used for testing, any fancy performance comparison graphs you generated) to the Benchmark Wiki page:
          https://cwiki.apache.org/confluence/display/MAHOUT/Mahout+Benchmarks

          As for the general benchmarking setup (design of your implementation, how to install and run it, limitations and constraints) - that I think would be nice to have on a separate wiki page linked to from the "Implementations" section:

          https://cwiki.apache.org/confluence/display/MAHOUT/Mahout+Wiki#MahoutWiki-ImplementationBackground

          Might make sense to provide links between those pages to make discovering information easier.

          Show
          Isabel Drost-Fromm added a comment - I think there are really three interesting views on your implementation that should be documented: Anything special that you found needed to be done to get Mahout up and running on EC2/EMR that is not yet included in the respective wiki pages would be great to have integrated and updated there. I'd suggest adding your findings wrt. benchmarking (running times, experimental results, size of the corpus used for testing, any fancy performance comparison graphs you generated) to the Benchmark Wiki page: https://cwiki.apache.org/confluence/display/MAHOUT/Mahout+Benchmarks As for the general benchmarking setup (design of your implementation, how to install and run it, limitations and constraints) - that I think would be nice to have on a separate wiki page linked to from the "Implementations" section: https://cwiki.apache.org/confluence/display/MAHOUT/Mahout+Wiki#MahoutWiki-ImplementationBackground Might make sense to provide links between those pages to make discovering information easier.
          Hide
          Szymon Chojnacki added a comment -

          PDF file:

          mahout-588_distribution.pdf

          contains an analysis of the relationship between the minCount cutoff set in seq2sparse and maximum cardinality of tfidf-vectors or centroids.

          Show
          Szymon Chojnacki added a comment - PDF file: mahout-588_distribution.pdf contains an analysis of the relationship between the minCount cutoff set in seq2sparse and maximum cardinality of tfidf-vectors or centroids.
          Hide
          Timothy Potter added a comment -

          Thanks for the instructions Isabel. The only problem I see is that the current EC2 wiki is primarily based around creating your own Hadoop AMI, whereas my instructions are based on using an existing Hadoop 0.20.2 AMI from bixolabs (S3 bucket: 453820947548/bixolabs-public-amis). Moreover, I think our process is much easier, but the process that is currently on the wiki is still valid.

          My updated notes are attached along with the setup script I used to create the SequenceFiles.

          Show
          Timothy Potter added a comment - Thanks for the instructions Isabel. The only problem I see is that the current EC2 wiki is primarily based around creating your own Hadoop AMI, whereas my instructions are based on using an existing Hadoop 0.20.2 AMI from bixolabs (S3 bucket: 453820947548/bixolabs-public-amis). Moreover, I think our process is much easier, but the process that is currently on the wiki is still valid. My updated notes are attached along with the setup script I used to create the SequenceFiles.
          Hide
          Timothy Potter added a comment -

          Thanks for the feedback Isabel.

          The only caveat I see is that the current EC2 documentation is primarily based around creating your own Hadoop AMI, whereas my instructions are based on using an existing Hadoop 0.20.2 AMI from bixolabs (S3 bucket: 453820947548/bixolabs-public-amis). Moreover, I think our process is much easier, but the process that is currently on the wiki is still valid.

          My updated notes are attached along with the setup script I used to create the SequenceFiles. I'm thinking I should just create a separate section on the EC2 page with our process...

          Show
          Timothy Potter added a comment - Thanks for the feedback Isabel. The only caveat I see is that the current EC2 documentation is primarily based around creating your own Hadoop AMI, whereas my instructions are based on using an existing Hadoop 0.20.2 AMI from bixolabs (S3 bucket: 453820947548/bixolabs-public-amis). Moreover, I think our process is much easier, but the process that is currently on the wiki is still valid. My updated notes are attached along with the setup script I used to create the SequenceFiles. I'm thinking I should just create a separate section on the EC2 page with our process...
          Hide
          Szymon Chojnacki added a comment -

          The file:

          mahout-588_canopy.pdf contains a description of the steps I undertook in order to find T1 and T2, which would give small set of canopies within limited time constraint.

          Show
          Szymon Chojnacki added a comment - The file: mahout-588_canopy.pdf contains a description of the steps I undertook in order to find T1 and T2, which would give small set of canopies within limited time constraint.
          Hide
          Mat Kelcey added a comment -

          Hey Timothy!
          I'm a software engineer on the EMR project and I'm interested in finding out more about your EMR/Mahout experiences.
          Hoping that we can have a chat sometime?
          I reckon I'll be able to get your some credits too to help your experimentation; you shouldn't be out of pocket for this work that's useful for everyone.
          Cheers,
          Mat

          Show
          Mat Kelcey added a comment - Hey Timothy! I'm a software engineer on the EMR project and I'm interested in finding out more about your EMR/Mahout experiences. Hoping that we can have a chat sometime? I reckon I'll be able to get your some credits too to help your experimentation; you shouldn't be out of pocket for this work that's useful for everyone. Cheers, Mat
          Hide
          Timothy Potter added a comment -

          Updated the EMR page in the Mahout wiki with the steps we used to create vectors for benchmarking. Also, as requested by Grant, I've renamed the text analyzer we're using to MailArchivesClusteringAnalyzer instead of TamingAnalyzer. Added test cases for the new code.

          Show
          Timothy Potter added a comment - Updated the EMR page in the Mahout wiki with the steps we used to create vectors for benchmarking. Also, as requested by Grant, I've renamed the text analyzer we're using to MailArchivesClusteringAnalyzer instead of TamingAnalyzer. Added test cases for the new code.
          Hide
          Timothy Potter added a comment -

          Added EC2 steps to wiki at:

          https://cwiki.apache.org/confluence/display/MAHOUT/Use+an+Existing+Hadoop+AMI

          These instructions are complementary to the existing EC2 page in that I assume an AMI already exists, whereas the existing page is valid if there is no suitable AMI available.

          Show
          Timothy Potter added a comment - Added EC2 steps to wiki at: https://cwiki.apache.org/confluence/display/MAHOUT/Use+an+Existing+Hadoop+AMI These instructions are complementary to the existing EC2 page in that I assume an AMI already exists, whereas the existing page is valid if there is no suitable AMI available.
          Hide
          Grant Ingersoll added a comment -

          Tim or Syzmon,

          Can you put the code to be added to Mahout in patch form? That will make it easy to see where everything lives, etc. With that, I can review and look to commit.

          Thanks,
          Grant

          Show
          Grant Ingersoll added a comment - Tim or Syzmon, Can you put the code to be added to Mahout in patch form? That will make it easy to see where everything lives, etc. With that, I can review and look to commit. Thanks, Grant
          Show
          Grant Ingersoll added a comment - See https://cwiki.apache.org/confluence/display/MAHOUT/How+To+Contribute
          Hide
          Timothy Potter added a comment -

          Szymon,

          I'll create a patch for the following files. I'm not sure what you're planning on the other Java classes (Taming*), so I'll leave those in your good hands ...

          MailArchivesClusteringAnalyzer.java
          MailArchivesClusteringAnalyzerTest.java
          SequenceFilesFromMailArchives.java
          SequenceFilesFromMailArchivesTest.java
          prep_asf_mail_archives.sh

          Cheers,
          Tim

          Show
          Timothy Potter added a comment - Szymon, I'll create a patch for the following files. I'm not sure what you're planning on the other Java classes (Taming*), so I'll leave those in your good hands ... MailArchivesClusteringAnalyzer.java MailArchivesClusteringAnalyzerTest.java SequenceFilesFromMailArchives.java SequenceFilesFromMailArchivesTest.java prep_asf_mail_archives.sh Cheers, Tim
          Hide
          Timothy Potter added a comment -

          Patch file for trunk

          Show
          Timothy Potter added a comment - Patch file for trunk
          Hide
          Szymon Chojnacki added a comment -

          Tim,

          I would have to rethink how the other utility classes could be used. If I
          find a good place to inject into a broader analysis, I'll try to create a
          patch.

          Cheers
          ps. thank you for the linkedin invitation

          -----------------------------------------
          This email was sent using SquirrelMail.
          "Webmail for nuts!"
          http://squirrelmail.org/

          Show
          Szymon Chojnacki added a comment - Tim, I would have to rethink how the other utility classes could be used. If I find a good place to inject into a broader analysis, I'll try to create a patch. Cheers ps. thank you for the linkedin invitation ----------------------------------------- This email was sent using SquirrelMail. "Webmail for nuts!" http://squirrelmail.org/
          Hide
          Grant Ingersoll added a comment -

          Hi Tim,

          For the shell script, can we parameterize that a bit more? As in pass in the prep dir and output dir?

          Also, s3cmd is GPL, so we can't include it, but we should at least document that it is required for this script to work. Perhaps we could use s3-curl which is BSD and does the same thing and could be bundled in?

          Thanks,
          Grant

          Show
          Grant Ingersoll added a comment - Hi Tim, For the shell script, can we parameterize that a bit more? As in pass in the prep dir and output dir? Also, s3cmd is GPL, so we can't include it, but we should at least document that it is required for this script to work. Perhaps we could use s3-curl which is BSD and does the same thing and could be bundled in? Thanks, Grant
          Hide
          Grant Ingersoll added a comment -

          I should say, otherwise the patch looks good!

          Show
          Grant Ingersoll added a comment - I should say, otherwise the patch looks good!
          Hide
          Grant Ingersoll added a comment -

          Tim, I committed the code piece, so that leaves just the shell script to update. Thanks!

          Show
          Grant Ingersoll added a comment - Tim, I committed the code piece, so that leaves just the shell script to update. Thanks!
          Hide
          Timothy Potter added a comment -

          Script updated to allow the user to pass the prep working directory and output path as command-line args. Also added better documentation, basic sanity checking on the values, and simple error handling.

          Show
          Timothy Potter added a comment - Script updated to allow the user to pass the prep working directory and output path as command-line args. Also added better documentation, basic sanity checking on the values, and simple error handling.
          Hide
          Hudson added a comment -

          Integrated in Mahout-Quality #708 (See https://hudson.apache.org/hudson/job/Mahout-Quality/708/)

          Show
          Hudson added a comment - Integrated in Mahout-Quality #708 (See https://hudson.apache.org/hudson/job/Mahout-Quality/708/ )
          Hide
          Grant Ingersoll added a comment -

          I made some mods to check to see if it has already downloaded and extracted.
          Committed revision 1089630.

          Thanks, Tim!

          Show
          Grant Ingersoll added a comment - I made some mods to check to see if it has already downloaded and extracted. Committed revision 1089630. Thanks, Tim!
          Hide
          Hudson added a comment -

          Integrated in Mahout-Quality #725 (See https://hudson.apache.org/hudson/job/Mahout-Quality/725/)
          MAHOUT-588: Script for downloading and preparing the Apache Mail archives for clustering

          Show
          Hudson added a comment - Integrated in Mahout-Quality #725 (See https://hudson.apache.org/hudson/job/Mahout-Quality/725/ ) MAHOUT-588 : Script for downloading and preparing the Apache Mail archives for clustering
          Hide
          Sean Owen added a comment -

          Looks like a lot of great work. I'm not clear on whether there are additional steps here – is it "done"? In any event looks like something we should consider to be finished before 0.6 ships.

          Show
          Sean Owen added a comment - Looks like a lot of great work. I'm not clear on whether there are additional steps here – is it "done"? In any event looks like something we should consider to be finished before 0.6 ships.
          Hide
          Nathan Halko added a comment -

          Im looking to do some benchmarking with the asf-mail-archives data set. Mainly compare lanczos svd with the new ssvd. I can't get access to the bucket. I see that Grant said they were public, but for the life of me I can't get at them. I try as if its truly public like the /elasticmapreduce bucket and get an access denied. Then with the /ACCESS_KEY:SECRET_KEY@.. and get

          The request signature we calculated does not match the signature you provided. Check your key and signing method.

          Any ideas??

          Show
          Nathan Halko added a comment - Im looking to do some benchmarking with the asf-mail-archives data set. Mainly compare lanczos svd with the new ssvd. I can't get access to the bucket. I see that Grant said they were public, but for the life of me I can't get at them. I try as if its truly public like the /elasticmapreduce bucket and get an access denied. Then with the /ACCESS_KEY:SECRET_KEY@.. and get The request signature we calculated does not match the signature you provided. Check your key and signing method. Any ideas??
          Hide
          Sean Owen added a comment -

          That isn't a permission-denied error. It means the request wasn't quite signed correctly. I use their client libraries to get the signing right; it's tricky.

          Show
          Sean Owen added a comment - That isn't a permission-denied error. It means the request wasn't quite signed correctly. I use their client libraries to get the signing right; it's tricky.
          Hide
          Grant Ingersoll added a comment -

          I've turned off access to mine. You should now use the Amazon Public Dataset: http://aws.amazon.com/datasets/7791434387204566

          Show
          Grant Ingersoll added a comment - I've turned off access to mine. You should now use the Amazon Public Dataset: http://aws.amazon.com/datasets/7791434387204566
          Hide
          Nathan Halko added a comment -

          So is there still access to the tfidf-vectors as detailed here https://cwiki.apache.org/MAHOUT/dimensional-reduction.html? I mounted the Public Dataset in EBS but would really like to avoid parsing the raw files into vectors. Sean do you have any references for getting the signing right? I haven't found anything useful so far. Thanks

          Show
          Nathan Halko added a comment - So is there still access to the tfidf-vectors as detailed here https://cwiki.apache.org/MAHOUT/dimensional-reduction.html? I mounted the Public Dataset in EBS but would really like to avoid parsing the raw files into vectors. Sean do you have any references for getting the signing right? I haven't found anything useful so far. Thanks

            People

            • Assignee:
              Grant Ingersoll
              Reporter:
              Grant Ingersoll
            • Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development