Mahout
  1. Mahout
  2. MAHOUT-123

Implement Latent Dirichlet Allocation

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.2
    • Fix Version/s: 0.2
    • Component/s: Clustering
    • Labels:
      None

      Description

      (For GSoC)

      Abstract:

      Latent Dirichlet Allocation (Blei et al, 2003) is a powerful learning
      algorithm for automatically and jointly clustering words into "topics"
      and documents into mixtures of topics, and it has been successfully
      applied to model change in scientific fields over time (Griffiths and
      Steyver, 2004; Hall, et al. 2008). In this project, I propose to
      implement a distributed variant of Latent Dirichlet Allocation using
      MapReduce, and, time permitting, to investigate extensions of LDA and
      possibly more efficient algorithms for distributed inference.

      Detailed Description:

      A topic model is, roughly, a hierarchical Bayesian model that
      associates with each document a probability distribution over
      "topics", which are in turn distributions over words. For instance, a
      topic in a collection of newswire might include words about "sports",
      such as "baseball", "home run", "player", and a document about steroid
      use in baseball might include "sports", "drugs", and "politics". Note
      that the labels "sports", "drugs", and "politics", are post-hoc labels
      assigned by a human, and that the algorithm itself only assigns
      associate words with probabilities. The task of parameter estimation
      in these models is to learn both what these topics are, and which
      documents employ them in what proportions.

      One of the promises of unsupervised learning algorithms like Latent
      Dirichlet Allocation (LDA; Blei et al, 2003) is the ability to take a
      massive collections of documents and condense them down into a
      collection of easily understandable topics. However, all available
      open source implementations of LDA and related topics models are not
      distributed, which hampers their utility. This project seeks to
      correct this shortcoming.

      In the literature, there have been several proposals for paralellzing
      LDA. Newman, et al (2007) proposed to create an "approximate" LDA in
      which each processors gets its own subset of the documents to run
      Gibbs sampling over. However, Gibbs sampling is slow and stochastic by
      its very nature, which is not advantageous for repeated runs. Instead,
      I propose to follow Nallapati, et al. (2007) and use a variational
      approximation that is fast and non-random.

      References:

      David M. Blei, J McAuliffe. Supervised Topic Models. NIPS, 2007.

      David M. Blei , Andrew Y. Ng , Michael I. Jordan, Latent dirichlet
      allocation, The Journal of Machine Learning Research, 3, p.993-1022,
      3/1/2003

      T. L. Griffiths and M. Steyvers. Finding scientific topics. Proc Natl
      Acad Sci U S A, 101 Suppl 1: 5228-5235, April 2004.

      David LW Hall, Daniel Jurafsky, and Christopher D. Manning. Studying
      the History of Ideas Using Topic Models. EMNLP, Honolulu, 2008.

      Ramesh Nallapati, William Cohen, John Lafferty, Parallelized
      variational EM for Latent Dirichlet Allocation: An experimental
      evaluation of speed and scalability, ICDM workshop on high performance
      data mining, 2007.

      Newman, D., Asuncion, A., Smyth, P., & Welling, M. Distributed
      Inference for Latent Dirichlet Allocation. NIPS, 2007.

      Xuerui Wang , Andrew McCallum, Topics over time: a non-Markov
      continuous-time model of topical trends. KDD, 2006

      Wolfe, J., Haghighi, A, and Klein, D. Fully distributed EM for very
      large datasets. ICML, 2008.

      1. lda.patch
        27 kB
        David Hall
      2. MAHOUT-123.patch
        26 kB
        David Hall
      3. MAHOUT-123.patch
        33 kB
        David Hall
      4. MAHOUT-123.patch
        0.1 kB
        David Hall
      5. MAHOUT-123.patch
        33 kB
        David Hall
      6. MAHOUT-123.patch
        43 kB
        David Hall
      7. MAHOUT-123.patch
        46 kB
        David Hall
      8. MAHOUT-123.patch
        57 kB
        David Hall
      9. MAHOUT-123.patch
        58 kB
        David Hall
      10. MAHOUT-123.patch
        60 kB
        Grant Ingersoll
      11. MAHOUT-123.patch
        59 kB
        David Hall

        Activity

        Hide
        Grant Ingersoll added a comment -

        Committed revision 804979.

        Show
        Grant Ingersoll added a comment - Committed revision 804979.
        Hide
        Grant Ingersoll added a comment -

        Agreed that better stopword handling should improve the results, but the brain quickly filters out those anyway.

        Show
        Grant Ingersoll added a comment - Agreed that better stopword handling should improve the results, but the brain quickly filters out those anyway.
        Hide
        Grant Ingersoll added a comment -

        Success!

        I will look to commit soon. Good job, David!

        Show
        Grant Ingersoll added a comment - Success! I will look to commit soon. Good job, David!
        Hide
        Grant Ingersoll added a comment -

        The problem was that the edited patch wrote topics to .../examples/ and not ../examples , which took a frustratingly long time to figure out.

        Ouch. I will try it out.

        Show
        Grant Ingersoll added a comment - The problem was that the edited patch wrote topics to .../examples/ and not ../examples , which took a frustratingly long time to figure out. Ouch. I will try it out.
        Hide
        David Hall added a comment -

        The problem was that the edited patch wrote topics to .../examples/ and not ../examples , which took a frustratingly long time to figure out.

        I also played with the parameters a little longer. The topics aren't as great as I'd like, but it's because I haven't figured out the right setting for getting rid of stop words. "could" and "said" are still in there. That said, they're mostly coherent topics, if kind of boring.

        – David

        Show
        David Hall added a comment - The problem was that the edited patch wrote topics to .../examples/ and not ../examples , which took a frustratingly long time to figure out. I also played with the parameters a little longer. The topics aren't as great as I'd like, but it's because I haven't figured out the right setting for getting rid of stop words. "could" and "said" are still in there. That said, they're mostly coherent topics, if kind of boring. – David
        Hide
        Grant Ingersoll added a comment -

        Moved bin/ to examples directory, Added ASL to some headers. Ran the test, which seems to go fine, but it didn't output any topics.

        To run, the instructions are now:

        cd <MAHOUT_HOME>/examples
        bin/build-reuters.sh

        Show
        Grant Ingersoll added a comment - Moved bin/ to examples directory, Added ASL to some headers. Ran the test, which seems to go fine, but it didn't output any topics. To run, the instructions are now: cd <MAHOUT_HOME>/examples bin/build-reuters.sh
        Hide
        Yanen Li added a comment -

        Another issue needed to be fixed, in the core DIR, when
        mvn install

        build failed with message:

        ===============================================================================
        [ERROR] BUILD FAILURE
        [INFO] ------------------------------------------------------------------------
        [INFO] Compilation failure
        /home/yaneli/workspace/java/mahout_6/core/src/test/java/org/apache/mahout/clustering/lda/TestMapReduce.java:[123,12]
        cannot find symbol
        symbol : method
        map(<nulltype>,org.apache.hadoop.io.Text,org.apache.hadoop.mapreduce.Mapper<org.apache.hadoop.io.WritableComparable<?>,org.apache.mahout.matrix.Vector,org.apache.mahout.clustering.lda.IntPairWritable,org.apache.hadoop.io.DoubleWritable>.Context)
        location: class org.apache.mahout.clustering.lda.LDAMapper

        /home/yaneli/workspace/java/mahout_6/core/src/test/java/org/apache/mahout/clustering/lda/TestMapReduce.java:[123,12]
        cannot find symbol
        symbol : method
        map(<nulltype>,org.apache.hadoop.io.Text,org.apache.hadoop.mapreduce.Mapper<org.apache.hadoop.io.WritableComparable<?>,org.apache.mahout.matrix.Vector,org.apache.mahout.clustering.lda.IntPairWritable,org.apache.hadoop.io.DoubleWritable>.Context)
        location: class org.apache.mahout.clustering.lda.LDAMapper

        [INFO] ------------------------------------------------------------------------
        [INFO] For more information, run Maven with the -e switch
        [INFO] ------------------------------------------------------------------------
        [INFO] Total time: 21 seconds
        [INFO] Finished at: Mon Aug 03 15:23:02 PDT 2009
        [INFO] Final Memory: 34M/202M
        [INFO] ------------------------------------------------------------------------

        ========================================================================================

        Show
        Yanen Li added a comment - Another issue needed to be fixed, in the core DIR, when mvn install build failed with message: =============================================================================== [ERROR] BUILD FAILURE [INFO] ------------------------------------------------------------------------ [INFO] Compilation failure /home/yaneli/workspace/java/mahout_6/core/src/test/java/org/apache/mahout/clustering/lda/TestMapReduce.java: [123,12] cannot find symbol symbol : method map(<nulltype>,org.apache.hadoop.io.Text,org.apache.hadoop.mapreduce.Mapper<org.apache.hadoop.io.WritableComparable<?>,org.apache.mahout.matrix.Vector,org.apache.mahout.clustering.lda.IntPairWritable,org.apache.hadoop.io.DoubleWritable>.Context) location: class org.apache.mahout.clustering.lda.LDAMapper /home/yaneli/workspace/java/mahout_6/core/src/test/java/org/apache/mahout/clustering/lda/TestMapReduce.java: [123,12] cannot find symbol symbol : method map(<nulltype>,org.apache.hadoop.io.Text,org.apache.hadoop.mapreduce.Mapper<org.apache.hadoop.io.WritableComparable<?>,org.apache.mahout.matrix.Vector,org.apache.mahout.clustering.lda.IntPairWritable,org.apache.hadoop.io.DoubleWritable>.Context) location: class org.apache.mahout.clustering.lda.LDAMapper [INFO] ------------------------------------------------------------------------ [INFO] For more information, run Maven with the -e switch [INFO] ------------------------------------------------------------------------ [INFO] Total time: 21 seconds [INFO] Finished at: Mon Aug 03 15:23:02 PDT 2009 [INFO] Final Memory: 34M/202M [INFO] ------------------------------------------------------------------------ ========================================================================================
        Hide
        Yanen Li added a comment -

        Now index is created, vectors are also created without a problem, but
        there still be exceptions in the gson parser when running LDA in the
        stand along mode:

        ====================================================================================

        [WARNING] While downloading easymock:easymockclassextension:2.2
        This artifact has been relocated to org.easymock:easymockclassextension:2.2.

        [INFO] [exec:java]
        09/08/03 15:18:35 INFO lda.LDADriver: Iteration 0
        09/08/03 15:18:35 INFO jvm.JvmMetrics: Initializing JVM Metrics with
        processName=JobTracker, sessionId=
        09/08/03 15:18:35 WARN mapred.JobClient: Use GenericOptionsParser for
        parsing the arguments. Applications should implement Tool for the
        same.
        09/08/03 15:18:35 WARN mapred.JobClient: No job jar file set. User
        classes may not be found. See JobConf(Class) or
        JobConf#setJar(String).
        09/08/03 15:18:35 INFO input.FileInputFormat: Total input paths to process : 1
        09/08/03 15:18:36 INFO input.FileInputFormat: Total input paths to process : 1
        09/08/03 15:18:36 INFO mapred.JobClient: Running job: job_local_0001
        09/08/03 15:18:36 INFO mapred.MapTask: io.sort.mb = 100
        09/08/03 15:18:36 INFO mapred.MapTask: data buffer = 79691776/99614720
        09/08/03 15:18:36 INFO mapred.MapTask: record buffer = 262144/327680
        09/08/03 15:18:37 INFO mapred.JobClient: map 0% reduce 0%
        09/08/03 15:18:37 WARN mapred.LocalJobRunner: job_local_0001
        com.google.gson.JsonParseException: Failed parsing JSON source:
        java.io.StringReader@4977fa9a to Json
        at com.google.gson.JsonParser.parse(JsonParser.java:57)
        at com.google.gson.Gson.fromJson(Gson.java:376)
        at com.google.gson.Gson.fromJson(Gson.java:329)
        at org.apache.mahout.matrix.AbstractVector.decodeVector(AbstractVector.java:358)
        at org.apache.mahout.matrix.AbstractVector.decodeVector(AbstractVector.java:342)
        at org.apache.mahout.clustering.lda.LDAMapper.map(LDAMapper.java:48)
        at org.apache.mahout.clustering.lda.LDAMapper.map(LDAMapper.java:39)
        at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:518)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:303)
        at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:176)
        Caused by: com.google.gson.ParseException: Encountered "SEQ" at line
        1, column 1.
        Was expecting one of:
        <DIGITS> ...
        "null" ...
        "NaN" ...
        "Infinity" ...
        <BOOLEAN> ...
        <SINGLE_QUOTE_LITERAL> ...
        <DOUBLE_QUOTE_LITERAL> ...
        ")]}\'\n" ...
        "{" ...
        "[" ...
        "-" ...

        ====================================================================================

        Yanen

        Show
        Yanen Li added a comment - Now index is created, vectors are also created without a problem, but there still be exceptions in the gson parser when running LDA in the stand along mode: ==================================================================================== [WARNING] While downloading easymock:easymockclassextension:2.2 This artifact has been relocated to org.easymock:easymockclassextension:2.2. [INFO] [exec:java] 09/08/03 15:18:35 INFO lda.LDADriver: Iteration 0 09/08/03 15:18:35 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId= 09/08/03 15:18:35 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 09/08/03 15:18:35 WARN mapred.JobClient: No job jar file set. User classes may not be found. See JobConf(Class) or JobConf#setJar(String). 09/08/03 15:18:35 INFO input.FileInputFormat: Total input paths to process : 1 09/08/03 15:18:36 INFO input.FileInputFormat: Total input paths to process : 1 09/08/03 15:18:36 INFO mapred.JobClient: Running job: job_local_0001 09/08/03 15:18:36 INFO mapred.MapTask: io.sort.mb = 100 09/08/03 15:18:36 INFO mapred.MapTask: data buffer = 79691776/99614720 09/08/03 15:18:36 INFO mapred.MapTask: record buffer = 262144/327680 09/08/03 15:18:37 INFO mapred.JobClient: map 0% reduce 0% 09/08/03 15:18:37 WARN mapred.LocalJobRunner: job_local_0001 com.google.gson.JsonParseException: Failed parsing JSON source: java.io.StringReader@4977fa9a to Json at com.google.gson.JsonParser.parse(JsonParser.java:57) at com.google.gson.Gson.fromJson(Gson.java:376) at com.google.gson.Gson.fromJson(Gson.java:329) at org.apache.mahout.matrix.AbstractVector.decodeVector(AbstractVector.java:358) at org.apache.mahout.matrix.AbstractVector.decodeVector(AbstractVector.java:342) at org.apache.mahout.clustering.lda.LDAMapper.map(LDAMapper.java:48) at org.apache.mahout.clustering.lda.LDAMapper.map(LDAMapper.java:39) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:518) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:303) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:176) Caused by: com.google.gson.ParseException: Encountered "SEQ" at line 1, column 1. Was expecting one of: <DIGITS> ... "null" ... "NaN" ... "Infinity" ... <BOOLEAN> ... <SINGLE_QUOTE_LITERAL> ... <DOUBLE_QUOTE_LITERAL> ... ")]}\'\n" ... "{" ... "[" ... "-" ... ==================================================================================== Yanen
        Hide
        David Hall added a comment -

        Patch fixed for Yanen's problem. Apparently I messed up the dependencies somehow so that they'd work on my machine, but not anywhere else. Now I think it's ok. (I nuked my Maven repo.)

        – David

        Show
        David Hall added a comment - Patch fixed for Yanen's problem. Apparently I messed up the dependencies somehow so that they'd work on my machine, but not anywhere else. Now I think it's ok. (I nuked my Maven repo.) – David
        Hide
        Yanen Li added a comment -

        Now I can create the index using the Lucene program
        An other error occurred when creating vector from index:

        (I am in the utils folder)

        ========================================================================================
        Creating vectors from index

        core-job:
        [jar] Building jar:
        /workspace/Mahout_0.2/core/target/mahout-core-0.2-SNAPSHOT.job
        + Error stacktraces are turned on.
        [INFO] Scanning for projects...
        [INFO] Searching repository for plugin with prefix: 'exec'.
        [INFO] ------------------------------------------------------------------------
        [INFO] Building Mahout utilities
        [INFO] task-segment: [exec:java]
        [INFO] ------------------------------------------------------------------------
        [INFO] Preparing exec:java
        [INFO] No goals needed for project - skipping
        [INFO] [exec:java]
        09/08/03 08:40:41 INFO vectors.Driver: Output File: ../core/work/vectors
        09/08/03 08:40:41 WARN util.NativeCodeLoader: Unable to load
        native-hadoop library for your platform... using builtin-java classes
        where applicable
        09/08/03 08:40:41 INFO compress.CodecPool: Got brand-new compressor
        [INFO] ------------------------------------------------------------------------
        [ERROR] BUILD ERROR
        [INFO] ------------------------------------------------------------------------
        [INFO] An exception occured while executing the Java class. null

        [INFO] ------------------------------------------------------------------------
        [INFO] Trace
        org.apache.maven.lifecycle.LifecycleExecutionException: An exception
        occured while executing the Java class. null
        at org.apache.maven.lifecycle.DefaultLifecycleExecutor.executeGoals(DefaultLifecycleExecutor.java:583)
        at org.apache.maven.lifecycle.DefaultLifecycleExecutor.executeStandaloneGoal(DefaultLifecycleExecutor.java:512)
        at org.apache.maven.lifecycle.DefaultLifecycleExecutor.executeGoal(DefaultLifecycleExecutor.java:482)
        at org.apache.maven.lifecycle.DefaultLifecycleExecutor.executeGoalAndHandleFailures(DefaultLifecycleExecutor.java:330)
        at org.apache.maven.lifecycle.DefaultLifecycleExecutor.executeTaskSegments(DefaultLifecycleExecutor.java:291)
        at org.apache.maven.lifecycle.DefaultLifecycleExecutor.execute(DefaultLifecycleExecutor.java:142)
        at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:336)
        at org.apache.maven.DefaultMaven.execute(DefaultMaven.java:129)
        at org.apache.maven.cli.MavenCli.main(MavenCli.java:287)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:616)
        at org.codehaus.classworlds.Launcher.launchEnhanced(Launcher.java:315)
        at org.codehaus.classworlds.Launcher.launch(Launcher.java:255)
        at org.codehaus.classworlds.Launcher.mainWithExitCode(Launcher.java:430)
        at org.codehaus.classworlds.Launcher.main(Launcher.java:375)
        Caused by: org.apache.maven.plugin.MojoExecutionException: An
        exception occured while executing the Java class. null
        at org.codehaus.mojo.exec.ExecJavaMojo.execute(ExecJavaMojo.java:338)
        at org.apache.maven.plugin.DefaultPluginManager.executeMojo(DefaultPluginManager.java:451)
        at org.apache.maven.lifecycle.DefaultLifecycleExecutor.executeGoals(DefaultLifecycleExecutor.java:558)
        ... 16 more
        Caused by: java.lang.reflect.InvocationTargetException
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:616)
        at org.codehaus.mojo.exec.ExecJavaMojo$1.run(ExecJavaMojo.java:283)
        at java.lang.Thread.run(Thread.java:636)
        Caused by: java.lang.NullPointerException
        at org.apache.mahout.utils.vectors.lucene.LuceneIteratable$TDIterator.next(LuceneIteratable.java:111)
        at org.apache.mahout.utils.vectors.lucene.LuceneIteratable$TDIterator.next(LuceneIteratable.java:82)
        at org.apache.mahout.utils.vectors.io.SequenceFileVectorWriter.write(SequenceFileVectorWriter.java:42)
        at org.apache.mahout.utils.vectors.Driver.main(Driver.java:204)
        ... 6 more

        =========================================================================================

        Any idea what is going wrong?

        Yanen

        Show
        Yanen Li added a comment - Now I can create the index using the Lucene program An other error occurred when creating vector from index: (I am in the utils folder) ======================================================================================== Creating vectors from index core-job: [jar] Building jar: /workspace/Mahout_0.2/core/target/mahout-core-0.2-SNAPSHOT.job + Error stacktraces are turned on. [INFO] Scanning for projects... [INFO] Searching repository for plugin with prefix: 'exec'. [INFO] ------------------------------------------------------------------------ [INFO] Building Mahout utilities [INFO] task-segment: [exec:java] [INFO] ------------------------------------------------------------------------ [INFO] Preparing exec:java [INFO] No goals needed for project - skipping [INFO] [exec:java] 09/08/03 08:40:41 INFO vectors.Driver: Output File: ../core/work/vectors 09/08/03 08:40:41 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 09/08/03 08:40:41 INFO compress.CodecPool: Got brand-new compressor [INFO] ------------------------------------------------------------------------ [ERROR] BUILD ERROR [INFO] ------------------------------------------------------------------------ [INFO] An exception occured while executing the Java class. null [INFO] ------------------------------------------------------------------------ [INFO] Trace org.apache.maven.lifecycle.LifecycleExecutionException: An exception occured while executing the Java class. null at org.apache.maven.lifecycle.DefaultLifecycleExecutor.executeGoals(DefaultLifecycleExecutor.java:583) at org.apache.maven.lifecycle.DefaultLifecycleExecutor.executeStandaloneGoal(DefaultLifecycleExecutor.java:512) at org.apache.maven.lifecycle.DefaultLifecycleExecutor.executeGoal(DefaultLifecycleExecutor.java:482) at org.apache.maven.lifecycle.DefaultLifecycleExecutor.executeGoalAndHandleFailures(DefaultLifecycleExecutor.java:330) at org.apache.maven.lifecycle.DefaultLifecycleExecutor.executeTaskSegments(DefaultLifecycleExecutor.java:291) at org.apache.maven.lifecycle.DefaultLifecycleExecutor.execute(DefaultLifecycleExecutor.java:142) at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:336) at org.apache.maven.DefaultMaven.execute(DefaultMaven.java:129) at org.apache.maven.cli.MavenCli.main(MavenCli.java:287) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:616) at org.codehaus.classworlds.Launcher.launchEnhanced(Launcher.java:315) at org.codehaus.classworlds.Launcher.launch(Launcher.java:255) at org.codehaus.classworlds.Launcher.mainWithExitCode(Launcher.java:430) at org.codehaus.classworlds.Launcher.main(Launcher.java:375) Caused by: org.apache.maven.plugin.MojoExecutionException: An exception occured while executing the Java class. null at org.codehaus.mojo.exec.ExecJavaMojo.execute(ExecJavaMojo.java:338) at org.apache.maven.plugin.DefaultPluginManager.executeMojo(DefaultPluginManager.java:451) at org.apache.maven.lifecycle.DefaultLifecycleExecutor.executeGoals(DefaultLifecycleExecutor.java:558) ... 16 more Caused by: java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:616) at org.codehaus.mojo.exec.ExecJavaMojo$1.run(ExecJavaMojo.java:283) at java.lang.Thread.run(Thread.java:636) Caused by: java.lang.NullPointerException at org.apache.mahout.utils.vectors.lucene.LuceneIteratable$TDIterator.next(LuceneIteratable.java:111) at org.apache.mahout.utils.vectors.lucene.LuceneIteratable$TDIterator.next(LuceneIteratable.java:82) at org.apache.mahout.utils.vectors.io.SequenceFileVectorWriter.write(SequenceFileVectorWriter.java:42) at org.apache.mahout.utils.vectors.Driver.main(Driver.java:204) ... 6 more ========================================================================================= Any idea what is going wrong? Yanen
        Hide
        David Hall added a comment -

        I unfortunately haven't run it on a Hadoop cluster yet. It should "just work" if you run it with the right Hadoop configuration. Shouldn't running it through the "hadoop" shell script add the configuration?

        I'll get it running on a hadoop cluster soon.

        The code actually requires Hadoop 0.20, because Mahout has decided to move in that direction.

        – David

        Show
        David Hall added a comment - I unfortunately haven't run it on a Hadoop cluster yet. It should "just work" if you run it with the right Hadoop configuration. Shouldn't running it through the "hadoop" shell script add the configuration? I'll get it running on a hadoop cluster soon. The code actually requires Hadoop 0.20, because Mahout has decided to move in that direction. – David
        Hide
        Yanen Li added a comment - - edited

        David,

        cool!

        do you have instructions on how to run it on a Hadoop cluster? I don't think there is a maven tool installed in the Hadoop cluster, so we need to specify all libs and other java options.

        Since the Hadoop cluster I am working on is of version 0.19.0, is the LDA code
        compatible with this?

        Yanen

        Show
        Yanen Li added a comment - - edited David, cool! do you have instructions on how to run it on a Hadoop cluster? I don't think there is a maven tool installed in the Hadoop cluster, so we need to specify all libs and other java options. Since the Hadoop cluster I am working on is of version 0.19.0, is the LDA code compatible with this? Yanen
        Hide
        David Hall added a comment -

        Ok, core/bin/build-reuters will:

        download reuters to work/reuters(something or another), untar it, build an index with it using lucene, convert said index into vectors, run lda for 40 iterations (which is close enough to convergence) to work/lda, and then dump the top 100 words for each topic in into work/topics/topic-K, where K is the topic of interest.

        Show
        David Hall added a comment - Ok, core/bin/build-reuters will: download reuters to work/reuters(something or another), untar it, build an index with it using lucene, convert said index into vectors, run lda for 40 iterations (which is close enough to convergence) to work/lda, and then dump the top 100 words for each topic in into work/topics/topic-K, where K is the topic of interest.
        Hide
        Grant Ingersoll added a comment -

        The Mahout Examples pom.xml has a commented out version that does it for Wikipedia. I, too, don't know how to get it too run standalone, as the Antrun task doesn't let you specify the id to run. I guess I'd just have the users download it. Let's get an example working from the code out, then we'll figure out how to make it easy to run.

        Show
        Grant Ingersoll added a comment - The Mahout Examples pom.xml has a commented out version that does it for Wikipedia. I, too, don't know how to get it too run standalone, as the Antrun task doesn't let you specify the id to run. I guess I'd just have the users download it. Let's get an example working from the code out, then we'll figure out how to make it easy to run.
        Hide
        David Hall added a comment -

        So it looks like the way Lucene does it is w/ an ant task. I can't figure out the maven way to do this, without my building some kind of jar from it. I'm happy to do it, but I'm not sure what the proper way to do this is.

        Thoughts?

        – David

        Show
        David Hall added a comment - So it looks like the way Lucene does it is w/ an ant task. I can't figure out the maven way to do this, without my building some kind of jar from it. I'm happy to do it, but I'm not sure what the proper way to do this is. Thoughts? – David
        Hide
        Grant Ingersoll added a comment -

        We've automated download of Reuters in Lucene. We are doing research in IR and NLP, IMO.

        I'd like to use Reuters for some classification work, too, so +1 on it. Otherwise, feel free to tell people they need to download it. That's what we do for the Synthetic Control stuff.

        Show
        Grant Ingersoll added a comment - We've automated download of Reuters in Lucene. We are doing research in IR and NLP, IMO. I'd like to use Reuters for some classification work, too, so +1 on it. Otherwise, feel free to tell people they need to download it. That's what we do for the Synthetic Control stuff.
        Hide
        David Hall added a comment -

        Ok, I'll add more comments.

        I had been using Reuters 21578, but I'm not convinced that it's ok to include it, and I was looking around for something better. I'll get the download automated for wikipedia chunks. Is a shell script ok to do most of it?

        – David

        Show
        David Hall added a comment - Ok, I'll add more comments. I had been using Reuters 21578, but I'm not convinced that it's ok to include it, and I was looking around for something better. I'll get the download automated for wikipedia chunks. Is a shell script ok to do most of it? – David
        Hide
        Grant Ingersoll added a comment -

        Patch review:

        Still could use some more comments in the Mapper/Reducer about what is going on.
        Also still needs an example. Note, http://people.apache.org/~gsingers/wikipedia/chunks.tar.gz contains a few hundred wikipedia articles, perhaps that would be a good reference? It's pretty easy to automate the download of those. Otherwise, we can just point people at them.

        Show
        Grant Ingersoll added a comment - Patch review: Still could use some more comments in the Mapper/Reducer about what is going on. Also still needs an example. Note, http://people.apache.org/~gsingers/wikipedia/chunks.tar.gz contains a few hundred wikipedia articles, perhaps that would be a good reference? It's pretty easy to automate the download of those. Otherwise, we can just point people at them.
        Hide
        Grant Ingersoll added a comment -

        How big is the data? Either we can put it in examples somewhere (resources, likely) or we can tell people to download it. Do you have a pointer to it?

        Show
        Grant Ingersoll added a comment - How big is the data? Either we can put it in examples somewhere (resources, likely) or we can tell people to download it. Do you have a pointer to it?
        Hide
        David Hall added a comment -

        Everything fixed except adding an example.

        What's the best way to include data with Mahout? I've never had luck autogenerating data for LDA.

        Show
        David Hall added a comment - Everything fixed except adding an example. What's the best way to include data with Mahout? I've never had luck autogenerating data for LDA.
        Hide
        Grant Ingersoll added a comment - - edited

        Notes:

        1. LDADriver – Switch to use Commons-CLI2 for arg processing. See the other clustering algorithms.
        2. Hadoop 0.20 introduces a lot of deprecations, we should clean those up here. No need to put in new code based on deprecations
        3. Some more comments inline in the Mapper/Reducer would be great, especially explaining what is being collected

        Would be good to see some small example.

        What you have now seems ready to commit given the minor changes above, what is next?

        General note: Wiki like is http://cwiki.apache.org/MAHOUT/latent-dirichlet-allocation.html

        Show
        Grant Ingersoll added a comment - - edited Notes: 1. LDADriver – Switch to use Commons-CLI2 for arg processing. See the other clustering algorithms. 2. Hadoop 0.20 introduces a lot of deprecations, we should clean those up here. No need to put in new code based on deprecations 3. Some more comments inline in the Mapper/Reducer would be great, especially explaining what is being collected Would be good to see some small example. What you have now seems ready to commit given the minor changes above, what is next? General note: Wiki like is http://cwiki.apache.org/MAHOUT/latent-dirichlet-allocation.html
        Hide
        Grant Ingersoll added a comment -

        Patch applies and tests pass. I'll try to dig deeper soon. Keep up the good work.

        Show
        Grant Ingersoll added a comment - Patch applies and tests pass. I'll try to dig deeper soon. Keep up the good work.
        Hide
        David Hall added a comment -

        Tests included, wiki page is created.

        Show
        David Hall added a comment - Tests included, wiki page is created.
        Hide
        Grant Ingersoll added a comment -

        Tests should cover basic sanity of operation. Serialization can sometimes bite things if you are implementing your own Writable stuff, but not a big deal. Also, its reasonable to try and test boundary conditions, etc. and that bad input is properly handled (including simply throwing an exception). For a first pass, sanity checks should suffice.

        Show
        Grant Ingersoll added a comment - Tests should cover basic sanity of operation. Serialization can sometimes bite things if you are implementing your own Writable stuff, but not a big deal. Also, its reasonable to try and test boundary conditions, etc. and that bad input is properly handled (including simply throwing an exception). For a first pass, sanity checks should suffice.
        Hide
        David Hall added a comment -

        Sigh, wrong command again. Reattached an actual patch.

        Show
        David Hall added a comment - Sigh, wrong command again. Reattached an actual patch.
        Hide
        David Hall added a comment -

        Ok, here's the updates for the vectors. I'll add a page to the wiki shortly.

        As for testing, this is actually something I'd like some direction on. It's never been clear to me how to test the actual implementation of clustering algorithms in any meaningful way. Looking at the Dirichlet clusterer, all that it tests are that serialization works, that things aren't null, and that it outputs the "right" number of things. Serialization in this case doesn't seem terribly necessary since my "model" are just serialized writables. So... I should just add some basic sanity checks?

        – David

        Show
        David Hall added a comment - Ok, here's the updates for the vectors. I'll add a page to the wiki shortly. As for testing, this is actually something I'd like some direction on. It's never been clear to me how to test the actual implementation of clustering algorithms in any meaningful way. Looking at the Dirichlet clusterer, all that it tests are that serialization works, that things aren't null, and that it outputs the "right" number of things. Serialization in this case doesn't seem terribly necessary since my "model" are just serialized writables. So... I should just add some basic sanity checks? – David
        Hide
        Grant Ingersoll added a comment -

        Hey David,

        patch applies cleanly, but needs to be brought up to the date for the new Vector iterators. Some tests for the various pieces are also needed. A quick few lines on how to run it on the wiki would also be good, if you haven't done it already.

        Otherwise, starting to take shape, keep up the good work.

        Show
        Grant Ingersoll added a comment - Hey David, patch applies cleanly, but needs to be brought up to the date for the new Vector iterators. Some tests for the various pieces are also needed. A quick few lines on how to run it on the wiki would also be good, if you haven't done it already. Otherwise, starting to take shape, keep up the good work.
        Hide
        David Hall added a comment -

        Right issue this time.

        Mostly functional patch.

        Show
        David Hall added a comment - Right issue this time. Mostly functional patch.
        Hide
        David Hall added a comment -

        (Still in progress.)

        It seems to work, but it's much to slow because I underestimated the badness of using DenseVectors. Switching to an element wise system now.

        Show
        David Hall added a comment - (Still in progress.) It seems to work, but it's much to slow because I underestimated the badness of using DenseVectors. Switching to an element wise system now.
        Hide
        David Hall added a comment -

        This is a roughcut implementation. Not ready to go yet. I've been waiting on MAHOUT-126 because it seems like the way to create the Vectors I need. Or perhaps there's a better way.

        Basic approach follows the Dirichlet implementation. There is a driver class (LDA Driver) which runs K mapreduces, and a Mapper and a Reducer. We also have an Inferencer, which is what the Mapper uses to compute expected sufficient statistics. A document is just a V-dimensional sparse vector of word counts.

        Map: Perform Inference on each document (~ E-step) and output log probabilities of p(word|topic)
        Reduce: logSum the input log probabilities (~ M-Step), and output the result.

        Loop: use the results of the reduce as the log probabilities for the map.

        Remaining:
        1) Actually run the thing
        2) Number-of-non-zero elements in a sparse vector. Is that staying "size"?
        3) Allow for computing of likelihood to determine when we're done.
        4) What's the status of serializing as sparse vector and reading as a dense vector? Is that going to happen?
        5) Find a fun data set to bundle...
        6) Convenience method for running just inference on a set of documents and outputting MAP estimates of word probabilities.

        Show
        David Hall added a comment - This is a roughcut implementation. Not ready to go yet. I've been waiting on MAHOUT-126 because it seems like the way to create the Vectors I need. Or perhaps there's a better way. Basic approach follows the Dirichlet implementation. There is a driver class (LDA Driver) which runs K mapreduces, and a Mapper and a Reducer. We also have an Inferencer, which is what the Mapper uses to compute expected sufficient statistics. A document is just a V-dimensional sparse vector of word counts. Map: Perform Inference on each document (~ E-step) and output log probabilities of p(word|topic) Reduce: logSum the input log probabilities (~ M-Step), and output the result. Loop: use the results of the reduce as the log probabilities for the map. Remaining: 1) Actually run the thing 2) Number-of-non-zero elements in a sparse vector. Is that staying "size"? 3) Allow for computing of likelihood to determine when we're done. 4) What's the status of serializing as sparse vector and reading as a dense vector? Is that going to happen? 5) Find a fun data set to bundle... 6) Convenience method for running just inference on a set of documents and outputting MAP estimates of word probabilities.

          People

          • Assignee:
            Grant Ingersoll
            Reporter:
            David Hall
          • Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Time Tracking

              Estimated:
              Original Estimate - 504h
              504h
              Remaining:
              Remaining Estimate - 504h
              504h
              Logged:
              Time Spent - Not Specified
              Not Specified

                Development