Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 0.3
    • Fix Version/s: 0.3
    • Component/s: None
    • Labels:
      None

      Description

      Identifies interesting Collocations in text using ngrams scored via the LogLikelihoodRatio calculation.

      As discussed in:

      Current form is a tar of a maven project that depends on mahout. Build as usual with 'mvn clean install', can be executed using:

      mvn -e exec:java  -Dexec.mainClass="org.apache.mahout.colloc.CollocDriver" -Dexec.args="--input src/test/resources/article --colloc target/colloc --output target/output -w"
      

      Output will be placed in target/output and can be viewed nicely using:

      sort -rn -k1 target/output/part-00000
      

      Includes rudimentary unit tests. Please review and comment. Needs more work to get this into patch state and integrate with Robin's document vectorizer work in MAHOUT-237

      Some basic TODO/FIXME's include:

      • use mahout math's ObjectInt map implementation when available
      • make the analyzer configurable
      • better input validation + negative unit tests.
      • more flexible ways to generate units of analysis (n-1)grams.
      1. mahout-colloc.tar.gz
        12 kB
        Drew Farris
      2. mahout-colloc.tar.gz
        14 kB
        Drew Farris
      3. MAHOUT-242.patch
        51 kB
        Drew Farris
      4. MAHOUT-242.patch
        52 kB
        Drew Farris
      5. MAHOUT-242.patch
        49 kB
        Drew Farris
      6. MAHOUT-242.patch
        52 kB
        Drew Farris
      7. MAHOUT-242.patch
        52 kB
        Drew Farris

        Activity

        Hide
        Robin Anil added a comment -

        Committed and resolved.

        Show
        Robin Anil added a comment - Committed and resolved.
        Hide
        Drew Farris added a comment -

        Moved to utils based on discussion on the dev list. This can be committed as-is, or I can take a closer look at Robin's suggested refactoring (counters, sequence file output) and submit a revised patch later

        Show
        Drew Farris added a comment - Moved to utils based on discussion on the dev list. This can be committed as-is, or I can take a closer look at Robin's suggested refactoring (counters, sequence file output) and submit a revised patch later
        Hide
        Jake Mannix added a comment -

        util? I don't think this belongs in util. I kind like making an nlp package in core, and eventually we can split it out to its own maven submodule.

        Show
        Jake Mannix added a comment - util? I don't think this belongs in util. I kind like making an nlp package in core, and eventually we can split it out to its own maven submodule.
        Hide
        Robin Anil added a comment -
        • In the map/reduce job try adding increasing a counter when you encounter a gram and another counter when you encounter an error, instead of throwing errors
        • Make the output a sequencefile instead of TextOutput format ?
        Show
        Robin Anil added a comment - In the map/reduce job try adding increasing a counter when you encounter a gram and another counter when you encounter an error, instead of throwing errors Make the output a sequencefile instead of TextOutput format ?
        Hide
        Robin Anil added a comment -

        I will just run this and start thinking on integrating this with DictionaryVectorizer. Move it to util and upload the new patch. I will commit first and then start refactoring.

        Show
        Robin Anil added a comment - I will just run this and start thinking on integrating this with DictionaryVectorizer. Move it to util and upload the new patch. I will commit first and then start refactoring.
        Hide
        Drew Farris added a comment -

        No problem moving this to util, I wasn't sure where stuff like this is supposed to live.

        Show
        Drew Farris added a comment - No problem moving this to util, I wasn't sure where stuff like this is supposed to live.
        Hide
        Robin Anil added a comment -

        Mind If I move this to util along with rest of the text stuff

        Show
        Robin Anil added a comment - Mind If I move this to util along with rest of the text stuff
        Hide
        Drew Farris added a comment -

        Updated patch now includes a combiner for pass1

        Show
        Drew Farris added a comment - Updated patch now includes a combiner for pass1
        Hide
        Drew Farris added a comment -

        Updated patch, removed pom modifications checked in as a part of MAHOUT-215. Applies and builds cleanly against r905872.

        Show
        Drew Farris added a comment - Updated patch, removed pom modifications checked in as a part of MAHOUT-215 . Applies and builds cleanly against r905872.
        Hide
        Drew Farris added a comment -

        Hey Drew, I'm not much of a maven guy - what's the maven-foo you use to get this running?

        Jake, my mistake, I should have included updated docs when I updated the patch.

        I eliminated the code to read plain text files from a directory, so you would need to begin by producing a SequenceFile<Text,Text> (document id, document) as input. Robin's utility in mahout-examples, o.a.m.text.SequenceFilesFromDirectory can do this. Run the following from the 'examples' directory;

        mvn -e exec:java -Dexec.mainClass="org.apache.mahout.text.SequenceFilesFromDirectory" -Dexec.args="--parent (...input directory..) --outputDir (..output directory..) --charset UTF-8"
        

        Once you have the sequence file to use as input, run the following from the 'examples' directory as well.

        mvn -e exec:java  -Dexec.mainClass="org.apache.mahout.nlp.collocations.llr.CollocDriver" -Dexec.args="--input (..path-to-input..) --output (..path-to-output..) -w"
        

        Once the driver class is run, the collocations will be in (output-directory)/colloc/part-00000 as plaintext. They can be sorted by LLR score using the same sort command I included in the previous comment above (I have a question about this below).

        FWIW, I need to re-submit the patch to clean up some of the pom changes to MAHOUT-215 were applied.

        A couple follow up questions while you're looking at this:

        Currently I just dump the results of the second pass to a file, not sorted by LLR score. I could sort by LLR by sending it through another pass with an identity mapper and a reducer but I suspect that's probably pretty inefficient. Is there a better way to sort the output of the second pass by LLR?

        If I only wanted to emit the top 1-10% of the collocs (user configurable), how would I tell the reducer to stop emitting results at a certain point (or is yet another pass needed to achieve something like this?)

        Would it be better to emit a sequencefile<LongWritable,Text> instead of a text file as the output from the final pass?

        Show
        Drew Farris added a comment - Hey Drew, I'm not much of a maven guy - what's the maven-foo you use to get this running? Jake, my mistake, I should have included updated docs when I updated the patch. I eliminated the code to read plain text files from a directory, so you would need to begin by producing a SequenceFile<Text,Text> (document id, document) as input. Robin's utility in mahout-examples, o.a.m.text.SequenceFilesFromDirectory can do this. Run the following from the 'examples' directory; mvn -e exec:java -Dexec.mainClass= "org.apache.mahout.text.SequenceFilesFromDirectory" -Dexec.args= "--parent (...input directory..) --outputDir (..output directory..) --charset UTF-8" Once you have the sequence file to use as input, run the following from the 'examples' directory as well. mvn -e exec:java -Dexec.mainClass= "org.apache.mahout.nlp.collocations.llr.CollocDriver" -Dexec.args= "--input (..path-to-input..) --output (..path-to-output..) -w" Once the driver class is run, the collocations will be in (output-directory)/colloc/part-00000 as plaintext. They can be sorted by LLR score using the same sort command I included in the previous comment above (I have a question about this below). FWIW, I need to re-submit the patch to clean up some of the pom changes to MAHOUT-215 were applied. A couple follow up questions while you're looking at this: Currently I just dump the results of the second pass to a file, not sorted by LLR score. I could sort by LLR by sending it through another pass with an identity mapper and a reducer but I suspect that's probably pretty inefficient. Is there a better way to sort the output of the second pass by LLR? If I only wanted to emit the top 1-10% of the collocs (user configurable), how would I tell the reducer to stop emitting results at a certain point (or is yet another pass needed to achieve something like this?) Would it be better to emit a sequencefile<LongWritable,Text> instead of a text file as the output from the final pass?
        Hide
        Jake Mannix added a comment -

        Hey Drew, I'm not much of a maven guy - what's the maven-foo you use to get this running?

        mvn clean install

        followed by

        mvn e exec:java -Dexec.mainClass="org.apache.mahout.colloc.CollocDriver" -Dexec.args="-input src/test/resources/article --colloc target/colloc --output target/output -w"

        ? Returns with some classpath sadness for me, probably because I haven't set it up correctly...

        Caused by: java.lang.ClassNotFoundException: org.apache.mahout.colloc.CollocDriver
        Show
        Jake Mannix added a comment - Hey Drew, I'm not much of a maven guy - what's the maven-foo you use to get this running? mvn clean install followed by mvn e exec:java -Dexec.mainClass="org.apache.mahout.colloc.CollocDriver" -Dexec.args=" -input src/test/resources/article --colloc target/colloc --output target/output -w" ? Returns with some classpath sadness for me, probably because I haven't set it up correctly... Caused by: java.lang.ClassNotFoundException: org.apache.mahout.colloc.CollocDriver
        Hide
        Ted Dunning added a comment -

        Each of the existing attributes are serializable which suggests that there might be a short path to Writable for these, I'll see if Avro offers any shortcuts here. A subclass of IndexWriter may indeed be a very quick route to this.

        Perhaps. But the fact that Avro data is self describing and highly portable is probably as important as most other considerations. There are also lots of concessions to Java that are in the Lucene document that are very different from the concessions to good serialization that we need.

        I also wonder if flipping this around and implementing a LuceneIndexInputFormat may be another option.

        Yes. Or at least a conversion system for spilling an index into SequenceFiles.

        Show
        Ted Dunning added a comment - Each of the existing attributes are serializable which suggests that there might be a short path to Writable for these, I'll see if Avro offers any shortcuts here. A subclass of IndexWriter may indeed be a very quick route to this. Perhaps. But the fact that Avro data is self describing and highly portable is probably as important as most other considerations. There are also lots of concessions to Java that are in the Lucene document that are very different from the concessions to good serialization that we need. I also wonder if flipping this around and implementing a LuceneIndexInputFormat may be another option. Yes. Or at least a conversion system for spilling an index into SequenceFiles.
        Hide
        Drew Farris added a comment -

        Ted,

        Thanks for the advice, I'll take a look at doing something in line with the structure you describe. It may be useful to re-used the classes introduced with Lucene's new TokenStream API, where essentially each Token is really a collection of different types of attributes. Each of the existing attributes are serializable which suggests that there might be a short path to Writable for these, I'll see if Avro offers any shortcuts here. A subclass of IndexWriter may indeed be a very quick route to this.

        I also wonder if flipping this around and implementing a LuceneIndexInputFormat may be another option.

        Show
        Drew Farris added a comment - Ted, Thanks for the advice, I'll take a look at doing something in line with the structure you describe. It may be useful to re-used the classes introduced with Lucene's new TokenStream API , where essentially each Token is really a collection of different types of attributes . Each of the existing attributes are serializable which suggests that there might be a short path to Writable for these, I'll see if Avro offers any shortcuts here. A subclass of IndexWriter may indeed be a very quick route to this. I also wonder if flipping this around and implementing a LuceneIndexInputFormat may be another option.
        Hide
        Ted Dunning added a comment -

        Drew,

        I think that what we really need is a serialized form for something a lot like a Lucene document, i.e. SequenceFile<key=docId, value=map<fieldname, list<token>>> where token is either a string or a string and a token number or a string and a token number and an offset (i.e. record(string, position, offset) with nullable fields). One really clever integration point would be to build an implementation of a Lucene IndexWriter that would emit these serialized documents.

        My guess is that Avro would be a useful tool in implementing this. It should be quite easy to emit simpler forms of this from any of our current code.

        Show
        Ted Dunning added a comment - Drew, I think that what we really need is a serialized form for something a lot like a Lucene document, i.e. SequenceFile<key=docId, value=map<fieldname, list<token>>> where token is either a string or a string and a token number or a string and a token number and an offset (i.e. record(string, position, offset) with nullable fields). One really clever integration point would be to build an implementation of a Lucene IndexWriter that would emit these serialized documents. My guess is that Avro would be a useful tool in implementing this. It should be quite easy to emit simpler forms of this from any of our current code.
        Hide
        Drew Farris added a comment -

        Decoupling the tokenization logic from the ngrammer is a great idea. I think Ted sort of alluded to this in a recent post on another subject as well. Here, the input would really be a stream of tokens. It makes a great deal of sense to do the analysis offline.

        I'll take a shot a providing a SequenceFile<Doc ID, Document> -> SequenceFile<DocId, TokenStream> utility and update this code to take input from there. I suspect it would also be handy to be able to easily string a bunch of these processing steps (Document files > Sequence file<Doc Id, Doc> -> Sequence File<DocId, TokenStream together into a chain, but that's a larger issue. What does anyone else think?

        Also, I wanted to ask: does it make more sense to emit a sequence file <longwritable, text> (or gram) instead of a text file? There had been some recent traffic on this list about this wrt line endings and I suspect it may be more reasonable to provide the former as output (or both!?)

        Show
        Drew Farris added a comment - Decoupling the tokenization logic from the ngrammer is a great idea. I think Ted sort of alluded to this in a recent post on another subject as well. Here, the input would really be a stream of tokens. It makes a great deal of sense to do the analysis offline. I'll take a shot a providing a SequenceFile<Doc ID, Document> -> SequenceFile<DocId, TokenStream> utility and update this code to take input from there. I suspect it would also be handy to be able to easily string a bunch of these processing steps (Document files > Sequence file<Doc Id, Doc> -> Sequence File<DocId, TokenStream together into a chain, but that's a larger issue. What does anyone else think? Also, I wanted to ask: does it make more sense to emit a sequence file <longwritable, text> (or gram) instead of a text file? There had been some recent traffic on this list about this wrt line endings and I suspect it may be more reasonable to provide the former as output (or both!?)
        Hide
        Zaki Rahaman added a comment -

        Thanks Drew. Some overdue feedback from testing this baby out...

        My use case may differ from what's more typical, but basically I had 10MM+ semi-processed strings (lowercased, no punctuation, etc.) that I wanted to do my analysis on, not a large document or a series of smaller documents. As a result, I found that using the Lucene Analyzers in the NGramCollector to be horribly ineffecient (the job was taking hours to complete 1% on my modest 7 node cluster). Instead I used some custom logic to generate the n-grams and pass them to the collector. After this modification, the entire set of jobs ran in a little less than an hour even on a larger dataset.

        I know Robin mentioned something about an Ngram generator in the Bayes classifier, I should check it out.

        I guess I would advocate splitting the module as it exists into a couple of different pieces:

        1. A general purpose tool for tokenizing/analyzing a document/documents (or in my case, 'documents' consisting of strings). I know there's some exisitng tools for vectorizing text, and indices, so maybe this is something similar but the basic idea would be to go from document --> analyzed text

        2+3 (These can remain as is in the module or be split). N gram generator and counter and LLR pieces.

        Maybe this is all crazy talk, but it would seem to me refactoring/extracting the pieces out like this would prove useful and allow for code reuse in other modules (other NLP-type modules, elsewhere in the project).

        Show
        Zaki Rahaman added a comment - Thanks Drew. Some overdue feedback from testing this baby out... My use case may differ from what's more typical, but basically I had 10MM+ semi-processed strings (lowercased, no punctuation, etc.) that I wanted to do my analysis on, not a large document or a series of smaller documents. As a result, I found that using the Lucene Analyzers in the NGramCollector to be horribly ineffecient (the job was taking hours to complete 1% on my modest 7 node cluster). Instead I used some custom logic to generate the n-grams and pass them to the collector. After this modification, the entire set of jobs ran in a little less than an hour even on a larger dataset. I know Robin mentioned something about an Ngram generator in the Bayes classifier, I should check it out. I guess I would advocate splitting the module as it exists into a couple of different pieces: 1. A general purpose tool for tokenizing/analyzing a document/documents (or in my case, 'documents' consisting of strings). I know there's some exisitng tools for vectorizing text, and indices, so maybe this is something similar but the basic idea would be to go from document --> analyzed text 2+3 (These can remain as is in the module or be split). N gram generator and counter and LLR pieces. Maybe this is all crazy talk, but it would seem to me refactoring/extracting the pieces out like this would prove useful and allow for code reuse in other modules (other NLP-type modules, elsewhere in the project).
        Hide
        Drew Farris added a comment -

        Thanks for the review Isabel, here's an updated patch.

        CollocMapper, Line 66: If I read your implementation correctly, this means that documents are always read fully into memory, right?

        Yes, you are correct. In addition to what Ted and Jake have already said, reading parts documents is a little tricky because there would have to be some overlap to ensure that the collocations around the split were picked up properly. Perhaps there will be an opportunity to do something like this once the collocs are windowed based on sentence boundary?

        Gram, Line 192: You can omit the "else" clauses, in case the "if" already returns its result to the caller, however this is a question of style."

        I prefer the form you describe, easier to read.

        I was wondering, why in line 177 you did not write "this.position != other.position"?

        I'm not certain about what you're referring to here, before edits, line 177 is in the middle of the write() method.

        NGramCollector, Line 47 (and a few others): Shouldn't we avoid using deprecated apis instead of suppressing deprecation warnings?

        I've removed the SuppressWarnings("deprecated"). The only reason they are there is that I've used the hadoop 0.19 api here and I tend to get distracted by the deprecation warnings.

        This brings up a question of policy however. Should I be using the newest hadoop api for new work in Mahout such as this, or should we continue to make efforts to retain backwards compatibility with older hadoop installations? I assumed the policy was the latter, but relish the opportunity to learn the new api – I just didn't think Mahout ready to commit to the new api yet.

        LLRReducer, Line 143: How about making the method package private if it should be used in unit tests only anyway?
        Line 106: Would be nice to have an additional counter for the skipped grams.

        done and done, great idea to use the counter to track skipped grams.

        Show
        Drew Farris added a comment - Thanks for the review Isabel, here's an updated patch. CollocMapper, Line 66: If I read your implementation correctly, this means that documents are always read fully into memory, right? Yes, you are correct. In addition to what Ted and Jake have already said, reading parts documents is a little tricky because there would have to be some overlap to ensure that the collocations around the split were picked up properly. Perhaps there will be an opportunity to do something like this once the collocs are windowed based on sentence boundary? Gram, Line 192: You can omit the "else" clauses, in case the "if" already returns its result to the caller, however this is a question of style." I prefer the form you describe, easier to read. I was wondering, why in line 177 you did not write "this.position != other.position"? I'm not certain about what you're referring to here, before edits, line 177 is in the middle of the write() method. NGramCollector, Line 47 (and a few others): Shouldn't we avoid using deprecated apis instead of suppressing deprecation warnings? I've removed the SuppressWarnings("deprecated"). The only reason they are there is that I've used the hadoop 0.19 api here and I tend to get distracted by the deprecation warnings. This brings up a question of policy however. Should I be using the newest hadoop api for new work in Mahout such as this, or should we continue to make efforts to retain backwards compatibility with older hadoop installations? I assumed the policy was the latter, but relish the opportunity to learn the new api – I just didn't think Mahout ready to commit to the new api yet. LLRReducer, Line 143: How about making the method package private if it should be used in unit tests only anyway? Line 106: Would be nice to have an additional counter for the skipped grams. done and done, great idea to use the counter to track skipped grams.
        Hide
        Isabel Drost-Fromm added a comment -

        I am not worried about them at this point.

        Also not very worried - probably should have indicated that basically everything I found could be filed as "trivial, minor or style question only"...

        Show
        Isabel Drost-Fromm added a comment - I am not worried about them at this point. Also not very worried - probably should have indicated that basically everything I found could be filed as "trivial, minor or style question only"...
        Hide
        Ted Dunning added a comment -

        Are we really worried about individual documents which are bigger than hundreds of MB?

        I am not worried about them at this point. Even in the most skewed distributions, this would be very, very rare. How many people on linked in have 100 million friends?

        The quadratic cost of unwindowed collocation detection would make this infeasible first. For windowed collocation, it isn't that hard to scan the input, but reading the whole shebang isn't such a bad thing.

        Show
        Ted Dunning added a comment - Are we really worried about individual documents which are bigger than hundreds of MB? I am not worried about them at this point. Even in the most skewed distributions, this would be very, very rare. How many people on linked in have 100 million friends? The quadratic cost of unwindowed collocation detection would make this infeasible first. For windowed collocation, it isn't that hard to scan the input, but reading the whole shebang isn't such a bad thing.
        Hide
        jakes old (non-committer) account added a comment -

        I can try and take a look at this later today / tonight.

        regarding

        CollocMapper, Line 66: If I read your implementation correctly, this means that documents are always read fully into memory, right? So we would assume to only run the ngramCollector over documents that fit into main memory and unable to process larger documents. I am wondering whether this is an issue at all, and if so, whether there is any way around that.

        Are we really worried about individual documents which are bigger than hundreds of MB?

        Show
        jakes old (non-committer) account added a comment - I can try and take a look at this later today / tonight. regarding CollocMapper, Line 66: If I read your implementation correctly, this means that documents are always read fully into memory, right? So we would assume to only run the ngramCollector over documents that fit into main memory and unable to process larger documents. I am wondering whether this is an issue at all, and if so, whether there is any way around that. Are we really worried about individual documents which are bigger than hundreds of MB?
        Hide
        Isabel Drost-Fromm added a comment -

        First of all, thanks for the patch. The code looks good so far, patch applies cleanly and builds w/o problems. Some initial comments and questions I had when reading it:

        CollocMapper, Line 66: If I read your implementation correctly, this means that documents are always read fully into memory, right? So we would assume to only run the ngramCollector over documents that fit into main memory and unable to process larger documents. I am wondering whether this is an issue at all, and if so, whether there is any way around that.

        Gram, Line 192: You can omit the "else" clauses, in case the "if" already returns its result to the caller, however this is a question of style. I was wondering, why in line 177 you did not write "this.position != other.position"?

        NGramCollector, Line 47 (and a few others): Shouldn't we avoid using deprecated apis instead of suppressing deprecation warnings?

        LLRReducer, Line 143: How about making the method package private if it should be used in unit tests only anyway?
        Line 106: Would be nice to have an additional counter for the skipped grams.

        I agree with you that things like sentence boundary detection and more sophisticated tokenization should be left as work for an additional issue.

        Jake, would be great, if you could have a closer look to verify that this is about the pipeline you had in mind in the referenced e-mail threads and mention anything that might still be missing.

        Show
        Isabel Drost-Fromm added a comment - First of all, thanks for the patch. The code looks good so far, patch applies cleanly and builds w/o problems. Some initial comments and questions I had when reading it: CollocMapper, Line 66: If I read your implementation correctly, this means that documents are always read fully into memory, right? So we would assume to only run the ngramCollector over documents that fit into main memory and unable to process larger documents. I am wondering whether this is an issue at all, and if so, whether there is any way around that. Gram, Line 192: You can omit the "else" clauses, in case the "if" already returns its result to the caller, however this is a question of style. I was wondering, why in line 177 you did not write "this.position != other.position"? NGramCollector, Line 47 (and a few others): Shouldn't we avoid using deprecated apis instead of suppressing deprecation warnings? LLRReducer, Line 143: How about making the method package private if it should be used in unit tests only anyway? Line 106: Would be nice to have an additional counter for the skipped grams. I agree with you that things like sentence boundary detection and more sophisticated tokenization should be left as work for an additional issue. Jake, would be great, if you could have a closer look to verify that this is about the pipeline you had in mind in the referenced e-mail threads and mention anything that might still be missing.
        Hide
        Drew Farris added a comment -

        Log-likelihood collocation identifier in patch form. This puts itself in o.a.m.nlp.collocations.llr

        I think there are some improvements that can be made, but if possible it would be nice to review, commit this version and add on to it later through additional patches More specifically, I'd like to see this:

        I removed the ability to read in files from a directory, Robin's document -> sequence file work fits into this well.

        Show
        Drew Farris added a comment - Log-likelihood collocation identifier in patch form. This puts itself in o.a.m.nlp.collocations.llr I think there are some improvements that can be made, but if possible it would be nice to review, commit this version and add on to it later through additional patches More specifically, I'd like to see this: include the ability to avoid forming collocations around sentence boundaries and other boundaries per: http://www.lucidimagination.com/search/document/d259def498803ffe/collocation_clarification#29fbb050cf5fa64 work for non-whitespace delimited languages, e.g: anything an analyzer can produce tokens for. I removed the ability to read in files from a directory, Robin's document -> sequence file work fits into this well.
        Hide
        Drew Farris added a comment -

        Thanks for taking a look and providing some great feedback Robin.

        Here's a new version that includes the following changes:

        • Now runs from a SequenceFile<Text,Text> of ID -> DOC by default. Tested on some medium-sized collections of 10k and 100k files using Robin's directory to sequence file util.
        • Analyzer is now configurable from the command-line via the --analyzerName option
        • Using a Writable implementation instead of strings to move data around. No more parsing, splitting, concatenating
        • Improved the handling of the output directory, output from passes are written to subdirectories of this directory, so no need to specify multiple output directories any longer.

        After 'mvn clean install' a sample can be run like so:

        mvn -e exec:java  -Dexec.mainClass="org.apache.mahout.colloc.CollocDriver" -Dexec.args="--input src/test/resources/article --output target/output -w -t"
        

        I'd like to get this into patch form as a next step + get all of the license headers on the code here, but I'm not sure where it should live in terms of project/package names, etc. Any thoughts?

        Also, I'm looking for feedback on the algorithm implementation – this version differs that I presented on the list in that the implementation tracks the part of the orginal n-gram that the sub-part appears in (head, tail). I'm not 100% sure this is necessary or even correct.

        Also, it's a bummer to have to create an analyzer subclass just to provide an implementation with a no-argument constructor. Has anyone considered making use of a DI framework with mahout? I know Grant has mentioned such options spring or guice with Mahout? Anyone have any strong objections to pull one of those in as a dependency?

        Show
        Drew Farris added a comment - Thanks for taking a look and providing some great feedback Robin. Here's a new version that includes the following changes: Now runs from a SequenceFile<Text,Text> of ID -> DOC by default. Tested on some medium-sized collections of 10k and 100k files using Robin's directory to sequence file util. Analyzer is now configurable from the command-line via the --analyzerName option Using a Writable implementation instead of strings to move data around. No more parsing, splitting, concatenating Improved the handling of the output directory, output from passes are written to subdirectories of this directory, so no need to specify multiple output directories any longer. After 'mvn clean install' a sample can be run like so: mvn -e exec:java -Dexec.mainClass="org.apache.mahout.colloc.CollocDriver" -Dexec.args="--input src/test/resources/article --output target/output -w -t" I'd like to get this into patch form as a next step + get all of the license headers on the code here, but I'm not sure where it should live in terms of project/package names, etc. Any thoughts? Also, I'm looking for feedback on the algorithm implementation – this version differs that I presented on the list in that the implementation tracks the part of the orginal n-gram that the sub-part appears in (head, tail). I'm not 100% sure this is necessary or even correct. Also, it's a bummer to have to create an analyzer subclass just to provide an implementation with a no-argument constructor. Has anyone considered making use of a DI framework with mahout? I know Grant has mentioned such options spring or guice with Mahout? Anyone have any strong objections to pull one of those in as a dependency?
        Hide
        Robin Anil added a comment -
        • Try to stick to SequenceFile<Text,Text> docid => for input content and leave it to the user to generate this file. There is a SequenceFileFromDirectory class in examples in my patch which does this conversion and writes SequenceFiles on HDFS directly
        • Also Take a look at the TermCountMapper, where I have parameterized the Lucene Analyzer through the conf
        • If you need to pass a tuple as input or output. Check out the StringTupleWritable class, instead of appending stuff to Text or splitting it.
        Show
        Robin Anil added a comment - Try to stick to SequenceFile<Text,Text> docid => for input content and leave it to the user to generate this file. There is a SequenceFileFromDirectory class in examples in my patch which does this conversion and writes SequenceFiles on HDFS directly Also Take a look at the TermCountMapper, where I have parameterized the Lucene Analyzer through the conf If you need to pass a tuple as input or output. Check out the StringTupleWritable class, instead of appending stuff to Text or splitting it.

          People

          • Assignee:
            Drew Farris
            Reporter:
            Drew Farris
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development