Mahout
  1. Mahout
  2. MAHOUT-126

Prepare document vectors from the text

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.2
    • Fix Version/s: 0.2
    • Component/s: None
    • Labels:
      None

      Description

      Clustering algorithms presently take the document vectors as input. Generating these document vectors from the text can be broken in two tasks.

      1. Create lucene index of the input plain-text documents
      2. From the index, generate the document vectors (sparse) with weights as TF-IDF values of the term. With lucene index, this value can be calculated very easily.

      Presently, I have created two separate utilities, which could possibly be invoked from another class.

      1. MAHOUT-126-TF.patch
        4 kB
        David Hall
      2. MAHOUT-126-null-entry.patch
        0.8 kB
        David Hall
      3. MAHOUT-126-no-normalization.patch
        2 kB
        David Hall
      4. MAHOUT-126-no-normalization.patch
        1 kB
        David Hall
      5. mahout-126-benson.patch
        11 kB
        Benson Margulies
      6. MAHOUT-126.patch
        10 kB
        Shashikant Kore
      7. MAHOUT-126.patch
        41 kB
        Grant Ingersoll
      8. MAHOUT-126.patch
        50 kB
        Grant Ingersoll
      9. MAHOUT-126.patch
        41 kB
        Grant Ingersoll

        Issue Links

          Activity

          Hide
          David Hall added a comment -

          I actually need something like this as well for LDA, except that I would prefer to be able to have the vectors not TF-IDF weighted. Could I get you to add some way of configuring that?

          Show
          David Hall added a comment - I actually need something like this as well for LDA, except that I would prefer to be able to have the vectors not TF-IDF weighted. Could I get you to add some way of configuring that?
          Hide
          Shashikant Kore added a comment -

          Patch to create index and document vectors from text.

          Show
          Shashikant Kore added a comment - Patch to create index and document vectors from text.
          Hide
          Shashikant Kore added a comment -

          David,

          Sorry, I don't have any background in LDA. Please take a look at the patch and suggest what changes are required in DocumentVector.getDocumentVector() method. I will do rest of the changes of configuration.

          Show
          Shashikant Kore added a comment - David, Sorry, I don't have any background in LDA. Please take a look at the patch and suggest what changes are required in DocumentVector.getDocumentVector() method. I will do rest of the changes of configuration.
          Hide
          David Hall added a comment -

          Sure, I just want to be able to have:

          double weight = similarity.tf(termFreq) * similarity.idf(docFreq, numDocs);

          be this instead:

          double weight = termFreq

          based on some configuration or another. (Maybe if I can just pass in a custom "Similarity" object? Or there could be a protected method "createSimilarity" that I could override?)

          Basically, LDA wants raw counts (or at least, some kind of integers).

          Thanks!

          Show
          David Hall added a comment - Sure, I just want to be able to have: double weight = similarity.tf(termFreq) * similarity.idf(docFreq, numDocs); be this instead: double weight = termFreq based on some configuration or another. (Maybe if I can just pass in a custom "Similarity" object? Or there could be a protected method "createSimilarity" that I could override?) Basically, LDA wants raw counts (or at least, some kind of integers). Thanks!
          Hide
          Grant Ingersoll added a comment -

          Passing in a way to make a custom weight object makes sense

          Show
          Grant Ingersoll added a comment - Passing in a way to make a custom weight object makes sense
          Hide
          Grant Ingersoll added a comment -

          See SOLR-1193.

          Show
          Grant Ingersoll added a comment - See SOLR-1193 .
          Hide
          Benson Margulies added a comment -

          This patch needs to explicitly manage the character set of the files it is reading. It uses FileReader without specifying a character set.

          Show
          Benson Margulies added a comment - This patch needs to explicitly manage the character set of the files it is reading. It uses FileReader without specifying a character set.
          Hide
          Benson Margulies added a comment -

          Improved patch. Allows specification of file character set. Applyable inside of eclipse.

          Show
          Benson Margulies added a comment - Improved patch. Allows specification of file character set. Applyable inside of eclipse.
          Hide
          Grant Ingersoll added a comment -

          So just kind of brainstorming here, but I think we should create a separate Module for this kind of stuff, to keep out of core and give us some more flexibility in regards to dependencies, etc.

          Also (and I realize this is just a start patch), I think we should assume a Lucene index exists already instead of maintaining code to actually create an index. There are a lot of ways to do that and people will likely have different fields, etc. For instance, Solr can provide all of the capabilities here and it has distributed support, so it can scale. Moreover, though, is people may have the info in a DB or in other places. I realize we need baby steps, but...

          I'll try to post a patch this afternoon that takes this effort and melds it with some of my ideas for demo purposes.

          Show
          Grant Ingersoll added a comment - So just kind of brainstorming here, but I think we should create a separate Module for this kind of stuff, to keep out of core and give us some more flexibility in regards to dependencies, etc. Also (and I realize this is just a start patch), I think we should assume a Lucene index exists already instead of maintaining code to actually create an index. There are a lot of ways to do that and people will likely have different fields, etc. For instance, Solr can provide all of the capabilities here and it has distributed support, so it can scale. Moreover, though, is people may have the info in a DB or in other places. I realize we need baby steps, but... I'll try to post a patch this afternoon that takes this effort and melds it with some of my ideas for demo purposes.
          Hide
          Grant Ingersoll added a comment -

          Shashikant,

          Couple of comments on the Lucene specific stuff, though, so that you guys can speed up what you have.

          First off, have a look at Lucene's support of TermVectorMapper. Much like SAX, it gives you a call back mechanism such that you don't have to construct two different data structures (i.e. many people incorrectly use the DOM to parse XML and then extract out of the DOM into their own Data Structure when they should use SAX instead).

          You might have a look at the TermVectorComponent in Solr, as it pretty much does what you are looking to do in this patch and I believe it to be more efficient.

          Seems like we should be able to avoid caching the whole term list in memory. At a minimum, if you are going to, allTerms should be a Map<String, Integer> that stores the term and it's DF (doc freq.), as you are currently doing the DF lookup twice, AFAICT. DF lookup is expensive in Lucene. If you don't cache the whole list, we should at least have an LRU cache for DF.

          Show
          Grant Ingersoll added a comment - Shashikant, Couple of comments on the Lucene specific stuff, though, so that you guys can speed up what you have. First off, have a look at Lucene's support of TermVectorMapper. Much like SAX, it gives you a call back mechanism such that you don't have to construct two different data structures (i.e. many people incorrectly use the DOM to parse XML and then extract out of the DOM into their own Data Structure when they should use SAX instead). You might have a look at the TermVectorComponent in Solr, as it pretty much does what you are looking to do in this patch and I believe it to be more efficient. Seems like we should be able to avoid caching the whole term list in memory. At a minimum, if you are going to, allTerms should be a Map<String, Integer> that stores the term and it's DF (doc freq.), as you are currently doing the DF lookup twice, AFAICT. DF lookup is expensive in Lucene. If you don't cache the whole list, we should at least have an LRU cache for DF.
          Hide
          Grant Ingersoll added a comment -

          Seems like we should be able to avoid caching the whole term list in memory. At a minimum, if you are going to, allTerms should be a Map<String, Integer> that stores the term and it's DF (doc freq.), as you are currently doing the DF lookup twice, AFAICT. DF lookup is expensive in Lucene. If you don't cache the whole list, we should at least have an LRU cache for DF.

          Never mind, I see why the list is cached. Still, makes sense to cache the DF.

          Show
          Grant Ingersoll added a comment - Seems like we should be able to avoid caching the whole term list in memory. At a minimum, if you are going to, allTerms should be a Map<String, Integer> that stores the term and it's DF (doc freq.), as you are currently doing the DF lookup twice, AFAICT. DF lookup is expensive in Lucene. If you don't cache the whole list, we should at least have an LRU cache for DF. Never mind, I see why the list is cached. Still, makes sense to cache the DF.
          Hide
          Grant Ingersoll added a comment -

          Here's a first attempt at my thoughts based on the two previous patches, plus some other ideas.

          The main gist of the idea centers around the VectorIterable interface and is driven by the o.a.mahout.utils.vectors.Driver class.

          Note, I dropped the Lucene indexing part, as I don't think we need to be in the game of creating Lucene indexes. That is a well known and well document process that is available elsewhere. In fact, for this particular piece, I indexed Wikipedia in Solr and then pointed the Driver class at the Lucene index.

          See http://cwiki.apache.org/confluence/display/MAHOUT/Creating+Vectors+from+Text for details on using.

          Show
          Grant Ingersoll added a comment - Here's a first attempt at my thoughts based on the two previous patches, plus some other ideas. The main gist of the idea centers around the VectorIterable interface and is driven by the o.a.mahout.utils.vectors.Driver class. Note, I dropped the Lucene indexing part, as I don't think we need to be in the game of creating Lucene indexes. That is a well known and well document process that is available elsewhere. In fact, for this particular piece, I indexed Wikipedia in Solr and then pointed the Driver class at the Lucene index. See http://cwiki.apache.org/confluence/display/MAHOUT/Creating+Vectors+from+Text for details on using.
          Hide
          Grant Ingersoll added a comment -

          Note, I haven't actually tried clustering just yet with the output!

          Show
          Grant Ingersoll added a comment - Note, I haven't actually tried clustering just yet with the output!
          Hide
          Shashikant Kore added a comment -

          Grant,

          I went through the patch. Compilation failed with following error.
          "Driver.java:[111,26] FSDirectory(java.io.File,org.apache.lucene.store.LockFactory) has protected access in org.apache.lucene.store.FSDirectory"

          So, I haven't really run the code.

          Overall, the code looks good. Now I understand TermVectorMapper.

          Should VectorMapper be taken as an option? David had commented that he wants vectors with DF as weights. He could add, say DFMapper, to get desired weights.

          I think, document labelling (Mahout-65) also needs to in soon because it will require chanes to this code. Mostly those changes will reflect in LuceneIteratable.

          Show
          Shashikant Kore added a comment - Grant, I went through the patch. Compilation failed with following error. "Driver.java: [111,26] FSDirectory(java.io.File,org.apache.lucene.store.LockFactory) has protected access in org.apache.lucene.store.FSDirectory" So, I haven't really run the code. Overall, the code looks good. Now I understand TermVectorMapper. Should VectorMapper be taken as an option? David had commented that he wants vectors with DF as weights. He could add, say DFMapper, to get desired weights. I think, document labelling (Mahout-65) also needs to in soon because it will require chanes to this code. Mostly those changes will reflect in LuceneIteratable.
          Hide
          Grant Ingersoll added a comment -

          Yeah, still needs the labeling stuff.

          As for weights, you should be able to pass in a Weight object. See the TFIDF implementation. Likely still needs some work.

          As for the Lucene error, I thought I had updated the Lucene version to be 2.9-dev, which I believe makes this all right.

          Show
          Grant Ingersoll added a comment - Yeah, still needs the labeling stuff. As for weights, you should be able to pass in a Weight object. See the TFIDF implementation. Likely still needs some work. As for the Lucene error, I thought I had updated the Lucene version to be 2.9-dev, which I believe makes this all right.
          Hide
          Grant Ingersoll added a comment -

          We really need named Vectors to make this fly.

          Show
          Grant Ingersoll added a comment - We really need named Vectors to make this fly.
          Hide
          Grant Ingersoll added a comment -

          Here's a version that is brought up to trunk and adds in MAHOUT-65-name.patch to allow for labeling the vectors.

          Next, I'm going to run the output through some clustering

          Show
          Grant Ingersoll added a comment - Here's a version that is brought up to trunk and adds in MAHOUT-65 -name.patch to allow for labeling the vectors. Next, I'm going to run the output through some clustering
          Hide
          Grant Ingersoll added a comment -

          Updated patch since MAHOUT-65-name.patch was committed.

          Show
          Grant Ingersoll added a comment - Updated patch since MAHOUT-65 -name.patch was committed.
          Hide
          Grant Ingersoll added a comment -

          Committed revision 785618.

          Show
          Grant Ingersoll added a comment - Committed revision 785618.
          Hide
          David Hall added a comment -

          LuceneIteratable (is that an intentional pun?) has behavior that isn't documented well. Namely, if the normless constructor is called, the norm defaults to 2.

          This has the consequence that not passing in a norm to Driver L2 normalizes the vectors. You have to specify a negative double != -1.0 to get unnormalized counts. Relatedly, -1 maps to the L2 norm. This is odd behavior to me, or it should at least be documented. (The wiki article implies there's a difference between using --norm 2 and using no norm at all.)

          Also, I'd like an option to tell Driver what weight object to use. I can do the patch for this.

          Thanks!

          Show
          David Hall added a comment - LuceneIteratable (is that an intentional pun?) has behavior that isn't documented well. Namely, if the normless constructor is called, the norm defaults to 2. This has the consequence that not passing in a norm to Driver L2 normalizes the vectors. You have to specify a negative double != -1.0 to get unnormalized counts. Relatedly, -1 maps to the L2 norm. This is odd behavior to me, or it should at least be documented. (The wiki article implies there's a difference between using --norm 2 and using no norm at all.) Also, I'd like an option to tell Driver what weight object to use. I can do the patch for this. Thanks!
          Hide
          Grant Ingersoll added a comment -

          Agreed about the weirdness on the default norms. Yeah, patch would be great.

          Show
          Grant Ingersoll added a comment - Agreed about the weirdness on the default norms. Yeah, patch would be great.
          Hide
          David Hall added a comment -

          Ok, here's the patch for normalization. other one forthcoming.

          Also, I'm getting null pointers out of VectorMapper after building an index using Lucene's demo indexer. I'm going to follow the solr instructions and see if I have better luck.

          Show
          David Hall added a comment - Ok, here's the patch for normalization. other one forthcoming. Also, I'm getting null pointers out of VectorMapper after building an index using Lucene's demo indexer. I'm going to follow the solr instructions and see if I have better luck.
          Hide
          Grant Ingersoll added a comment -

          Also, I'm getting null pointers out of VectorMapper after building an index using Lucene's demo indexer. I'm going to follow the solr instructions and see if I have better luck.

          This stuff requires Term Vectors to be enabled in the Lucene index.

          Show
          Grant Ingersoll added a comment - Also, I'm getting null pointers out of VectorMapper after building an index using Lucene's demo indexer. I'm going to follow the solr instructions and see if I have better luck. This stuff requires Term Vectors to be enabled in the Lucene index.
          Hide
          Grant Ingersoll added a comment -

          Also, I don't use git, is there a way to produce a patch that is consumable by the patch utility, or provide the options needed to run.

          In SVN, I do:

          svn diff > ../mypatch.patch
          

          and then apply as:
          patch -p 0 -i ../mypatch.patch

          Show
          Grant Ingersoll added a comment - Also, I don't use git, is there a way to produce a patch that is consumable by the patch utility, or provide the options needed to run. In SVN, I do: svn diff > ../mypatch.patch and then apply as: patch -p 0 -i ../mypatch.patch
          Hide
          David Hall added a comment -

          My bad. git-format-patch formats an email that has a patch (sigh) and not a patch itself.

          Run the command you pasted on the new patch.

          Show
          David Hall added a comment - My bad. git-format-patch formats an email that has a patch (sigh) and not a patch itself. Run the command you pasted on the new patch.
          Hide
          David Hall added a comment -

          This patch contains an implementation of a TF weight, and it adds the --weight option to Driver to support its use. Default is TFIDF. An error is thrown on input besides TFIDF or TF.

          Show
          David Hall added a comment - This patch contains an implementation of a TF weight, and it adds the --weight option to Driver to support its use. Default is TFIDF. An error is thrown on input besides TFIDF or TF.
          Hide
          Grant Ingersoll added a comment -

          I committed the no-norm thing with some slight mods, since it is not merely valid to check to see if NO_NORMALIZATION since a value < 0 is not valid.

          Show
          Grant Ingersoll added a comment - I committed the no-norm thing with some slight mods, since it is not merely valid to check to see if NO_NORMALIZATION since a value < 0 is not valid.
          Hide
          David Hall added a comment -

          Ok, I'm probably misunderstanding something, or there could be a bug. I modified Lucene's demo indexer to store a term vector. It's still crashing. I added a series of printlns before TermVector.java:65 and CachedTermInfo:71, and I end up with the assertion here failing:

          {{
          @Override
          public TermEntry getTermEntry(String field, String term) {
          if (this.field.equals(field) == false)

          { return null;}

          TermEntry ret = termEntries.get(term);
          assert(ret != null); // This assertion is firing.
          return ret;
          }
          }}

          In my dataset, this happens after several hundred iterations. The term is a stop-word for the corpus in question, and it looks like there's an attempt at stopwording earlier in the file. Maybe these are not interacting well?

          – David

          Show
          David Hall added a comment - Ok, I'm probably misunderstanding something, or there could be a bug. I modified Lucene's demo indexer to store a term vector. It's still crashing. I added a series of printlns before TermVector.java:65 and CachedTermInfo:71, and I end up with the assertion here failing: {{ @Override public TermEntry getTermEntry(String field, String term) { if (this.field.equals(field) == false) { return null;} TermEntry ret = termEntries.get(term); assert(ret != null); // This assertion is firing. return ret; } }} In my dataset, this happens after several hundred iterations. The term is a stop-word for the corpus in question, and it looks like there's an attempt at stopwording earlier in the file. Maybe these are not interacting well? – David
          Hide
          David Hall added a comment -

          I'm going to assume that's the problem. The attached patch just skips over any null term vectors. It seems like reasonable behavior here, given the filtering.

          Show
          David Hall added a comment - I'm going to assume that's the problem. The attached patch just skips over any null term vectors. It seems like reasonable behavior here, given the filtering.
          Hide
          Grant Ingersoll added a comment -

          Hey David,

          I'm not sure what's going on here, because that value being null means the term is not the index, yet is in the Term Vector for that doc. Are you sure you're loading the same field? Can you share the indexing code?

          This fix works, though, but I'd like to know at a deeper level what's going on.

          Show
          Grant Ingersoll added a comment - Hey David, I'm not sure what's going on here, because that value being null means the term is not the index, yet is in the Term Vector for that doc. Are you sure you're loading the same field? Can you share the indexing code? This fix works, though, but I'd like to know at a deeper level what's going on.
          Hide
          David Hall added a comment -

          That's not the only time. This constructor clearly lets certain things slip through.

            public CachedTermInfo(IndexReader reader, String field, int minDf, int maxDfPercent) throws IOException {
              this.field = field;
              TermEnum te = reader.terms(new Term(field, ""));
              int count = 0;
              int numDocs = reader.numDocs();
              double percent = numDocs * maxDfPercent / 100.0;
              //Should we use a linked hash map so that we no terms are in order?
              termEntries = new LinkedHashMap<String, TermEntry>();
              do {
                Term term = te.term();
                if (term == null || term.field().equals(field) == false){
                  break;
                }
                int df = te.docFreq();
                if (df < minDf || df > percent){
                  continue;
                }
                TermEntry entry = new TermEntry(term.text(), count++, df);
                termEntries.put(entry.term, entry);
              } while (te.next());
              te.close();
          

          My code is essentially Lucene's demo indexing code (IndexFiles.java and FileDocument.java: http://google.com/codesearch/p?hl=en&sa=N&cd=1&ct=rc#uGhWbO8eR20/trunk/src/demo/org/apache/lucene/demo/FileDocument.java&q=org.apache.lucene.demo.IndexFiles
          } except that I replaced

          doc.add(new Field("contents", new FileReader(f)));

          with

             doc.add(new Field("contents", new FileReader(f),Field.TermVector.YES));

          I then ran

           java -cp <classpath> org.apache.lucene.demo.IndexFiles /Users/dlwh/txt-reuters/ 

          and then

           java -cp <classpath> org.apache.mahout.utils.vectors.Driver --dir /Users/dlwh/src/lucene/index/ --output ~/src/vec-reuters -f contents -t /Users/dlwh/dict --weight TF 

          For what's it worth, it gives a null on "reuters", which is not usually a stop word, except that every single document ends with it, and so the IDF filtering above is catching it.

          Show
          David Hall added a comment - That's not the only time. This constructor clearly lets certain things slip through. public CachedTermInfo(IndexReader reader, String field, int minDf, int maxDfPercent) throws IOException { this .field = field; TermEnum te = reader.terms( new Term(field, "")); int count = 0; int numDocs = reader.numDocs(); double percent = numDocs * maxDfPercent / 100.0; //Should we use a linked hash map so that we no terms are in order? termEntries = new LinkedHashMap< String , TermEntry>(); do { Term term = te.term(); if (term == null || term.field().equals(field) == false ){ break ; } int df = te.docFreq(); if (df < minDf || df > percent){ continue ; } TermEntry entry = new TermEntry(term.text(), count++, df); termEntries.put(entry.term, entry); } while (te.next()); te.close(); My code is essentially Lucene's demo indexing code (IndexFiles.java and FileDocument.java: http://google.com/codesearch/p?hl=en&sa=N&cd=1&ct=rc#uGhWbO8eR20/trunk/src/demo/org/apache/lucene/demo/FileDocument.java&q=org.apache.lucene.demo.IndexFiles } except that I replaced doc.add( new Field( "contents" , new FileReader(f))); with doc.add( new Field( "contents" , new FileReader(f),Field.TermVector.YES)); I then ran java -cp <classpath> org.apache.lucene.demo.IndexFiles /Users/dlwh/txt-reuters/ and then java -cp <classpath> org.apache.mahout.utils.vectors.Driver --dir /Users/dlwh/src/lucene/index/ --output ~/src/vec-reuters -f contents -t /Users/dlwh/dict --weight TF For what's it worth, it gives a null on "reuters", which is not usually a stop word, except that every single document ends with it, and so the IDF filtering above is catching it.
          Hide
          Grant Ingersoll added a comment -

          Yep, you are right. I committed your patch anyway. We probably should add to the cmd line to support setting minDF, maxDF.

          Show
          Grant Ingersoll added a comment - Yep, you are right. I committed your patch anyway. We probably should add to the cmd line to support setting minDF, maxDF.
          Hide
          Grant Ingersoll added a comment -

          I think this is in pretty good shape for now, can open new issues to deal with specific problems.

          Show
          Grant Ingersoll added a comment - I think this is in pretty good shape for now, can open new issues to deal with specific problems.

            People

            • Assignee:
              Grant Ingersoll
              Reporter:
              Shashikant Kore
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development