Lucene - Core
  1. Lucene - Core
  2. LUCENE-1320

ShingleMatrixFilter, a three dimensional permutating shingle filter

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Blocker Blocker
    • Resolution: Fixed
    • Affects Version/s: 2.3.2
    • Fix Version/s: 2.4
    • Component/s: modules/analysis
    • Labels:
      None
    • Lucene Fields:
      Patch Available

      Description

      Backed by a column focused matrix that creates all permutations of shingle tokens in three dimensions. I.e. it handles multi token synonyms.

      Could for instance in some cases be used to replaces 0-slop phrase queries with something speedier.

      Token[][][]{
        {{hello}, {greetings, and, salutations}},
        {{world}, {earth}, {tellus}}
      }
      

      passes the following test with 2-3 grams:

      assertNext(ts, "hello_world");
      assertNext(ts, "greetings_and");
      assertNext(ts, "greetings_and_salutations");
      assertNext(ts, "and_salutations");
      assertNext(ts, "and_salutations_world");
      assertNext(ts, "salutations_world");
      assertNext(ts, "hello_earth");
      assertNext(ts, "and_salutations_earth");
      assertNext(ts, "salutations_earth");
      assertNext(ts, "hello_tellus");
      assertNext(ts, "and_salutations_tellus");
      assertNext(ts, "salutations_tellus");
      

      Contains more and less complex tests that demonstrate offsets, posincr, payload boosts calculation and construction of a matrix from a token stream.

      The matrix attempts to hog as little memory as possible by seeking no more than maximumShingleSize columns forward in the stream and clearing up unused resources (columns and unique token sets). Can still be optimized quite a bit though.

      1. LUCENE-1320.txt
        50 kB
        Karl Wettin
      2. LUCENE-1320.txt
        59 kB
        Karl Wettin
      3. LUCENE-1320.txt
        70 kB
        Karl Wettin
      4. LUCENE-1320.patch
        22 kB
        Grant Ingersoll

        Activity

        Hide
        Karl Wettin added a comment -

        documentation will have to come later... until then see the test cases

        Show
        Karl Wettin added a comment - documentation will have to come later... until then see the test cases
        Hide
        Karl Wettin added a comment -

        This works pretty well, I'll commit it soon.

        • javadocs
        • improved default shingle token weights (still not that great)

        Also optimized and refactored some that resulted in nicer looking code in the tests and:

        • PrefixAwareTokenFilter
        • PrefixAndSuffixAwareTokenFilter
        • SingleTokenTokenStream
        /**
         * Joins two token streams and leaves the last token of the prefix stream available
         * to be used when updating the token values in the second stream based on that token.
         */
        public class PrefixAwareTokenFilter extends TokenStream {
          /** The default implementation adds last prefix token end offset to the suffix token start and end offsets. */
          public Token updateSuffixToken(Token suffixToken, Token lastPrefixToken) {
        
        
        /** Links two PrefixAndSuffixAwareTokenFilter */  
        public class PrefixAndSuffixAwareTokenFilter extends TokenStream {
          public Token updateInputToken(Token inputToken, Token lastPrefixToken) {
          public Token updateSuffixToken(Token suffixToken, Token lastInputToken) {
        
        Show
        Karl Wettin added a comment - This works pretty well, I'll commit it soon. javadocs improved default shingle token weights (still not that great) Also optimized and refactored some that resulted in nicer looking code in the tests and: PrefixAwareTokenFilter PrefixAndSuffixAwareTokenFilter SingleTokenTokenStream /** * Joins two token streams and leaves the last token of the prefix stream available * to be used when updating the token values in the second stream based on that token. */ public class PrefixAwareTokenFilter extends TokenStream { /** The default implementation adds last prefix token end offset to the suffix token start and end offsets. */ public Token updateSuffixToken(Token suffixToken, Token lastPrefixToken) { /** Links two PrefixAndSuffixAwareTokenFilter */ public class PrefixAndSuffixAwareTokenFilter extends TokenStream { public Token updateInputToken(Token inputToken, Token lastPrefixToken) { public Token updateSuffixToken(Token suffixToken, Token lastInputToken) {
        Hide
        Steve Rowe added a comment -

        Hi Karl,

        The classes you introduce here look interesting, but the documentation is very sparse.

        Things I think should be addressed in the documentation:

        • Where would you see this stuff being used - on the query side or the indexing side? Or both?
        • Where would matrix come from in a real-world scenario? It looks like there are (at least) three mechanisms for constructing the matrix - which one(s) make sense where?
        • What do payloads have to do with the whole thing? (Looks like weight? ShingleMatrixFilter.calculateShingleWeight() should be explained at the class level - since it's public, I assume you mean for it to be overridable?)
        • The various ShingleMatrixFilter constructors should have javadoc explaining their use.
        • This class's use of the new flags feature looks interesting - a discussion in the documentation would be useful for future implementations.

        A couple of random notes:

        • Missing Apache license declarations: PrefixAndSuffixAwareTokenFilter.java and TestPrefixAndSuffixAwareTokenFilter.java
        • Since you only use SingleTokenTokenStream in your tests, and since it likely will only ever be used in tests, I think it should be moved from src/java/ to src/test/.
        • TestShingleMatrixFilter.TokenListStream looks generally useful for testing filters - maybe this could be pulled out as a separate class, maybe into the o.a.l.analysis.miscellaneous package?
        • On line #83 of TestShingleMatrixFilter, it looks like the first assignment to ts could be removed:
        83:   ts = tls;
        84:   ts = new ShingleMatrixFilter(ts, 2, 2, null);
        
        Show
        Steve Rowe added a comment - Hi Karl, The classes you introduce here look interesting, but the documentation is very sparse. Things I think should be addressed in the documentation: Where would you see this stuff being used - on the query side or the indexing side? Or both? Where would matrix come from in a real-world scenario? It looks like there are (at least) three mechanisms for constructing the matrix - which one(s) make sense where? What do payloads have to do with the whole thing? (Looks like weight? ShingleMatrixFilter.calculateShingleWeight() should be explained at the class level - since it's public, I assume you mean for it to be overridable?) The various ShingleMatrixFilter constructors should have javadoc explaining their use. This class's use of the new flags feature looks interesting - a discussion in the documentation would be useful for future implementations. A couple of random notes: Missing Apache license declarations: PrefixAndSuffixAwareTokenFilter.java and TestPrefixAndSuffixAwareTokenFilter.java Since you only use SingleTokenTokenStream in your tests, and since it likely will only ever be used in tests, I think it should be moved from src/java/ to src/test/. TestShingleMatrixFilter.TokenListStream looks generally useful for testing filters - maybe this could be pulled out as a separate class, maybe into the o.a.l.analysis.miscellaneous package? On line #83 of TestShingleMatrixFilter, it looks like the first assignment to ts could be removed: 83: ts = tls; 84: ts = new ShingleMatrixFilter(ts, 2, 2, null );
        Hide
        Karl Wettin added a comment - - edited

        Thanks for the comments Steve! I'll pop a more documented patch in soon. Here are my replies:

        Where would you see this stuff being used - on the query side or the indexing side? Or both?

        Historically I've used shingles at both ends to replace phrase queries and to fix word de/composition problems. This implementation was however written to tokenize the 20 news groups data for the cluster example in Mahout.

        Where would matrix come from in a real-world scenario? It looks like there are (at least) three mechanisms for constructing the matrix - which one(s) make sense where?

        From a token stream. You need to implement a TokenSettingsCodec to tell the shingle filter how to position an input token in the matrix: in a new column, a new row or in the same row. It is also used to define how to get and set a weight float of a token.

          /**
           * Using this codec makes a {@link ShingleMatrixFilter} act like {@linke ShingleFilter}.
           * It produces the most simple sort of shingles, ignoring token position increments, et c.
           * 
           * It adds each token as a new column.
           */
          public static class OneDimensionalNonWeightedTokenSettingsCodec extends TokenSettingsCodec {
        
          /**
           * A codec that creates a two dimensional matrix 
           * by treating tokens from the input stream with 0 position increment 
           * as new rows to the current column.
           */
          public static class TwoDimensionalNonWeightedSynonymTokenSettingsCodec extends TokenSettingsCodec {
        
          /**
           * A full featured codec not to be used for something serious.
           *
           * It takes complete control of 
           * payload for weight
           * and the bit flags for positioning in the matrix.
           * 
           * Mainly exist for demonstrational purposes.
           */
          public static class SimpleThreeDimensionalTokenSettingsCodec extends TokenSettingsCodec {
        

        What do payloads have to do with the whole thing? (Looks like weight?
        ShingleMatrixFilter.calculateShingleWeight() should be explained at the class level - since it's public, I assume you mean for it to be overridable?)

        Yeah, it's weights. They can be used either at query time or index time. Or both for that sake. You could for instance want to be producing a matrix with all sort of weighted data in synonym space: stems, stems without diactits, source tokens without diacrits, et c. Then you'd expect to see the weight difference in the shingles too.

        Weights are turned off by always returning 1f at getWeight and ignore calls to setWeight in your TokenSettingsCodec.

        Since you only use SingleTokenTokenStream in your tests, and since it likely will only ever be used in tests, I think it should be moved from src/java/ to src/test/.

        That's actually a real use case in the test. When replacing phrase queries with shingles you might want to boost the edges by adding (boosted) prefix and suffix tokens at index and query time:

        ts = new PrefixAndSuffixAwareTokenFilter(new SingleTokenTokenStream(tokenFactory("^", 1, 100f, 0, 0)), tls, new SingleTokenTokenStream(tokenFactory("$", 1, 50f, 0, 0)));
        
        assertNext(ts, "^_hello", 1, 10.049875f, 0, 4);
        assertNext(ts, "^_greetings", 1, 10.049875f, 0, 4);
        assertNext(ts, "hello_world", 1, 1.4142135f, 0, 10);
        assertNext(ts, "greetings_world", 1, 1.4142135f, 0, 10);
        assertNext(ts, "hello_earth", 1, 1.4142135f, 0, 10);
        assertNext(ts, "greetings_earth", 1, 1.4142135f, 0, 10);
        assertNext(ts, "hello_tellus", 1, 1.4142135f, 0, 10);
        assertNext(ts, "greetings_tellus", 1, 1.4142135f, 0, 10);
        assertNext(ts, "world_$", 1, 7.1414285f, 5, 10);
        assertNext(ts, "earth_$", 1, 7.1414285f, 5, 10);
        assertNext(ts, "tellus_$", 1, 7.1414285f, 5, 10);
        assertNull(ts.next());
        

        As you can see, the default weight calculating is sort of messed up. I'd prefere to see more impact from the weight of the prefix and the suffix token. It's not too bad though.

        The various ShingleMatrixFilter constructors should have javadoc explaining their use.

        I'll do that, but the names of the constructor parameters are rather self explainatory. It would just be a

        This class's use of the new flags feature looks interesting - a discussion in the documentation would be useful for future implementations.

        It's rather terrible, I use the int as a state instead of the intended bitset level. It's just for demonstrational purposes though.

        TestShingleMatrixFilter.TokenListStream looks generally useful for testing filters - maybe this could be pulled out as a separate class, maybe into the o.a.l.analysis.miscellaneous package?

        Or perhaps the CachingTokenFilter could be rewritten to accept a token collection in the constructor.

        Show
        Karl Wettin added a comment - - edited Thanks for the comments Steve! I'll pop a more documented patch in soon. Here are my replies: Where would you see this stuff being used - on the query side or the indexing side? Or both? Historically I've used shingles at both ends to replace phrase queries and to fix word de/composition problems. This implementation was however written to tokenize the 20 news groups data for the cluster example in Mahout. Where would matrix come from in a real-world scenario? It looks like there are (at least) three mechanisms for constructing the matrix - which one(s) make sense where? From a token stream. You need to implement a TokenSettingsCodec to tell the shingle filter how to position an input token in the matrix: in a new column, a new row or in the same row. It is also used to define how to get and set a weight float of a token. /** * Using this codec makes a {@link ShingleMatrixFilter} act like {@linke ShingleFilter}. * It produces the most simple sort of shingles, ignoring token position increments, et c. * * It adds each token as a new column. */ public static class OneDimensionalNonWeightedTokenSettingsCodec extends TokenSettingsCodec { /** * A codec that creates a two dimensional matrix * by treating tokens from the input stream with 0 position increment * as new rows to the current column. */ public static class TwoDimensionalNonWeightedSynonymTokenSettingsCodec extends TokenSettingsCodec { /** * A full featured codec not to be used for something serious. * * It takes complete control of * payload for weight * and the bit flags for positioning in the matrix. * * Mainly exist for demonstrational purposes. */ public static class SimpleThreeDimensionalTokenSettingsCodec extends TokenSettingsCodec { What do payloads have to do with the whole thing? (Looks like weight? ShingleMatrixFilter.calculateShingleWeight() should be explained at the class level - since it's public, I assume you mean for it to be overridable?) Yeah, it's weights. They can be used either at query time or index time. Or both for that sake. You could for instance want to be producing a matrix with all sort of weighted data in synonym space: stems, stems without diactits, source tokens without diacrits, et c. Then you'd expect to see the weight difference in the shingles too. Weights are turned off by always returning 1f at getWeight and ignore calls to setWeight in your TokenSettingsCodec. Since you only use SingleTokenTokenStream in your tests, and since it likely will only ever be used in tests, I think it should be moved from src/java/ to src/test/. That's actually a real use case in the test. When replacing phrase queries with shingles you might want to boost the edges by adding (boosted) prefix and suffix tokens at index and query time: ts = new PrefixAndSuffixAwareTokenFilter( new SingleTokenTokenStream(tokenFactory( "^" , 1, 100f, 0, 0)), tls, new SingleTokenTokenStream(tokenFactory( "$" , 1, 50f, 0, 0))); assertNext(ts, "^_hello" , 1, 10.049875f, 0, 4); assertNext(ts, "^_greetings" , 1, 10.049875f, 0, 4); assertNext(ts, "hello_world" , 1, 1.4142135f, 0, 10); assertNext(ts, "greetings_world" , 1, 1.4142135f, 0, 10); assertNext(ts, "hello_earth" , 1, 1.4142135f, 0, 10); assertNext(ts, "greetings_earth" , 1, 1.4142135f, 0, 10); assertNext(ts, "hello_tellus" , 1, 1.4142135f, 0, 10); assertNext(ts, "greetings_tellus" , 1, 1.4142135f, 0, 10); assertNext(ts, "world_$" , 1, 7.1414285f, 5, 10); assertNext(ts, "earth_$" , 1, 7.1414285f, 5, 10); assertNext(ts, "tellus_$" , 1, 7.1414285f, 5, 10); assertNull(ts.next()); As you can see, the default weight calculating is sort of messed up. I'd prefere to see more impact from the weight of the prefix and the suffix token. It's not too bad though. The various ShingleMatrixFilter constructors should have javadoc explaining their use. I'll do that, but the names of the constructor parameters are rather self explainatory. It would just be a This class's use of the new flags feature looks interesting - a discussion in the documentation would be useful for future implementations. It's rather terrible, I use the int as a state instead of the intended bitset level. It's just for demonstrational purposes though. TestShingleMatrixFilter.TokenListStream looks generally useful for testing filters - maybe this could be pulled out as a separate class, maybe into the o.a.l.analysis.miscellaneous package? Or perhaps the CachingTokenFilter could be rewritten to accept a token collection in the constructor.
        Hide
        Karl Wettin added a comment -

        new patch mainly contains more javadocs

        Show
        Karl Wettin added a comment - new patch mainly contains more javadocs
        Hide
        Steve Rowe added a comment -

        Cool, the added javadocs look great.

        Show
        Steve Rowe added a comment - Cool, the added javadocs look great.
        Hide
        Karl Wettin added a comment -

        Committed

        Thanks again for the input Steve!

        Show
        Karl Wettin added a comment - Committed Thanks again for the input Steve!
        Hide
        Grant Ingersoll added a comment -

        Despite the fact that we allow contribs to be 1.5, I don't think the analysis package should be 1.5, at least it shouldn't be made 1.5 without some discussion on the mailing list.

        Show
        Grant Ingersoll added a comment - Despite the fact that we allow contribs to be 1.5, I don't think the analysis package should be 1.5, at least it shouldn't be made 1.5 without some discussion on the mailing list.
        Hide
        Grant Ingersoll added a comment -

        I'm marking this as a blocker for 2.4 based on the Java 1.5 incompatibilities that were introduced.

        Show
        Grant Ingersoll added a comment - I'm marking this as a blocker for 2.4 based on the Java 1.5 incompatibilities that were introduced.
        Hide
        Karl Wettin added a comment -

        OK. Either remove it or place it in some alternative contrib module? The first chooise is obviously the easiest.

        Show
        Karl Wettin added a comment - OK. Either remove it or place it in some alternative contrib module? The first chooise is obviously the easiest.
        Hide
        Karl Wettin added a comment -

        It really is quite a bit of work to downgrade this to 1.4, lots of generics but it also depends on enum.

        So if you don't want 1.5 in contrib/analyzers I vote for simply removing it from trunk now and reintroducing it in the 3.1-dev trunk.

        karl

        Show
        Karl Wettin added a comment - It really is quite a bit of work to downgrade this to 1.4, lots of generics but it also depends on enum. So if you don't want 1.5 in contrib/analyzers I vote for simply removing it from trunk now and reintroducing it in the 3.1-dev trunk. karl
        Hide
        Grant Ingersoll added a comment -

        I'm almost done w/ a conversion. Regex is your friend. As is IntelliJ.

        Show
        Grant Ingersoll added a comment - I'm almost done w/ a conversion. Regex is your friend. As is IntelliJ.
        Hide
        Grant Ingersoll added a comment -

        Java 1.4 compatible. Give this a try

        Show
        Grant Ingersoll added a comment - Java 1.4 compatible. Give this a try
        Hide
        Karl Wettin added a comment -

        Cool, thanks!

        The only thing I could see is that you managed to remove a couple of <pre> tags.

        I'll also leave this out of the commit:

        Index: contrib/analyzers/src/java/org/apache/lucene/analysis/compound/hyphenation/PatternParser.java
        ===================================================================
        --- contrib/analyzers/src/java/org/apache/lucene/analysis/compound/hyphenation/PatternParser.java       (revision 694390)
        +++ contrib/analyzers/src/java/org/apache/lucene/analysis/compound/hyphenation/PatternParser.java       (arbetskopia)
        @@ -267,7 +267,7 @@
           // EntityResolver methods
           //
           public InputSource resolveEntity(String publicId, String systemId)
        -  throws SAXException, IOException {
        +  throws SAXException {
             return HyphenationDTDGenerator.generateDTD();
           }
        
        Show
        Karl Wettin added a comment - Cool, thanks! The only thing I could see is that you managed to remove a couple of <pre> tags. I'll also leave this out of the commit: Index: contrib/analyzers/src/java/org/apache/lucene/analysis/compound/hyphenation/PatternParser.java =================================================================== --- contrib/analyzers/src/java/org/apache/lucene/analysis/compound/hyphenation/PatternParser.java (revision 694390) +++ contrib/analyzers/src/java/org/apache/lucene/analysis/compound/hyphenation/PatternParser.java (arbetskopia) @@ -267,7 +267,7 @@ // EntityResolver methods // public InputSource resolveEntity( String publicId, String systemId) - throws SAXException, IOException { + throws SAXException { return HyphenationDTDGenerator.generateDTD(); }
        Hide
        Karl Wettin added a comment -

        JDK downgrade committed. Thanks for the time spent Grant!

        Show
        Karl Wettin added a comment - JDK downgrade committed. Thanks for the time spent Grant!

          People

          • Assignee:
            Karl Wettin
            Reporter:
            Karl Wettin
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development