Lucene - Core
  1. Lucene - Core
  2. LUCENE-6747

FingerprintFilter - a TokenFilter for clustering/linking purposes

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 5.4, 6.0
    • Component/s: modules/analysis
    • Labels:
      None
    • Lucene Fields:
      New, Patch Available

      Description

      A TokenFilter that emits a single token which is a sorted, de-duplicated set of the input tokens.
      This approach to normalizing text is used in tools like OpenRefine[1] and elsewhere [2] to help in clustering or linking texts.
      The implementation proposed here has a an upper limit on the size of the combined token which is output.

      [1] https://github.com/OpenRefine/OpenRefine/wiki/Clustering-In-Depth
      [2] https://rajmak.wordpress.com/2013/04/27/clustering-text-map-reduce-in-python/

      1. fingerprintv1.patch
        10 kB
        Mark Harwood
      2. fingerprintv2.patch
        13 kB
        Mark Harwood
      3. fingerprintv3.patch
        14 kB
        Mark Harwood
      4. fingerprintv4.patch
        19 kB
        Mark Harwood

        Activity

        Hide
        Mark Harwood added a comment -

        Proposed implementation and test

        Show
        Mark Harwood added a comment - Proposed implementation and test
        Hide
        Adrien Grand added a comment -

        If you could tolerate that these fingerprints are not be reliable identifiers of your input, I'm wondering that we could make it more efficient by just using a hash function that doesn't depend on the order of its inputs?

        Otherwise this looks rather good to me. Instead of taking the min offset and the max offset as offsets for the final token, I'm wondering that it might make more sense to use 0 and the final offset (the one returned after end() has been called) instead so that we don't treat token chars differently depending on whether they appear before/after the tokens or in the middle? By the way even with the current approach, we don't need to call Math.min/max: As tokens are supposed to be emitted in order, the start offset would be the start offset of the first token and the end offset would be the end offset of the last token.

        Show
        Adrien Grand added a comment - If you could tolerate that these fingerprints are not be reliable identifiers of your input, I'm wondering that we could make it more efficient by just using a hash function that doesn't depend on the order of its inputs? Otherwise this looks rather good to me. Instead of taking the min offset and the max offset as offsets for the final token, I'm wondering that it might make more sense to use 0 and the final offset (the one returned after end() has been called) instead so that we don't treat token chars differently depending on whether they appear before/after the tokens or in the middle? By the way even with the current approach, we don't need to call Math.min/max: As tokens are supposed to be emitted in order, the start offset would be the start offset of the first token and the end offset would be the end offset of the last token.
        Hide
        Mark Harwood added a comment -

        Thanks for taking a look, Adrien.
        Added a v2 patch with following changes:

        1) added call to input.end() to get final offset state
        2) final state is retained using captureState()
        3) Added a FingerprintFilterFactory class

        As for the alternative hashing idea :
        For speed reasons this would be a nice idea but reduces the read-ability of results if you want to debug any collisions or otherwise display connections.

        For compactness reasons (storing in doc values etc) it would always be possible to chain a conventional hashing algo in a TokenFilter on the end of this text-normalizing filter. (Do we already have a conventional hashing TokenFilter?)

        Show
        Mark Harwood added a comment - Thanks for taking a look, Adrien. Added a v2 patch with following changes: 1) added call to input.end() to get final offset state 2) final state is retained using captureState() 3) Added a FingerprintFilterFactory class As for the alternative hashing idea : For speed reasons this would be a nice idea but reduces the read-ability of results if you want to debug any collisions or otherwise display connections. For compactness reasons (storing in doc values etc) it would always be possible to chain a conventional hashing algo in a TokenFilter on the end of this text-normalizing filter. (Do we already have a conventional hashing TokenFilter?)
        Hide
        Adrien Grand added a comment -
        +      if (item instanceof char[]) {
        +        sb.append((char[]) item);
        +      } else {
        +        sb.append(item);
        +      }
        

        Can we get something else than a char[]? If no I think we should just forcefully cast?

        Otherwise +1.

        Show
        Adrien Grand added a comment - + if (item instanceof char []) { + sb.append(( char []) item); + } else { + sb.append(item); + } Can we get something else than a char[]? If no I think we should just forcefully cast? Otherwise +1.
        Hide
        Mark Harwood added a comment -

        Updated patch - removed instanceof check and added entry to Changes.txt.

        Will commit to trunk and 5.x in a day or two if there's no objections

        Show
        Mark Harwood added a comment - Updated patch - removed instanceof check and added entry to Changes.txt. Will commit to trunk and 5.x in a day or two if there's no objections
        Hide
        Mark Harwood added a comment -

        Some final tweaks:
        1) Found a bug where separator not appended if first token is length ==1
        2) Randomized testing identified issue with input.end() not being called when IOExceptions occur
        3) Added missing SPI entry for FingerprintFilterFactory and associated test class for FingerprintFilterFactory

        Show
        Mark Harwood added a comment - Some final tweaks: 1) Found a bug where separator not appended if first token is length ==1 2) Randomized testing identified issue with input.end() not being called when IOExceptions occur 3) Added missing SPI entry for FingerprintFilterFactory and associated test class for FingerprintFilterFactory
        Hide
        ASF subversion and git services added a comment -

        Commit 1698145 from mharwood@apache.org in branch 'dev/trunk'
        [ https://svn.apache.org/r1698145 ]

        LUCENE-6747: FingerprintFilter is a new TokenFilter that outputs a single token which is a concatenation of the sorted and de-duplicated set of input tokens.

        Show
        ASF subversion and git services added a comment - Commit 1698145 from mharwood@apache.org in branch 'dev/trunk' [ https://svn.apache.org/r1698145 ] LUCENE-6747 : FingerprintFilter is a new TokenFilter that outputs a single token which is a concatenation of the sorted and de-duplicated set of input tokens.
        Hide
        ASF subversion and git services added a comment -

        Commit 1698188 from mharwood@apache.org in branch 'dev/branches/branch_5x'
        [ https://svn.apache.org/r1698188 ]

        LUCENE-6747: FingerprintFilter is a new TokenFilter that outputs a single token which is a concatenation of the sorted and de-duplicated set of input tokens.

        Show
        ASF subversion and git services added a comment - Commit 1698188 from mharwood@apache.org in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1698188 ] LUCENE-6747 : FingerprintFilter is a new TokenFilter that outputs a single token which is a concatenation of the sorted and de-duplicated set of input tokens.
        Hide
        Mark Harwood added a comment -

        Commited to trunk and 5.x

        Show
        Mark Harwood added a comment - Commited to trunk and 5.x
        Hide
        Adrien Grand added a comment -

        Mark, I see this issue under "Lucene 6.0.0" in the CHANGES.txt file on trunk, should it be moved to 5.4.0?

        Show
        Adrien Grand added a comment - Mark, I see this issue under "Lucene 6.0.0" in the CHANGES.txt file on trunk, should it be moved to 5.4.0?

          People

          • Assignee:
            Unassigned
            Reporter:
            Mark Harwood
          • Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development