Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-6747

FingerprintFilter - a TokenFilter for clustering/linking purposes

    Details

    • Type: New Feature
    • Status: Closed
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 5.4, 6.0
    • Component/s: modules/analysis
    • Labels:
      None
    • Lucene Fields:
      New, Patch Available

      Description

      A TokenFilter that emits a single token which is a sorted, de-duplicated set of the input tokens.
      This approach to normalizing text is used in tools like OpenRefine[1] and elsewhere [2] to help in clustering or linking texts.
      The implementation proposed here has a an upper limit on the size of the combined token which is output.

      [1] https://github.com/OpenRefine/OpenRefine/wiki/Clustering-In-Depth
      [2] https://rajmak.wordpress.com/2013/04/27/clustering-text-map-reduce-in-python/

        Attachments

        1. fingerprintv1.patch
          10 kB
          Mark Harwood
        2. fingerprintv2.patch
          13 kB
          Mark Harwood
        3. fingerprintv3.patch
          14 kB
          Mark Harwood
        4. fingerprintv4.patch
          19 kB
          Mark Harwood

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              markh Mark Harwood
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: