Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-6747

FingerprintFilter - a TokenFilter for clustering/linking purposes

    XMLWordPrintableJSON

Details

    • New Feature
    • Status: Closed
    • Minor
    • Resolution: Fixed
    • None
    • 5.4, 6.0
    • modules/analysis
    • None
    • New, Patch Available

    Description

      A TokenFilter that emits a single token which is a sorted, de-duplicated set of the input tokens.
      This approach to normalizing text is used in tools like OpenRefine[1] and elsewhere [2] to help in clustering or linking texts.
      The implementation proposed here has a an upper limit on the size of the combined token which is output.

      [1] https://github.com/OpenRefine/OpenRefine/wiki/Clustering-In-Depth
      [2] https://rajmak.wordpress.com/2013/04/27/clustering-text-map-reduce-in-python/

      Attachments

        1. fingerprintv1.patch
          10 kB
          Mark Harwood
        2. fingerprintv2.patch
          13 kB
          Mark Harwood
        3. fingerprintv3.patch
          14 kB
          Mark Harwood
        4. fingerprintv4.patch
          19 kB
          Mark Harwood

        Activity

          People

            Unassigned Unassigned
            mharwood Mark Harwood
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: