Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-5317

Concordance/Key Word In Context (KWIC) capability

Details

    • New Feature
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 4.5
    • None
    • core/search
    • New, Patch Available

    Description

      This patch enables a Lucene-powered concordance search capability.

      Concordances are extremely useful for linguists, lawyers and other analysts performing analytic search vs. traditional snippeting/document retrieval tasks. By "analytic search," I mean that the user wants to browse every time a term appears (or at least the topn) in a subset of documents and see the words before and after.

      Concordance technology is far simpler and less interesting than IR relevance models/methods, but it can be extremely useful for some use cases.

      Traditional concordance sort orders are available (sort on words before the target, words after, target then words before and target then words after).

      Under the hood, this is running SpanQuery's getSpans() and reanalyzing to obtain character offsets. There is plenty of room for optimizations and refactoring.

      Many thanks to my colleague, Jason Robinson, for input on the design of this patch.

      Attachments

        1. concordance_v1.patch.gz
          19 kB
          Tim Allison
        2. LUCENE-5317.patch
          176 kB
          Steven Rowe
        3. LUCENE-5317.patch
          135 kB
          Steven Rowe
        4. lucene5317v1.patch
          175 kB
          Tim Allison
        5. lucene5317v2.patch
          175 kB
          Tim Allison

        Issue Links

          Activity

            People

              teofili Tommaso Teofili
              tallison Tim Allison
              Votes:
              4 Vote for this issue
              Watchers:
              14 Start watching this issue

              Dates

                Created:
                Updated:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 0.5h
                  0.5h