Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-5317

Concordance/Key Word In Context (KWIC) capability

    XMLWordPrintableJSON

    Details

    • Type: New Feature
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 4.5
    • Fix Version/s: None
    • Component/s: core/search
    • Labels:
    • Lucene Fields:
      New, Patch Available

      Description

      This patch enables a Lucene-powered concordance search capability.

      Concordances are extremely useful for linguists, lawyers and other analysts performing analytic search vs. traditional snippeting/document retrieval tasks. By "analytic search," I mean that the user wants to browse every time a term appears (or at least the topn) in a subset of documents and see the words before and after.

      Concordance technology is far simpler and less interesting than IR relevance models/methods, but it can be extremely useful for some use cases.

      Traditional concordance sort orders are available (sort on words before the target, words after, target then words before and target then words after).

      Under the hood, this is running SpanQuery's getSpans() and reanalyzing to obtain character offsets. There is plenty of room for optimizations and refactoring.

      Many thanks to my colleague, Jason Robinson, for input on the design of this patch.

        Attachments

        1. concordance_v1.patch.gz
          19 kB
          Tim Allison
        2. LUCENE-5317.patch
          176 kB
          Steven Rowe
        3. LUCENE-5317.patch
          135 kB
          Steven Rowe
        4. lucene5317v1.patch
          175 kB
          Tim Allison
        5. lucene5317v2.patch
          175 kB
          Tim Allison

          Issue Links

            Activity

              People

              • Assignee:
                teofili Tommaso Teofili
                Reporter:
                tallison Tim Allison
              • Votes:
                4 Vote for this issue
                Watchers:
                14 Start watching this issue

                Dates

                • Created:
                  Updated:

                  Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 0.5h
                  0.5h