Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-1443

Add a junk text detector to Tika

    XMLWordPrintableJSON

    Details

    • Type: Wish
    • Status: Open
    • Priority: Minor
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      It would be helpful to have a detector that flags documents whose extracted text is junk. This could be used as a component of TIKA-1332 or as a standalone detector. See TIKA-1332 for some initial ideas of what statistics we might use for such a detector.

      Two use cases:

      • Parser developers could quickly see whether changes in code lead to less "junky" documents or more "junky" documents. This would also aid in prioritizing manual review of output comparison (see discussion in TIKA-1419).
      • Search system integrators could use that information to set document specific relevancy rankings or to avoid indexing a document

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              tallison Tim Allison
            • Votes:
              1 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

              • Created:
                Updated: