Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-344

Charset hint in metadata

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Minor
    • Resolution: Duplicate
    • 0.6
    • None
    • parser
    • None

    Description

      It would be nice if TextParser and HtmlParser support Metadata.CONTENT_ENCODING hint.

      In my application I always prefer that hint (if it is present) over the charset detector result, because charset detector is often wrong on short inputs (even if match.confidence is 100) and I know that hint if present is right in 99%.

      To be more general, user might be able to change default behaviour by override a function F(hint, detectorResults) -> charset.
      Other solution is to create some standard strategies and let user to choose one of them:
      a) hint is most important
      b) charset detector result is most important
      c) create some heuristic using detectorResult.confidence, hint and maybe input length
      Maybe the last heuristic method would be good enough for most cases.

      Attachments

        Activity

          People

            Unassigned Unassigned
            bartex Piotr Bartosiewicz
            Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: