Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-2940

Consider an ensemble charset detection method

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Minor
    • Resolution: Unresolved
    • None
    • None
    • None
    • None

    Description

      I recently ran our four charset detectors against our text based files.

      The raw data is available here:
      http://162.242.228.174/encoding_detection/charsets_combined_201909.sql.zip (in sql form) or http://162.242.228.174/encoding_detection/charsets_combined_201909.csv.zip (in a csv).

      I've posted a preliminary/draft report here: https://github.com/tballison/share/blob/master/slides/Tika_charset_detector_study_201909.docx

      In general, we could see a ~1.4% improvement in "common tokens"[0] if we used an ensemble approach on our corpus. For users with more homogeneous documents, this improvement could be far greater (e.g. if their documents all come from a content management system that is applying an incorrect html-meta charset header).

      I'm opening this issue for discussion and as encouragement for others to work with the raw data and/or make recommendations on the preliminary report's methodology.

      [0] "common tokens" in tika-eval refers to the lists we developed of the top 30k most common words per 118 languages covered in tika-eval. It can be a sign of improved extraction if the total number of "common tokens" increases.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              tallison Tim Allison
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated: