Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-2038

A more accurate facility for detecting Charset Encoding of HTML documents

Attach filesAttach ScreenshotAdd voteVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Minor
    • Resolution: Unresolved
    • None
    • None
    • core, detector
    • None

    Description

      Currently, Tika uses icu4j for detecting charset encoding of HTML documents as well as the other naturally text documents. But the accuracy of encoding detector tools, including icu4j, in dealing with the HTML documents is meaningfully less than from which the other text documents. Hence, in our project I developed a library that works pretty well for HTML documents, which is available here: https://github.com/shabanali-faghani/IUST-HTMLCharDet

      Since Tika is widely used with and within some of other Apache stuffs such as Nutch, Lucene, Solr, etc. and these projects are strongly in connection with the HTML documents, it seems that having such an facility in Tika also will help them to become more accurate.

      Attachments

        1. tld_text_html_plus_H_column.xlsx
          14 kB
          Shabanali Faghani
        2. tld_text_html.xlsx
          13 kB
          Tim Allison
        3. proposedTLDSampling.csv
          1 kB
          Tim Allison
        4. lang-wise-eval_runnable.zip
          19.23 MB
          Shabanali Faghani
        5. lang-wise-eval_source_code.zip
          9.53 MB
          Shabanali Faghani
        6. lang-wise-eval_results.zip
          2.51 MB
          Shabanali Faghani
        7. comparisons_20160804.xlsx
          115 kB
          Tim Allison
        8. comparisons_20160803b.xlsx
          104 kB
          Tim Allison
        9. iust_encodings.zip
          7 kB
          Tim Allison
        10. tika_1_14-SNAPSHOT_encoding_detector.zip
          5 kB
          Tim Allison

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            Unassigned Unassigned
            faghani Shabanali Faghani

            Dates

              Created:
              Updated:

              Issue deployment