Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-2273

Enable configuration of EncodingDetectors via TikaConfig

    Details

    • Type: Improvement
    • Status: Open
    • Priority: Minor
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      It would be nice to allow easier configuration of encoding detectors. It should be straightforward to follow the example of detectors...(famous last words).

        Activity

        Hide
        tallison@mitre.org Tim Allison added a comment - - edited

        First draft of a patch. Not all tests are passing.

        If anyone has a chance to review, I'd appreciate it!

        Parsers that use the AutoDetectReader have to grab TikaConfig from somewhere...I don't much like this.

        This could lead to inefficiencies of creating the entire TikaConfig at each parse for TXTParser and HtmlParser and others. I've mitigated this for those using AutoDetectParser by including a TikaConfig in the ParseContext if a user hasn't already specified one.

        Are there better options?

        Show
        tallison@mitre.org Tim Allison added a comment - - edited First draft of a patch. Not all tests are passing. If anyone has a chance to review, I'd appreciate it! Parsers that use the AutoDetectReader have to grab TikaConfig from somewhere...I don't much like this. This could lead to inefficiencies of creating the entire TikaConfig at each parse for TXTParser and HtmlParser and others. I've mitigated this for those using AutoDetectParser by including a TikaConfig in the ParseContext if a user hasn't already specified one. Are there better options?
        Hide
        tallison@mitre.org Tim Allison added a comment -

        Bob Paulin. Help! I've broken two tests in tika-bundle. My goal was to allow configuration of EncodingDetectors used by several parsers via AutoDetectReader. All was going well, I thought, but in the bundle, it looks, from the debugger, like there aren't any parsers [parsers.size() == 0] actually going through TikaConfig to be updated with:

                if (encodingDetector != null) {
                    for (Parser p : parsers) {
                        if (p instanceof AbstractEncodingDetectorParser) {
                            ((AbstractEncodingDetectorParser)p).setEncodingDetector(encodingDetector);
                        }
                    }
                }
        

        in DefaultParser's getDefaultParsers(ServiceLoader, EncodingDetector)

        Are the Parsers getting built outside of this code in the bundle somehow?

        Show
        tallison@mitre.org Tim Allison added a comment - Bob Paulin . Help! I've broken two tests in tika-bundle. My goal was to allow configuration of EncodingDetectors used by several parsers via AutoDetectReader. All was going well, I thought, but in the bundle, it looks, from the debugger, like there aren't any parsers [parsers.size() == 0] actually going through TikaConfig to be updated with: if (encodingDetector != null) { for (Parser p : parsers) { if (p instanceof AbstractEncodingDetectorParser) { ((AbstractEncodingDetectorParser)p).setEncodingDetector(encodingDetector); } } } in DefaultParser's getDefaultParsers(ServiceLoader, EncodingDetector) Are the Parsers getting built outside of this code in the bundle somehow?
        Hide
        hudson Hudson added a comment -

        SUCCESS: Integrated in Jenkins build Tika-trunk #1209 (See https://builds.apache.org/job/Tika-trunk/1209/)
        TIKA-2273 – two tests turned off temporarily in bundle. First draft of (tallison: rev 6d022be03b5423f6c036e1aa45e4ce02a9678462)

        • (edit) tika-core/src/main/java/org/apache/tika/detect/EncodingDetector.java
        • (edit) tika-core/src/main/java/org/apache/tika/parser/DefaultParser.java
        • (edit) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/JackcessExtractor.java
        • (edit) tika-core/src/main/java/org/apache/tika/config/TikaConfig.java
        • (edit) tika-parsers/src/main/java/org/apache/tika/parser/envi/EnviHeaderParser.java
        • (edit) tika-core/src/main/java/org/apache/tika/extractor/EmbeddedDocumentUtil.java
        • (add) tika-core/src/main/java/org/apache/tika/parser/AbstractEncodingDetectorParser.java
        • (add) tika-core/src/main/java/org/apache/tika/detect/DefaultEncodingDetector.java
        • (edit) tika-parsers/src/main/java/org/apache/tika/parser/txt/Icu4jEncodingDetector.java
        • (edit) tika-parsers/src/main/java/org/apache/tika/parser/html/HtmlParser.java
        • (add) tika-core/src/main/java/org/apache/tika/detect/CompositeEncodingDetector.java
        • (edit) tika-parsers/src/main/java/org/apache/tika/parser/isatab/ISATabUtils.java
        • (edit) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/OutlookExtractor.java
        • (add) tika-parsers/src/test/resources/org/apache/tika/config/TIKA-2273-blacklist-encoding-detector-default.xml
        • (edit) tika-parsers/src/main/java/org/apache/tika/parser/txt/TXTParser.java
        • (edit) tika-bundle/src/test/java/org/apache/tika/bundle/BundleIT.java
        • (add) tika-parsers/src/test/java/org/apache/tika/config/TikaEncodingDetectorTest.java
        • (add) tika-parsers/src/test/resources/org/apache/tika/config/TIKA-2273-parameterize-encoding-detector.xml
        • (edit) tika-core/src/main/java/org/apache/tika/detect/AutoDetectReader.java
        • (add) tika-core/src/main/java/org/apache/tika/detect/NonDetectingEncodingDetector.java
        • (edit) tika-core/src/test/java/org/apache/tika/TikaTest.java
        • (edit) tika-parsers/src/test/java/org/apache/tika/parser/txt/TXTParserTest.java
        • (add) tika-parsers/src/test/resources/org/apache/tika/config/TIKA-2273-no-icu4j-encoding-detector.xml
        • (edit) tika-parsers/src/main/java/org/apache/tika/parser/chm/ChmParser.java
        • (edit) tika-parsers/src/main/java/org/apache/tika/parser/code/SourceCodeParser.java
        Show
        hudson Hudson added a comment - SUCCESS: Integrated in Jenkins build Tika-trunk #1209 (See https://builds.apache.org/job/Tika-trunk/1209/ ) TIKA-2273 – two tests turned off temporarily in bundle. First draft of (tallison: rev 6d022be03b5423f6c036e1aa45e4ce02a9678462) (edit) tika-core/src/main/java/org/apache/tika/detect/EncodingDetector.java (edit) tika-core/src/main/java/org/apache/tika/parser/DefaultParser.java (edit) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/JackcessExtractor.java (edit) tika-core/src/main/java/org/apache/tika/config/TikaConfig.java (edit) tika-parsers/src/main/java/org/apache/tika/parser/envi/EnviHeaderParser.java (edit) tika-core/src/main/java/org/apache/tika/extractor/EmbeddedDocumentUtil.java (add) tika-core/src/main/java/org/apache/tika/parser/AbstractEncodingDetectorParser.java (add) tika-core/src/main/java/org/apache/tika/detect/DefaultEncodingDetector.java (edit) tika-parsers/src/main/java/org/apache/tika/parser/txt/Icu4jEncodingDetector.java (edit) tika-parsers/src/main/java/org/apache/tika/parser/html/HtmlParser.java (add) tika-core/src/main/java/org/apache/tika/detect/CompositeEncodingDetector.java (edit) tika-parsers/src/main/java/org/apache/tika/parser/isatab/ISATabUtils.java (edit) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/OutlookExtractor.java (add) tika-parsers/src/test/resources/org/apache/tika/config/ TIKA-2273 -blacklist-encoding-detector-default.xml (edit) tika-parsers/src/main/java/org/apache/tika/parser/txt/TXTParser.java (edit) tika-bundle/src/test/java/org/apache/tika/bundle/BundleIT.java (add) tika-parsers/src/test/java/org/apache/tika/config/TikaEncodingDetectorTest.java (add) tika-parsers/src/test/resources/org/apache/tika/config/ TIKA-2273 -parameterize-encoding-detector.xml (edit) tika-core/src/main/java/org/apache/tika/detect/AutoDetectReader.java (add) tika-core/src/main/java/org/apache/tika/detect/NonDetectingEncodingDetector.java (edit) tika-core/src/test/java/org/apache/tika/TikaTest.java (edit) tika-parsers/src/test/java/org/apache/tika/parser/txt/TXTParserTest.java (add) tika-parsers/src/test/resources/org/apache/tika/config/ TIKA-2273 -no-icu4j-encoding-detector.xml (edit) tika-parsers/src/main/java/org/apache/tika/parser/chm/ChmParser.java (edit) tika-parsers/src/main/java/org/apache/tika/parser/code/SourceCodeParser.java
        Hide
        hudson Hudson added a comment -

        SUCCESS: Integrated in Jenkins build Tika-trunk #1211 (See https://builds.apache.org/job/Tika-trunk/1211/)
        TIKA-2273 – fix configuration of encoding detectors when parsers are (tallison: rev e7a0c3eece98ac81b3813aeb429b24f16090201c)

        • (edit) tika-core/src/main/java/org/apache/tika/parser/AutoDetectParser.java
        • (edit) tika-parsers/src/main/java/org/apache/tika/parser/txt/TXTParser.java
        • (edit) tika-core/src/main/java/org/apache/tika/parser/DefaultParser.java
        • (add) tika-parsers/src/test/resources/org/apache/tika/config/TIKA-2273-non-detecting-params-bad-charset.xml
        • (edit) tika-core/src/main/java/org/apache/tika/parser/AbstractEncodingDetectorParser.java
        • (edit) tika-core/src/main/java/org/apache/tika/config/TikaConfig.java
        • (edit) tika-parsers/src/main/java/org/apache/tika/parser/html/HtmlParser.java
        • (edit) tika-core/src/main/java/org/apache/tika/config/ServiceLoader.java
        • (add) tika-parsers/src/test/resources/org/apache/tika/config/TIKA-2273-non-detecting-params.xml
        • (edit) tika-bundle/src/test/java/org/apache/tika/bundle/BundleIT.java
        • (add) tika-parsers/src/test/resources/org/apache/tika/config/TIKA-2273-encoding-detector-outside-static-init.xml
        • (edit) tika-parsers/src/test/java/org/apache/tika/parser/txt/TXTParserTest.java
        • (edit) tika-parsers/src/main/java/org/apache/tika/parser/code/SourceCodeParser.java
        • (edit) tika-core/src/main/java/org/apache/tika/detect/NonDetectingEncodingDetector.java
        • (edit) tika-parsers/src/test/java/org/apache/tika/config/TikaEncodingDetectorTest.java
        Show
        hudson Hudson added a comment - SUCCESS: Integrated in Jenkins build Tika-trunk #1211 (See https://builds.apache.org/job/Tika-trunk/1211/ ) TIKA-2273 – fix configuration of encoding detectors when parsers are (tallison: rev e7a0c3eece98ac81b3813aeb429b24f16090201c) (edit) tika-core/src/main/java/org/apache/tika/parser/AutoDetectParser.java (edit) tika-parsers/src/main/java/org/apache/tika/parser/txt/TXTParser.java (edit) tika-core/src/main/java/org/apache/tika/parser/DefaultParser.java (add) tika-parsers/src/test/resources/org/apache/tika/config/ TIKA-2273 -non-detecting-params-bad-charset.xml (edit) tika-core/src/main/java/org/apache/tika/parser/AbstractEncodingDetectorParser.java (edit) tika-core/src/main/java/org/apache/tika/config/TikaConfig.java (edit) tika-parsers/src/main/java/org/apache/tika/parser/html/HtmlParser.java (edit) tika-core/src/main/java/org/apache/tika/config/ServiceLoader.java (add) tika-parsers/src/test/resources/org/apache/tika/config/ TIKA-2273 -non-detecting-params.xml (edit) tika-bundle/src/test/java/org/apache/tika/bundle/BundleIT.java (add) tika-parsers/src/test/resources/org/apache/tika/config/ TIKA-2273 -encoding-detector-outside-static-init.xml (edit) tika-parsers/src/test/java/org/apache/tika/parser/txt/TXTParserTest.java (edit) tika-parsers/src/main/java/org/apache/tika/parser/code/SourceCodeParser.java (edit) tika-core/src/main/java/org/apache/tika/detect/NonDetectingEncodingDetector.java (edit) tika-parsers/src/test/java/org/apache/tika/config/TikaEncodingDetectorTest.java
        Hide
        hudson Hudson added a comment -

        SUCCESS: Integrated in Jenkins build Tika-trunk #1212 (See https://builds.apache.org/job/Tika-trunk/1212/)
        TIKA-2273 – cleanup, update CHANGES.txt (tallison: rev 5e0f926c57649e1abfe56fb893266a501f390f1c)

        • (edit) CHANGES.txt
        • (edit) tika-core/src/main/java/org/apache/tika/parser/AbstractEncodingDetectorParser.java
        Show
        hudson Hudson added a comment - SUCCESS: Integrated in Jenkins build Tika-trunk #1212 (See https://builds.apache.org/job/Tika-trunk/1212/ ) TIKA-2273 – cleanup, update CHANGES.txt (tallison: rev 5e0f926c57649e1abfe56fb893266a501f390f1c) (edit) CHANGES.txt (edit) tika-core/src/main/java/org/apache/tika/parser/AbstractEncodingDetectorParser.java
        Hide
        tallison@mitre.org Tim Allison added a comment -

        All appears to work in 1.x. I'm still having problems loading any encoding detector in the overall bundle in 2.x

        This is for a separate issue, but I wanted to document: I also found that we need to have some method of ordering SPI encoding detectors between packages e.g. text vs. web. In Tika 1.x, we have one service provider/location for encoding detection, and we load them in in the order they appear in that one file. However, if someone adds their own or in 2.x where we have a service file in text and one in web, the order is not guaranteed. For the parsers, we rely on sorting by the parsers' class name. Should we do this for EncodingDetectors or remove them from SPI and require configuration, or perhaps centralization in tika-core???

        Show
        tallison@mitre.org Tim Allison added a comment - All appears to work in 1.x. I'm still having problems loading any encoding detector in the overall bundle in 2.x This is for a separate issue, but I wanted to document: I also found that we need to have some method of ordering SPI encoding detectors between packages e.g. text vs. web. In Tika 1.x, we have one service provider/location for encoding detection, and we load them in in the order they appear in that one file. However, if someone adds their own or in 2.x where we have a service file in text and one in web, the order is not guaranteed. For the parsers, we rely on sorting by the parsers' class name. Should we do this for EncodingDetectors or remove them from SPI and require configuration, or perhaps centralization in tika-core???

          People

          • Assignee:
            Unassigned
            Reporter:
            tallison@mitre.org Tim Allison
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:

              Development