Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-2041

Charset detection doesn't appear to be thread-safe

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 2.0, 1.14
    • Component/s: None
    • Labels:
      None

      Description

      On the user list, Christian Leitinger noted that his team found a potential issue with the thread safety of the encoding detector. I was able to reproduce this with on the corpus of html files in Shabanali Faghani's encoding detector.

          @Test
          public void testMultiThreadingEncodingDetection() throws Exception {
      
              Path testDocs = Paths.get("C:/data/encodings/corpus");
              List<Path> paths = new ArrayList<>();
              Map<Path, String> encodings = new ConcurrentHashMap<>();
              for (File encodingDirs : testDocs.toFile().listFiles()) {
                  for (File file : encodingDirs.listFiles()) {
                          String encoding = getEncoding(file.toPath());
                          paths.add(file.toPath());
                          encodings.put(file.toPath(), encoding);
                  }
              }
              int numThreads = 1000;
              ExecutorService ex = Executors.newFixedThreadPool(numThreads);
              CompletionService<String> completionService =
                      new ExecutorCompletionService<>(ex);
      
              for (int i = 0; i < numThreads; i++) {
                  completionService.submit(new EncodingDetectorRunner(paths, encodings), "done");
              }
              int completed = 0;
              while (completed < numThreads) {
                  Future<String> future = completionService.take();
                  if (future.isDone() && "done".equals(future.get())) {
                      completed++;
                  }
              }
              assertTrue("success!", true);
          }
      
          private class EncodingDetectorRunner implements Runnable {
              private final List<Path> paths;
              private final Map<Path, String> encodings;
              private final Random r = new Random();
              private EncodingDetectorRunner(List<Path> paths, Map<Path, String> encodings) {
                  this.paths = paths;
                  this.encodings = encodings;
              }
      
              @Override
              public void run() {
                  for (int i = 0; i < 100; i++) {
                      int pInd = r.nextInt(paths.size());
      
                      String detectedEncoding = null;
                      try {
                          detectedEncoding = getEncoding(paths.get(pInd));
                      } catch (Exception e) {
                          throw new RuntimeException(e);
                      }
                      String trueEncoding = encodings.get(paths.get(pInd));
                      if (! detectedEncoding.equals(trueEncoding)) {
                          throw new RuntimeException("detected: " + detectedEncoding +
                                  " but should have been: "+trueEncoding + " for " + paths.get(pInd));
                      }
                  }
              }
          }
      
          public String getEncoding(Path p) throws Exception {
              try (InputStream is = TikaInputStream.get(p)) {
                  AutoDetectReader reader = new AutoDetectReader(is);
                  String val = reader.getCharset().toString();
                  if (val == null) {
                      return "NULL";
                  } else {
                      return val;
                  }
              }
          }
      

      yields:

      ava.util.concurrent.ExecutionException: java.lang.RuntimeException: detected: ISO-8859-1 but should have been: windows-1252 for C:\data\encodings\corpus\Shift_JIS\1
      
      	at java.util.concurrent.FutureTask.report(FutureTask.java:122)
      	at java.util.concurrent.FutureTask.get(FutureTask.java:192)
      	at org.apache.tika.parser.html.HtmlParserTest.testMultiThreadingEncodingDetection(HtmlParserTest.java:1213)
      

        Activity

        Hide
        fnl Florian Leitner added a comment -

        Hi; I'm one of the pals of Christian. If it might help, one possible origin I suspect of being (involved in) the cause of this bug is the private static class CharsetRecog_IBM420_ar in CharsetRecog_sbcs. It can edit the byte array being processed (search for a private method called "unshape"). That might be able to change the distribution of bytes that the CharsetDetector uses to predict the final outcome, if the order of the CharsetRecognizer static private classes being used somehow is not thread-safe. While I had no time to figure out a proof for this theory, I imagine that the unshape method, which mutates the CharsetDetector.fByteStats and CharsetDetector.fInputBytes values, could be related to the cause of this bug; Or at least that there is some similar code found in a CharsetRecognizer that mutates these array values.

        Show
        fnl Florian Leitner added a comment - Hi; I'm one of the pals of Christian. If it might help, one possible origin I suspect of being (involved in) the cause of this bug is the private static class CharsetRecog_IBM420_ar in CharsetRecog_sbcs. It can edit the byte array being processed (search for a private method called "unshape"). That might be able to change the distribution of bytes that the CharsetDetector uses to predict the final outcome, if the order of the CharsetRecognizer static private classes being used somehow is not thread-safe. While I had no time to figure out a proof for this theory, I imagine that the unshape method, which mutates the CharsetDetector.fByteStats and CharsetDetector.fInputBytes values, could be related to the cause of this bug; Or at least that there is some similar code found in a CharsetRecognizer that mutates these array values.
        Hide
        tallison@mitre.org Tim Allison added a comment - - edited

        From Nick Burch on the user list:

        On the whole, I think Tika is following the POI model on thread-safety as a minimum. That is, two threads working on two different documents should always be fine. Two threads trying to work on the same document may not be

        Show
        tallison@mitre.org Tim Allison added a comment - - edited From Nick Burch on the user list: On the whole, I think Tika is following the POI model on thread-safety as a minimum. That is, two threads working on two different documents should always be fine. Two threads trying to work on the same document may not be
        Hide
        fnl Florian Leitner added a comment - - edited

        @Tim: Regarding Nick's comment and just to clear up any ambiguity: Neither the code above or where we observed the bug are cases of multiple threads working on the same input stream/byte array. [EDITS: making this reply more clear.]

        Show
        fnl Florian Leitner added a comment - - edited @Tim: Regarding Nick's comment and just to clear up any ambiguity: Neither the code above or where we observed the bug are cases of multiple threads working on the same input stream/byte array. [EDITS: making this reply more clear.]
        Hide
        tallison@mitre.org Tim Allison added a comment -

        Sorry. I agree. Had to run to a meeting before including the important part b above. I'm able to reproduce this issue with multiple threads processing each file only once.

        Has anyone done enough googling or used pure ICU4J to figure out if this is an issue known by/fixed by them in a more recent version?

        Show
        tallison@mitre.org Tim Allison added a comment - Sorry. I agree. Had to run to a meeting before including the important part b above. I'm able to reproduce this issue with multiple threads processing each file only once. Has anyone done enough googling or used pure ICU4J to figure out if this is an issue known by/fixed by them in a more recent version?
        Hide
        tallison@mitre.org Tim Allison added a comment - - edited

        Out of curiosity, what type of diffs are you seeing? When I experiment with turning off loading of certain subsets of recognizers, the errors I'm getting seem to be roughly equivalent windows vs. ISO character sets: windows-1255 vs. ISO-8859-8 or windows-1254 vs. ISO-8859-9.

        The problem there, I think, is that haveC1bytes is storing instance state, but the recognizers are being loaded statically (looks like they've gotten rid of haveC1Bytes in trunk and in 57-rc)

        Are you seeing differences beyond these win/ISO equivalences?

        Show
        tallison@mitre.org Tim Allison added a comment - - edited Out of curiosity, what type of diffs are you seeing? When I experiment with turning off loading of certain subsets of recognizers, the errors I'm getting seem to be roughly equivalent windows vs. ISO character sets: windows-1255 vs. ISO-8859-8 or windows-1254 vs. ISO-8859-9. The problem there, I think, is that haveC1bytes is storing instance state, but the recognizers are being loaded statically (looks like they've gotten rid of haveC1Bytes in trunk and in 57-rc) Are you seeing differences beyond these win/ISO equivalences?
        Hide
        fnl Florian Leitner added a comment -

        Nope, that is exactly the same problem I was seeing, and to my best knowledge there is no other case.

        In single-threaded mode haveC1bytes is always false; Therefore, my suspicion is that something (unshape?) triggers an update of the various mutable state values in CharsetDetector, and by the time the CharsetDetector instance checks the result of the CharsetRecognizer, instead of returning the correct Latin encoding name, instead it returns the Windows variant, because haveC1Bytes has (wrongly) been changed and set to true. (BTW, this is doubly annoying because even for ASCII-only documents, Latin-1 is the default result, but instead leading to the "windows-1252" variant being detected.)

        Show
        fnl Florian Leitner added a comment - Nope, that is exactly the same problem I was seeing, and to my best knowledge there is no other case. In single-threaded mode haveC1bytes is always false; Therefore, my suspicion is that something (unshape?) triggers an update of the various mutable state values in CharsetDetector, and by the time the CharsetDetector instance checks the result of the CharsetRecognizer, instead of returning the correct Latin encoding name, instead it returns the Windows variant, because haveC1Bytes has (wrongly) been changed and set to true. (BTW, this is doubly annoying because even for ASCII-only documents, Latin-1 is the default result, but instead leading to the "windows-1252" variant being detected.)
        Hide
        tallison@mitre.org Tim Allison added a comment -

        Thank you for the confirmation. Unless my colleagues object, I'll re-copy/paste the classes that we use from ICU4J from the latest release (or trunk?). That should fix this.

        Show
        tallison@mitre.org Tim Allison added a comment - Thank you for the confirmation. Unless my colleagues object, I'll re-copy/paste the classes that we use from ICU4J from the latest release (or trunk?). That should fix this.
        Hide
        fnl Florian Leitner added a comment - - edited

        Sounds great; If it resolves the above unit-test, it should fix our problem, too. Thank you for looking into this! [EDIT: typo]

        Show
        fnl Florian Leitner added a comment - - edited Sounds great; If it resolves the above unit-test, it should fix our problem, too. Thank you for looking into this! [EDIT: typo]
        Hide
        tallison@mitre.org Tim Allison added a comment - - edited

        Fixes this problem, indeed. However...

        Looks like the newer version has dropped EBCDIC_500_*. I'll add those back in?

        The newer version has decreased to 8000, the number of bytes read; formerly 12000. This is causing one of our tests to fail. Will increase bytes read to 12000.

        Show
        tallison@mitre.org Tim Allison added a comment - - edited Fixes this problem, indeed. However... Looks like the newer version has dropped EBCDIC_500_*. I'll add those back in? The newer version has decreased to 8000, the number of bytes read; formerly 12000. This is causing one of our tests to fail. Will increase bytes read to 12000.
        Hide
        hudson Hudson added a comment -

        FAILURE: Integrated in Tika-trunk #1086 (See https://builds.apache.org/job/Tika-trunk/1086/)
        TIKA-2041, step 1, upgrade icu4j components; add back ebcdic and bump (tallison: rev d698d49793bb96aee6a8bfcc4a174c922cb83cb8)

        • tika-parsers/src/main/java/org/apache/tika/parser/txt/CharsetRecog_2022.java
        • tika-parsers/src/main/java/org/apache/tika/parser/txt/CharsetRecog_sbcs.java
        • tika-parsers/src/main/java/org/apache/tika/parser/txt/CharsetRecognizer.java
        • tika-parsers/src/main/java/org/apache/tika/parser/txt/CharsetRecog_UTF8.java
        • tika-parsers/src/main/java/org/apache/tika/parser/txt/CharsetDetector.java
        • tika-parsers/src/main/java/org/apache/tika/parser/txt/CharsetRecog_mbcs.java
        • tika-parsers/src/main/java/org/apache/tika/parser/txt/CharsetRecog_Unicode.java
        • tika-parsers/src/main/java/org/apache/tika/parser/txt/CharsetMatch.java
          TIKA-2041 – add unit test in HTMLParserTest (tallison: rev 7dc5c671f892bac79d8fb2d55e9e94718ca31cbe)
        • CHANGES.txt
        • tika-parsers/src/test/java/org/apache/tika/parser/html/HtmlParserTest.java
        Show
        hudson Hudson added a comment - FAILURE: Integrated in Tika-trunk #1086 (See https://builds.apache.org/job/Tika-trunk/1086/ ) TIKA-2041 , step 1, upgrade icu4j components; add back ebcdic and bump (tallison: rev d698d49793bb96aee6a8bfcc4a174c922cb83cb8) tika-parsers/src/main/java/org/apache/tika/parser/txt/CharsetRecog_2022.java tika-parsers/src/main/java/org/apache/tika/parser/txt/CharsetRecog_sbcs.java tika-parsers/src/main/java/org/apache/tika/parser/txt/CharsetRecognizer.java tika-parsers/src/main/java/org/apache/tika/parser/txt/CharsetRecog_UTF8.java tika-parsers/src/main/java/org/apache/tika/parser/txt/CharsetDetector.java tika-parsers/src/main/java/org/apache/tika/parser/txt/CharsetRecog_mbcs.java tika-parsers/src/main/java/org/apache/tika/parser/txt/CharsetRecog_Unicode.java tika-parsers/src/main/java/org/apache/tika/parser/txt/CharsetMatch.java TIKA-2041 – add unit test in HTMLParserTest (tallison: rev 7dc5c671f892bac79d8fb2d55e9e94718ca31cbe) CHANGES.txt tika-parsers/src/test/java/org/apache/tika/parser/html/HtmlParserTest.java
        Hide
        gagravarr Nick Burch added a comment -

        We added the EBCDIC_500_ family of detectors into our own copy of the library. Tried but failed to get it accepted upstream

        We should keep it, but possibly also put in a note that it's Tika-custom to avoid confusion on any future upgrades!

        Show
        gagravarr Nick Burch added a comment - We added the EBCDIC_500_ family of detectors into our own copy of the library. Tried but failed to get it accepted upstream We should keep it, but possibly also put in a note that it's Tika-custom to avoid confusion on any future upgrades!
        Hide
        tallison@mitre.org Tim Allison added a comment -

        Good to know. Any other custom changes?

        Show
        tallison@mitre.org Tim Allison added a comment - Good to know. Any other custom changes?
        Hide
        tallison@mitre.org Tim Allison added a comment -

        Thank you, Christian L., Florian Leitner, Christian Aistleitner for finding this issue and helping us to find the cause.

        Christian L., thank you for taking the first (very important!) step of contacting us about this issue. Now you know how to reach us. Let us know what else you find.

        Show
        tallison@mitre.org Tim Allison added a comment - Thank you, Christian L. , Florian Leitner , Christian Aistleitner for finding this issue and helping us to find the cause. Christian L. , thank you for taking the first (very important!) step of contacting us about this issue. Now you know how to reach us. Let us know what else you find.
        Hide
        hudson Hudson added a comment -

        FAILURE: Integrated in tika-2.x-windows #29 (See https://builds.apache.org/job/tika-2.x-windows/29/)
        TIKA-2041, upgrade ICU4j's charset detector to avoid multithreading bug. (tallison: rev 9f6c71fa69eaae558aff85cfa0dce72bca08fd4e)

        • CHANGES.txt
        • tika-parser-modules/tika-parser-text-module/src/main/java/org/apache/tika/parser/txt/CharsetRecognizer.java
        • tika-parser-modules/tika-parser-web-module/src/test/java/org/apache/tika/parser/html/HtmlParserTest.java
        • tika-parser-modules/tika-parser-text-module/src/main/java/org/apache/tika/parser/txt/CharsetMatch.java
        • tika-parser-modules/tika-parser-text-module/src/main/java/org/apache/tika/parser/txt/CharsetRecog_sbcs.java
        • tika-parser-modules/tika-parser-text-module/src/main/java/org/apache/tika/parser/txt/CharsetRecog_mbcs.java
        • tika-parser-modules/tika-parser-text-module/src/main/java/org/apache/tika/parser/txt/CharsetRecog_UTF8.java
        • tika-parser-modules/tika-parser-text-module/src/main/java/org/apache/tika/parser/txt/CharsetRecog_Unicode.java
        • tika-parser-modules/tika-parser-text-module/src/main/java/org/apache/tika/parser/txt/CharsetDetector.java
        • tika-parser-modules/tika-parser-text-module/src/main/java/org/apache/tika/parser/txt/Icu4jEncodingDetector.java
        • tika-parser-modules/tika-parser-text-module/src/main/java/org/apache/tika/parser/txt/CharsetRecog_2022.java
        Show
        hudson Hudson added a comment - FAILURE: Integrated in tika-2.x-windows #29 (See https://builds.apache.org/job/tika-2.x-windows/29/ ) TIKA-2041 , upgrade ICU4j's charset detector to avoid multithreading bug. (tallison: rev 9f6c71fa69eaae558aff85cfa0dce72bca08fd4e) CHANGES.txt tika-parser-modules/tika-parser-text-module/src/main/java/org/apache/tika/parser/txt/CharsetRecognizer.java tika-parser-modules/tika-parser-web-module/src/test/java/org/apache/tika/parser/html/HtmlParserTest.java tika-parser-modules/tika-parser-text-module/src/main/java/org/apache/tika/parser/txt/CharsetMatch.java tika-parser-modules/tika-parser-text-module/src/main/java/org/apache/tika/parser/txt/CharsetRecog_sbcs.java tika-parser-modules/tika-parser-text-module/src/main/java/org/apache/tika/parser/txt/CharsetRecog_mbcs.java tika-parser-modules/tika-parser-text-module/src/main/java/org/apache/tika/parser/txt/CharsetRecog_UTF8.java tika-parser-modules/tika-parser-text-module/src/main/java/org/apache/tika/parser/txt/CharsetRecog_Unicode.java tika-parser-modules/tika-parser-text-module/src/main/java/org/apache/tika/parser/txt/CharsetDetector.java tika-parser-modules/tika-parser-text-module/src/main/java/org/apache/tika/parser/txt/Icu4jEncodingDetector.java tika-parser-modules/tika-parser-text-module/src/main/java/org/apache/tika/parser/txt/CharsetRecog_2022.java
        Hide
        hudson Hudson added a comment -

        SUCCESS: Integrated in tika-2.x #125 (See https://builds.apache.org/job/tika-2.x/125/)
        TIKA-2041, upgrade ICU4j's charset detector to avoid multithreading bug. (tallison: rev 9f6c71fa69eaae558aff85cfa0dce72bca08fd4e)

        • tika-parser-modules/tika-parser-text-module/src/main/java/org/apache/tika/parser/txt/CharsetMatch.java
        • tika-parser-modules/tika-parser-web-module/src/test/java/org/apache/tika/parser/html/HtmlParserTest.java
        • tika-parser-modules/tika-parser-text-module/src/main/java/org/apache/tika/parser/txt/CharsetRecognizer.java
        • tika-parser-modules/tika-parser-text-module/src/main/java/org/apache/tika/parser/txt/CharsetRecog_mbcs.java
        • tika-parser-modules/tika-parser-text-module/src/main/java/org/apache/tika/parser/txt/Icu4jEncodingDetector.java
        • tika-parser-modules/tika-parser-text-module/src/main/java/org/apache/tika/parser/txt/CharsetRecog_sbcs.java
        • tika-parser-modules/tika-parser-text-module/src/main/java/org/apache/tika/parser/txt/CharsetRecog_UTF8.java
        • tika-parser-modules/tika-parser-text-module/src/main/java/org/apache/tika/parser/txt/CharsetRecog_Unicode.java
        • tika-parser-modules/tika-parser-text-module/src/main/java/org/apache/tika/parser/txt/CharsetDetector.java
        • tika-parser-modules/tika-parser-text-module/src/main/java/org/apache/tika/parser/txt/CharsetRecog_2022.java
        • CHANGES.txt
        Show
        hudson Hudson added a comment - SUCCESS: Integrated in tika-2.x #125 (See https://builds.apache.org/job/tika-2.x/125/ ) TIKA-2041 , upgrade ICU4j's charset detector to avoid multithreading bug. (tallison: rev 9f6c71fa69eaae558aff85cfa0dce72bca08fd4e) tika-parser-modules/tika-parser-text-module/src/main/java/org/apache/tika/parser/txt/CharsetMatch.java tika-parser-modules/tika-parser-web-module/src/test/java/org/apache/tika/parser/html/HtmlParserTest.java tika-parser-modules/tika-parser-text-module/src/main/java/org/apache/tika/parser/txt/CharsetRecognizer.java tika-parser-modules/tika-parser-text-module/src/main/java/org/apache/tika/parser/txt/CharsetRecog_mbcs.java tika-parser-modules/tika-parser-text-module/src/main/java/org/apache/tika/parser/txt/Icu4jEncodingDetector.java tika-parser-modules/tika-parser-text-module/src/main/java/org/apache/tika/parser/txt/CharsetRecog_sbcs.java tika-parser-modules/tika-parser-text-module/src/main/java/org/apache/tika/parser/txt/CharsetRecog_UTF8.java tika-parser-modules/tika-parser-text-module/src/main/java/org/apache/tika/parser/txt/CharsetRecog_Unicode.java tika-parser-modules/tika-parser-text-module/src/main/java/org/apache/tika/parser/txt/CharsetDetector.java tika-parser-modules/tika-parser-text-module/src/main/java/org/apache/tika/parser/txt/CharsetRecog_2022.java CHANGES.txt
        Hide
        c.leitinger Christian L. added a comment -

        Thanks for the fix!
        That was super-quick.
        You rock!

        Show
        c.leitinger Christian L. added a comment - Thanks for the fix! That was super-quick. You rock!
        Hide
        gagravarr Nick Burch added a comment -

        Running "git log" and "git diff" on the file suggests other custom bits are:

        • CharsetRecog_IBM866_ru (9f5593e6990fddc3471e1b87fb1df2f00c5ed600)
        • comments in the class javadoc on how the recogniser works, and hints on adding new ones (aa944e848894e16adec24eeffe29e11b96ad29a8)
        • tweak for invalid character in the charset / ngram uncertainty (aa944e848894e16adec24eeffe29e11b96ad29a8)

        Ones we customised but now have their own "proper" fixes upstream look to be:

        • byte[] for each byte when detecting IBM420 charset (a157a6f025c4b5eda63881124919b4271f85ba0b)
        • typecasts (424c0ebd503900bdff481e059a3a326dc7add814)
        • for loops (79211cc7a400f3b0b066f0a6401dee1c2fa9c90e)
        Show
        gagravarr Nick Burch added a comment - Running "git log" and "git diff" on the file suggests other custom bits are: CharsetRecog_IBM866_ru (9f5593e6990fddc3471e1b87fb1df2f00c5ed600) comments in the class javadoc on how the recogniser works, and hints on adding new ones (aa944e848894e16adec24eeffe29e11b96ad29a8) tweak for invalid character in the charset / ngram uncertainty (aa944e848894e16adec24eeffe29e11b96ad29a8) Ones we customised but now have their own "proper" fixes upstream look to be: byte[] for each byte when detecting IBM420 charset (a157a6f025c4b5eda63881124919b4271f85ba0b) typecasts (424c0ebd503900bdff481e059a3a326dc7add814) for loops (79211cc7a400f3b0b066f0a6401dee1c2fa9c90e)
        Hide
        tallison@mitre.org Tim Allison added a comment -

        Thanks to Nick Burch for pointing out some other Tika-custom bits. I'm reopening this until I have a chance to add those back.

        Show
        tallison@mitre.org Tim Allison added a comment - Thanks to Nick Burch for pointing out some other Tika-custom bits. I'm reopening this until I have a chance to add those back.
        Hide
        tallison@mitre.org Tim Allison added a comment -

        I think I got all of these. Thank you, again, Nick Burch.

        Show
        tallison@mitre.org Tim Allison added a comment - I think I got all of these. Thank you, again, Nick Burch .
        Hide
        hudson Hudson added a comment -

        FAILURE: Integrated in tika-2.x-windows #33 (See https://builds.apache.org/job/tika-2.x-windows/33/)
        TIKA-2041 - add important diffs between new copy/paste from ICU4J and (tallison: rev b41c0b2a88a314db594ca1f616ae4be71676bbcc)

        • tika-parser-modules/tika-parser-text-module/src/main/java/org/apache/tika/parser/txt/CharsetDetector.java
        • tika-parser-modules/tika-parser-text-module/src/main/java/org/apache/tika/parser/txt/CharsetRecog_sbcs.java
        • tika-parser-modules/tika-parser-text-module/src/main/java/org/apache/tika/parser/txt/CharsetMatch.java
        Show
        hudson Hudson added a comment - FAILURE: Integrated in tika-2.x-windows #33 (See https://builds.apache.org/job/tika-2.x-windows/33/ ) TIKA-2041 - add important diffs between new copy/paste from ICU4J and (tallison: rev b41c0b2a88a314db594ca1f616ae4be71676bbcc) tika-parser-modules/tika-parser-text-module/src/main/java/org/apache/tika/parser/txt/CharsetDetector.java tika-parser-modules/tika-parser-text-module/src/main/java/org/apache/tika/parser/txt/CharsetRecog_sbcs.java tika-parser-modules/tika-parser-text-module/src/main/java/org/apache/tika/parser/txt/CharsetMatch.java
        Hide
        hudson Hudson added a comment -

        SUCCESS: Integrated in tika-2.x #129 (See https://builds.apache.org/job/tika-2.x/129/)
        TIKA-2041 - add important diffs between new copy/paste from ICU4J and (tallison: rev b41c0b2a88a314db594ca1f616ae4be71676bbcc)

        • tika-parser-modules/tika-parser-text-module/src/main/java/org/apache/tika/parser/txt/CharsetRecog_sbcs.java
        • tika-parser-modules/tika-parser-text-module/src/main/java/org/apache/tika/parser/txt/CharsetDetector.java
        • tika-parser-modules/tika-parser-text-module/src/main/java/org/apache/tika/parser/txt/CharsetMatch.java
        Show
        hudson Hudson added a comment - SUCCESS: Integrated in tika-2.x #129 (See https://builds.apache.org/job/tika-2.x/129/ ) TIKA-2041 - add important diffs between new copy/paste from ICU4J and (tallison: rev b41c0b2a88a314db594ca1f616ae4be71676bbcc) tika-parser-modules/tika-parser-text-module/src/main/java/org/apache/tika/parser/txt/CharsetRecog_sbcs.java tika-parser-modules/tika-parser-text-module/src/main/java/org/apache/tika/parser/txt/CharsetDetector.java tika-parser-modules/tika-parser-text-module/src/main/java/org/apache/tika/parser/txt/CharsetMatch.java
        Hide
        hudson Hudson added a comment -

        SUCCESS: Integrated in Tika-trunk #1090 (See https://builds.apache.org/job/Tika-trunk/1090/)
        TIKA-2041 - add important diffs between new copy/paste from ICU4J and (tallison: rev bd9a9b911b4e0205c9dfd4527063e6e1c0fd0c44)

        • tika-parsers/src/main/java/org/apache/tika/parser/txt/CharsetDetector.java
        • tika-parsers/src/main/java/org/apache/tika/parser/txt/CharsetRecog_sbcs.java
        • tika-parsers/src/main/java/org/apache/tika/parser/txt/CharsetMatch.java
        Show
        hudson Hudson added a comment - SUCCESS: Integrated in Tika-trunk #1090 (See https://builds.apache.org/job/Tika-trunk/1090/ ) TIKA-2041 - add important diffs between new copy/paste from ICU4J and (tallison: rev bd9a9b911b4e0205c9dfd4527063e6e1c0fd0c44) tika-parsers/src/main/java/org/apache/tika/parser/txt/CharsetDetector.java tika-parsers/src/main/java/org/apache/tika/parser/txt/CharsetRecog_sbcs.java tika-parsers/src/main/java/org/apache/tika/parser/txt/CharsetMatch.java

          People

          • Assignee:
            tallison@mitre.org Tim Allison
            Reporter:
            tallison@mitre.org Tim Allison
          • Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development