Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-2484

Improve CharsetDetector to recognize UTF-16LE/BE,UTF-32LE/BE and UTF-7 with/without BOMs correctly

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Minor
    • Resolution: Unresolved
    • 1.16, 1.17
    • None
    • parser
    • None

    Description

      I would like to help to improve the recognition accuracy of the CharsetDetector.

      Therefore I created a testset of plain/text-files to check the quality of org.apache.tika.parser.txt.CharsetDetector: charset.tar.gz
      (Testset created out of http://source.icu-project.org/repos/icu/icu4j/tags/release-4-8/main/tests/core/src/com/ibm/icu/dev/test/charsetdet/CharsetDetectionTests.xml)

      The Testset was processed using TIKA1.17 (ID: 877d621, HEAD from 26.10.2017) and ICU4J 59.1 CharsetDetector with custom UTF-7 improvements. Here are the results:

      TIKA-1.17
      charset.tar.gz
      Correct recognitions: 165/341
      
      TIKA-1.17+ UTF-7 recognizer:
      charset.tar.gz
      Correct recognitions: 213/341
      
      ICU4j 59.1 + UTF-7 recognizer:
      charset.tar.gz
      Correct recognitions: 333/341
      

      As UTF-7 recognizer I used these two simple classes:

      package test.utils;
      
      import java.util.Arrays;
      
      /**
       * Pattern state container for the Boyer-Moore algorithm
       */
      public final class BoyerMoorePattern
      {
      
          private final byte[] pattern;
      
          private final int[] skipArray;
      
          public BoyerMoorePattern(byte[] pattern)
          {
              this.pattern = pattern;
              skipArray = new int[256];
              Arrays.fill(skipArray, -1);
              // Initialize with pattern values
              for (int i = 0; i < pattern.length; i++)
              {
                  skipArray[pattern[i] & 0xFF] = i;
              }
          }
      
          /**
           * Get the pattern length
           * 
           * @return length as int
           */
          public int getLength()
          {
              return pattern.length;
          }
      
          /**
           * Searches for the first occurrence of the pattern in the input byte array.
           * 
           * @param data - The data we want to search in
           * @param startIdx - The startindex
           * @param endIdx - The endindex
           * @return offset as int or -1 if not found at all
           */
          public final int searchPattern(byte[] data, int startIdx, int endIdx)
          {
              int patternLength = pattern.length;
              int skip = 0;
              for (int i = startIdx; i <= endIdx - patternLength; i += skip)
              {
                  skip = 0;
                  for (int j = patternLength - 1; j >= 0; j--)
                  {
                      if (pattern[j] != data[i + j])
                      {
                          skip = Math.max(1, j - skipArray[data[i + j] & 0xFF]);
                          break;
                      }
                  }
                  if (skip == 0)
                  {
                      return i;
                  }
              }
      
              return -1;
          }
      
          /**
           * Searches for the first occurrence of the pattern in the input byte array.
           * 
           * @param data - The data we want to search in
           * @param startIdx - The startindex
           * @return offset as int
           */
          public final int searchPattern(byte[] data, int startIdx)
          {
              return searchPattern(data, startIdx, data.length);
          }
      
          /**
           * Searches for the first occurrence of the pattern in the input byte array.
           * 
           * @param data - The data we want to search in
           * @return offset as int or -1 if not found at all
           */
          public final int searchPattern(byte[] data)
          {
              return searchPattern(data, 0, data.length);
          }
      }
      
      package test;
      
      import java.io.IOException;
      import java.io.InputStream;
      import java.nio.charset.Charset;
      import java.util.logging.Logger;
      import java.util.regex.Matcher;
      import java.util.regex.Pattern;
      
      import org.apache.commons.io.IOUtils;
      import org.apache.tika.detect.EncodingDetector;
      import org.apache.tika.metadata.Metadata;
      import org.apache.tika.parser.txt.CharsetDetector;
      import org.apache.tika.parser.txt.CharsetMatch;
      
      import test.utils.BoyerMoorePattern;
      
      
      public class MyEncodingDetector implements EncodingDetector {
      	
      	public Charset detect(InputStream input, Metadata metadata)
      			throws IOException {
      
      		
      		CharsetDetector detector;
      		CharsetMatch match;
      
      		detector = new CharsetDetector();
      
      		detector.setText(input);
      		match = detector.detect();
      
      		match = detector.detect();
      		
      		String charsetName = isItUtf7(match, IOUtils.toByteArray(input)); // determines whether the match is UTF-7 or not
      		
      		if (charsetName != null) {
      			return Charset.forName(charsetName);
      		}
      		return null;
      	}
      
      
          /**
           * Checks for BOM and determines whether it is UTF-7 or not.
           * 
           * @param match - The default match we expect, if it is not UTF-7
           * @param data - The bytearray we want to check
           * 
           * @return match
           */
          private String isItUtf7(CharsetMatch match, byte[] data)
          {
              if (isUTF7withBOM(data) || isUTF7withoutBOM(data)) {
                  return "UTF-7";
              } else {
              	if (match != null) {
              		return match.getName();
              	}
              	return null;
              }        
          }
          
          private boolean isUTF7withBOM(byte[] data) {
              if ((data.length > 4 && data[0] == 43 && data[1] == 47 && data[2] == 118)
                      && (data[3] == 56 || data[3] == 57 || data[3] == 43 || data[3] == 47))
              {
                  // Checkin byte-array for "byte order marks" (BOM):
                  // 43 47 118 56
                  // 43 47 118 57
                  // 43 47 118 43
                  // 43 47 118 47
                  return true;
              }
              return false;
          }
          
          private boolean isUTF7withoutBOM(byte[] data) {
              byte[] utf7StartPattern = "+".getBytes();
              byte[] utf7EndPattern = "-".getBytes();
              BoyerMoorePattern bmpattern = new BoyerMoorePattern(utf7StartPattern); // create a new pattern with the bytes
              int startPosSP = bmpattern.searchPattern(data);
              
              BoyerMoorePattern empattern = new BoyerMoorePattern(utf7EndPattern); // create a new pattern with the bytes
              int startPosEP = empattern.searchPattern(data);
      		
              if (startPosSP != -1 && startPosEP != -1) {
              	// the pattern was found, so we can create a regular expression for the basic pattern now
      
              	Pattern p = Pattern.compile("\\+[a-zA-Z]\\w{2,}\\-");	// a word with length of at least 3 characters or more
              	Matcher m = p.matcher(new String(data));
              	
              	int numberMatches = 0;
              	while (m.find()) {
              		numberMatches++;
              	}
              	
              	System.out.println("Number of possible UTF-7 regex matches: " + numberMatches);
      
              	int minimumMatches = 3;
              	
              	if (numberMatches > minimumMatches) {	// if there are more than minimumMatches "+xxx-" words the expected encoding shall be UTF-7
              		return true;
              	}
              }
              
              return false;
          }
      }
      

      There might be some false positive (FP) recognitions with the current regex and the number of matches.
      A better approach might be to set the minimumMatches in dependence of the amount of text given to the detector.

      This is just a simple first try, nothing for productivity. It even does not cover all possible UTF-7 strings.

      By the way:

      I am perfectly aware of the fact that the current testset does only cover a few encodings. However, the specified files address the main weakness of the current CharsetDetector.

      I don't know the history that lead to the creation of the CharsetDetector in TIKA and why ICU4J was rebuild with extensions like the cp866 ngram detection, instead of participating in icu4j development.
      Wouldn't it be better to forward the changes of the CharsetDetector to the ICU4J developers so they can implement missing encodings?

      Is it planned to include the newest version of ICU4J in future releases of TIKA?

      What about neural networks to determine some or all charsets? (given that there are enough testfiles)

      Attachments

        1. IUC10-fr.UTF-7.without-BOM
          0.7 kB
          Andreas Meier
        2. IUC10-fr.UTF-7.with-BOM
          0.7 kB
          Andreas Meier
        3. IUC10-fr.UTF-32LE.without-BOM
          3 kB
          Andreas Meier
        4. IUC10-fr.UTF-32LE.with-BOM
          3 kB
          Andreas Meier
        5. IUC10-fr.UTF-32BE.without-BOM
          3 kB
          Andreas Meier
        6. IUC10-fr.UTF-32BE.with-BOM
          3 kB
          Andreas Meier
        7. IUC10-fr.UTF-16LE.without-BOM
          1 kB
          Andreas Meier
        8. IUC10-fr.UTF-16LE.with-BOM
          1 kB
          Andreas Meier
        9. IUC10-fr.UTF-16BE.without-BOM
          1 kB
          Andreas Meier
        10. IUC10-fr.UTF-16BE.with-BOM
          1 kB
          Andreas Meier
        11. IUC10-ar.UTF-7.without-BOM
          1 kB
          Andreas Meier
        12. IUC10-ar.UTF-7.with-BOM
          1 kB
          Andreas Meier
        13. IUC10-ar.UTF-32LE.without-BOM
          2 kB
          Andreas Meier
        14. IUC10-ar.UTF-32LE.with-BOM
          2 kB
          Andreas Meier
        15. IUC10-ar.UTF-32BE.without-BOM
          2 kB
          Andreas Meier
        16. IUC10-ar.UTF-32BE.with-BOM
          2 kB
          Andreas Meier
        17. IUC10-ar.UTF-16LE.without-BOM
          1 kB
          Andreas Meier
        18. IUC10-ar.UTF-16LE.with-BOM
          1 kB
          Andreas Meier
        19. IUC10-ar.UTF-16BE.without-BOM
          1 kB
          Andreas Meier
        20. IUC10-ar.UTF-16BE.with-BOM
          1 kB
          Andreas Meier
        21. charset.zip
          196 kB
          Tim Allison

        Issue Links

          Activity

            People

              Unassigned Unassigned
              AndreasMeier Andreas Meier
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated: