[TIKA-2484] Improve CharsetDetector to recognize UTF-16LE/BE,UTF-32LE/BE and UTF-7 with/without BOMs correctly - ASF JIRA

Details

Type: Improvement
Status: Open
Priority: Minor
Resolution: Unresolved
Affects Version/s: 1.16, 1.17
Fix Version/s: None
Component/s: parser
Labels:
None

Description

I would like to help to improve the recognition accuracy of the CharsetDetector.

Therefore I created a testset of plain/text-files to check the quality of org.apache.tika.parser.txt.CharsetDetector: charset.tar.gz
(Testset created out of http://source.icu-project.org/repos/icu/icu4j/tags/release-4-8/main/tests/core/src/com/ibm/icu/dev/test/charsetdet/CharsetDetectionTests.xml)

The Testset was processed using TIKA1.17 (ID: 877d621, HEAD from 26.10.2017) and ICU4J 59.1 CharsetDetector with custom UTF-7 improvements. Here are the results:

TIKA-1.17
charset.tar.gz
Correct recognitions: 165/341

TIKA-1.17+ UTF-7 recognizer:
charset.tar.gz
Correct recognitions: 213/341

ICU4j 59.1 + UTF-7 recognizer:
charset.tar.gz
Correct recognitions: 333/341

As UTF-7 recognizer I used these two simple classes:

package test.utils;

import java.util.Arrays;

/**
 * Pattern state container for the Boyer-Moore algorithm
 */
public final class BoyerMoorePattern
{

    private final byte[] pattern;

    private final int[] skipArray;

    public BoyerMoorePattern(byte[] pattern)
    {
        this.pattern = pattern;
        skipArray = new int[256];
        Arrays.fill(skipArray, -1);
        // Initialize with pattern values
        for (int i = 0; i < pattern.length; i++)
        {
            skipArray[pattern[i] & 0xFF] = i;
        }
    }

    /**
     * Get the pattern length
     * 
     * @return length as int
     */
    public int getLength()
    {
        return pattern.length;
    }

    /**
     * Searches for the first occurrence of the pattern in the input byte array.
     * 
     * @param data - The data we want to search in
     * @param startIdx - The startindex
     * @param endIdx - The endindex
     * @return offset as int or -1 if not found at all
     */
    public final int searchPattern(byte[] data, int startIdx, int endIdx)
    {
        int patternLength = pattern.length;
        int skip = 0;
        for (int i = startIdx; i <= endIdx - patternLength; i += skip)
        {
            skip = 0;
            for (int j = patternLength - 1; j >= 0; j--)
            {
                if (pattern[j] != data[i + j])
                {
                    skip = Math.max(1, j - skipArray[data[i + j] & 0xFF]);
                    break;
                }
            }
            if (skip == 0)
            {
                return i;
            }
        }

        return -1;
    }

    /**
     * Searches for the first occurrence of the pattern in the input byte array.
     * 
     * @param data - The data we want to search in
     * @param startIdx - The startindex
     * @return offset as int
     */
    public final int searchPattern(byte[] data, int startIdx)
    {
        return searchPattern(data, startIdx, data.length);
    }

    /**
     * Searches for the first occurrence of the pattern in the input byte array.
     * 
     * @param data - The data we want to search in
     * @return offset as int or -1 if not found at all
     */
    public final int searchPattern(byte[] data)
    {
        return searchPattern(data, 0, data.length);
    }
}

package test;

import java.io.IOException;
import java.io.InputStream;
import java.nio.charset.Charset;
import java.util.logging.Logger;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

import org.apache.commons.io.IOUtils;
import org.apache.tika.detect.EncodingDetector;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.txt.CharsetDetector;
import org.apache.tika.parser.txt.CharsetMatch;

import test.utils.BoyerMoorePattern;


public class MyEncodingDetector implements EncodingDetector {
	
	public Charset detect(InputStream input, Metadata metadata)
			throws IOException {

		
		CharsetDetector detector;
		CharsetMatch match;

		detector = new CharsetDetector();

		detector.setText(input);
		match = detector.detect();

		match = detector.detect();
		
		String charsetName = isItUtf7(match, IOUtils.toByteArray(input)); // determines whether the match is UTF-7 or not
		
		if (charsetName != null) {
			return Charset.forName(charsetName);
		}
		return null;
	}


    /**
     * Checks for BOM and determines whether it is UTF-7 or not.
     * 
     * @param match - The default match we expect, if it is not UTF-7
     * @param data - The bytearray we want to check
     * 
     * @return match
     */
    private String isItUtf7(CharsetMatch match, byte[] data)
    {
        if (isUTF7withBOM(data) || isUTF7withoutBOM(data)) {
            return "UTF-7";
        } else {
        	if (match != null) {
        		return match.getName();
        	}
        	return null;
        }        
    }
    
    private boolean isUTF7withBOM(byte[] data) {
        if ((data.length > 4 && data[0] == 43 && data[1] == 47 && data[2] == 118)
                && (data[3] == 56 || data[3] == 57 || data[3] == 43 || data[3] == 47))
        {
            // Checkin byte-array for "byte order marks" (BOM):
            // 43 47 118 56
            // 43 47 118 57
            // 43 47 118 43
            // 43 47 118 47
            return true;
        }
        return false;
    }
    
    private boolean isUTF7withoutBOM(byte[] data) {
        byte[] utf7StartPattern = "+".getBytes();
        byte[] utf7EndPattern = "-".getBytes();
        BoyerMoorePattern bmpattern = new BoyerMoorePattern(utf7StartPattern); // create a new pattern with the bytes
        int startPosSP = bmpattern.searchPattern(data);
        
        BoyerMoorePattern empattern = new BoyerMoorePattern(utf7EndPattern); // create a new pattern with the bytes
        int startPosEP = empattern.searchPattern(data);
		
        if (startPosSP != -1 && startPosEP != -1) {
        	// the pattern was found, so we can create a regular expression for the basic pattern now

        	Pattern p = Pattern.compile("\\+[a-zA-Z]\\w{2,}\\-");	// a word with length of at least 3 characters or more
        	Matcher m = p.matcher(new String(data));
        	
        	int numberMatches = 0;
        	while (m.find()) {
        		numberMatches++;
        	}
        	
        	System.out.println("Number of possible UTF-7 regex matches: " + numberMatches);

        	int minimumMatches = 3;
        	
        	if (numberMatches > minimumMatches) {	// if there are more than minimumMatches "+xxx-" words the expected encoding shall be UTF-7
        		return true;
        	}
        }
        
        return false;
    }
}

There might be some false positive (FP) recognitions with the current regex and the number of matches.
A better approach might be to set the minimumMatches in dependence of the amount of text given to the detector.

This is just a simple first try, nothing for productivity. It even does not cover all possible UTF-7 strings.

By the way:

I am perfectly aware of the fact that the current testset does only cover a few encodings. However, the specified files address the main weakness of the current CharsetDetector.

I don't know the history that lead to the creation of the CharsetDetector in TIKA and why ICU4J was rebuild with extensions like the cp866 ngram detection, instead of participating in icu4j development.
Wouldn't it be better to forward the changes of the CharsetDetector to the ICU4J developers so they can implement missing encodings?

Is it planned to include the newest version of ICU4J in future releases of TIKA?

What about neural networks to determine some or all charsets? (given that there are enough testfiles)

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

IUC10-fr.UTF-7.without-BOM
27/Oct/17 12:17
0.7 kB
Andreas Meier
IUC10-fr.UTF-7.with-BOM
27/Oct/17 12:17
0.7 kB
Andreas Meier
IUC10-fr.UTF-32LE.without-BOM
27/Oct/17 12:17
3 kB
Andreas Meier
IUC10-fr.UTF-32LE.with-BOM
27/Oct/17 12:17
3 kB
Andreas Meier
IUC10-fr.UTF-32BE.without-BOM
27/Oct/17 12:17
3 kB
Andreas Meier
IUC10-fr.UTF-32BE.with-BOM
27/Oct/17 12:17
3 kB
Andreas Meier
IUC10-fr.UTF-16LE.without-BOM
27/Oct/17 12:17
1 kB
Andreas Meier
IUC10-fr.UTF-16LE.with-BOM
27/Oct/17 12:17
1 kB
Andreas Meier
IUC10-fr.UTF-16BE.without-BOM
27/Oct/17 12:17
1 kB
Andreas Meier
IUC10-fr.UTF-16BE.with-BOM
27/Oct/17 12:17
1 kB
Andreas Meier
IUC10-ar.UTF-7.without-BOM
27/Oct/17 12:17
1 kB
Andreas Meier
IUC10-ar.UTF-7.with-BOM
27/Oct/17 12:17
1 kB
Andreas Meier
IUC10-ar.UTF-32LE.without-BOM
27/Oct/17 12:17
2 kB
Andreas Meier
IUC10-ar.UTF-32LE.with-BOM
27/Oct/17 12:17
2 kB
Andreas Meier
IUC10-ar.UTF-32BE.without-BOM
27/Oct/17 12:17
2 kB
Andreas Meier
IUC10-ar.UTF-32BE.with-BOM
27/Oct/17 12:17
2 kB
Andreas Meier
IUC10-ar.UTF-16LE.without-BOM
27/Oct/17 12:17
1 kB
Andreas Meier
IUC10-ar.UTF-16LE.with-BOM
27/Oct/17 12:17
1 kB
Andreas Meier
IUC10-ar.UTF-16BE.without-BOM
27/Oct/17 12:17
1 kB
Andreas Meier
IUC10-ar.UTF-16BE.with-BOM
27/Oct/17 12:17
1 kB
Andreas Meier
charset.zip
06/Nov/17 14:09
196 kB
Tim Allison

Issue Links

is related to

TIKA-721 UTF16-LE not detected

Open

TIKA-2038 A more accurate facility for detecting Charset Encoding of HTML documents

Open

Improve CharsetDetector to recognize UTF-16LE/BE,UTF-32LE/BE and UTF-7 with/without BOMs correctly

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates