Solr
  1. Solr
  2. SOLR-2346

Non UTF-8 Text files having other than english texts(Japanese/Hebrew) are no getting indexed correctly.

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Critical Critical
    • Resolution: Fixed
    • Affects Version/s: 1.4.1, 3.1, 4.0-ALPHA
    • Fix Version/s: 3.6, 4.0-ALPHA
    • Labels:
      None
    • Environment:

      Solr 1.4.1, Packaged Jetty as servlet container, Windows XP SP1, Machine was booted in Japanese Locale.

      Description

      I am able to successfully index/search non-Engilsh files (like Hebrew, Japanese) which was encoded in UTF-8. However, When I tried to index data which was encoded in local encoding like Big5 for Japanese I could not see the desired results. The contents after indexing looked garbled for Big5 encoded document when I searched for all indexed documents. When I index attached non utf-8 file it indexes in following way

      • <result name="response" numFound="1" start="0">
      • <doc>
      • <arr name="attr_content">
        <str>�� ������</str>
        </arr>
      • <arr name="attr_content_encoding">
        <str>Big5</str>
        </arr>
      • <arr name="attr_content_language">
        <str>zh</str>
        </arr>
      • <arr name="attr_language">
        <str>zh</str>
        </arr>
      • <arr name="attr_stream_size">
        <str>17</str>
        </arr>
      • <arr name="content_type">
        <str>text/plain</str>
        </arr>
        <str name="id">doc2</str>
        </doc>
        </result>
        </response>

      Here you said it index file in UTF8 however it seems that non UTF8 file gets indexed in Big5 encoding.
      Here I tried fetching indexed data stream in Big5 and converted in UTF8.

      String id = (String) resulDocument.getFirstValue("attr_content");
      byte[] bytearray = id.getBytes("Big5");
      String utf8String = new String(bytearray, "UTF-8");
      It does not gives expected results.

      When I index UTF-8 file it indexes like following

      • <doc>
      • <arr name="attr_content">
        <str>マイ ネットワーク</str>
        </arr>
      • <arr name="attr_content_encoding">
        <str>UTF-8</str>
        </arr>
      • <arr name="attr_stream_content_type">
        <str>text/plain</str>
        </arr>
      • <arr name="attr_stream_name">
        <str>sample_jap_unicode.txt</str>
        </arr>
      • <arr name="attr_stream_size">
        <str>28</str>
        </arr>
      • <arr name="attr_stream_source_info">
        <str>myfile</str>
        </arr>
      • <arr name="content_type">
        <str>text/plain</str>
        </arr>
        <str name="id">doc2</str>
        </doc>

      So, I can index and search UTF-8 data.

      For more reference below is the discussion with Yonik.
      Please find attached TXT file which I was using to index and search.

      curl "http://localhost:8983/solr/update/extract?literal.id=doc1&uprefix=attr_&fmap.content=attr_content&fmap.div=foo_t&boost.foo_t=3&commit=true&charset=utf-8" -F "myfile=@sample_jap_non_UTF-8"

      One problem is that you are giving big5 encoded text to Solr and saying that it's UTF8.
      Here's one way to actually tell solr what the encoding of the text you are sending is:

      curl "http://localhost:8983/solr/update/extract?literal.id=doc1&uprefix=attr_&fmap.content=attr_content&fmap.div=foo_t&boost.foo_t=3&commit=true" --data-binary @sample_jap_non_UTF-8.txt -H 'Content-type:text/plain; charset=big5'

      Now the problem appears that for some reason, this doesn't work...
      Could you open a JIRA issue and attach your two test files?

      -Yonik
      http://lucidimagination.com

      1. sample_jap_UTF-8.txt
        0.0 kB
        Prasad Deshpande
      2. sample_jap_non_UTF-8.txt
        0.0 kB
        Prasad Deshpande
      3. NormalSave.msg
        11 kB
        Prasad Deshpande
      4. UnicodeSave.msg
        11 kB
        Prasad Deshpande
      5. SOLR-2346.patch
        2 kB
        Koji Sekiguchi
      6. SOLR-2346.patch
        2 kB
        Koji Sekiguchi
      7. SOLR-2346.patch
        2 kB
        Koji Sekiguchi

        Activity

        Hide
        Prasad Deshpande added a comment -

        I have verified use case using attached files.

        Show
        Prasad Deshpande added a comment - I have verified use case using attached files.
        Hide
        Robert Muir added a comment -
        String id = (String) resulDocument.getFirstValue("attr_content");
        byte[] bytearray = id.getBytes("Big5");
        String utf8String = new String(bytearray, "UTF-8");
        It does not gives expected results.
        

        You cannot convert character sets this way in java (asking for the bytes in big5 but
        then making a string as utf-8)... this is wrong.

        Show
        Robert Muir added a comment - String id = (String) resulDocument.getFirstValue("attr_content"); byte[] bytearray = id.getBytes("Big5"); String utf8String = new String(bytearray, "UTF-8"); It does not gives expected results. You cannot convert character sets this way in java (asking for the bytes in big5 but then making a string as utf-8)... this is wrong.
        Hide
        Yonik Seeley added a comment -

        From the email thread:

        One problem is that you are giving big5 encoded text to Solr and saying that it's UTF8.
        Here's one way to actually tell solr what the encoding of the text you are sending is:

        curl "http://localhost:8983/solr/update/extract?literal.id=doc1&uprefix=attr_&fmap.content=attr_content&fmap.div=foo_t&boost.foo_t=3&commit=true" --data-binary @sample_jap_non_UTF-8.txt -H 'Content-type:text/plain; charset=big5'

        Now the problem appears that for some reason, this doesn't work...
        Could you open a JIRA issue and attach your two test files?

        Show
        Yonik Seeley added a comment - From the email thread: One problem is that you are giving big5 encoded text to Solr and saying that it's UTF8. Here's one way to actually tell solr what the encoding of the text you are sending is: curl "http://localhost:8983/solr/update/extract?literal.id=doc1&uprefix=attr_&fmap.content=attr_content&fmap.div=foo_t&boost.foo_t=3&commit=true" --data-binary @sample_jap_non_UTF-8.txt -H 'Content-type:text/plain; charset=big5' Now the problem appears that for some reason, this doesn't work... Could you open a JIRA issue and attach your two test files?
        Hide
        Robert Muir added a comment -

        Right, I agree solr should work with a non-UTF8 encoded doc in this way,
        but the code is still wrong: its not a correct way to convert characters in java.

        Show
        Robert Muir added a comment - Right, I agree solr should work with a non-UTF8 encoded doc in this way, but the code is still wrong: its not a correct way to convert characters in java.
        Hide
        Prasad Deshpande added a comment - - edited

        I agree, I was just trying to decode the garbled characters so that it can be readable to the user. Still the problem is in indexing, while indexing the all the characters are getting garbled.

        Show
        Prasad Deshpande added a comment - - edited I agree, I was just trying to decode the garbled characters so that it can be readable to the user. Still the problem is in indexing, while indexing the all the characters are getting garbled.
        Hide
        Prasad Deshpande added a comment -

        Hope following issue could be same.

        Above are the Hebrew *.msg files which I have tried to index using following command.
        curl "http://localhost:8983/solr/update/extract?literal.id=doc1&uprefix=attr_&fmap.content=attr_content&fmap.div=foo_t&boost.foo_t=3&commit=true" -F "myfile=@NormalSave.msg"

        File UnicodeSave.msg was saved as "Outlook Message Format - Unicode" and normalSave.msg was saved as "Outlook Message Format".
        When I search with : in solr it gives junk characters in case NormalSave.msg and in case of UnicodeSave.msg it gives empty "attr_content".

        Show
        Prasad Deshpande added a comment - Hope following issue could be same. Above are the Hebrew *.msg files which I have tried to index using following command. curl "http://localhost:8983/solr/update/extract?literal.id=doc1&uprefix=attr_&fmap.content=attr_content&fmap.div=foo_t&boost.foo_t=3&commit=true" -F "myfile=@NormalSave.msg" File UnicodeSave.msg was saved as "Outlook Message Format - Unicode" and normalSave.msg was saved as "Outlook Message Format". When I search with : in solr it gives junk characters in case NormalSave.msg and in case of UnicodeSave.msg it gives empty "attr_content".
        Hide
        Koji Sekiguchi added a comment -

        I've faced the same problem. I'm trying to index a Shift_JIS encoded text file through the following request:

        http://localhost:8983/solr/update/extract?literal.id=docA0000001&stream.file=/foo/bar/sjis.txt&commit=true&stream.contentType=text%2Fplain%3B+charset%3DShift_JIS

        But Tika's AutoDetectParser doesn't regard Solr's charset (or Solr doesn't set the content type to Tika Parser; I should dig in).

        I looked into ExtractingDocumentLoader.java and it seemed that I could select an appropriate parser if I use stream.type parameter:

        ExtractingDocumentLoader.java
        public void load(SolrQueryRequest req, SolrQueryResponse rsp, ContentStream stream) throws IOException {
          errHeader = "ExtractingDocumentLoader: " + stream.getSourceInfo();
          Parser parser = null;
          String streamType = req.getParams().get(ExtractingParams.STREAM_TYPE, null);
          if (streamType != null) {
            //Cache?  Parsers are lightweight to construct and thread-safe, so I'm told
            MediaType mt = MediaType.parse(streamType.trim().toLowerCase());
            parser = config.getParser(mt);
          } else {
            parser = autoDetectParser;
          }
          :
        }
        

        The request was:

        http://localhost:8983/solr/update/extract?literal.id=docA0000001&stream.file=/foo/bar/sjis.txt&commit=true&stream.contentType=text%2Fplain%3B+charset%3DShift_JIS&stream.type=text%2Fplain

        I could select TXTParser rather than AutoDetectParser, but the problem wasn't solved.

        And I looked at Tika Javadoc for TXTParser and it said "The text encoding of the document stream is automatically detected based on the byte patterns found at the beginning of the stream. The input metadata key HttpHeaders.CONTENT_ENCODING is used as an encoding hint if the automatic encoding detection fails.":

        http://tika.apache.org/0.8/api/org/apache/tika/parser/txt/TXTParser.html

        So I tried to insert the following hard coded fix:

        ExtractingDocumentLoader.java
        Metadata metadata = new Metadata();
        metadata.add(ExtractingMetadataConstants.STREAM_NAME, stream.getName());
        metadata.add(ExtractingMetadataConstants.STREAM_SOURCE_INFO, stream.getSourceInfo());
        metadata.add(ExtractingMetadataConstants.STREAM_SIZE, String.valueOf(stream.getSize()));
        metadata.add(ExtractingMetadataConstants.STREAM_CONTENT_TYPE, stream.getContentType());
        metadata.add(HttpHeaders.CONTENT_ENCODING, "Shift_JIS");   // <= temporary fix
        

        and the problem was gone (anymore garbled characters indexed).

        Show
        Koji Sekiguchi added a comment - I've faced the same problem. I'm trying to index a Shift_JIS encoded text file through the following request: http://localhost:8983/solr/update/extract?literal.id=docA0000001&stream.file=/foo/bar/sjis.txt&commit=true&stream.contentType=text%2Fplain%3B+charset%3DShift_JIS But Tika's AutoDetectParser doesn't regard Solr's charset (or Solr doesn't set the content type to Tika Parser; I should dig in). I looked into ExtractingDocumentLoader.java and it seemed that I could select an appropriate parser if I use stream.type parameter: ExtractingDocumentLoader.java public void load(SolrQueryRequest req, SolrQueryResponse rsp, ContentStream stream) throws IOException { errHeader = "ExtractingDocumentLoader: " + stream.getSourceInfo(); Parser parser = null ; String streamType = req.getParams().get(ExtractingParams.STREAM_TYPE, null ); if (streamType != null ) { //Cache? Parsers are lightweight to construct and thread-safe, so I'm told MediaType mt = MediaType.parse(streamType.trim().toLowerCase()); parser = config.getParser(mt); } else { parser = autoDetectParser; } : } The request was: http://localhost:8983/solr/update/extract?literal.id=docA0000001&stream.file=/foo/bar/sjis.txt&commit=true&stream.contentType=text%2Fplain%3B+charset%3DShift_JIS&stream.type=text%2Fplain I could select TXTParser rather than AutoDetectParser, but the problem wasn't solved. And I looked at Tika Javadoc for TXTParser and it said "The text encoding of the document stream is automatically detected based on the byte patterns found at the beginning of the stream. The input metadata key HttpHeaders.CONTENT_ENCODING is used as an encoding hint if the automatic encoding detection fails.": http://tika.apache.org/0.8/api/org/apache/tika/parser/txt/TXTParser.html So I tried to insert the following hard coded fix: ExtractingDocumentLoader.java Metadata metadata = new Metadata(); metadata.add(ExtractingMetadataConstants.STREAM_NAME, stream.getName()); metadata.add(ExtractingMetadataConstants.STREAM_SOURCE_INFO, stream.getSourceInfo()); metadata.add(ExtractingMetadataConstants.STREAM_SIZE, String .valueOf(stream.getSize())); metadata.add(ExtractingMetadataConstants.STREAM_CONTENT_TYPE, stream.getContentType()); metadata.add(HttpHeaders.CONTENT_ENCODING, "Shift_JIS" ); // <= temporary fix and the problem was gone (anymore garbled characters indexed).
        Hide
        Koji Sekiguchi added a comment -

        By looking at Tika, HtmlParser and TXTParser see HttpHeaders.CONTENT_ENCODING value in metadata.
        I think Solr sould set it in metadata if the charset value is presented by user.

        Show
        Koji Sekiguchi added a comment - By looking at Tika, HtmlParser and TXTParser see HttpHeaders.CONTENT_ENCODING value in metadata. I think Solr sould set it in metadata if the charset value is presented by user.
        Hide
        Koji Sekiguchi added a comment -

        The patch that solves my problem,

        Prasad, can you try the patch with your Big5 text and see the result?

        Show
        Koji Sekiguchi added a comment - The patch that solves my problem, Prasad, can you try the patch with your Big5 text and see the result?
        Hide
        Robert Muir added a comment -

        Bulk move 3.2 -> 3.3

        Show
        Robert Muir added a comment - Bulk move 3.2 -> 3.3
        Hide
        Robert Muir added a comment -

        3.4 -> 3.5

        Show
        Robert Muir added a comment - 3.4 -> 3.5
        Hide
        Shinichiro Abe added a comment -

        I've faced the same problem. Tika parsed my Shift_JIS file as windows-1252, I could not see the desired results. I can index the file correctly by applying Koji's patch. But this patch is effective for remote streaming, not for POST. So, I changed a part of code below.

              //String charset = ContentStreamBase.getCharsetFromContentType(stream.getContentType());
              String contentType = req.getParams().get(CommonParams.STREAM_CONTENTTYPE, null);
              String charset = ContentStreamBase.getCharsetFromContentType(contentType);
        
        Show
        Shinichiro Abe added a comment - I've faced the same problem. Tika parsed my Shift_JIS file as windows-1252, I could not see the desired results. I can index the file correctly by applying Koji's patch. But this patch is effective for remote streaming, not for POST. So, I changed a part of code below. //String charset = ContentStreamBase.getCharsetFromContentType(stream.getContentType()); String contentType = req.getParams().get(CommonParams.STREAM_CONTENTTYPE, null); String charset = ContentStreamBase.getCharsetFromContentType(contentType);
        Hide
        Koji Sekiguchi added a comment -

        I can index the file correctly by applying Koji's patch. But this patch is effective for remote streaming, not for POST.

        I don't understand what you said and your fix (I don't understand why you use CommonParams.STREAM_CONTENTTYPE to fix for your POST case).

        If you meant curl command by POST, you can set content type via -H parameter.

        Show
        Koji Sekiguchi added a comment - I can index the file correctly by applying Koji's patch. But this patch is effective for remote streaming, not for POST. I don't understand what you said and your fix (I don't understand why you use CommonParams.STREAM_CONTENTTYPE to fix for your POST case). If you meant curl command by POST , you can set content type via -H parameter.
        Hide
        Koji Sekiguchi added a comment -

        New patch attached. I updated for current trunk and getCharsetFromContentType() method to remove unnecessary strings after the charset value.

        I think this is ready to go.

        Show
        Koji Sekiguchi added a comment - New patch attached. I updated for current trunk and getCharsetFromContentType() method to remove unnecessary strings after the charset value. I think this is ready to go.
        Hide
        Koji Sekiguchi added a comment -

        getCharsetFromContentType() method to remove unnecessary strings after the charset value.

        My fault. This is not necessary. I should add --data-binary option to curl.

        Show
        Koji Sekiguchi added a comment - getCharsetFromContentType() method to remove unnecessary strings after the charset value. My fault. This is not necessary. I should add --data-binary option to curl.
        Hide
        Koji Sekiguchi added a comment -

        committed trunk and 3x.

        Show
        Koji Sekiguchi added a comment - committed trunk and 3x.
        Hide
        Uwe Schindler added a comment -

        Nice fix, is in-line with the other charset handling e.g. for XML imports using standard request handler. I fixed the incorrect XML handling in solr a year ago and did the same thing to pass the charset to the XML parser as "hint".

        Show
        Uwe Schindler added a comment - Nice fix, is in-line with the other charset handling e.g. for XML imports using standard request handler. I fixed the incorrect XML handling in solr a year ago and did the same thing to pass the charset to the XML parser as "hint".

          People

          • Assignee:
            Koji Sekiguchi
            Reporter:
            Prasad Deshpande
          • Votes:
            1 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development