Solr
  1. Solr
  2. SOLR-2003

report errors for wrongly-encoded files in ResourceLoader.getLines()

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 3.1, 4.0-ALPHA
    • Component/s: None
    • Labels:
      None

      Description

      ResourceLoader is used to load things like stopwords and synonyms files, but it uses the default 'Charset' argument for this.

      when you open an InputStream with a Charset, you get:

      decoder = charset.newDecoder().onMalformedInput(
          CodingErrorAction.REPLACE).onUnmappableCharacter(
          CodingErrorAction.REPLACE);
      

      For cases like malformed encoded stopwords and synonyms files, I think its more helpful to use CodingErrorAction.REPORT than to silently replace with a replacement char. Then the user gets an exception.

      See: http://www.lucidimagination.com/search/document/1e50cb0992727fa1/foreign_characters_question

      1. SOLR-2003_friendly.patch
        2 kB
        Robert Muir
      2. SOLR-2003.patch
        3 kB
        Robert Muir

        Activity

        Hide
        Robert Muir added a comment -

        patch with the example from the user's list as a test (i encoded it as an ISO-8859-1 stopwords file, instead of UTF-8)

        Show
        Robert Muir added a comment - patch with the example from the user's list as a test (i encoded it as an ISO-8859-1 stopwords file, instead of UTF-8)
        Hide
        Hoss Man added a comment -

        If there's a way to tell that the file is in the "wrong" encoding, then +1 to throwing an exception

        I didn't even know that was possible.

        Show
        Hoss Man added a comment - If there's a way to tell that the file is in the "wrong" encoding, then +1 to throwing an exception I didn't even know that was possible.
        Hide
        Robert Muir added a comment -

        If there's a way to tell that the file is in the "wrong" encoding, then +1 to throwing an exception

        Well technically, its just the action of what to do for an exceptional case when decoding something malformed (e.g. illegal byte sequence).
        The default action is to silently ignore, and substitute a replacement character (U+FFFD), but you can change this to throw an exception.

        So we can't detect all cases, only ones that are "obviously" wrong and cause the decoder to get angry.

        Show
        Robert Muir added a comment - If there's a way to tell that the file is in the "wrong" encoding, then +1 to throwing an exception Well technically, its just the action of what to do for an exceptional case when decoding something malformed (e.g. illegal byte sequence). The default action is to silently ignore, and substitute a replacement character (U+FFFD), but you can change this to throw an exception. So we can't detect all cases, only ones that are "obviously" wrong and cause the decoder to get angry.
        Hide
        Robert Muir added a comment -

        Committed revision 964430 (trunk) / 964433 (3x)

        Show
        Robert Muir added a comment - Committed revision 964430 (trunk) / 964433 (3x)
        Hide
        Robert Muir added a comment -

        reopening as Mark hit this in cloud tests, but the exception could be friendlier (its from a low level inputstream and doesnt know the filename you were trying to load, etc)

        Show
        Robert Muir added a comment - reopening as Mark hit this in cloud tests, but the exception could be friendlier (its from a low level inputstream and doesnt know the filename you were trying to load, etc)
        Hide
        Robert Muir added a comment -

        attached is an improvement to make the error more friendly.
        it wraps the low-level exception, but provides the filename and suggests it might be in the wrong encoding

        Show
        Robert Muir added a comment - attached is an improvement to make the error more friendly. it wraps the low-level exception, but provides the filename and suggests it might be in the wrong encoding
        Hide
        Robert Muir added a comment -

        forgot to mark this resolved.

        Committed "friendly error" improvement revisions 964832 / 964838 (3x)

        Show
        Robert Muir added a comment - forgot to mark this resolved. Committed "friendly error" improvement revisions 964832 / 964838 (3x)
        Hide
        Grant Ingersoll added a comment -

        Bulk close for 3.1.0 release

        Show
        Grant Ingersoll added a comment - Bulk close for 3.1.0 release

          People

          • Assignee:
            Robert Muir
            Reporter:
            Robert Muir
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development