Solr
  1. Solr
  2. SOLR-1865

ignore byte-order markers in SolrResourceLoader

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 3.1, 4.0-ALPHA
    • Component/s: None
    • Labels:
      None

      Description

      If you create say a stopwords list with windows notepad or other editors and save as UTF-8,
      some of these editors will insert a byte-order marker (zero-width no-break space) as the first
      character of the file.

      http://www.lucidimagination.com/search/document/5101871231fc95af/is_this_a_bug_of_the_ressourceloader

      1. SOLR-1865.patch
        3 kB
        Hoss Man
      2. SOLR-1865.patch
        2 kB
        Robert Muir

        Activity

        Hide
        Robert Muir added a comment -

        attached is a patch to ignore BOM's at the beginning of files loaded with getLines()

        Show
        Robert Muir added a comment - attached is a patch to ignore BOM's at the beginning of files loaded with getLines()
        Hide
        Hoss Man added a comment -

        Robert: based on my limited understanding, aren't there different BOMs for different encodings? ...

        http://unicode.org/faq/utf_bom.html#bom4

        The getLInes method modified in your patch could (conceivably) be used to open files in other encodings, so do we also need to worry about those possibilities as well? (or does InputStreamReader take care of that for us?)

        Show
        Hoss Man added a comment - Robert: based on my limited understanding, aren't there different BOMs for different encodings? ... http://unicode.org/faq/utf_bom.html#bom4 The getLInes method modified in your patch could (conceivably) be used to open files in other encodings, so do we also need to worry about those possibilities as well? (or does InputStreamReader take care of that for us?)
        Hide
        Robert Muir added a comment -

        Hoss Man: it is true that, as bytes, other encodings represent the BOM in a different way.

        However, your last statement is the important part:
        the Reader converts it to java characters (UTF-16) encoding for us.

        So in String or char context it is always going to be U+FEFF, regardless of whichever unicode encoding it was originally in.

        Show
        Robert Muir added a comment - Hoss Man: it is true that, as bytes, other encodings represent the BOM in a different way. However, your last statement is the important part: the Reader converts it to java characters (UTF-16) encoding for us. So in String or char context it is always going to be U+FEFF, regardless of whichever unicode encoding it was originally in.
        Hide
        Hoss Man added a comment -

        Robert: i updated your test to verify that the file has a BOM in it just in case someone (or some software) inadvertently removes it.

        if this looks cool then by all means commit.

        Show
        Hoss Man added a comment - Robert: i updated your test to verify that the file has a BOM in it just in case someone (or some software) inadvertently removes it. if this looks cool then by all means commit.
        Hide
        Robert Muir added a comment -

        Thanks Hoss Man, great idea.

        I'll commit this patch in a bit.

        Show
        Robert Muir added a comment - Thanks Hoss Man, great idea. I'll commit this patch in a bit.
        Hide
        Robert Muir added a comment -

        Committed to trunk: revision 942288.

        Show
        Robert Muir added a comment - Committed to trunk: revision 942288.
        Hide
        Robert Muir added a comment -

        Committed revision 942289 to branch_3x

        Show
        Robert Muir added a comment - Committed revision 942289 to branch_3x
        Hide
        Hoss Man added a comment -

        Correcting Fix Version based on CHANGES.txt, see this thread for more details...

        http://mail-archives.apache.org/mod_mbox/lucene-dev/201005.mbox/%3Calpine.DEB.1.10.1005251052040.24672@radix.cryptio.net%3E

        Show
        Hoss Man added a comment - Correcting Fix Version based on CHANGES.txt, see this thread for more details... http://mail-archives.apache.org/mod_mbox/lucene-dev/201005.mbox/%3Calpine.DEB.1.10.1005251052040.24672@radix.cryptio.net%3E
        Hide
        Grant Ingersoll added a comment -

        Bulk close for 3.1.0 release

        Show
        Grant Ingersoll added a comment - Bulk close for 3.1.0 release

          People

          • Assignee:
            Robert Muir
            Reporter:
            Robert Muir
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development