Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 3.5, 4.0-ALPHA
    • Component/s: modules/analysis
    • Labels:
      None
    • Lucene Fields:
      New, Patch Available

      Description

      WordListLoader is basically used for loading up stopwords lists, stem dictionaries, etc.
      Unfortunately the api returns Set<String> and sometimes even HashSet<String> or HashMap<String,String>

      I think we should break it and return CharArraySets and CharArrayMaps (but leave the return value as generic Set,Map).

      If someone objects to breaking it in 3.1, then we can do this only in 4.0, but i think it would be good to fix it both places.
      The reason is that if someone does new FooAnalyzer() a lot (probably not uncommon) i think its doing a bunch of useless copying.

      I think we should slap @lucene.internal on this API too, since thats mostly how its being used.

      1. LUCENE-2564.patch
        54 kB
        Simon Willnauer
      2. LUCENE-2564.patch
        53 kB
        Simon Willnauer
      3. LUCENE-2564.patch
        51 kB
        Simon Willnauer

        Activity

        Hide
        Uwe Schindler added a comment -

        Bulk close after release of 3.5

        Show
        Uwe Schindler added a comment - Bulk close after release of 3.5
        Hide
        Simon Willnauer added a comment -

        committed to trunk and backported to 3.x

        Show
        Simon Willnauer added a comment - committed to trunk and backported to 3.x
        Hide
        Simon Willnauer added a comment -

        final patch containing a CHANGES.txt entry - I am going to commit this in a minute

        Show
        Simon Willnauer added a comment - final patch containing a CHANGES.txt entry - I am going to commit this in a minute
        Hide
        Robert Muir added a comment -

        I figure since we backport this I should put it under lucene/contrib/CHANGES.txt?

        +1

        Show
        Robert Muir added a comment - I figure since we backport this I should put it under lucene/contrib/CHANGES.txt? +1
        Hide
        Simon Willnauer added a comment -

        next iteration

        • added javadocs & removed all nocommits
        • renamed IOUtils#getReader to IOUtils#getDecodingReader
        • used a CharSet instance instead of a string in all getDecodingReader variants so we can use a cached UTF-8 CharSet instance instead of looking it up for each invocation.

        I think its ready... where do we add the CHANGES.txt entry for that stuff? I figure since we backport this I should put it under lucene/contrib/CHANGES.txt?

        Show
        Simon Willnauer added a comment - next iteration added javadocs & removed all nocommits renamed IOUtils#getReader to IOUtils#getDecodingReader used a CharSet instance instead of a string in all getDecodingReader variants so we can use a cached UTF-8 CharSet instance instead of looking it up for each invocation. I think its ready... where do we add the CHANGES.txt entry for that stuff? I figure since we backport this I should put it under lucene/contrib/CHANGES.txt?
        Hide
        Robert Muir added a comment -

        patch looks good... i was just referring to Solr's resource loading of stopwords and stuff.

        but we don't have to do that here, imo we should fix the issues here first.

        Maybe for the javadocs on getReader we should explain that unlike the java default, it creates
        a reader that will throw an exception if it detects the charset is wrong
        (so this is good for configuration files-reading like WordListLoader, but not recommended
        for say documents crawled from the web or something)

        Show
        Robert Muir added a comment - patch looks good... i was just referring to Solr's resource loading of stopwords and stuff. but we don't have to do that here, imo we should fix the issues here first. Maybe for the javadocs on getReader we should explain that unlike the java default, it creates a reader that will throw an exception if it detects the charset is wrong (so this is good for configuration files-reading like WordListLoader, but not recommended for say documents crawled from the web or something)
        Hide
        Simon Willnauer added a comment -

        I tried to simplify this a little without going into changing solr (not sure what robert meant by that anyway...) I basically moved all File / InputStream related stuff out of WordListLoader and made it fixed to CharArraySet/Map - there are still nocommits in there since I wanted to get some feedback what others think before I add javadoc everywhere

        Show
        Simon Willnauer added a comment - I tried to simplify this a little without going into changing solr (not sure what robert meant by that anyway...) I basically moved all File / InputStream related stuff out of WordListLoader and made it fixed to CharArraySet/Map - there are still nocommits in there since I wanted to get some feedback what others think before I add javadoc everywhere
        Hide
        Robert Muir added a comment -

        bulk move 3.2 -> 3.3

        Show
        Robert Muir added a comment - bulk move 3.2 -> 3.3
        Hide
        Robert Muir added a comment -

        as much as i hate the fact this one uses the default encoding in its File method,
        its only used by StopFilter etc.

        Our provided analyzers and Solr are treating all this stuff as UTF-8 encoded resources,
        so I think its ok to delay until 3.2 and re-assess the best way.

        I made a prototype patch and it was complicated, mainly because i wanted to fix
        this thing so that its coherent with Solr's resource loeading.

        Show
        Robert Muir added a comment - as much as i hate the fact this one uses the default encoding in its File method, its only used by StopFilter etc. Our provided analyzers and Solr are treating all this stuff as UTF-8 encoded resources, so I think its ok to delay until 3.2 and re-assess the best way. I made a prototype patch and it was complicated, mainly because i wanted to fix this thing so that its coherent with Solr's resource loeading.
        Hide
        Robert Muir added a comment -

        There are more problems with this loader... it uses FileReader (platform-dependent encoding).
        I think we should break it to default to UTF-8, too.

        Show
        Robert Muir added a comment - There are more problems with this loader... it uses FileReader (platform-dependent encoding). I think we should break it to default to UTF-8, too.

          People

          • Assignee:
            Robert Muir
            Reporter:
            Robert Muir
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development