Lucene - Core
  1. Lucene - Core
  2. LUCENE-2466

fix some more locale problems in lucene/solr

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 3.1, 4.0-ALPHA
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      set ANT_ARGS="-Dargs=-Duser.language=tr -Duser.country=TR"
      ant clean test

      We should make sure this works across all of lucene/solr

      1. LUCENE-2466_coretests.patch
        2 kB
        Robert Muir
      2. LUCENE-2466_lucene_thai.patch
        2 kB
        Robert Muir
      3. LUCENE-2466_thai_solr.patch
        7 kB
        Robert Muir
      4. LUCENE-2466.patch
        34 kB
        Robert Muir
      5. LUCENE-2466.patch
        38 kB
        Robert Muir

        Issue Links

          Activity

          Hide
          Robert Muir added a comment -

          attached is a patch, then lucene core/contrib is ok.

          But Solr has some failures that must be investigated.

          If no one objects I would like to commit this first and backport, then investigate those.

          Show
          Robert Muir added a comment - attached is a patch, then lucene core/contrib is ok. But Solr has some failures that must be investigated. If no one objects I would like to commit this first and backport, then investigate those.
          Hide
          Robert Muir added a comment -

          attached is a patch that fixes the tests for solr, too.

          • I added StrUtils.ROOT_LOCALE, but we could probably use Locale.ENGLISH just fine too, this is just me being nitpicky.
          • commons-codec fixed this in their 1.4 release, so i upgraded to 1.4 (not in patch, obviously) so that DoubleMetaphoneFilter etc pass also.
          • besides lowercasing, Solr uses uppercasing in a lot of places... in my opinion we should review why it is doing this.
          • I didnt change SolrQueryParser, similar problems exist in Lucene's QueryParser (strange casing) and thats for another day.

          Someone should review the Solr stuff, as I don't think i necessarily present the best solution but just indicate where the problems are.

          Show
          Robert Muir added a comment - attached is a patch that fixes the tests for solr, too. I added StrUtils.ROOT_LOCALE, but we could probably use Locale.ENGLISH just fine too, this is just me being nitpicky. commons-codec fixed this in their 1.4 release, so i upgraded to 1.4 (not in patch, obviously) so that DoubleMetaphoneFilter etc pass also. besides lowercasing, Solr uses uppercasing in a lot of places... in my opinion we should review why it is doing this. I didnt change SolrQueryParser, similar problems exist in Lucene's QueryParser (strange casing) and thats for another day. Someone should review the Solr stuff, as I don't think i necessarily present the best solution but just indicate where the problems are.
          Hide
          Yonik Seeley added a comment -

          Awesome! If we can get the tests to pass with these different locales, commit it! When in doubt, we should not be sensitive to locale.

          I didnt change SolrQueryParser, similar problems exist in Lucene's QueryParser (strange casing)

          The QP shouldn't currently be an issue for solr, we never set the flags to do lowercasing (I've always been against it as the right solution is field specific, not parser specific).

          Show
          Yonik Seeley added a comment - Awesome! If we can get the tests to pass with these different locales, commit it! When in doubt, we should not be sensitive to locale. I didnt change SolrQueryParser, similar problems exist in Lucene's QueryParser (strange casing) The QP shouldn't currently be an issue for solr, we never set the flags to do lowercasing (I've always been against it as the right solution is field specific, not parser specific).
          Hide
          Robert Muir added a comment -

          here is a cleaned up patch, using Locale.ENGLISH, that fixes the casing problems.

          • Note the use of Locale.ENGLISH is not an affront to non-english users, it just forces consistent casing behavior and is already defined as a constant.

          I plan to commit soon (trunk/stable), and then look at the unrelated separate failures for Thai:
          set ANT_ARGS="-Dargs=-Duser.language=th -Duser.country=TH -Duser.variant=TH"

          I suspect much of these failures are due to date handling.

          We might want to devise a plan to help test this stuff, either let Hudson pick a different locale each night, maybe just from the "troublesome ones", and/or do something similar to the LocalizedTestCase in lucene (but this can cause tests to be very slow).

          Show
          Robert Muir added a comment - here is a cleaned up patch, using Locale.ENGLISH, that fixes the casing problems. Note the use of Locale.ENGLISH is not an affront to non-english users, it just forces consistent casing behavior and is already defined as a constant. I plan to commit soon (trunk/stable), and then look at the unrelated separate failures for Thai: set ANT_ARGS="-Dargs=-Duser.language=th -Duser.country=TH -Duser.variant=TH" I suspect much of these failures are due to date handling. We might want to devise a plan to help test this stuff, either let Hudson pick a different locale each night, maybe just from the "troublesome ones", and/or do something similar to the LocalizedTestCase in lucene (but this can cause tests to be very slow).
          Hide
          Robert Muir added a comment -

          Committed 945245 (trunk) /945270 (3x) for the casing problems.

          Show
          Robert Muir added a comment - Committed 945245 (trunk) /945270 (3x) for the casing problems.
          Hide
          Robert Muir added a comment -

          attached patch fixes trunk for the thai locale.

          doesnt need to be merged as the tests don't exist in 3x, i created this problem

          Show
          Robert Muir added a comment - attached patch fixes trunk for the thai locale. doesnt need to be merged as the tests don't exist in 3x, i created this problem
          Hide
          Robert Muir added a comment -

          Committed revision 945274 for the lucene wildcard/regex tests.

          I will look at the solr problems under this locale now, they probably need to be merged to 3x also.

          Show
          Robert Muir added a comment - Committed revision 945274 for the lucene wildcard/regex tests. I will look at the solr problems under this locale now, they probably need to be merged to 3x also.
          Hide
          Robert Muir added a comment -

          I talked to Hoss Man about some of these date problems, and he was of the opinion that for Solr, the Locale should never be used for date parsing/formatting (only standard UTC/Locale.US). So these are easy to fix.

          But there is another problem, in this case the formats of floats themselves. Should they follow the same rule in Solr, or should localized numerics formats be supported?

             [junit] Caused by: java.lang.NumberFormatException: For input string: "<some thai digits here>"
             [junit]     at sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:1224)
             [junit]     at java.lang.Float.parseFloat(Float.java:422)
             [junit]     at org.apache.solr.util.NumberUtils.float2sortableStr(NumberUtils.java:79)
             [junit]     at org.apache.solr.schema.SortableFloatField.toInternal(SortableFloatField.java:49)
             [junit]     at org.apache.solr.schema.FieldType.createField(FieldType.java:236)
             [junit]     ... 38 more
             [junit] </result>)
          

          Furthermore, what about DataImportHandlers use of some of the same DateMathParser stuff used in other places in Solr? It tends to use TimeZone.getDefault/Locale.getDefault... should this be changed?

          Show
          Robert Muir added a comment - I talked to Hoss Man about some of these date problems, and he was of the opinion that for Solr, the Locale should never be used for date parsing/formatting (only standard UTC/Locale.US). So these are easy to fix. But there is another problem, in this case the formats of floats themselves. Should they follow the same rule in Solr, or should localized numerics formats be supported? [junit] Caused by: java.lang.NumberFormatException: For input string: "<some thai digits here>" [junit] at sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:1224) [junit] at java.lang.Float.parseFloat(Float.java:422) [junit] at org.apache.solr.util.NumberUtils.float2sortableStr(NumberUtils.java:79) [junit] at org.apache.solr.schema.SortableFloatField.toInternal(SortableFloatField.java:49) [junit] at org.apache.solr.schema.FieldType.createField(FieldType.java:236) [junit] ... 38 more [junit] </result>) Furthermore, what about DataImportHandlers use of some of the same DateMathParser stuff used in other places in Solr? It tends to use TimeZone.getDefault/Locale.getDefault... should this be changed?
          Hide
          Robert Muir added a comment -

          attached is a patch with some modifications to Solr, adding missing Locale.US params etc, following Hoss Man's rule.

          I am still nervous about DIH (i didnt touch it) but this makes all the tests pass under th_TH_TH.

          Show
          Robert Muir added a comment - attached is a patch with some modifications to Solr, adding missing Locale.US params etc, following Hoss Man's rule. I am still nervous about DIH (i didnt touch it) but this makes all the tests pass under th_TH_TH.
          Hide
          Yonik Seeley added a comment -

          IMO, there's nothing in Solr that should depend on the system locale unless explicitly referenced or configured to do so. The defaults should certainly never do so.

          Hoss pointed out this in DIH:
          http://wiki.apache.org/solr/DataImportHandler#NumberFormatTransformer
          At a minimum I think this should be changed in trunk to not default to the system locale.

          Anyway, my communication will be limited over the next week starting tomorrow (Apache Lucene EuroCon)...
          so here's my standing +1 to commit all changes that remove system locale defaults.

          Show
          Yonik Seeley added a comment - IMO, there's nothing in Solr that should depend on the system locale unless explicitly referenced or configured to do so. The defaults should certainly never do so. Hoss pointed out this in DIH: http://wiki.apache.org/solr/DataImportHandler#NumberFormatTransformer At a minimum I think this should be changed in trunk to not default to the system locale. Anyway, my communication will be limited over the next week starting tomorrow (Apache Lucene EuroCon)... so here's my standing +1 to commit all changes that remove system locale defaults.
          Hide
          Robert Muir added a comment -

          Committed LUCENE-2466_thai_solr.patch 945343 (trunk) / 945353 (3x)

          Show
          Robert Muir added a comment - Committed LUCENE-2466 _thai_solr.patch 945343 (trunk) / 945353 (3x)
          Hide
          Robert Muir added a comment -

          I ran a few more locales, no more failures... I think we found the worst problems.

          Show
          Robert Muir added a comment - I ran a few more locales, no more failures... I think we found the worst problems.
          Hide
          Robert Muir added a comment -

          setting fix versions correctly here.

          happy to backport this stuff to 1.4.1 if desired.

          Show
          Robert Muir added a comment - setting fix versions correctly here. happy to backport this stuff to 1.4.1 if desired.
          Hide
          Grant Ingersoll added a comment -

          Bulk close for 3.1

          Show
          Grant Ingersoll added a comment - Bulk close for 3.1

            People

            • Assignee:
              Unassigned
              Reporter:
              Robert Muir
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development