Lucene - Core
  1. Lucene - Core
  2. LUCENE-3983

HTMLCharacterEntities.jflex uses String.toUpperCase without Locale

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.0-ALPHA
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New, Patch Available

      Description

      Is this expected?

            "xi", "\u03BE", "yacute", "\u00FD", "yen", "\u00A5", "yuml", "\u00FF",
            "zeta", "\u03B6", "zwj", "\u200D", "zwnj", "\u200C"
          };
          for (int i = 0 ; i < entities.length ; i += 2) {
            Character value = entities[i + 1].charAt(0);
            entityValues.put(entities[i], value);
            if (upperCaseVariantsAccepted.contains(entities[i])) {
              entityValues.put(entities[i].toUpperCase(), value);
            }
          }
      

      In my opinion, this should look like:

            "xi", "\u03BE", "yacute", "\u00FD", "yen", "\u00A5", "yuml", "\u00FF",
            "zeta", "\u03B6", "zwj", "\u200D", "zwnj", "\u200C"
          };
          for (int i = 0 ; i < entities.length ; i += 2) {
            Character value = entities[i + 1].charAt(0);
            entityValues.put(entities[i], value);
            if (upperCaseVariantsAccepted.contains(entities[i])) {
              entityValues.put(entities[i].toUpperCase(Locale.ENGLISH), value);
            }
          }
      

      (otherwise in the Turkish locale, the entities containing "i" (like "xi" -> '\u03BE') will not be detected correctly).

        Activity

        Hide
        Steve Rowe added a comment -

        Since upperCaseVariantsAccepted entries don't include an "i", and this set will likely never grow, this isn't really a problem?:

        private static final Set<String> upperCaseVariantsAccepted
            = new HashSet<String>(Arrays.asList("quot","copy","gt","lt","reg","amp"));

        However, it's definitely a good idea in general.

        +1

        Show
        Steve Rowe added a comment - Since upperCaseVariantsAccepted entries don't include an "i", and this set will likely never grow, this isn't really a problem?: private static final Set< String > upperCaseVariantsAccepted = new HashSet< String >(Arrays.asList( "quot" , "copy" , "gt" , "lt" , "reg" , "amp" )); However, it's definitely a good idea in general. +1
        Hide
        Steve Rowe added a comment -

        Maybe a better idea would be to convert upperCaseVariantsAccepted into a map, so that runtime uppercasing isn't required:

          private static final Map<String,String> upperCaseVariantsAccepted
              = new HashMap<String,String>();
          static {
            upperCaseVariantsAccepted.put("quot", "QUOT");
            upperCaseVariantsAccepted.put("copy", "COPY");
            upperCaseVariantsAccepted.put("gt", "GT");
            upperCaseVariantsAccepted.put("lt", "LT");
            upperCaseVariantsAccepted.put("reg", "REG");
            upperCaseVariantsAccepted.put("amp", "AMP");
          }
        [...]
          for (int i = 0 ; i < entities.length ; i += 2) {
            Character value = entities[i + 1].charAt(0);
            entityValues.put(entities[i], value);
            String upperCaseVariant = upperCaseVariantsAccepted.get(entities[i]);
            if (upperCaseVariant != null) {
              entityValues.put(upperCaseVariant, value);
            }
          }
        
        Show
        Steve Rowe added a comment - Maybe a better idea would be to convert upperCaseVariantsAccepted into a map, so that runtime uppercasing isn't required: private static final Map< String , String > upperCaseVariantsAccepted = new HashMap< String , String >(); static { upperCaseVariantsAccepted.put( "quot" , "QUOT" ); upperCaseVariantsAccepted.put( "copy" , "COPY" ); upperCaseVariantsAccepted.put( "gt" , "GT" ); upperCaseVariantsAccepted.put( "lt" , "LT" ); upperCaseVariantsAccepted.put( "reg" , "REG" ); upperCaseVariantsAccepted.put( "amp" , "AMP" ); } [...] for ( int i = 0 ; i < entities.length ; i += 2) { Character value = entities[i + 1].charAt(0); entityValues.put(entities[i], value); String upperCaseVariant = upperCaseVariantsAccepted.get(entities[i]); if (upperCaseVariant != null ) { entityValues.put(upperCaseVariant, value); } }
        Hide
        Uwe Schindler added a comment -

        I have no preference, I just noticed the missing Locale and that alarmed me. We should really avoid that to prevent bugs from the beginning.
        I would simply add the Locale.ENGLISH, commit that and leave the rest unchanged. I just assigned it to you, as I have no uptodate jfex installed to regenerate the java files, otherwise I would have heavy committed

        Show
        Uwe Schindler added a comment - I have no preference, I just noticed the missing Locale and that alarmed me. We should really avoid that to prevent bugs from the beginning. I would simply add the Locale.ENGLISH, commit that and leave the rest unchanged. I just assigned it to you, as I have no uptodate jfex installed to regenerate the java files, otherwise I would have heavy committed
        Hide
        Steve Rowe added a comment -

        Patch removing runtime upcasing, as in my previous comment.

        Committing shortly.

        Show
        Steve Rowe added a comment - Patch removing runtime upcasing, as in my previous comment. Committing shortly.
        Hide
        Steve Rowe added a comment -

        Committed to trunk.

        I don't think it's worth it to backport to the 3.6 branch, since the only danger here was if the set of recognized uppercase variants of HTML character entities ever grew, one of them might contain an "i"; since branch 3.6 is bugfix-only, though, that set will never grow.

        Show
        Steve Rowe added a comment - Committed to trunk. I don't think it's worth it to backport to the 3.6 branch, since the only danger here was if the set of recognized uppercase variants of HTML character entities ever grew, one of them might contain an "i"; since branch 3.6 is bugfix-only, though, that set will never grow.

          People

          • Assignee:
            Steve Rowe
            Reporter:
            Uwe Schindler
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development