Lucene - Core
  1. Lucene - Core
  2. LUCENE-3983

HTMLCharacterEntities.jflex uses String.toUpperCase without Locale

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.0-ALPHA
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New, Patch Available

      Description

      Is this expected?

            "xi", "\u03BE", "yacute", "\u00FD", "yen", "\u00A5", "yuml", "\u00FF",
            "zeta", "\u03B6", "zwj", "\u200D", "zwnj", "\u200C"
          };
          for (int i = 0 ; i < entities.length ; i += 2) {
            Character value = entities[i + 1].charAt(0);
            entityValues.put(entities[i], value);
            if (upperCaseVariantsAccepted.contains(entities[i])) {
              entityValues.put(entities[i].toUpperCase(), value);
            }
          }
      

      In my opinion, this should look like:

            "xi", "\u03BE", "yacute", "\u00FD", "yen", "\u00A5", "yuml", "\u00FF",
            "zeta", "\u03B6", "zwj", "\u200D", "zwnj", "\u200C"
          };
          for (int i = 0 ; i < entities.length ; i += 2) {
            Character value = entities[i + 1].charAt(0);
            entityValues.put(entities[i], value);
            if (upperCaseVariantsAccepted.contains(entities[i])) {
              entityValues.put(entities[i].toUpperCase(Locale.ENGLISH), value);
            }
          }
      

      (otherwise in the Turkish locale, the entities containing "i" (like "xi" -> '\u03BE') will not be detected correctly).

        Activity

        Uwe Schindler created issue -
        Hide
        Steve Rowe added a comment -

        Since upperCaseVariantsAccepted entries don't include an "i", and this set will likely never grow, this isn't really a problem?:

        private static final Set<String> upperCaseVariantsAccepted
            = new HashSet<String>(Arrays.asList("quot","copy","gt","lt","reg","amp"));

        However, it's definitely a good idea in general.

        +1

        Show
        Steve Rowe added a comment - Since upperCaseVariantsAccepted entries don't include an "i", and this set will likely never grow, this isn't really a problem?: private static final Set< String > upperCaseVariantsAccepted = new HashSet< String >(Arrays.asList( "quot" , "copy" , "gt" , "lt" , "reg" , "amp" )); However, it's definitely a good idea in general. +1
        Hide
        Steve Rowe added a comment -

        Maybe a better idea would be to convert upperCaseVariantsAccepted into a map, so that runtime uppercasing isn't required:

          private static final Map<String,String> upperCaseVariantsAccepted
              = new HashMap<String,String>();
          static {
            upperCaseVariantsAccepted.put("quot", "QUOT");
            upperCaseVariantsAccepted.put("copy", "COPY");
            upperCaseVariantsAccepted.put("gt", "GT");
            upperCaseVariantsAccepted.put("lt", "LT");
            upperCaseVariantsAccepted.put("reg", "REG");
            upperCaseVariantsAccepted.put("amp", "AMP");
          }
        [...]
          for (int i = 0 ; i < entities.length ; i += 2) {
            Character value = entities[i + 1].charAt(0);
            entityValues.put(entities[i], value);
            String upperCaseVariant = upperCaseVariantsAccepted.get(entities[i]);
            if (upperCaseVariant != null) {
              entityValues.put(upperCaseVariant, value);
            }
          }
        
        Show
        Steve Rowe added a comment - Maybe a better idea would be to convert upperCaseVariantsAccepted into a map, so that runtime uppercasing isn't required: private static final Map< String , String > upperCaseVariantsAccepted = new HashMap< String , String >(); static { upperCaseVariantsAccepted.put( "quot" , "QUOT" ); upperCaseVariantsAccepted.put( "copy" , "COPY" ); upperCaseVariantsAccepted.put( "gt" , "GT" ); upperCaseVariantsAccepted.put( "lt" , "LT" ); upperCaseVariantsAccepted.put( "reg" , "REG" ); upperCaseVariantsAccepted.put( "amp" , "AMP" ); } [...] for ( int i = 0 ; i < entities.length ; i += 2) { Character value = entities[i + 1].charAt(0); entityValues.put(entities[i], value); String upperCaseVariant = upperCaseVariantsAccepted.get(entities[i]); if (upperCaseVariant != null ) { entityValues.put(upperCaseVariant, value); } }
        Steve Rowe made changes -
        Field Original Value New Value
        Priority Major [ 3 ] Minor [ 4 ]
        Hide
        Uwe Schindler added a comment -

        I have no preference, I just noticed the missing Locale and that alarmed me. We should really avoid that to prevent bugs from the beginning.
        I would simply add the Locale.ENGLISH, commit that and leave the rest unchanged. I just assigned it to you, as I have no uptodate jfex installed to regenerate the java files, otherwise I would have heavy committed

        Show
        Uwe Schindler added a comment - I have no preference, I just noticed the missing Locale and that alarmed me. We should really avoid that to prevent bugs from the beginning. I would simply add the Locale.ENGLISH, commit that and leave the rest unchanged. I just assigned it to you, as I have no uptodate jfex installed to regenerate the java files, otherwise I would have heavy committed
        Hide
        Steve Rowe added a comment -

        Patch removing runtime upcasing, as in my previous comment.

        Committing shortly.

        Show
        Steve Rowe added a comment - Patch removing runtime upcasing, as in my previous comment. Committing shortly.
        Steve Rowe made changes -
        Attachment LUCENE-3983.patch [ 12528100 ]
        Hide
        Steve Rowe added a comment -

        Committed to trunk.

        I don't think it's worth it to backport to the 3.6 branch, since the only danger here was if the set of recognized uppercase variants of HTML character entities ever grew, one of them might contain an "i"; since branch 3.6 is bugfix-only, though, that set will never grow.

        Show
        Steve Rowe added a comment - Committed to trunk. I don't think it's worth it to backport to the 3.6 branch, since the only danger here was if the set of recognized uppercase variants of HTML character entities ever grew, one of them might contain an "i"; since branch 3.6 is bugfix-only, though, that set will never grow.
        Steve Rowe made changes -
        Status Open [ 1 ] Resolved [ 5 ]
        Lucene Fields New [ 10121 ] New,Patch Available [ 10121, 10120 ]
        Fix Version/s 4.0 [ 12314025 ]
        Resolution Fixed [ 1 ]
        Uwe Schindler made changes -
        Status Resolved [ 5 ] Closed [ 6 ]
        Transition Time In Source Status Execution Times Last Executer Last Execution Date
        Open Open Resolved Resolved
        34d 6h 34m 1 Steve Rowe 18/May/12 18:00
        Resolved Resolved Closed Closed
        356d 17h 43m 1 Uwe Schindler 10/May/13 11:43

          People

          • Assignee:
            Steve Rowe
            Reporter:
            Uwe Schindler
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development