Uploaded image for project: 'Cocoon'
  1. Cocoon
  2. COCOON-2063

NekoHTMLTransformer needs to set the default-encoding of the current system to work properly with UTF-8

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: 2.1.11, 2.2
    • Fix Version/s: 2.1.12, 2.2.1
    • Component/s: Blocks: HTML
    • Labels:
      None
    • Other Info:
      Patch available
    • Affects version (Component):
      Blocks: HTML - 1.0.0-M1
    • Fix version (Component):
      Blocks: HTML

      Description

      The NekoHTMLTransformer uses the cyberneko HTMLConfiguration for tidying html. Unfortunately it does not use the system's current encoding as default, instead you have to set a property to set your encoding. But this varies from one OS to another, so the best solution is to set this property automatically in the NekoHTMLTransformer depending on what Java uses as defaultCharset:

                  config.setProperty("http://cyberneko.org/html/properties/default-encoding", Charset.defaultCharset().name());
      1. nekohtmltransformer-encoding.patch
        3 kB
        Alexander Klimetschek
      2. NekoHTMLGenerator_BRANCH2_1_X.patch
        1.0 kB
        Ellis Pritchard

        Activity

        Hide
        alexander.klimetschek Alexander Klimetschek added a comment -
        Affects cocoon-html-impl.
        Show
        alexander.klimetschek Alexander Klimetschek added a comment - Affects cocoon-html-impl.
        Hide
        alexander.klimetschek Alexander Klimetschek added a comment -
        I forgot to mention that if someone wants to override this property via the configuration of the NekoHTMLTransformer, he can certainly do it. The manual config is applied after the dynamic setting of the encoding property, thus the manual one overrides the dynamic one.
        Show
        alexander.klimetschek Alexander Klimetschek added a comment - I forgot to mention that if someone wants to override this property via the configuration of the NekoHTMLTransformer, he can certainly do it. The manual config is applied after the dynamic setting of the encoding property, thus the manual one overrides the dynamic one.
        Hide
        ellispritchard Ellis Pritchard added a comment -
        This has bitten us too.

        Here's a patch for Cocoon 2.1.X, rev 597695
        Show
        ellispritchard Ellis Pritchard added a comment - This has bitten us too. Here's a patch for Cocoon 2.1.X, rev 597695
        Hide
        ellispritchard Ellis Pritchard added a comment -
        Added Affects Version 2.1.11-dev
        Show
        ellispritchard Ellis Pritchard added a comment - Added Affects Version 2.1.11-dev
        Hide
        ellispritchard Ellis Pritchard added a comment -
        Anyone fancy applying the patches?
        Show
        ellispritchard Ellis Pritchard added a comment - Anyone fancy applying the patches?
        Hide
        joerg.heinicke@gmx.de Jörg Heinicke added a comment -
        I fixed this issue in 2.2. The fix in 2.1 does not work since it uses java.nio which was only added in Java 1.4. Cocoon 2.1 has to be Java 1.3 compatible. Is there a way to find out the default encoding in Java 1.3? All the classes and methods were it would be necessary like new String(byte[]) or InputStreamReader() only point to http://java.sun.com/j2se/1.3/docs/api/java/lang/package-summary.html#charenc which just names some constants. With Mac OS X I also have no access to the source code of the JDK. The bytecode implies that the mentioned classes and methods use some Sun-internal class to retrieve the default encoding.
        Show
        joerg.heinicke@gmx.de Jörg Heinicke added a comment - I fixed this issue in 2.2. The fix in 2.1 does not work since it uses java.nio which was only added in Java 1.4. Cocoon 2.1 has to be Java 1.3 compatible. Is there a way to find out the default encoding in Java 1.3? All the classes and methods were it would be necessary like new String(byte[]) or InputStreamReader() only point to http://java.sun.com/j2se/1.3/docs/api/java/lang/package-summary.html#charenc which just names some constants. With Mac OS X I also have no access to the source code of the JDK. The bytecode implies that the mentioned classes and methods use some Sun-internal class to retrieve the default encoding.
        Hide
        joerg.heinicke@gmx.de Jörg Heinicke added a comment -
        Also had to revert the fix for Cocoon 2.2 since Charset.defaultCharset() is only available on Java 5.
        Show
        joerg.heinicke@gmx.de Jörg Heinicke added a comment - Also had to revert the fix for Cocoon 2.2 since Charset.defaultCharset() is only available on Java 5.
        Hide
        ellispritchard Ellis Pritchard added a comment -
        Oh, how annoying.

        Is there a possibility of starting to use the src/jdk1.x directories for these kind of patches?

        Just because some people are stuck in the dark ages doesn't mean we can't shine a light...
        Show
        ellispritchard Ellis Pritchard added a comment - Oh, how annoying. Is there a possibility of starting to use the src/jdk1.x directories for these kind of patches? Just because some people are stuck in the dark ages doesn't mean we can't shine a light...
        Hide
        joerg.heinicke@gmx.de Jörg Heinicke added a comment -
        With Ant this would be rather easy as we already have some 1.4-specific code in Cocoon 2.1. No idea about Maven.

        Is there really no other way to figure out the default encoding on before Java 5 JVMs?
        Show
        joerg.heinicke@gmx.de Jörg Heinicke added a comment - With Ant this would be rather easy as we already have some 1.4-specific code in Cocoon 2.1. No idea about Maven. Is there really no other way to figure out the default encoding on before Java 5 JVMs?
        Hide
        vgritsenko Vadim Gritsenko added a comment -
        I'm not sure why generator needs to know encoding... Can it be simply always set to UTF-8?
        Show
        vgritsenko Vadim Gritsenko added a comment - I'm not sure why generator needs to know encoding... Can it be simply always set to UTF-8?
        Hide
        joerg.heinicke@gmx.de Jörg Heinicke added a comment -
        As Vadim mentioned at http://marc.info/?l=xml-cocoon-dev&m=120905050708311&w=4 the NekoHTMLTransformer had an issue with converting String to byte[] using OS' default encoding rather than keeping the string - and so had the NekoHTMLGenerator when reading a request parameter value. These both issues are fixed in SVN and maybe caused the symptoms you saw. Closing the issue for now. Feel free to reopen it if problem still persists.
        Show
        joerg.heinicke@gmx.de Jörg Heinicke added a comment - As Vadim mentioned at http://marc.info/?l=xml-cocoon-dev&m=120905050708311&w=4 the NekoHTMLTransformer had an issue with converting String to byte[] using OS' default encoding rather than keeping the string - and so had the NekoHTMLGenerator when reading a request parameter value. These both issues are fixed in SVN and maybe caused the symptoms you saw. Closing the issue for now. Feel free to reopen it if problem still persists.

          People

          • Assignee:
            joerg.heinicke@gmx.de Jörg Heinicke
            Reporter:
            alexander.klimetschek Alexander Klimetschek
          • Votes:
            1 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development