Forrest
  1. Forrest
  2. FOR-1231

Forrest does not deal properly with UTF-8 .xml content, even with the proper XML content-type header, and generates corrupted HTML

    Details

    • Type: Bug Bug
    • Status: Open
    • Priority: Critical Critical
    • Resolution: Unresolved
    • Affects Version/s: 0.9, 0.10-dev
    • Fix Version/s: None
    • Labels:
      None

      Description

      We're using Forrest to generate the Apache ManifoldCF site. We've added Japanese content. The content worked fine via localhost:8888, but the generated html content does not load properly in a browser, even though the browser correctly divines that the HTML page has utf-8 encoding. It looks like many utf-8 characters in the source XML are handled correctly but some are corrupted. I've also tried the fix in FORREST-668 but this does not help. See http://incubator.apache.org/connectors and click on the tab in Japanese to see what I mean. The current source for the site can be found in: https://svn.apache.org/repos/asf/incubator/lcf/trunk/site.

      I checked out latest Forrest trunk and built and used that but there has been no improvement.
      1. FOR-1231.patch
        2 kB
        Karl Wright

        Issue Links

          Activity

          Karl Wright created issue -
          Karl Wright made changes -
          Field Original Value New Value
          Link This issue blocks CONNECTORS-385 [ CONNECTORS-385 ]
          Karl Wright made changes -
          Description We're using Forrest to generate the Apache ManifoldCF site. We've added Japanese content. The content worked fine via localhost:8888, but the html images do not load properly in a browser, even though the browser correctly presumes the page is utf-8. It looks like many characters are handled correctly but some are corrupted. I've also tried the fix in FORREST-668 but this does not help. See http://incubator.apache.org/connectors and click on the tab in Japanese to see what I mean. The current source for the site can be found in: https://svn.apache.org/repos/asf/incubator/lcf/trunk/site.

          I checked out latest Forrest trunk and build that but there is no improvement.
          We're using Forrest to generate the Apache ManifoldCF site. We've added Japanese content. The content worked fine via localhost:8888, but the generated html content does not load properly in a browser, even though the browser correctly divines that the HTML page has utf-8 encoding. It looks like many utf-8 characters are handled correctly but some are corrupted. I've also tried the fix in FORREST-668 but this does not help. See http://incubator.apache.org/connectors and click on the tab in Japanese to see what I mean. The current source for the site can be found in: https://svn.apache.org/repos/asf/incubator/lcf/trunk/site.

          I checked out latest Forrest trunk and build that but there is no improvement.
          Karl Wright made changes -
          Description We're using Forrest to generate the Apache ManifoldCF site. We've added Japanese content. The content worked fine via localhost:8888, but the generated html content does not load properly in a browser, even though the browser correctly divines that the HTML page has utf-8 encoding. It looks like many utf-8 characters are handled correctly but some are corrupted. I've also tried the fix in FORREST-668 but this does not help. See http://incubator.apache.org/connectors and click on the tab in Japanese to see what I mean. The current source for the site can be found in: https://svn.apache.org/repos/asf/incubator/lcf/trunk/site.

          I checked out latest Forrest trunk and build that but there is no improvement.
          We're using Forrest to generate the Apache ManifoldCF site. We've added Japanese content. The content worked fine via localhost:8888, but the generated html content does not load properly in a browser, even though the browser correctly divines that the HTML page has utf-8 encoding. It looks like many utf-8 characters in the source XML are handled correctly but some are corrupted. I've also tried the fix in FORREST-668 but this does not help. See http://incubator.apache.org/connectors and click on the tab in Japanese to see what I mean. The current source for the site can be found in: https://svn.apache.org/repos/asf/incubator/lcf/trunk/site.

          I checked out latest Forrest trunk and build that but there is no improvement.
          Karl Wright made changes -
          Description We're using Forrest to generate the Apache ManifoldCF site. We've added Japanese content. The content worked fine via localhost:8888, but the generated html content does not load properly in a browser, even though the browser correctly divines that the HTML page has utf-8 encoding. It looks like many utf-8 characters in the source XML are handled correctly but some are corrupted. I've also tried the fix in FORREST-668 but this does not help. See http://incubator.apache.org/connectors and click on the tab in Japanese to see what I mean. The current source for the site can be found in: https://svn.apache.org/repos/asf/incubator/lcf/trunk/site.

          I checked out latest Forrest trunk and build that but there is no improvement.
          We're using Forrest to generate the Apache ManifoldCF site. We've added Japanese content. The content worked fine via localhost:8888, but the generated html content does not load properly in a browser, even though the browser correctly divines that the HTML page has utf-8 encoding. It looks like many utf-8 characters in the source XML are handled correctly but some are corrupted. I've also tried the fix in FORREST-668 but this does not help. See http://incubator.apache.org/connectors and click on the tab in Japanese to see what I mean. The current source for the site can be found in: https://svn.apache.org/repos/asf/incubator/lcf/trunk/site.

          I checked out latest Forrest trunk and built and used that but there has been no improvement.
          Hide
          Hitoshi Ozawa added a comment -
          While at this, would appreciate if it's possible to install Japanese fonts as well so pdf containing Japanese would show up correctly as well.
          Show
          Hitoshi Ozawa added a comment - While at this, would appreciate if it's possible to install Japanese fonts as well so pdf containing Japanese would show up correctly as well.
          Hide
          David Crossley added a comment -
          Please ask about separate usage issues on the user mailing list.

          The PDF fonts are configurable. See that plugin's docs:
          http://forrest.apache.org/docs/plugins/org.apache.forrest.plugin.output.pdf/
          Show
          David Crossley added a comment - Please ask about separate usage issues on the user mailing list. The PDF fonts are configurable. See that plugin's docs: http://forrest.apache.org/docs/plugins/org.apache.forrest.plugin.output.pdf/
          Hide
          Karl Wright added a comment -
          I'm told that the Japanese portion of the site is correctly generated on a system that has a default locale of ja_JP. Obviously, though, this is not a good solution to the problem since we cannot select different locales when there is more than one language involved.
          Show
          Karl Wright added a comment - I'm told that the Japanese portion of the site is correctly generated on a system that has a default locale of ja_JP. Obviously, though, this is not a good solution to the problem since we cannot select different locales when there is more than one language involved.
          Hide
          Hitoshi Ozawa added a comment -
          Sorry David, I thought the html pages were being dynamically generated on the Apache server.
          It seems it's not. "forrest site" works fine on my Japanese OS.

          Karl, is your system setup to use en_US-UTF-8?
          export LC_ALL=en_US.UTF-8
          export LANG=en_US.UTF-8
          export LANGUAGE=en_US.UTF-8
          Show
          Hitoshi Ozawa added a comment - Sorry David, I thought the html pages were being dynamically generated on the Apache server. It seems it's not. "forrest site" works fine on my Japanese OS. Karl, is your system setup to use en_US-UTF-8? export LC_ALL=en_US.UTF-8 export LANG=en_US.UTF-8 export LANGUAGE=en_US.UTF-8
          Hide
          Karl Wright added a comment -
          bq. Karl, is your system setup to use en_US-UTF-8?
          bq. export LC_ALL=en_US.UTF-8
          bq. export LANG=en_US.UTF-8
          bq. export LANGUAGE=en_US.UTF-8

          I set the equivalent Windows variables but no change in the generated code for me. So it must be something else.
          Show
          Karl Wright added a comment - bq. Karl, is your system setup to use en_US-UTF-8? bq. export LC_ALL=en_US.UTF-8 bq. export LANG=en_US.UTF-8 bq. export LANGUAGE=en_US.UTF-8 I set the equivalent Windows variables but no change in the generated code for me. So it must be something else.
          Hide
          Karl Wright added a comment -
          I figured it out. What we need to do is set the JAVA default encoding to UTF-8. The easy way to do this is (on Windows):

          set JAVA_TOOL_OPTIONS=-Dfile.encoding=UTF8

           ... or on Linux:

          export JAVA_TOOL_OPTIONS=-Dfile.encoding=UTF8

          Doing this before a Forrest invocation causes all JVMs it brings up to have the right encoding. (It's Cocoon that seems to be broken, by the way)
          Show
          Karl Wright added a comment - I figured it out. What we need to do is set the JAVA default encoding to UTF-8. The easy way to do this is (on Windows): set JAVA_TOOL_OPTIONS=-Dfile.encoding=UTF8  ... or on Linux: export JAVA_TOOL_OPTIONS=-Dfile.encoding=UTF8 Doing this before a Forrest invocation causes all JVMs it brings up to have the right encoding. (It's Cocoon that seems to be broken, by the way)
          Hide
          Karl Wright added a comment -
          This patch works, at least as far as generating Japanese correctly on an en_US Windows machine.
          Show
          Karl Wright added a comment - This patch works, at least as far as generating Japanese correctly on an en_US Windows machine.
          Karl Wright made changes -
          Attachment FOR-1231.patch [ 12511374 ]
          Hide
          David Crossley added a comment -
          Thanks. I was thinking of a similar patch. However i wondered if it would need to append this setting to any existing JAVA_TOOL_OPTIONS then reset at finish.

          I have applied your patch as-is. Thanks.
          If someone thinks that it needs more, then please do.

          Regarding the Cocoon situation, i think that the doc comments refer to the fact that Cocoon/Forrest have many supporting products handling various parts of the system. Perhaps some of those treat the encoding differently. So this environment setting seems a good solution.
          Show
          David Crossley added a comment - Thanks. I was thinking of a similar patch. However i wondered if it would need to append this setting to any existing JAVA_TOOL_OPTIONS then reset at finish. I have applied your patch as-is. Thanks. If someone thinks that it needs more, then please do. Regarding the Cocoon situation, i think that the doc comments refer to the fact that Cocoon/Forrest have many supporting products handling various parts of the system. Perhaps some of those treat the encoding differently. So this environment setting seems a good solution.

            People

            • Assignee:
              Unassigned
              Reporter:
              Karl Wright
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:

                Development