Cocoon
  1. Cocoon
  2. COCOON-2002

HTML transformer only works with latin-1 characters

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Critical Critical
    • Resolution: Fixed
    • Affects Version/s: 2.1.10, 2.1.11
    • Fix Version/s: 2.1.12
    • Component/s: Blocks: HTML
    • Labels:
      None
    • Other Info:
      Patch available

      Description

      when transforming HTML in encodings other than latin-1
      the result is a page of question mark.

        Activity

        Hide
        Cédric Damioli added a comment -
        The encoding of the incoming document is only used at parsing time.

        Here, we are already dealing with chars, which may or may not exist in a specific encoding.

        UTF-8 is probably the better choice here
        Show
        Cédric Damioli added a comment - The encoding of the incoming document is only used at parsing time. Here, we are already dealing with chars, which may or may not exist in a specific encoding. UTF-8 is probably the better choice here
        Hide
        Abbas Mousavi added a comment -
        >Just a thought - won't the encoding that needs to be used depend on what was used in the input document?
        >i.e. if the source document passed in from a file generator has <?xml version="1.0" encoding="Big5"?>, would
        >the above change cause similar problems?

        yes, you are right, it is better if one adds a configuration parameter for setting the encoding. but before adding
        a configuration parameter UTF-8 is a better choice than latin-1.

        >Also, how is this affected by the char-encoding property in the tidy.properties configuration file?

        setting input-encoding in tidy.properties has no effect.

        Show
        Abbas Mousavi added a comment - >Just a thought - won't the encoding that needs to be used depend on what was used in the input document? >i.e. if the source document passed in from a file generator has <?xml version="1.0" encoding="Big5"?>, would >the above change cause similar problems? yes, you are right, it is better if one adds a configuration parameter for setting the encoding. but before adding a configuration parameter UTF-8 is a better choice than latin-1. >Also, how is this affected by the char-encoding property in the tidy.properties configuration file? setting input-encoding in tidy.properties has no effect.
        Hide
        Andrew Stevens added a comment -
        Just a thought - won't the encoding that needs to be used depend on what was used in the input document? i.e. if the source document passed in from a file generator has <?xml version="1.0" encoding="Big5"?>, would the above change cause similar problems?

        Also, how is this affected by the char-encoding property in the tidy.properties configuration file? Rather than the above change, could you have solved your problem by ensuring that property matches the source encoding being used in your documents? It may be that jtidy's default is latin-1.

        It seems to me that passing the above value in to the getBytes call assumes that the AbstractSAXTransformer's text recording code is written to always use UTF-8 for the stored text (and transcode where necessary). Is this actually the case?
        Show
        Andrew Stevens added a comment - Just a thought - won't the encoding that needs to be used depend on what was used in the input document? i.e. if the source document passed in from a file generator has <?xml version="1.0" encoding="Big5"?>, would the above change cause similar problems? Also, how is this affected by the char-encoding property in the tidy.properties configuration file? Rather than the above change, could you have solved your problem by ensuring that property matches the source encoding being used in your documents? It may be that jtidy's default is latin-1. It seems to me that passing the above value in to the getBytes call assumes that the AbstractSAXTransformer's text recording code is written to always use UTF-8 for the stored text (and transcode where necessary). Is this actually the case?
        Hide
        Abbas Mousavi added a comment -
        this change in org.apache.cocoon.transformation.HTMLTransformer

        solved the problem, the change is near line 173 >>>
         new ByteArrayInputStream(text.getBytes("UTF-8"));

        ---------------------------------------------------------------------------------------
        /*
         * Licensed to the Apache Software Foundation (ASF) under one or more
         * contributor license agreements. See the NOTICE file distributed with
         * this work for additional information regarding copyright ownership.
         * The ASF licenses this file to You under the Apache License, Version 2.0
         * (the "License"); you may not use this file except in compliance with
         * the License. You may obtain a copy of the License at
         *
         * http://www.apache.org/licenses/LICENSE-2.0
         *
         * Unless required by applicable law or agreed to in writing, software
         * distributed under the License is distributed on an "AS IS" BASIS,
         * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
         * See the License for the specific language governing permissions and
         * limitations under the License.
         */
        package org.apache.cocoon.transformation;

        import java.io.BufferedInputStream;
        import java.io.ByteArrayInputStream;
        import java.io.IOException;
        import java.io.PrintWriter;
        import java.io.StringWriter;
        import java.util.HashMap;
        import java.util.Map;
        import java.util.Properties;
        import java.util.StringTokenizer;

        import org.apache.avalon.framework.configuration.Configurable;
        import org.apache.avalon.framework.configuration.Configuration;
        import org.apache.avalon.framework.configuration.ConfigurationException;
        import org.apache.avalon.framework.parameters.Parameters;
        import org.apache.cocoon.ProcessingException;
        import org.apache.cocoon.environment.SourceResolver;
        import org.apache.cocoon.transformation.AbstractSAXTransformer;
        import org.apache.cocoon.xml.XMLUtils;
        import org.apache.cocoon.xml.IncludeXMLConsumer;
        import org.apache.excalibur.source.Source;
        import org.w3c.tidy.Tidy;
        import org.xml.sax.Attributes;
        import org.xml.sax.SAXException;

        /**
         * Converts (escaped) HTML snippets into JTidied HTML.
         * This transformer expects a list of elements, passed as comma separated
         * values of the "tags" parameter. It records the text enclosed in such
         * elements and pass it thru JTidy to obtain valid XHTML.
         *
         * <p>TODO: Add namespace support.
         * <p><strong>WARNING:</strong> This transformer should be considered unstable.
         *
         * @author <a href="mailto:d.madama@pro-netics.com">Daniele Madama</a>
         * @author <a href="mailto:gianugo@apache.org">Gianugo Rabellino</a>
         *
         * @version CVS $Id: HTMLTransformer.java 433543 2006-08-22 06:22:54Z crossley $
         */
        public class HTMLTransformer
            extends AbstractSAXTransformer
            implements Configurable {

            /**
             * Properties for Tidy format
             */
            private Properties properties;

            /**
             * Tags that must be normalized
             */
            private Map tags;

            /**
             * React on endElement calls that contain a tag to be
             * tidied and run Jtidy on it, otherwise passthru.
             *
             * @see org.xml.sax.ContentHandler#endElement(java.lang.String, java.lang.String, java.lang.String)
             */
            public void endElement(String uri, String name, String raw)
                throws SAXException {
                if (this.tags.containsKey(name)) {
                    String toBeNormalized = this.endTextRecording();
                    try {
                        this.normalize(toBeNormalized);
                    } catch (ProcessingException e) {
                        e.printStackTrace();
                    }
                }
                super.endElement(uri, name, raw);
            }

            /**
             * Start buffering text if inside a tag to be normalized,
             * passthru otherwise.
             *
             * @see org.xml.sax.ContentHandler#startElement(java.lang.String, java.lang.String, java.lang.String, org.xml.sax.Attributes)
             */
            public void startElement(
                String uri,
                String name,
                String raw,
                Attributes attr)
                throws SAXException {
                super.startElement(uri, name, raw, attr);
                if (this.tags.containsKey(name)) {
                    this.startTextRecording();
                }
            }

            /**
             * Configure this transformer, possibly passing to it
             * a jtidy configuration file location.
             */
            public void configure(Configuration config) throws ConfigurationException {
                super.configure(config);

                String configUrl = config.getChild("jtidy-config").getValue(null);
                if (configUrl != null) {
                    org.apache.excalibur.source.SourceResolver resolver = null;
                    Source configSource = null;
                    try {
                        resolver = (org.apache.excalibur.source.SourceResolver)
                                   this.manager.lookup(org.apache.excalibur.source.SourceResolver.ROLE);
                        configSource = resolver.resolveURI(configUrl);
                        if (getLogger().isDebugEnabled()) {
                            getLogger().debug(
                                "Loading configuration from " + configSource.getURI());
                        }
                        this.properties = new Properties();
                        this.properties.load(configSource.getInputStream());

                    } catch (Exception e) {
                        getLogger().warn("Cannot load configuration from " + configUrl);
                        throw new ConfigurationException(
                            "Cannot load configuration from " + configUrl,
                            e);
                    } finally {
                        if (null != resolver) {
                            this.manager.release(resolver);
                            resolver.release(configSource);
                        }
                    }
                }
            }

            /**
             * The beef: run JTidy on the buffered text and stream
             * the result
             *
             * @param text the string to be tidied
             */
            private void normalize(String text) throws ProcessingException {
                try {
                    // Setup an instance of Tidy.
                    Tidy tidy = new Tidy();
                    tidy.setXmlOut(true);

                    if (this.properties == null) {
                        tidy.setXHTML(true);
                    } else {
                        tidy.setConfigurationFromProps(this.properties);
                    }

                    //Set Jtidy warnings on-off
                    tidy.setShowWarnings(getLogger().isWarnEnabled());
                    //Set Jtidy final result summary on-off
                    tidy.setQuiet(!getLogger().isInfoEnabled());
                    //Set Jtidy infos to a String (will be logged) instead of System.out
                    StringWriter stringWriter = new StringWriter();
                    PrintWriter errorWriter = new PrintWriter(stringWriter);
                    tidy.setErrout(errorWriter);

                    // Extract the document using JTidy and stream it.
                    ByteArrayInputStream bais =
                        new ByteArrayInputStream(text.getBytes("UTF-8"));
                    org.w3c.dom.Document doc =
                        tidy.parseDOM(new BufferedInputStream(bais), null);

                    // FIXME: Jtidy doesn't warn or strip duplicate attributes in same
                    // tag; stripping.
                    XMLUtils.stripDuplicateAttributes(doc, null);

                    errorWriter.flush();
                    errorWriter.close();
                    if (getLogger().isWarnEnabled()) {
                        getLogger().warn(stringWriter.toString());
                    }

                    IncludeXMLConsumer.includeNode(doc, this.contentHandler, this.lexicalHandler);
                } catch (Exception e) {
                    throw new ProcessingException(
                        "Exception in HTMLTransformer.normalize()",
                        e);
                }
            }

            /**
             * Setup this component, passing the tag names to be tidied.
             */

            public void setup(
                SourceResolver resolver,
                Map objectModel,
                String src,
                Parameters par)
                throws ProcessingException, SAXException, IOException {
                super.setup(resolver, objectModel, src, par);
                String tagsParam = par.getParameter("tags", "");
                if (getLogger().isDebugEnabled()) {
                    getLogger().debug("tags: " + tagsParam);
                }
                this.tags = new HashMap();
                StringTokenizer tokenizer = new StringTokenizer(tagsParam, ",");
                while (tokenizer.hasMoreElements()) {
                    String tok = tokenizer.nextToken().trim();
                    this.tags.put(tok, tok);
                }
            }
        }
        Show
        Abbas Mousavi added a comment - this change in org.apache.cocoon.transformation.HTMLTransformer solved the problem, the change is near line 173 >>>  new ByteArrayInputStream(text.getBytes("UTF-8")); --------------------------------------------------------------------------------------- /*  * Licensed to the Apache Software Foundation (ASF) under one or more  * contributor license agreements. See the NOTICE file distributed with  * this work for additional information regarding copyright ownership.  * The ASF licenses this file to You under the Apache License, Version 2.0  * (the "License"); you may not use this file except in compliance with  * the License. You may obtain a copy of the License at  *  * http://www.apache.org/licenses/LICENSE-2.0  *  * Unless required by applicable law or agreed to in writing, software  * distributed under the License is distributed on an "AS IS" BASIS,  * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.  * See the License for the specific language governing permissions and  * limitations under the License.  */ package org.apache.cocoon.transformation; import java.io.BufferedInputStream; import java.io.ByteArrayInputStream; import java.io.IOException; import java.io.PrintWriter; import java.io.StringWriter; import java.util.HashMap; import java.util.Map; import java.util.Properties; import java.util.StringTokenizer; import org.apache.avalon.framework.configuration.Configurable; import org.apache.avalon.framework.configuration.Configuration; import org.apache.avalon.framework.configuration.ConfigurationException; import org.apache.avalon.framework.parameters.Parameters; import org.apache.cocoon.ProcessingException; import org.apache.cocoon.environment.SourceResolver; import org.apache.cocoon.transformation.AbstractSAXTransformer; import org.apache.cocoon.xml.XMLUtils; import org.apache.cocoon.xml.IncludeXMLConsumer; import org.apache.excalibur.source.Source; import org.w3c.tidy.Tidy; import org.xml.sax.Attributes; import org.xml.sax.SAXException; /**  * Converts (escaped) HTML snippets into JTidied HTML.  * This transformer expects a list of elements, passed as comma separated  * values of the "tags" parameter. It records the text enclosed in such  * elements and pass it thru JTidy to obtain valid XHTML.  *  * <p>TODO: Add namespace support.  * <p><strong>WARNING:</strong> This transformer should be considered unstable.  *  * @author <a href="mailto: d.madama@pro-netics.com ">Daniele Madama</a>  * @author <a href="mailto: gianugo@apache.org ">Gianugo Rabellino</a>  *  * @version CVS $Id: HTMLTransformer.java 433543 2006-08-22 06:22:54Z crossley $  */ public class HTMLTransformer     extends AbstractSAXTransformer     implements Configurable {     /**      * Properties for Tidy format      */     private Properties properties;     /**      * Tags that must be normalized      */     private Map tags;     /**      * React on endElement calls that contain a tag to be      * tidied and run Jtidy on it, otherwise passthru.      *      * @see org.xml.sax.ContentHandler#endElement(java.lang.String, java.lang.String, java.lang.String)      */     public void endElement(String uri, String name, String raw)         throws SAXException {         if (this.tags.containsKey(name)) {             String toBeNormalized = this.endTextRecording();             try {                 this.normalize(toBeNormalized);             } catch (ProcessingException e) {                 e.printStackTrace();             }         }         super.endElement(uri, name, raw);     }     /**      * Start buffering text if inside a tag to be normalized,      * passthru otherwise.      *      * @see org.xml.sax.ContentHandler#startElement(java.lang.String, java.lang.String, java.lang.String, org.xml.sax.Attributes)      */     public void startElement(         String uri,         String name,         String raw,         Attributes attr)         throws SAXException {         super.startElement(uri, name, raw, attr);         if (this.tags.containsKey(name)) {             this.startTextRecording();         }     }     /**      * Configure this transformer, possibly passing to it      * a jtidy configuration file location.      */     public void configure(Configuration config) throws ConfigurationException {         super.configure(config);         String configUrl = config.getChild("jtidy-config").getValue(null);         if (configUrl != null) {             org.apache.excalibur.source.SourceResolver resolver = null;             Source configSource = null;             try {                 resolver = (org.apache.excalibur.source.SourceResolver)                            this.manager.lookup(org.apache.excalibur.source.SourceResolver.ROLE);                 configSource = resolver.resolveURI(configUrl);                 if (getLogger().isDebugEnabled()) {                     getLogger().debug(                         "Loading configuration from " + configSource.getURI());                 }                 this.properties = new Properties();                 this.properties.load(configSource.getInputStream());             } catch (Exception e) {                 getLogger().warn("Cannot load configuration from " + configUrl);                 throw new ConfigurationException(                     "Cannot load configuration from " + configUrl,                     e);             } finally {                 if (null != resolver) {                     this.manager.release(resolver);                     resolver.release(configSource);                 }             }         }     }     /**      * The beef: run JTidy on the buffered text and stream      * the result      *      * @param text the string to be tidied      */     private void normalize(String text) throws ProcessingException {         try {             // Setup an instance of Tidy.             Tidy tidy = new Tidy();             tidy.setXmlOut(true);             if (this.properties == null) {                 tidy.setXHTML(true);             } else {                 tidy.setConfigurationFromProps(this.properties);             }             //Set Jtidy warnings on-off             tidy.setShowWarnings(getLogger().isWarnEnabled());             //Set Jtidy final result summary on-off             tidy.setQuiet(!getLogger().isInfoEnabled());             //Set Jtidy infos to a String (will be logged) instead of System.out             StringWriter stringWriter = new StringWriter();             PrintWriter errorWriter = new PrintWriter(stringWriter);             tidy.setErrout(errorWriter);             // Extract the document using JTidy and stream it.             ByteArrayInputStream bais =                 new ByteArrayInputStream(text.getBytes("UTF-8"));             org.w3c.dom.Document doc =                 tidy.parseDOM(new BufferedInputStream(bais), null);             // FIXME: Jtidy doesn't warn or strip duplicate attributes in same             // tag; stripping.             XMLUtils.stripDuplicateAttributes(doc, null);             errorWriter.flush();             errorWriter.close();             if (getLogger().isWarnEnabled()) {                 getLogger().warn(stringWriter.toString());             }             IncludeXMLConsumer.includeNode(doc, this.contentHandler, this.lexicalHandler);         } catch (Exception e) {             throw new ProcessingException(                 "Exception in HTMLTransformer.normalize()",                 e);         }     }     /**      * Setup this component, passing the tag names to be tidied.      */     public void setup(         SourceResolver resolver,         Map objectModel,         String src,         Parameters par)         throws ProcessingException, SAXException, IOException {         super.setup(resolver, objectModel, src, par);         String tagsParam = par.getParameter("tags", "");         if (getLogger().isDebugEnabled()) {             getLogger().debug("tags: " + tagsParam);         }         this.tags = new HashMap();         StringTokenizer tokenizer = new StringTokenizer(tagsParam, ",");         while (tokenizer.hasMoreElements()) {             String tok = tokenizer.nextToken().trim();             this.tags.put(tok, tok);         }     } }

          People

          • Assignee:
            Unassigned
            Reporter:
            Abbas Mousavi
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development