Forrest
  1. Forrest
  2. FOR-448

Faulty treatment of a-Elements in html-pipeline

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.7, 0.8
    • Fix Version/s: 0.8
    • Component/s: Core operations
    • Labels:
      None
    • Environment:
      Windows XP SP2

      Description

      After noticing that anchor elements in html-files got lost in the Forrest default pipeline, I did some test with a sample document (before and after are included) and found that named anchors either get completely lost or messed up pretty bad. Even text within them is sometimes lost.

      The lines refer to original and translated file.

      Original Translated Looks Function
        line line
      ------------------------------------------
         16 157 ok gone
         <a> element is completely lost
         
         
         22 162 bad ok
         
         there are now 2 <a> elements
         <a name="anchor2"></a>Anchor 2<a href="#anchor1">Anchor 2</a>
         and unfortunately twice the text!
         
         29 166 ok gone
         <a> element is completely lost
         
         35 171 bad gone
         <a> element and text within it is completely lost!
         
         42 176 ok gone
         <a> element is completely lost
         
         49 181 ok gone
         <a> element is completely lost
      1. anchorerrortestfiles.zip
        3 kB
        David Crossley
      2. html-to-document.xml.diff
        2 kB
        Jim Dixon

        Activity

        Hide
        Ferdinand Soethe added a comment -
        Here comes the zip with the two testfiles
        Show
        Ferdinand Soethe added a comment - Here comes the zip with the two testfiles
        Hide
        Ferdinand Soethe added a comment -
        Forget to add:

        The problems are already there when I request index.xml, so as Ross suggested it is likely an html2document.xsl issue.

        Ferdinand
        Show
        Ferdinand Soethe added a comment - Forget to add: The problems are already there when I request index.xml, so as Ross suggested it is likely an html2document.xsl issue. Ferdinand
        Hide
        David Crossley added a comment -
        The attachment was retrieved from our old issue tracker.
        Contributed: 24/Feb/05 Ferdinand Soethe
        Show
        David Crossley added a comment - The attachment was retrieved from our old issue tracker. Contributed: 24/Feb/05 Ferdinand Soethe
        Hide
        Jim Dixon added a comment -
        This patch corrects the problem reported but does not deal with other shortcomings in html-to-document.xml and other stylesheets in the same directory. The general problem is that html-to-document.xml cannot handle many common HTML constructs. Some of these will be reported as separate issues.
        Show
        Jim Dixon added a comment - This patch corrects the problem reported but does not deal with other shortcomings in html-to-document.xml and other stylesheets in the same directory. The general problem is that html-to-document.xml cannot handle many common HTML constructs. Some of these will be reported as separate issues.
        Hide
        Jim Dixon added a comment -
        The original problem report contains at least one minor error. There are a number of anchors in the page. The second of these has as text "Anchor 2" but the anchor element is misleadingly labelled with
          href = "#anchor1"

        The problem reported arises for two reasons. First, the template handling anchors in html-to-document.xsl allows both the name and href attributes whereas they should be alternatives. The correction is to replace two IF elements with a CHOOSE with two WHENs, with the href attribute preferred, so that badly written HTML will be silently corrected (that is, if both name and href attributes are present, the name attribute will be discarded).

        Secondly, the template attempts to add an ID attribute, apparently in order to add the named anchor to the table of contents. This is an error (Sablotron rejects the original stylesheet) and anyway appears to be confused: generally speaking, adding named anchors to the TOC will simply confuse it.
        Show
        Jim Dixon added a comment - The original problem report contains at least one minor error. There are a number of anchors in the page. The second of these has as text "Anchor 2" but the anchor element is misleadingly labelled with   href = "#anchor1" The problem reported arises for two reasons. First, the template handling anchors in html-to-document.xsl allows both the name and href attributes whereas they should be alternatives. The correction is to replace two IF elements with a CHOOSE with two WHENs, with the href attribute preferred, so that badly written HTML will be silently corrected (that is, if both name and href attributes are present, the name attribute will be discarded). Secondly, the template attempts to add an ID attribute, apparently in order to add the named anchor to the table of contents. This is an error (Sablotron rejects the original stylesheet) and anyway appears to be confused: generally speaking, adding named anchors to the TOC will simply confuse it.
        Hide
        David Crossley added a comment -
        Thanks for your help Jim. I fixed it in a different way. The html4 specification does talk about @name and @href being okay as simultaneous attributes. Also i don't know why this template was removing other attributes such as @title and @target which some people want to use. So i simply used "xsl:copy-of" to copy all the attributes.

        As you suggested, removed the automated generation of @id attributes from @name attributes. The html4 spec indicates that this can lead to invalid IDs. Doing some research into the history of html-to-document.xsl i see that this has been there since the beginning. No idea why the original author thought that it was necessary. We can add it back if people think it necessary.
        Show
        David Crossley added a comment - Thanks for your help Jim. I fixed it in a different way. The html4 specification does talk about @name and @href being okay as simultaneous attributes. Also i don't know why this template was removing other attributes such as @title and @target which some people want to use. So i simply used "xsl:copy-of" to copy all the attributes. As you suggested, removed the automated generation of @id attributes from @name attributes. The html4 spec indicates that this can lead to invalid IDs. Doing some research into the history of html-to-document.xsl i see that this has been there since the beginning. No idea why the original author thought that it was necessary. We can add it back if people think it necessary.

          People

          • Assignee:
            Unassigned
            Reporter:
            Ferdinand Soethe
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development