Tika
  1. Tika
  2. TIKA-379

Html elements and attributes not available in XHTML representation

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Critical Critical
    • Resolution: Fixed
    • Affects Version/s: 0.7
    • Fix Version/s: 0.8
    • Component/s: parser
    • Labels:
      None

      Description

      The following HTML document :

      <html lang="fi"><head>document 1 title</head><body>jotain suomeksi</body></html>

      is rendered as the following xhtml by Tika :

      <?xml version="1.0" encoding="UTF-8"?><html xmlns="http://www.w3.org/1999/xhtml"><head><title/></head><body>document 1 titlejotain suomeksi</body></html>

      with the lang attribute getting lost. The lang is not stored in the metadata either.

      1. ASF.LICENSE.NOT.GRANTED--TIKA-379
        2 kB
        Julien Nioche
      2. ASF.LICENSE.NOT.GRANTED--TIKA-379-2.patch
        7 kB
        Julien Nioche
      3. TIKA-379-3.patch
        9 kB
        Julien Nioche

        Issue Links

          Activity

          Ken Krugler made changes -
          Link This issue is related to TIKA-478 [ TIKA-478 ]
          Hide
          Ken Krugler added a comment -

          There's a problem with this change, where calling the XHTMLContentHandler with elements from inside of the <head> block trigger the premature emitting of <head><title/></head><body> elements. I've filed TIKA-478, and will look at how best to fix this in HtmlHandler.

          Show
          Ken Krugler added a comment - There's a problem with this change, where calling the XHTMLContentHandler with elements from inside of the <head> block trigger the premature emitting of <head><title/></head><body> elements. I've filed TIKA-478 , and will look at how best to fix this in HtmlHandler.
          Chris A. Mattmann made changes -
          Status In Progress [ 3 ] Resolved [ 5 ]
          Fix Version/s 0.8 [ 12314877 ]
          Resolution Fixed [ 1 ]
          Hide
          Chris A. Mattmann added a comment -
          • FYI I committed this in r949635. Thanks, Julien!
          Show
          Chris A. Mattmann added a comment - FYI I committed this in r949635. Thanks, Julien!
          Chris A. Mattmann made changes -
          Status Open [ 1 ] In Progress [ 3 ]
          Chris A. Mattmann made changes -
          Assignee Chris A. Mattmann [ chrismattmann ]
          Julien Nioche made changes -
          Attachment TIKA-379-3.patch [ 12443561 ]
          Hide
          Julien Nioche added a comment -

          Modified patch which fixes test errors. could anyone review it?
          Thanks

          Julien

          Show
          Julien Nioche added a comment - Modified patch which fixes test errors. could anyone review it? Thanks Julien
          Julien Nioche made changes -
          Link This issue blocks NUTCH-817 [ NUTCH-817 ]
          Hide
          Jukka Zitting added a comment -

          Re: second patch - Seems like a good approach.

          Show
          Jukka Zitting added a comment - Re: second patch - Seems like a good approach.
          Julien Nioche made changes -
          Attachment TIKA-379-2.patch [ 12441735 ]
          Hide
          Julien Nioche added a comment -

          Attached a second patch with a suggested solution for normalising/filtering incoming attribute names. the code compiles but the tests fail. The purpose is mostly to illustrate the idea before implementing it properly.

          Show
          Julien Nioche added a comment - Attached a second patch with a suggested solution for normalising/filtering incoming attribute names. the code compiles but the tests fail. The purpose is mostly to illustrate the idea before implementing it properly.
          Julien Nioche made changes -
          Attachment TIKA-379 [ 12441731 ]
          Hide
          Julien Nioche added a comment -

          Adds the Base, Meta and Link elements found in the Head section to the XHTML output

          Show
          Julien Nioche added a comment - Adds the Base, Meta and Link elements found in the Head section to the XHTML output
          Hide
          Julien Nioche added a comment -

          thanks for your comments.
          I had seen the HTMLMapper but as I pointed out

          There is actually a special treatment for the elements in HEAD done in the class HtmlHandler so simply adding link to the HTMLMapper does not solve the problem.

          I will send a patch later today which modifies the HTMLMapper to make it generate LINK elements in the XHTML output. This is a reasonable thing to do as this entity is allowed in the XHTML DTD.
          I will look at the HTMLMapper later to see how we could get it to keep the href attributes

          Show
          Julien Nioche added a comment - thanks for your comments. I had seen the HTMLMapper but as I pointed out There is actually a special treatment for the elements in HEAD done in the class HtmlHandler so simply adding link to the HTMLMapper does not solve the problem. I will send a patch later today which modifies the HTMLMapper to make it generate LINK elements in the XHTML output. This is a reasonable thing to do as this entity is allowed in the XHTML DTD. I will look at the HTMLMapper later to see how we could get it to keep the href attributes
          Hide
          Jukka Zitting added a comment -

          The reason for the default HTML mapping rules in Tika are to simplify and normalize the input documents so that client applications could easily process all sorts of input (HTML or not) without needing type- or source-specific heuristics. The basic idea has been that clients should directly use the underlying parser libraries when it needs custom processing of specific content types.

          That said, I see the value of being able to process even complex HTML input through the Tika API, and perhaps the above original intent is too strict for many use cases. The HtmlMapper interface we added for TIKA-347 should make it possible to relax the mapping rules, and in revision 933909 I added a IdentityHtmlMapper implementation of this interface to make it even easier to use:

          ParseContext context = new ParseContext();
          context.set(HtmlMapper.class, IdentityHtmlMapper.INSTANCE);

          Note that IdentityHtmlMapper breaks the guarantee that the Tika output is valid XHTML. Also, currently the HtmlMapper interface only covers elements, so all attributes are still lost and IdentityHtmlMapper overrides the custom <a/> tag handling in HtmlHandler so even the href attributes are gone. It would be good if we could extend the HtmlMapper mechanism to avoid these problems.

          Show
          Jukka Zitting added a comment - The reason for the default HTML mapping rules in Tika are to simplify and normalize the input documents so that client applications could easily process all sorts of input (HTML or not) without needing type- or source-specific heuristics. The basic idea has been that clients should directly use the underlying parser libraries when it needs custom processing of specific content types. That said, I see the value of being able to process even complex HTML input through the Tika API, and perhaps the above original intent is too strict for many use cases. The HtmlMapper interface we added for TIKA-347 should make it possible to relax the mapping rules, and in revision 933909 I added a IdentityHtmlMapper implementation of this interface to make it even easier to use: ParseContext context = new ParseContext(); context.set(HtmlMapper.class, IdentityHtmlMapper.INSTANCE); Note that IdentityHtmlMapper breaks the guarantee that the Tika output is valid XHTML. Also, currently the HtmlMapper interface only covers elements, so all attributes are still lost and IdentityHtmlMapper overrides the custom <a/> tag handling in HtmlHandler so even the href attributes are gone. It would be good if we could extend the HtmlMapper mechanism to avoid these problems.
          Jukka N committed 933909 (3 files)
          Hide
          Julien Nioche added a comment -

          There is actually a special treatment for the elements in HEAD done in the class HtmlHandler so simply adding link to the HTMLMapper does not solve the problem.

          Show
          Julien Nioche added a comment - There is actually a special treatment for the elements in HEAD done in the class HtmlHandler so simply adding link to the HTMLMapper does not solve the problem.
          Hide
          Julien Nioche added a comment - - edited

          This is indeed a more generic problem. It also affects HTML elements like link which are commonly used in head sections to specify favicons or canonical representations. These values are not stored in the metadata either and are vital for a crawler.

          I agree with Ken that it would be better not only to store information in the metadata but also to be able to retrieve them from the SAX events.

          Looks like this is due to the filtering done in DefaultHTMLMapper which can be overriden in the Context so we could simply pass a less restrictive filter. The default mapper is based on http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd which allows link elements within the head so we could add it to mapSafeElement(), however as there are no restrictions on the hierarchy this would mean that such elements would also be allowed within the body.

          Any thoughts?

          Show
          Julien Nioche added a comment - - edited This is indeed a more generic problem. It also affects HTML elements like link which are commonly used in head sections to specify favicons or canonical representations. These values are not stored in the metadata either and are vital for a crawler. I agree with Ken that it would be better not only to store information in the metadata but also to be able to retrieve them from the SAX events. Looks like this is due to the filtering done in DefaultHTMLMapper which can be overriden in the Context so we could simply pass a less restrictive filter. The default mapper is based on http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd which allows link elements within the head so we could add it to mapSafeElement() , however as there are no restrictions on the hierarchy this would mean that such elements would also be allowed within the body . Any thoughts?
          Julien Nioche made changes -
          Summary Lang attribute on html tag skipped Html elements and attributes not available in XHTML representation
          Affects Version/s 0.7 [ 12314528 ]
          Priority Major [ 3 ] Critical [ 2 ]
          Hide
          Ken Krugler added a comment -

          I think this is part of a bigger issue re attributes getting stripped. E.g. <a rel="nofollow> is important for web crawlers.

          Since the language attribute can be applied to a variety of tags, I don't think it's an option to just store it in the metadata.

          Show
          Ken Krugler added a comment - I think this is part of a bigger issue re attributes getting stripped. E.g. <a rel="nofollow> is important for web crawlers. Since the language attribute can be applied to a variety of tags, I don't think it's an option to just store it in the metadata.
          Julien Nioche made changes -
          Summary Attribute on html tag not represented in XHTML Lang attribute on html tag skipped
          Description The following HTML document :

          <html lang="fi"><head>document 1 title</head><body>jotain suomeksi</body></html>

          is rendered as the following xhtml by Tika :

          <?xml version="1.0" encoding="UTF-8"?><html xmlns="http://www.w3.org/1999/xhtml"><head><title/></head><body>document 1 titlejotain suomeksi</body></html>

          with the lang attribute getting lost.

          The following HTML document :

          <html lang="fi"><head>document 1 title</head><body>jotain suomeksi</body></html>

          is rendered as the following xhtml by Tika :

          <?xml version="1.0" encoding="UTF-8"?><html xmlns="http://www.w3.org/1999/xhtml"><head><title/></head><body>document 1 titlejotain suomeksi</body></html>

          with the lang attribute getting lost. The lang is not stored in the metadata either.

          Julien Nioche made changes -
          Field Original Value New Value
          Link This issue blocks NUTCH-794 [ NUTCH-794 ]
          Julien Nioche created issue -

            People

            • Assignee:
              Chris A. Mattmann
              Reporter:
              Julien Nioche
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development