Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-2100

Html Parser does not keep the html tag attributes

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 1.13
    • 1.19, 2.0.0
    • parser
    • None

    Description

      Parsing a very simple html like
      <!DOCTYPE html>
      <html lang="en">
      <head>
      <title>Page Title</title>
      </head>
      <body>

      <h1 align="left">My First Heading</h1>
      <p>My first paragraph.</p>

      </body>
      </html>

      you won't be able to access the html tag's attributes (here lang="en") in the ContentHandler :
      *in the method startElement(String ns, String localName, String name,
      Attributes atts), atts is empty.
      *Moreover it seems that the html tag's attributes are not passed trough the HtmlMapper.mapSafeAttribute method too.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              Gerard Bouchar Gerard Bouchar
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: