Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-1782

XHTMLContentHandler doesn't pass attributes of html element

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 1.11
    • Fix Version/s: 1.12
    • Component/s: parser
    • Labels:
      None
    • Flags:
      Patch

      Description

      XHTMLContentHandler.startElement() uses lazyHead() for the html element because it's defined in the AUTO Set. As a consequence, attributes of the html element are not passed to downstream content handlers.

      1. TIKA-1782.patch
        2 kB
        Markus Jelsma

        Issue Links

          Activity

          Hide
          markus17 Markus Jelsma added a comment -

          Patch for trunk, ImageParserTest fails,

          testJPEG(org.apache.tika.parser.image.ImageParserTest) Time elapsed: 0.011 sec <<< ERROR!
          java.lang.UnsatisfiedLinkError: com.sun.imageio.plugins.jpeg.JPEGImageReader.initReaderIDs(Ljava/lang/Class;Ljava/lang/Class;Ljava/lang/Class;)V
          at com.sun.imageio.plugins.jpeg.JPEGImageReader.initReaderIDs(Native Method)
          at com.sun.imageio.plugins.jpeg.JPEGImageReader.<clinit>(JPEGImageReader.java:96)
          at com.sun.imageio.plugins.jpeg.JPEGImageReaderSpi.createReaderInstance(JPEGImageReaderSpi.java:85)
          at javax.imageio.spi.ImageReaderSpi.createReaderInstance(ImageReaderSpi.java:320)
          at javax.imageio.ImageIO$ImageReaderIterator.next(ImageIO.java:529)
          at javax.imageio.ImageIO$ImageReaderIterator.next(ImageIO.java:513)
          at org.apache.tika.parser.image.ImageParser.parse(ImageParser.java:164)
          at org.apache.tika.parser.image.ImageParserTest.testJPEG(ImageParserTest.java:93)

          Show
          markus17 Markus Jelsma added a comment - Patch for trunk, ImageParserTest fails, testJPEG(org.apache.tika.parser.image.ImageParserTest) Time elapsed: 0.011 sec <<< ERROR! java.lang.UnsatisfiedLinkError: com.sun.imageio.plugins.jpeg.JPEGImageReader.initReaderIDs(Ljava/lang/Class;Ljava/lang/Class;Ljava/lang/Class;)V at com.sun.imageio.plugins.jpeg.JPEGImageReader.initReaderIDs(Native Method) at com.sun.imageio.plugins.jpeg.JPEGImageReader.<clinit>(JPEGImageReader.java:96) at com.sun.imageio.plugins.jpeg.JPEGImageReaderSpi.createReaderInstance(JPEGImageReaderSpi.java:85) at javax.imageio.spi.ImageReaderSpi.createReaderInstance(ImageReaderSpi.java:320) at javax.imageio.ImageIO$ImageReaderIterator.next(ImageIO.java:529) at javax.imageio.ImageIO$ImageReaderIterator.next(ImageIO.java:513) at org.apache.tika.parser.image.ImageParser.parse(ImageParser.java:164) at org.apache.tika.parser.image.ImageParserTest.testJPEG(ImageParserTest.java:93)
          Hide
          tallison@mitre.org Tim Allison added a comment -

          What OS and Java version? I'm not seeing problems with RHEL 6.5 and Java 1.7.0_75.

          Show
          tallison@mitre.org Tim Allison added a comment - What OS and Java version? I'm not seeing problems with RHEL 6.5 and Java 1.7.0_75.
          Hide
          markus17 Markus Jelsma added a comment -

          Hello - this is on 1.8.0_40 and on Ubuntu 14.10

          openjdk version "1.8.0_40-internal"
          OpenJDK Runtime Environment (build 1.8.0_40-internal-b09)
          OpenJDK 64-Bit Server VM (build 25.40-b13, mixed mode)

          Show
          markus17 Markus Jelsma added a comment - Hello - this is on 1.8.0_40 and on Ubuntu 14.10 openjdk version "1.8.0_40-internal" OpenJDK Runtime Environment (build 1.8.0_40-internal-b09) OpenJDK 64-Bit Server VM (build 25.40-b13, mixed mode)
          Hide
          tallison@mitre.org Tim Allison added a comment -

          Not seeing it on RHEL with 1.8.0_66 either.

          IIRC, Jenkins is Ubuntu, I wonder why we aren't seeing it there either....hmmm...

          Show
          tallison@mitre.org Tim Allison added a comment - Not seeing it on RHEL with 1.8.0_66 either. IIRC, Jenkins is Ubuntu, I wonder why we aren't seeing it there either....hmmm...
          Hide
          tallison@mitre.org Tim Allison added a comment -

          r1710799.

          We should probably open a separate issue to handle the failed build in Ubuntu and java 1.8.

          Show
          tallison@mitre.org Tim Allison added a comment - r1710799. We should probably open a separate issue to handle the failed build in Ubuntu and java 1.8.
          Hide
          markus17 Markus Jelsma added a comment -

          Hello Tim, is testJPEG's failure unrelated to this change?

          Show
          markus17 Markus Jelsma added a comment - Hello Tim, is testJPEG's failure unrelated to this change?
          Hide
          tallison@mitre.org Tim Allison added a comment - - edited

          Y, I think so. The stacktrace seems to suggest a more profound issue, and my build with this patch works on RHEL and java 1.7.

          Was your build with trunk working before this patch?

          Show
          tallison@mitre.org Tim Allison added a comment - - edited Y, I think so. The stacktrace seems to suggest a more profound issue, and my build with this patch works on RHEL and java 1.7. Was your build with trunk working before this patch?
          Hide
          hudson Hudson added a comment -

          SUCCESS: Integrated in tika-trunk-jdk1.7 #878 (See https://builds.apache.org/job/tika-trunk-jdk1.7/878/)
          TIKA-1782 allow XHTMLContentHandler to pass attributes of html element via Markus Jelsma (tallison: http://svn.apache.org/viewvc/tika/trunk/?view=rev&rev=1710799)

          • trunk/CHANGES.txt
          • trunk/tika-core/src/main/java/org/apache/tika/sax/XHTMLContentHandler.java
          • trunk/tika-core/src/test/java/org/apache/tika/sax/XHTMLContentHandlerTest.java
          Show
          hudson Hudson added a comment - SUCCESS: Integrated in tika-trunk-jdk1.7 #878 (See https://builds.apache.org/job/tika-trunk-jdk1.7/878/ ) TIKA-1782 allow XHTMLContentHandler to pass attributes of html element via Markus Jelsma (tallison: http://svn.apache.org/viewvc/tika/trunk/?view=rev&rev=1710799 ) trunk/CHANGES.txt trunk/tika-core/src/main/java/org/apache/tika/sax/XHTMLContentHandler.java trunk/tika-core/src/test/java/org/apache/tika/sax/XHTMLContentHandlerTest.java
          Hide
          markus17 Markus Jelsma added a comment -

          Ah, testJPEG() fails independently and has nothing to do with this patch. For some reason though, in my code, this fix doesn't appear to work. It is identical to TIKA-995, which did work for me. I probably made an error, i am not sure.

          Show
          markus17 Markus Jelsma added a comment - Ah, testJPEG() fails independently and has nothing to do with this patch. For some reason though, in my code, this fix doesn't appear to work. It is identical to TIKA-995 , which did work for me. I probably made an error, i am not sure.
          Hide
          tallison@mitre.org Tim Allison added a comment -

          Hmmm...should I reopen this issue and revert? Do you have a shareable test file?

          Show
          tallison@mitre.org Tim Allison added a comment - Hmmm...should I reopen this issue and revert? Do you have a shareable test file?
          Hide
          markus17 Markus Jelsma added a comment -

          Hi - i have no test hanging around but my consumier code doesn't see the attribute on the HTML tag passed. Perhaps it is a good idea to revert until we're sure this fix works. The unit test for this issue only passes with this fix so i don't understand it.

          Show
          markus17 Markus Jelsma added a comment - Hi - i have no test hanging around but my consumier code doesn't see the attribute on the HTML tag passed. Perhaps it is a good idea to revert until we're sure this fix works. The unit test for this issue only passes with this fix so i don't understand it.
          Hide
          markus17 Markus Jelsma added a comment -

          Hello Tim, i think there is a test, see TIKA-980. The unit test comes with a test HTML page that has a itemscope attribute on the body tag. The test should continue to work if the itemscope is moved to the html tag, but it doesn't.

          Show
          markus17 Markus Jelsma added a comment - Hello Tim, i think there is a test, see TIKA-980 . The unit test comes with a test HTML page that has a itemscope attribute on the body tag. The test should continue to work if the itemscope is moved to the html tag, but it doesn't.
          Hide
          sully James Sullivan added a comment -

          Is this related http://stackoverflow.com/questions/30543395/how-to-get-href-attribute-from-base-tag-using-tika-sax-contenthandler? Even using Tika 1.12 the base tag attributes do not seem to be passed.

          Show
          sully James Sullivan added a comment - Is this related http://stackoverflow.com/questions/30543395/how-to-get-href-attribute-from-base-tag-using-tika-sax-contenthandler? Even using Tika 1.12 the base tag attributes do not seem to be passed.
          Hide
          markus17 Markus Jelsma added a comment -

          Yes i, unfortunately, agree. The unit test i supplied, similar to the attribute on body test, works very well. But for some reason i cannot read attributes on the html tag in the real world. Either the fix must be reverted, or we are doing something wrong.

          Show
          markus17 Markus Jelsma added a comment - Yes i, unfortunately, agree. The unit test i supplied, similar to the attribute on body test, works very well. But for some reason i cannot read attributes on the html tag in the real world. Either the fix must be reverted, or we are doing something wrong.

            People

            • Assignee:
              Unassigned
              Reporter:
              markus17 Markus Jelsma
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development