Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-1918

TikaParser specifies a default namespace when generating DOM

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • None
    • 1.10
    • parser
    • None
    • Patch Available

    Description

      The DOM generated by parse-tika differs from the one done by parse-html. Ideally we should be able to use either parsers with the same XPath expressions.
      This is related to NUTCH-1592, but this time instead of being a matter of uppercases, the problem comes from the namespace used.
      This issue has been investigated and fixed in storm-crawler https://github.com/DigitalPebble/storm-crawler/pull/58.

      Here is what Guillaume explained there :

      When parsing the content, Tika creates a properly formatted XHTML document: all elements are created within the namespace XHTML.

      However in XPath 1.0, there's no concept of default namespace so XPath expressions such as //BODY doesn't match anything. To make this work we should use //ns1:BODY and define a NamespaceContext which associates ns1 with "http://www.w3.org/1999/xhtml"

      To keep the XPathExpressions simpler, I modified the DOMBuilder which is our SaxHandler used to convert the SAX Events into a DOM tree to ignore a "default name space" and the ParserBolt initializes it with the XHTML namespace. This way //BODY matches.

      Attachments

        1. NUTCH-1918.patch
          2 kB
          Julien Nioche

        Activity

          People

            Unassigned Unassigned
            jnioche Julien Nioche
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: