Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-1918

TikaParser specifies a default namespace when generating DOM

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.10
    • Component/s: parser
    • Labels:
      None
    • Patch Info:
      Patch Available

      Description

      The DOM generated by parse-tika differs from the one done by parse-html. Ideally we should be able to use either parsers with the same XPath expressions.
      This is related to NUTCH-1592, but this time instead of being a matter of uppercases, the problem comes from the namespace used.
      This issue has been investigated and fixed in storm-crawler https://github.com/DigitalPebble/storm-crawler/pull/58.

      Here is what Guillaume explained there :

      When parsing the content, Tika creates a properly formatted XHTML document: all elements are created within the namespace XHTML.

      However in XPath 1.0, there's no concept of default namespace so XPath expressions such as //BODY doesn't match anything. To make this work we should use //ns1:BODY and define a NamespaceContext which associates ns1 with "http://www.w3.org/1999/xhtml"

      To keep the XPathExpressions simpler, I modified the DOMBuilder which is our SaxHandler used to convert the SAX Events into a DOM tree to ignore a "default name space" and the ParserBolt initializes it with the XHTML namespace. This way //BODY matches.

        Attachments

        1. NUTCH-1918.patch
          2 kB
          Julien Nioche

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              jnioche Julien Nioche
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: