Nutch
  1. Nutch
  2. NUTCH-185

XMLParser is configurable xml parser plugin.

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Won't Fix
    • Affects Version/s: 0.7.2, 0.8, 0.8.1
    • Fix Version/s: 1.1
    • Component/s: fetcher, indexer
    • Labels:
      None
    • Environment:

      OS Independent

      Description

      Xml parser is configurable plugin. It use XPath and namespaces to do the mapping between the XML elements and Lucene fields.

      Informations :

      1- Copy "xmlparser-conf.xml" to the nutch/conf dir

      2- To index your custom XML file, you have to modify the "xmlparser-conf.xml".
      This parser uses namespaces and XPATH to parse XML content
      The config file do the mapping between the XML noeds (using XPATH) and lucene field.
      Example : <field name="dctitle" xpath="//dc:title" type="Text" boost="1.4" />

      3- The xmlIndexerProperties encapsulate a set of fields associated to a namespace.
      If the namespace is found in the xml document, the fields represented by the namespace will be indexed.
      Example :
      <xmlIndexerProperties type="filePerDocument" namespace=" http://purl.org/dc/elements/1.1/">
      <field name="dctitle" xpath="//dc:title" type="Text" boost=" 1.4" />
      <field name="dccreator" xpath="//dc:creator" type="keyword" boost=" 1.0" />
      </xmlIndexerProperties>

      4- It is possible to define a default namespace that will be applied when the parser
      didn't find any namespace in the document or when the namespace found in the xml document doesn't match with the namespace defined in the xmlIndexerProperties.
      Example :
      <xmlIndexerProperties type="filePerDocument" namespace="default">
      <field name="xmlcontent" xpath="//*" type="Unstored" boost="1.0" />
      </xmlIndexerProperties>

      1. parse-xml.patch
        0.3 kB
        nutch.newbie
      2. parse-xml.zip
        157 kB
        Rida Benjelloun
      3. parse-xml.zip
        1.52 MB
        Rida Benjelloun

        Issue Links

          Activity

          No work has yet been logged on this issue.

            People

            • Assignee:
              Chris A. Mattmann
              Reporter:
              Rida Benjelloun
            • Votes:
              6 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development