Nutch
  1. Nutch
  2. NUTCH-185

XMLParser is configurable xml parser plugin.

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Won't Fix
    • Affects Version/s: 0.7.2, 0.8, 0.8.1
    • Fix Version/s: 1.1
    • Component/s: fetcher, indexer
    • Labels:
      None
    • Environment:

      OS Independent

      Description

      Xml parser is configurable plugin. It use XPath and namespaces to do the mapping between the XML elements and Lucene fields.

      Informations :

      1- Copy "xmlparser-conf.xml" to the nutch/conf dir

      2- To index your custom XML file, you have to modify the "xmlparser-conf.xml".
      This parser uses namespaces and XPATH to parse XML content
      The config file do the mapping between the XML noeds (using XPATH) and lucene field.
      Example : <field name="dctitle" xpath="//dc:title" type="Text" boost="1.4" />

      3- The xmlIndexerProperties encapsulate a set of fields associated to a namespace.
      If the namespace is found in the xml document, the fields represented by the namespace will be indexed.
      Example :
      <xmlIndexerProperties type="filePerDocument" namespace=" http://purl.org/dc/elements/1.1/">
      <field name="dctitle" xpath="//dc:title" type="Text" boost=" 1.4" />
      <field name="dccreator" xpath="//dc:creator" type="keyword" boost=" 1.0" />
      </xmlIndexerProperties>

      4- It is possible to define a default namespace that will be applied when the parser
      didn't find any namespace in the document or when the namespace found in the xml document doesn't match with the namespace defined in the xmlIndexerProperties.
      Example :
      <xmlIndexerProperties type="filePerDocument" namespace="default">
      <field name="xmlcontent" xpath="//*" type="Unstored" boost="1.0" />
      </xmlIndexerProperties>

      1. parse-xml.zip
        1.52 MB
        Rida Benjelloun
      2. parse-xml.zip
        157 kB
        Rida Benjelloun
      3. parse-xml.patch
        0.3 kB
        nutch.newbie

        Issue Links

          Activity

          Hide
          Rida Benjelloun added a comment -

          Version 1.0

          Show
          Rida Benjelloun added a comment - Version 1.0
          Hide
          Philippe EUGENE added a comment -

          Great Plugin. Thanks !
          I succesfull test this plugin on a 0.7.1 version of nutch.
          I have just a problem with somes structures like this :
          <authors>
          <author>author1</author>
          <author>author2</author>
          <author>author3</author>
          <authorr>

          In my Lucene Index i just see the author3 value for this field.
          I'm not sure that the problem is on the plugin.
          I don't know if it's possible to have multi-values for a field on nutch 0.7.1

          Show
          Philippe EUGENE added a comment - Great Plugin. Thanks ! I succesfull test this plugin on a 0.7.1 version of nutch. I have just a problem with somes structures like this : <authors> <author>author1</author> <author>author2</author> <author>author3</author> <authorr> In my Lucene Index i just see the author3 value for this field. I'm not sure that the problem is on the plugin. I don't know if it's possible to have multi-values for a field on nutch 0.7.1
          Hide
          Chris A. Mattmann added a comment -

          I propose that either this issue be closed and the patch files moved to NUTCH-23, or that NUTCH-23 be closed, as the two are duplicate issues. Comments?

          Show
          Chris A. Mattmann added a comment - I propose that either this issue be closed and the patch files moved to NUTCH-23 , or that NUTCH-23 be closed, as the two are duplicate issues. Comments?
          Hide
          nutch.newbie added a comment -

          Is there any update to this plugin for 0.8.1 version? Its a great plugin and I use it now in 0.7. Any update.. Sorry I am not a java guru to do this myself
          Regards

          Show
          nutch.newbie added a comment - Is there any update to this plugin for 0.8.1 version? Its a great plugin and I use it now in 0.7. Any update.. Sorry I am not a java guru to do this myself Regards
          Hide
          Rida Benjelloun added a comment -

          Hi,
          The plugin parse-xml has been updated. I have tested it with 0.8.1 version. The plugin fix also the bug related the multi-fields values.

          Best regards

          Rida Benjelloun.
          rida.benjelloun@doculibre.com

          Show
          Rida Benjelloun added a comment - Hi, The plugin parse-xml has been updated. I have tested it with 0.8.1 version. The plugin fix also the bug related the multi-fields values. Best regards Rida Benjelloun. rida.benjelloun@doculibre.com
          Hide
          nutch.newbie added a comment -

          Thank you very much! I will be giving it a go now.

          Will this plugin be added to the Nutch trunk as a part of distribution? I would really like to see that happen.

          Thanks again.

          Show
          nutch.newbie added a comment - Thank you very much! I will be giving it a go now. Will this plugin be added to the Nutch trunk as a part of distribution? I would really like to see that happen. Thanks again.
          Hide
          nutch.newbie added a comment -

          Made a small change in order to compile against latest trunk.

          Regards

          Show
          nutch.newbie added a comment - Made a small change in order to compile against latest trunk. Regards
          Hide
          Jayant Kumar Gandhi added a comment -

          Rida has made an ultimate plugin here. Must have for all people who need to use custom plugin to allow indexing/searching of various fields.
          I have successfully got this plugin to work on both 0.7.2 and 0.8.1 version of Nutch.

          I have a problem similar to Philippe Eugene above. I have a structure of xml similar to
          <tags>
          <tag>tag1</tag>
          <tag>tag2</tag>
          <tag>tag3</tag>
          <tags>

          If I have the configuration as

          <field name="tag" xpath="//Tags/Tag" type="Keyword" boost="1.3" />

          The plugin just insert one value for 'tag' and combines all the values and makes a single entry for 'tag'. This makes the content unsearchable by tags are we must give all tags as the whole thing "tag1 tag2 tag3" becomes a keyword.

          Instead if I have the configuration as:
          <field name="tag" xpath="//Tags/Tag[1]" type="Keyword" boost="1.3" />
          <field name="tag" xpath="//Tags/Tag[2]" type="Keyword" boost="1.3" />
          <field name="tag" xpath="//Tags/Tag[3]" type="Keyword" boost="1.3" />
          ...
          ...

          I get the same problem as Philippe, of only having the last value in the index => tag = tag3 will be stored in the index.
          I.'m not sure that the problem is on the plugin or is my XPath incorrect. I dont know java much, I am still trying to debug the code to find the cause and solution.

          Show
          Jayant Kumar Gandhi added a comment - Rida has made an ultimate plugin here. Must have for all people who need to use custom plugin to allow indexing/searching of various fields. I have successfully got this plugin to work on both 0.7.2 and 0.8.1 version of Nutch. I have a problem similar to Philippe Eugene above. I have a structure of xml similar to <tags> <tag>tag1</tag> <tag>tag2</tag> <tag>tag3</tag> <tags> If I have the configuration as <field name="tag" xpath="//Tags/Tag" type="Keyword" boost="1.3" /> The plugin just insert one value for 'tag' and combines all the values and makes a single entry for 'tag'. This makes the content unsearchable by tags are we must give all tags as the whole thing "tag1 tag2 tag3" becomes a keyword. Instead if I have the configuration as: <field name="tag" xpath="//Tags/Tag [1] " type="Keyword" boost="1.3" /> <field name="tag" xpath="//Tags/Tag [2] " type="Keyword" boost="1.3" /> <field name="tag" xpath="//Tags/Tag [3] " type="Keyword" boost="1.3" /> ... ... I get the same problem as Philippe, of only having the last value in the index => tag = tag3 will be stored in the index. I.'m not sure that the problem is on the plugin or is my XPath incorrect. I dont know java much, I am still trying to debug the code to find the cause and solution.
          Hide
          Rida Benjelloun added a comment -

          Nutch doesn't support multifieds values, so I decided to merge the content in the same field. If you want to search the field you should index it as "Text" instead of "keyword".

          Show
          Rida Benjelloun added a comment - Nutch doesn't support multifieds values, so I decided to merge the content in the same field. If you want to search the field you should index it as "Text" instead of "keyword".
          Hide
          Armel Nene added a comment -

          Hi, Iam run the parser and it works fine. Now i want the parser instead of setting defaults values as fields (i.e. <xmlcontent>), i want it to create index fields based on the field in the xml document. The reason is because i will be parsing large xml documents that do not have xpath. Also the xml are generated from database table and therefore do not have an xpath to validate against. Is this possible to implement from the parse-xml.

          In the populateField() method of the XMLParser class, the field are checked against the one in the properties map. To work around the issue, I tried to generate another XMLIndexerProperties object and use it set field to add list of field that i want. The logic was,

          1. If the properties map contains "default", create a new XMLIndexerProperties object
          2. Create a collection to hold the new xmlfields (i.e. Collection xmlFields = new ArrayList())
          3. Loop over the elements in the XML document and
          4. Create a new XMLField object with each element (i.e. XMLField.setName(Element.getName()))
          5. Add the XMLField object to the new collection
          6. Set the field of the new XMLIndexerProperties object with the newly created collection of fields using the XMLIndexerProperties setXMLFields(Collection fields) method
          7. Then pass the variable to the extractDataFromFields method
          8. And finally, return the populated field collection.

          When I run the code, the method parses XML files with a valid xpath, but when parsing XML document with no xpath, the program throws a class cast exception: java.lang.string

          I then modified the code again to make sure the xmlfield object are actually created, this time around when I wun the application; the document is parse with no errors but the default field <xmlcontent> is the one being stored in the index and the not the element from the xml document. The reason why i decided to create a XMLField object before storing the object in a collection is because the extractDataFromElement method looks for an object of that type when iterating over the elements. Anyway below is the logic i implmented which doesn't differ that much from the inital implementation:

          if (xmlIndexersProperties.containsKey("default")) {

          Collection fields = new ArrayList();
          List docx = ((org.jdom.Document)xml).getContent();
          Iterator children = docx.listIterator();
          while(children.hasNext()){
          Object o = children.next();
          if(o instanceof Element)

          { XMLField xmlfield = new XMLField(); Element el = (Element)o; xmlfield.setFieldName(el.getName()); //xmlfield.setFieldType(el.) }

          I hope someone can help on this and let me know how to go about implementing in a better way.

          Armel

          Show
          Armel Nene added a comment - Hi, Iam run the parser and it works fine. Now i want the parser instead of setting defaults values as fields (i.e. <xmlcontent>), i want it to create index fields based on the field in the xml document. The reason is because i will be parsing large xml documents that do not have xpath. Also the xml are generated from database table and therefore do not have an xpath to validate against. Is this possible to implement from the parse-xml. In the populateField() method of the XMLParser class, the field are checked against the one in the properties map. To work around the issue, I tried to generate another XMLIndexerProperties object and use it set field to add list of field that i want. The logic was, 1. If the properties map contains "default", create a new XMLIndexerProperties object 2. Create a collection to hold the new xmlfields (i.e. Collection xmlFields = new ArrayList()) 3. Loop over the elements in the XML document and 4. Create a new XMLField object with each element (i.e. XMLField.setName(Element.getName())) 5. Add the XMLField object to the new collection 6. Set the field of the new XMLIndexerProperties object with the newly created collection of fields using the XMLIndexerProperties setXMLFields(Collection fields) method 7. Then pass the variable to the extractDataFromFields method 8. And finally, return the populated field collection. When I run the code, the method parses XML files with a valid xpath, but when parsing XML document with no xpath, the program throws a class cast exception: java.lang.string I then modified the code again to make sure the xmlfield object are actually created, this time around when I wun the application; the document is parse with no errors but the default field <xmlcontent> is the one being stored in the index and the not the element from the xml document. The reason why i decided to create a XMLField object before storing the object in a collection is because the extractDataFromElement method looks for an object of that type when iterating over the elements. Anyway below is the logic i implmented which doesn't differ that much from the inital implementation: if (xmlIndexersProperties.containsKey("default")) { Collection fields = new ArrayList(); List docx = ((org.jdom.Document)xml).getContent(); Iterator children = docx.listIterator(); while(children.hasNext()){ Object o = children.next(); if(o instanceof Element) { XMLField xmlfield = new XMLField(); Element el = (Element)o; xmlfield.setFieldName(el.getName()); //xmlfield.setFieldType(el.) } I hope someone can help on this and let me know how to go about implementing in a better way. Armel
          Hide
          Martina Koch added a comment -

          Is there an update of this plugin available for the current trunk? Or is this kind of functionality implemented elsewhere?

          Thanks,
          Beaucarnea

          Show
          Martina Koch added a comment - Is there an update of this plugin available for the current trunk? Or is this kind of functionality implemented elsewhere? Thanks, Beaucarnea
          Hide
          Gopikrishnan added a comment -

          Building XMLParser plugin with the latest (1.0-dev) source, throw errors for the following classes because of changes in base interfaces

          org\apache\nutch\parse\xml\config\XMLIndexer.java :

          org\apache\nutch\parse\xml\XMLIndexer.java:40:
          org.apache.nutch.parse.xml.XMLIndexer is not abstract and does not override abstract method
          addIndexBackendOptions(org.apache.hadoop.conf.Configuration) in org.apache.nutch.indexer.IndexingFilter

          org\apache\nutch\parse\xml\XMLParser.java :

          org\apache\nutch\parse\xml\XMLParser.java:64:
          org.apache.nutch.parse.xml.XMLParser is not abstract and does not override abstract method getParse(org.apache.nutch.protocol.Content)
          in org.apache.nutch.parse.Parser

          org\apache\nutch\parse\xml\XMLParser.java:75:
          getParse(org.apache.nutch.protocol.Content) in org.apache.nutch.parse.xml.XMLParser cannot implement
          getParse(org.apache.nutch.protocol.Content) in org.apache.nutch.parse.Parser; attempting to use incompatible return type

          Could somebody tell me what are the changes to make this plugin compatible with the Nutch trunk (1.0)

          Show
          Gopikrishnan added a comment - Building XMLParser plugin with the latest (1.0-dev) source, throw errors for the following classes because of changes in base interfaces org\apache\nutch\parse\xml\config\XMLIndexer.java : org\apache\nutch\parse\xml\XMLIndexer.java:40: org.apache.nutch.parse.xml.XMLIndexer is not abstract and does not override abstract method addIndexBackendOptions(org.apache.hadoop.conf.Configuration) in org.apache.nutch.indexer.IndexingFilter org\apache\nutch\parse\xml\XMLParser.java : org\apache\nutch\parse\xml\XMLParser.java:64: org.apache.nutch.parse.xml.XMLParser is not abstract and does not override abstract method getParse(org.apache.nutch.protocol.Content) in org.apache.nutch.parse.Parser org\apache\nutch\parse\xml\XMLParser.java:75: getParse(org.apache.nutch.protocol.Content) in org.apache.nutch.parse.xml.XMLParser cannot implement getParse(org.apache.nutch.protocol.Content) in org.apache.nutch.parse.Parser; attempting to use incompatible return type Could somebody tell me what are the changes to make this plugin compatible with the Nutch trunk (1.0)
          Hide
          Chris A. Mattmann added a comment -

          once NUTCH-767 is handled, we get Tika's XML parser (which was the eventual home for this code anyways) for free. So, I'm going to mark this as "Won't Fix" in lieu of that.

          Show
          Chris A. Mattmann added a comment - once NUTCH-767 is handled, we get Tika's XML parser (which was the eventual home for this code anyways) for free. So, I'm going to mark this as "Won't Fix" in lieu of that.
          Hide
          Chris A. Mattmann added a comment -

          See comments related to NUTCH-767 in this issue's comments section. Once we address NUTCH-767, we get this functionality for free...

          Show
          Chris A. Mattmann added a comment - See comments related to NUTCH-767 in this issue's comments section. Once we address NUTCH-767 , we get this functionality for free...
          Show
          Markus Jelsma added a comment - Bulk close of resolved issues: http://www.lucidimagination.com/search/document/2738eeb014805854/clean_up_open_legacy_issues_in_jira

            People

            • Assignee:
              Chris A. Mattmann
              Reporter:
              Rida Benjelloun
            • Votes:
              6 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development