Issue Details (XML | Word | Printable)

Key: NUTCH-185
Type: New Feature New Feature
Status: Resolved Resolved
Resolution: Won't Fix
Priority: Major Major
Assignee: Chris A. Mattmann
Reporter: Rida Benjelloun
Votes: 6
Watchers: 4
Operations

If you were logged in you would be able to see more operations.
Nutch

XMLParser is configurable xml parser plugin.

Created: 25/Jan/06 01:45 AM   Updated: Thursday 03:16 AM
Return to search
Component/s: fetcher, indexer
Affects Version/s: 0.7.2, 0.8, 0.8.1
Fix Version/s: 1.1

Time Tracking:
Not Specified

File Attachments:
  Size
Text File Licensed for inclusion in ASF works parse-xml.patch 2006-10-24 06:28 AM nutch.newbie 0.3 kB
Zip Archive Licensed for inclusion in ASF works parse-xml.zip 2006-10-24 02:27 AM Rida Benjelloun 157 kB
Zip Archive Licensed for inclusion in ASF works parse-xml.zip 2006-01-25 01:49 AM Rida Benjelloun 1.52 MB
Environment: OS Independent
Issue Links:
Incorporates
 
Reference
 

Resolution Date: 26/Nov/09 03:16 AM


 Description  « Hide
Xml parser is configurable plugin. It use XPath and namespaces to do the mapping between the XML elements and Lucene fields.

Informations :

1- Copy "xmlparser-conf.xml" to the nutch/conf dir

2- To index your custom XML file, you have to modify the "xmlparser-conf.xml".
This parser uses namespaces and XPATH to parse XML content
The config file do the mapping between the XML noeds (using XPATH) and lucene field.
Example : <field name="dctitle" xpath="//dc:title" type="Text" boost="1.4" />

3- The xmlIndexerProperties encapsulate a set of fields associated to a namespace.
If the namespace is found in the xml document, the fields represented by the namespace will be indexed.
Example :
<xmlIndexerProperties type="filePerDocument" namespace=" http://purl.org/dc/elements/1.1/">
<field name="dctitle" xpath="//dc:title" type="Text" boost=" 1.4" />
<field name="dccreator" xpath="//dc:creator" type="keyword" boost=" 1.0" />
</xmlIndexerProperties>

4- It is possible to define a default namespace that will be applied when the parser
didn't find any namespace in the document or when the namespace found in the xml document doesn't match with the namespace defined in the xmlIndexerProperties.
Example :
<xmlIndexerProperties type="filePerDocument" namespace="default">
<field name="xmlcontent" xpath="//*" type="Unstored" boost="1.0" />
</xmlIndexerProperties>



 All   Comments   Work Log   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
Rida Benjelloun made changes - 25/Jan/06 01:49 AM
Field Original Value New Value
Attachment parse-xml.zip [ 12322310 ]
Rida Benjelloun made changes - 02/Feb/06 05:10 AM
Description XMLParser is configurable plugin. It use XPath and namespaces to do the mapping between the XML elements and Lucene fields.

Informations :

1- Copy "xmlparser-conf.xml" to the nutch/conf dir

2- To index your custom XML file, you have to modify the "xmlparser-conf.xml".
This parser uses namespaces and XPATH to parse XML content
The config file do the mapping between the XML noeds (using XPATH) and lucene field.
Example : <field name="dctitle" xpath="//dc:title" type="Text" boost="1.4" />

3- The xmlIndexerProperties encapsulate a set of fields associated to a namespace.
If the namespace is found in the xml document, the fields represented by the namespace will be indexed.
Example :
<xmlIndexerProperties type="filePerDocument" namespace=" http://purl.org/dc/elements/1.1/">
  <field name="dctitle" xpath="//dc:title" type="Text" boost=" 1.4" />
  <field name="dccreator" xpath="//dc:creator" type="keyword" boost=" 1.0" />
</xmlIndexerProperties>


4- It is possible to define a default namespace that will be applied when the parser
didn't find any namespace in the document or when the namespace found in the xml document doesn't match with the namespace defined in the xmlIndexerProperties.
Example :
<xmlIndexerProperties type="filePerDocument" namespace="default">
  <field name="xmlcontent" xpath="//*" type="Unstored" boost="1.0" />
</xmlIndexerProperties>
Xml parser is configurable plugin. It use XPath and namespaces to do the mapping between the XML elements and Lucene fields.

Informations :

1- Copy "xmlparser-conf.xml" to the nutch/conf dir

2- To index your custom XML file, you have to modify the "xmlparser-conf.xml".
This parser uses namespaces and XPATH to parse XML content
The config file do the mapping between the XML noeds (using XPATH) and lucene field.
Example : <field name="dctitle" xpath="//dc:title" type="Text" boost="1.4" />

3- The xmlIndexerProperties encapsulate a set of fields associated to a namespace.
If the namespace is found in the xml document, the fields represented by the namespace will be indexed.
Example :
<xmlIndexerProperties type="filePerDocument" namespace=" http://purl.org/dc/elements/1.1/">
  <field name="dctitle" xpath="//dc:title" type="Text" boost=" 1.4" />
  <field name="dccreator" xpath="//dc:creator" type="keyword" boost=" 1.0" />
</xmlIndexerProperties>


4- It is possible to define a default namespace that will be applied when the parser
didn't find any namespace in the document or when the namespace found in the xml document doesn't match with the namespace defined in the xmlIndexerProperties.
Example :
<xmlIndexerProperties type="filePerDocument" namespace="default">
  <field name="xmlcontent" xpath="//*" type="Unstored" boost="1.0" />
</xmlIndexerProperties>
Summary XMLParser is configurable plugin. It use XPath and namespaces to do the mapping between the XML elements and Lucene fields. XMLParser is configurable xml parser plugin.
Rida Benjelloun made changes - 24/Oct/06 02:27 AM
Attachment parse-xml.zip [ 12343485 ]
Rida Benjelloun made changes - 24/Oct/06 02:28 AM
Affects Version/s 0.8 [ 12310224 ]
Affects Version/s 0.8.1 [ 12312020 ]
nutch.newbie made changes - 24/Oct/06 06:28 AM
Attachment parse-xml.patch [ 12343501 ]
Chris A. Mattmann made changes - 24/Nov/06 06:28 PM
Assignee Chris A. Mattmann [ chrismattmann ]
Chris A. Mattmann made changes - 29/Sep/07 04:30 PM
Link This issue is related to NUTCH-562 [ NUTCH-562 ]
Chris A. Mattmann made changes - 26/Nov/09 03:15 AM
Link This issue is part of NUTCH-767 [ NUTCH-767 ]
Chris A. Mattmann made changes - 26/Nov/09 03:16 AM
Resolution Won't Fix [ 2 ]
Fix Version/s 1.1 [ 12313609 ]
Status Open [ 1 ] Resolved [ 5 ]