|
[
Permlink
| « Hide
]
Rida Benjelloun added a comment - 25/Jan/06 01:49 AM
Version 1.0
Great Plugin. Thanks !
I succesfull test this plugin on a 0.7.1 version of nutch. I have just a problem with somes structures like this : <authors> <author>author1</author> <author>author2</author> <author>author3</author> <authorr> In my Lucene Index i just see the author3 value for this field. Is there any update to this plugin for 0.8.1 version? Its a great plugin and I use it now in 0.7. Any update.. Sorry I am not a java guru to do this myself
Regards Hi,
The plugin parse-xml has been updated. I have tested it with 0.8.1 version. The plugin fix also the bug related the multi-fields values. Best regards Rida Benjelloun. Thank you very much! I will be giving it a go now.
Will this plugin be added to the Nutch trunk as a part of distribution? I would really like to see that happen. Thanks again. Made a small change in order to compile against latest trunk.
Regards Rida has made an ultimate plugin here. Must have for all people who need to use custom plugin to allow indexing/searching of various fields.
I have successfully got this plugin to work on both 0.7.2 and 0.8.1 version of Nutch. I have a problem similar to Philippe Eugene above. I have a structure of xml similar to If I have the configuration as <field name="tag" xpath="//Tags/Tag" type="Keyword" boost="1.3" /> The plugin just insert one value for 'tag' and combines all the values and makes a single entry for 'tag'. This makes the content unsearchable by tags are we must give all tags as the whole thing "tag1 tag2 tag3" becomes a keyword. Instead if I have the configuration as: I get the same problem as Philippe, of only having the last value in the index => tag = tag3 will be stored in the index. Nutch doesn't support multifieds values, so I decided to merge the content in the same field. If you want to search the field you should index it as "Text" instead of "keyword".
Hi, Iam run the parser and it works fine. Now i want the parser instead of setting defaults values as fields (i.e. <xmlcontent>), i want it to create index fields based on the field in the xml document. The reason is because i will be parsing large xml documents that do not have xpath. Also the xml are generated from database table and therefore do not have an xpath to validate against. Is this possible to implement from the parse-xml.
In the populateField() method of the XMLParser class, the field are checked against the one in the properties map. To work around the issue, I tried to generate another XMLIndexerProperties object and use it set field to add list of field that i want. The logic was, 1. If the properties map contains "default", create a new XMLIndexerProperties object When I run the code, the method parses XML files with a valid xpath, but when parsing XML document with no xpath, the program throws a class cast exception: java.lang.string I then modified the code again to make sure the xmlfield object are actually created, this time around when I wun the application; the document is parse with no errors but the default field <xmlcontent> is the one being stored in the index and the not the element from the xml document. The reason why i decided to create a XMLField object before storing the object in a collection is because the extractDataFromElement method looks for an object of that type when iterating over the elements. Anyway below is the logic i implmented which doesn't differ that much from the inital implementation: if (xmlIndexersProperties.containsKey("default")) { Collection fields = new ArrayList(); I hope someone can help on this and let me know how to go about implementing in a better way. Armel Is there an update of this plugin available for the current trunk? Or is this kind of functionality implemented elsewhere?
Thanks, Building XMLParser plugin with the latest (1.0-dev) source, throw errors for the following classes because of changes in base interfaces
org\apache\nutch\parse\xml\config\XMLIndexer.java : org\apache\nutch\parse\xml\XMLIndexer.java:40: org\apache\nutch\parse\xml\XMLParser.java : org\apache\nutch\parse\xml\XMLParser.java:64: org\apache\nutch\parse\xml\XMLParser.java:75: Could somebody tell me what are the changes to make this plugin compatible with the Nutch trunk (1.0) once NUTCH-767 is handled, we get Tika's XML parser (which was the eventual home for this code anyways) for free. So, I'm going to mark this as "Won't Fix" in lieu of that.
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||