After applying patch for Metadata parser (
NUTCH-1478) I notice that the metadata field just before the crawl ends is populated with the correct information. However when the crawl is completely finished the metadata field is populated with 'garbage' csh�����
I notice in my SQL log file that the scoring plugin is overwriting the metadata field in a final data insertion with 'csh \0\0\0\0\'. When I remove 'scoring-opic' out of 'plugin.includes' property in the nutch-site.xml , the metadata-field is crisp and clear.
MYSQL LOG FILE: (I did a crawl on http://nutch.apache.org. Below you will see a fragments of my MYSQL log file, only the moments when data is written to the METADATA field in the MYSQL table.
First Insertion .. here I suppose scoring-opic writes its information, csh ?€\0\0\0
58 Query INSERT INTO webpage (fetchInterval,fetchTime,id,markers,metadata,score )VALUES (2592000,1357122976493,'org.apache.nutch:http/',' dist 0 injmrk y\0','
csh ?€\0\0\0',1.0) ON DUPLICATE KEY UPDATE fetchInterval=2592000,fetchTime=1357122976493,markers=' dist 0 injmrk y\0',metadata='
Second Insertion - inhere scraped metada is inserted into metadata.
81 Query INSERT INTO webpage (id,markers,metadata,outlinks,parseStatus,signature,text,title )VALUES ('org.apache.nutch:http/',
The final insertion - please note that here the metadata field is overwritten with CSH\0\0\0\0
90 Query INSERT INTO webpage (fetchTime,id,inlinks,markers,metadata )VALUES (1359714995075,'org.apache.nutch:http/',' 0http://nutch.apache.org/
Nutch\0',' dist 0 injmrk y updmrk*1357122982-1745626508 _prsmrk*1357122982-1745626508 _gnmrk*1357122982-1745626508 ftcmrk*1357122982-1745626508\0','
csh \0\0\0\0\0') ON DUPLICATE KEY UPDATE fetchTime=1359714995075,inlinks=' 0http://nutch.apache.org/