Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-1511

Metadata in MYSQL updated with 'garbage'

VotersWatch issueWatchersLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    Description

      After applying patch for Metadata parser (NUTCH-1478) I notice that the metadata field just before the crawl ends is populated with the correct information. However when the crawl is completely finished the metadata field is populated with 'garbage' csh�����

      I notice in my SQL log file that the scoring plugin is overwriting the metadata field in a final data insertion with 'csh \0\0\0\0\'. When I remove 'scoring-opic' out of 'plugin.includes' property in the nutch-site.xml , the metadata-field is crisp and clear.

      MYSQL LOG FILE: (I did a crawl on http://nutch.apache.org. Below you will see a fragments of my MYSQL log file, only the moments when data is written to the METADATA field in the MYSQL table.

      First Insertion .. here I suppose scoring-opic writes its information, csh ?€\0\0\0

      58 Query INSERT INTO webpage (fetchInterval,fetchTime,id,markers,metadata,score )VALUES (2592000,1357122976493,'org.apache.nutch:http/',' dist 0 injmrk y\0','
      csh ?€\0\0\0',1.0) ON DUPLICATE KEY UPDATE fetchInterval=2592000,fetchTime=1357122976493,markers=' dist 0 injmrk y\0',metadata='
      csh ?€\0\0\0',score=1.0

      Second Insertion - inhere scraped metada is inserted into metadata.

      81 Query INSERT INTO webpage (id,markers,metadata,outlinks,parseStatus,signature,text,title )VALUES ('org.apache.nutch:http/',

      The final insertion - please note that here the metadata field is overwritten with CSH\0\0\0\0

      90 Query INSERT INTO webpage (fetchTime,id,inlinks,markers,metadata )VALUES (1359714995075,'org.apache.nutch:http/',' 0http://nutch.apache.org/
      Nutch\0',' dist 0 injmrk y updmrk*1357122982-1745626508 _prsmrk*1357122982-1745626508 _gnmrk*1357122982-1745626508 ftcmrk*1357122982-1745626508\0','
      csh \0\0\0\0\0') ON DUPLICATE KEY UPDATE fetchTime=1359714995075,inlinks=' 0http://nutch.apache.org/

      Attachments

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            Unassigned Unassigned
            wannabe J. Gobel
            Votes:
            2 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment