Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-807

JSParseFilter produces malformed URL

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Won't Fix
    • 1.0.0
    • None
    • parser
    • None
    • Redhat 2.6.18-128.1.6.el5PAE i686 i686 i386 GNU/Linux

    Description

      This is found when crawling site: http://zhidao.baidu.com/ ( a Chinese language site )

      It appears this page contains javascripts which confused JSParseFilter, which produced URL like this:

      http://zhidao.baidu.com/){if(A===46){baidu.hide(

      Not sure the impact/scope of this issue in general. The observation for this specific site is, much less pages got crawled.

      Thanks.

      Attachments

        Activity

          People

            Unassigned Unassigned
            minyaozhu Minyao Zhu
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: