Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-425

parse-js pollutes anchor text with base URL of source page

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 0.9.0
    • 0.9.0
    • fetcher
    • None

    Description

      Parse-js plugin always adds URL – usually page base URL – as anchor text for any link discovered parsing javascript. Anchor text is tokenized when indexed and by default gets a heavy weighting. The upshot is often pages show high in search results for no reason other than query term appears in (URL) anchors.

      See http://www.mail-archive.com/nutch-user%40lucene.apache.org/msg06935.html for related user list postings.

      Here is extract from linkdb exhibiting the problem:

      https://www2.westpac.com.au/emarket/check_merch.cfm?id=900030 Inlinks:
      fromUrl: http://premier.ticketek.com.au/content/buyers/buyers_step1.aspx anchor: http://premier.ticketek.com.au/content/buyers/buyers_step1.aspx
      fromUrl: http://premier.ticketek.com.au/content/outlets/agencies_qld.aspx anchor: http://premier.ticketek.com.au/content/outlets/agencies_qld.aspx
      fromUrl: http://premier.ticketek.com.au/shows/show.aspx?sh=TSSWANS05 anchor: http://premier.ticketek.com.au/shows/show.aspx?sh=TSSWANS05
      fromUrl: http://premier.ticketek.com.au/content/outlets/agencies_vic.aspx anchor: http://premier.ticketek.com.au/content/outlets/agencies_vic.aspx
      fromUrl: http://premier.ticketek.com.au/Venues/VenueDetails.aspx?v=NMO&s=6547 anchor: http://premier.ticketek.com.au/Venues/VenueDetails.aspx?v=NMO&s=6547
      fromUrl: http://premier.ticketek.com.au/content/buyers/buyers_step5.aspx anchor: http://premier.ticketek.com.au/content/buyers/buyers_step5.aspx
      fromUrl: http://premier.ticketek.com.au/content/outlets/agencies_nsw.aspx anchor: http://premier.ticketek.com.au/content/outlets/agencies_nsw.aspx

      Attachments

        1. nutch425.patch
          2 kB
          Michael Stack

        Activity

          People

            ab Andrzej Bialecki
            stack Michael Stack
            Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: