Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-2769

parse-html unable to parse certain outlinks

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 1.15, 1.16
    • None
    • parser
    • None

    Description

      Nutch is unable to parse certain outlinks in pages. 

      For example:

      Crawling http://d4fdot.com/pbfdot/PBC-North_index.asp does not parse the outlinks: 

      congress_avenue_lighting_improvements.asp

      blue_heron_boulevard_bridge_fender_replacement.asp

      indiantown_road_intersection_improvements.asp

       

      Crawling http://www.d4fdot.com/pbfdot/index.asp however, parses congress_avenue_lighting_improvements.asp correctly even though the Anchor element is structured similarly. 

       

      URL filters and normalizers have been modified to barely operate and no URLs or outlinks are being ignored in the current config and the error still occurs. 

       

       

       

      Attachments

        Activity

          People

            Unassigned Unassigned
            pemanuel Prajeeth Emanuel
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated: