Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-2769

parse-html unable to parse certain outlinks

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 1.15, 1.16
    • Fix Version/s: None
    • Component/s: parser
    • Labels:
      None

      Description

      Nutch is unable to parse certain outlinks in pages. 

      For example:

      Crawling http://d4fdot.com/pbfdot/PBC-North_index.asp does not parse the outlinks: 

      congress_avenue_lighting_improvements.asp

      blue_heron_boulevard_bridge_fender_replacement.asp

      indiantown_road_intersection_improvements.asp

       

      Crawling http://www.d4fdot.com/pbfdot/index.asp however, parses congress_avenue_lighting_improvements.asp correctly even though the Anchor element is structured similarly. 

       

      URL filters and normalizers have been modified to barely operate and no URLs or outlinks are being ignored in the current config and the error still occurs. 

       

       

       

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              pemanuel Prajeeth Emanuel
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated: