Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-2464

Plugin headings: Headers That Contain HTML Elements Are Not Parsed

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 1.13
    • Fix Version/s: 1.14
    • Component/s: plugin
    • Labels:
      None
    • Environment:

      Internal development/test environments.

      Description

      Nutch does not appear to traverse the HTML elements that may be contained within header elements (e.g., H1, H2, H3, etc. tags). Many times there are anchors and/or <span> tags within these elements that contain the actual text nodes that should be picked up as the header value for indexing purposes.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                jorgelbg Jorge Luis Betancourt Gonzalez
                Reporter:
                cpallansch Cass Pallansch
              • Votes:
                0 Vote for this issue
                Watchers:
                5 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: