Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-2433

Html Parser: keep htmltag where the outlinks are found

    Details

    • Type: New Feature
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 1.13
    • Fix Version/s: 1.14
    • Component/s: parser
    • Labels:
    • Environment:

      Apache Nutch release 1.13.

      Description

      When parsing HTML pages, I need to know in which HTML tag the outlinks were found (for example, 'a', 'script', 'img', etc).

      I propose to add a new configuration value, "parser.html.outlinks.htmlnode_metadata_name".
      If this configuration property is not empty, all found outlinks will be assigned a metadata with the name indicated in this configuration property with the html tag name where the outlink was found.

      I will now send the pull request with my code implementation.

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              maborec Marcos Bori
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: