Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-2433

Html Parser: keep htmltag where the outlinks are found

    XMLWordPrintableJSON

Details

    • New Feature
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 1.13
    • 1.14
    • parser
    • Apache Nutch release 1.13.

    Description

      When parsing HTML pages, I need to know in which HTML tag the outlinks were found (for example, 'a', 'script', 'img', etc).

      I propose to add a new configuration value, "parser.html.outlinks.htmlnode_metadata_name".
      If this configuration property is not empty, all found outlinks will be assigned a metadata with the name indicated in this configuration property with the html tag name where the outlink was found.

      I will now send the pull request with my code implementation.

      Attachments

        Activity

          People

            Unassigned Unassigned
            maborec Marcos Bori
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: