Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-2389

Precise data parsing using Jsoup CSS selectors

Agile BoardAttach filesAttach ScreenshotVotersStop watchingWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

    Details

    • Type: New Feature
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 2.3
    • Fix Version/s: 2.4
    • Component/s: parser
    • Labels:
      None

      Description

      As far as I know, currently Nutch 1.x and 2.x has no features to extract/parse exact contents for specific websites. I've developed a plugin parse-jsoup using Jsoup for my current project to extract precise content for site specific crawling using detailed XML configuration(field name, CSS-selector, attribute, extraction rules, data-type, default-value etc).

      Please let me know if this feature seems relevant and currently not present in Nutch. I have also plan to export it into Nutch 1.x.

        Attachments

          Activity

            People

            • Assignee:
              kaidul Kaidul Islam
              Reporter:
              kaidul Kaidul Islam

              Dates

              • Due:
                Created:
                Updated:
                Resolved:

              Time Tracking

              Estimated:
              Original Estimate - 0.05h
              0.05h
              Remaining:
              Remaining Estimate - 0.05h
              0.05h
              Logged:
              Time Spent - Not Specified
              Not Specified

                Issue deployment