Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-2389

Precise data parsing using Jsoup CSS selectors

    XMLWordPrintableJSON

Details

    • New Feature
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 2.3
    • 2.4
    • parser
    • None

    Description

      As far as I know, currently Nutch 1.x and 2.x has no features to extract/parse exact contents for specific websites. I've developed a plugin parse-jsoup using Jsoup for my current project to extract precise content for site specific crawling using detailed XML configuration(field name, CSS-selector, attribute, extraction rules, data-type, default-value etc).

      Please let me know if this feature seems relevant and currently not present in Nutch. I have also plan to export it into Nutch 1.x.

      Attachments

        Activity

          People

            kaidul Kaidul Islam
            kaidul Kaidul Islam
            Votes:
            1 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Time Tracking

                Estimated:
                Original Estimate - 0.05h
                0.05h
                Remaining:
                Remaining Estimate - 0.05h
                0.05h
                Logged:
                Time Spent - Not Specified
                Not Specified