Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-2389

Precise data parsing using Jsoup CSS selectors

    XMLWordPrintableJSON

    Details

    • Type: New Feature
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 2.3
    • Fix Version/s: 2.4
    • Component/s: parser
    • Labels:
      None

      Description

      As far as I know, currently Nutch 1.x and 2.x has no features to extract/parse exact contents for specific websites. I've developed a plugin parse-jsoup using Jsoup for my current project to extract precise content for site specific crawling using detailed XML configuration(field name, CSS-selector, attribute, extraction rules, data-type, default-value etc).

      Please let me know if this feature seems relevant and currently not present in Nutch. I have also plan to export it into Nutch 1.x.

        Attachments

          Activity

            People

            • Assignee:
              kaidul Kaidul Islam
              Reporter:
              kaidul Kaidul Islam
            • Votes:
              1 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Due:
                Created:
                Updated:
                Resolved:

                Time Tracking

                Estimated:
                Original Estimate - 0.05h
                0.05h
                Remaining:
                Remaining Estimate - 0.05h
                0.05h
                Logged:
                Time Spent - Not Specified
                Not Specified