Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-880

REST API for Nutch

    XMLWordPrintableJSON

Details

    • New Feature
    • Status: Closed
    • Major
    • Resolution: Fixed
    • nutchgora
    • nutchgora
    • None
    • None

    Description

      This issue is for discussing a REST-style API for accessing Nutch.

      Here's an initial idea:

      • I propose to use org.restlet for handling requests and returning JSON/XML/whatever responses.
      • hook up all regular tools so that they can be driven via this API. This would have to be an async API, since all Nutch operations take long time to execute. It follows then that we need to be able also to list running operations, retrieve their current status, and possibly abort/cancel/stop/suspend/resume/...? This also means that we would have to potentially create & manage many threads in a servlet - AFAIK this is frowned upon by J2EE purists...
      • package this in a webapp (that includes all deps, essentially nutch.job content), with the restlet servlet as an entry point.

      Open issues:

      • how to implement the reading of crawl results via this API
      • should we manage only crawls that use a single configuration per webapp, or should we have a notion of crawl contexts (sets of crawl configs) with CRUD ops on them? this would be nice, because it would allow managing of several different crawls, with different configs, in a single webapp - but it complicates the implementation a lot.

      Attachments

        1. API-2.patch
          61 kB
          Andrzej Bialecki
        2. API.patch
          40 kB
          Andrzej Bialecki

        Issue Links

          Activity

            People

              ab Andrzej Bialecki
              ab Andrzej Bialecki
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: