Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: nutchgora
    • Fix Version/s: nutchgora
    • Component/s: None
    • Labels:
      None

      Description

      This issue is for discussing a REST-style API for accessing Nutch.

      Here's an initial idea:

      • I propose to use org.restlet for handling requests and returning JSON/XML/whatever responses.
      • hook up all regular tools so that they can be driven via this API. This would have to be an async API, since all Nutch operations take long time to execute. It follows then that we need to be able also to list running operations, retrieve their current status, and possibly abort/cancel/stop/suspend/resume/...? This also means that we would have to potentially create & manage many threads in a servlet - AFAIK this is frowned upon by J2EE purists...
      • package this in a webapp (that includes all deps, essentially nutch.job content), with the restlet servlet as an entry point.

      Open issues:

      • how to implement the reading of crawl results via this API
      • should we manage only crawls that use a single configuration per webapp, or should we have a notion of crawl contexts (sets of crawl configs) with CRUD ops on them? this would be nice, because it would allow managing of several different crawls, with different configs, in a single webapp - but it complicates the implementation a lot.
      1. API-2.patch
        61 kB
        Andrzej Bialecki
      2. API.patch
        40 kB
        Andrzej Bialecki

        Issue Links

          Activity

          Hide
          Andrzej Bialecki added a comment -

          Thanks - this issue is already fixed in NUTCH-932, to be committed soon.

          Show
          Andrzej Bialecki added a comment - Thanks - this issue is already fixed in NUTCH-932 , to be committed soon.
          Hide
          Alexis added a comment -

          This revision introduced a bug in the nutch inject command. It now throws a NullPointerException.

          Please take a look at:
          http://svn.apache.org/viewvc/nutch/trunk/src/java/org/apache/nutch/crawl/InjectorJob.java?annotate=1028235&pathrev=1028235

          Make sure the first element in the array is not null:

          Index: src/java/org/apache/nutch/crawl/InjectorJob.java
          ===================================================================
          --- src/java/org/apache/nutch/crawl/InjectorJob.java    (revision 1031881)
          +++ src/java/org/apache/nutch/crawl/InjectorJob.java    (working copy)
          @@ -242,6 +242,7 @@
               job.setReducerClass(Reducer.class);
               job.setNumReduceTasks(0);
               job.waitForCompletion(true);
          +    jobs[0] = job;
          
               job = new NutchJob(getConf(), "inject-p2 " + args[0]);
               StorageUtils.initMapperJob(job, FIELDS, String.class,
          
          Show
          Alexis added a comment - This revision introduced a bug in the nutch inject command. It now throws a NullPointerException. Please take a look at: http://svn.apache.org/viewvc/nutch/trunk/src/java/org/apache/nutch/crawl/InjectorJob.java?annotate=1028235&pathrev=1028235 Make sure the first element in the array is not null: Index: src/java/org/apache/nutch/crawl/InjectorJob.java =================================================================== --- src/java/org/apache/nutch/crawl/InjectorJob.java (revision 1031881) +++ src/java/org/apache/nutch/crawl/InjectorJob.java (working copy) @@ -242,6 +242,7 @@ job.setReducerClass(Reducer.class); job.setNumReduceTasks(0); job.waitForCompletion(true); + jobs[0] = job; job = new NutchJob(getConf(), "inject-p2 " + args[0]); StorageUtils.initMapperJob(job, FIELDS, String.class,
          Hide
          Andrzej Bialecki added a comment -

          Committed in rev. 1028235. The webapp part of this issue is tracked now in NUTCH-929.

          Show
          Andrzej Bialecki added a comment - Committed in rev. 1028235. The webapp part of this issue is tracked now in NUTCH-929 .
          Hide
          Andrzej Bialecki added a comment -

          The webapp part is tracked now in NUTCH-929.

          Show
          Andrzej Bialecki added a comment - The webapp part is tracked now in NUTCH-929 .
          Hide
          Andrzej Bialecki added a comment -

          An improved version, which actually works The configuration and job management is implemented, there is also a unit test that exercises this API.

          If there are no objections I'd like to commit this first version of the API, and continue improving it in other issues.

          Show
          Andrzej Bialecki added a comment - An improved version, which actually works The configuration and job management is implemented, there is also a unit test that exercises this API. If there are no objections I'd like to commit this first version of the API, and continue improving it in other issues.
          Hide
          Andrzej Bialecki added a comment -

          I think we can combine the approach you outlined in NUTCH-907 with this one.

          I'm not sure... they are really not the same things - you can execute many crawls with different seed lists, but still using the same Configuration.

          What is "CLASS" ?

          It's the same as bin/nutch fully.qualified.class.name, only here I require that it implements NutchTool.

          Btw, Andrzej, I will be happy to help out with the implementation if you want.

          By all means - I didn't have time so far to progress beyond this patch...

          Show
          Andrzej Bialecki added a comment - I think we can combine the approach you outlined in NUTCH-907 with this one. I'm not sure... they are really not the same things - you can execute many crawls with different seed lists, but still using the same Configuration. What is "CLASS" ? It's the same as bin/nutch fully.qualified.class.name, only here I require that it implements NutchTool. Btw, Andrzej, I will be happy to help out with the implementation if you want. By all means - I didn't have time so far to progress beyond this patch...
          Hide
          Doğacan Güney added a comment -

          +1 from me.

          I think we can combine the approach you outlined in NUTCH-907 with this one. Instead of using confId-s to identify
          different confs, we can use different crawl prefixes (or whatever we will call them) to identify different crawl sets (though
          we still need a way to attach different conf-s to different crawl sets).

          I think API overall looks good. Maybe we can change all the Map<String, Object>s to be some classes though.

          A minor question:

          In JobManager.java:

          + public static enum JobType

          {INJECT, GENERATE, FETCH, PARSE, UPDATEDB, INDEX, CRAWL, CLASS}

          ;

          What is "CLASS" ?

          Btw, Andrzej, I will be happy to help out with the implementation if you want.

          Show
          Doğacan Güney added a comment - +1 from me. I think we can combine the approach you outlined in NUTCH-907 with this one. Instead of using confId-s to identify different confs, we can use different crawl prefixes (or whatever we will call them) to identify different crawl sets (though we still need a way to attach different conf-s to different crawl sets). I think API overall looks good. Maybe we can change all the Map<String, Object>s to be some classes though. A minor question: In JobManager.java: + public static enum JobType {INJECT, GENERATE, FETCH, PARSE, UPDATEDB, INDEX, CRAWL, CLASS} ; What is "CLASS" ? Btw, Andrzej, I will be happy to help out with the implementation if you want.
          Hide
          Andrzej Bialecki added a comment -

          Initial patch for discussion. This is a work in progress, so only some functionality is implemented, and even less than that is actually working

          I would appreciate a review and comments.

          Show
          Andrzej Bialecki added a comment - Initial patch for discussion. This is a work in progress, so only some functionality is implemented, and even less than that is actually working I would appreciate a review and comments.

            People

            • Assignee:
              Andrzej Bialecki
              Reporter:
              Andrzej Bialecki
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development