Nutch
  1. Nutch
  2. NUTCH-932

Bulk REST API to retrieve crawl results as JSON

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: nutchgora
    • Fix Version/s: nutchgora
    • Component/s: REST_api
    • Labels:
      None

      Description

      It would be useful to be able to retrieve results of a crawl as JSON. There are a few things that need to be discussed:

      • how to return bulk results using Restlet (WritableRepresentation subclass?)
      • what should be the format of results?

      I think it would make sense to provide a single record retrieval (by primary key), all records, and records within a range. This incidentally matches well the capabilities of the Gora Query class

      1. NUTCH-932-4.patch
        88 kB
        Andrzej Bialecki
      2. NUTCH-932-3.patch
        80 kB
        Andrzej Bialecki
      3. NUTCH-932-2.patch
        66 kB
        Andrzej Bialecki
      4. NUTCH-932.patch
        54 kB
        Andrzej Bialecki
      5. NUTCH-932.patch
        40 kB
        Andrzej Bialecki
      6. db.formatted.gz
        155 kB
        Andrzej Bialecki
      7. NUTCH-932.patch
        37 kB
        Andrzej Bialecki

        Activity

        Hide
        Andrzej Bialecki added a comment -

        Committed in rev. 1039014.

        Show
        Andrzej Bialecki added a comment - Committed in rev. 1039014.
        Hide
        Andrzej Bialecki added a comment -

        Final version of the patch.

        Show
        Andrzej Bialecki added a comment - Final version of the patch.
        Hide
        Andrzej Bialecki added a comment -

        NutchTool is an abstract class in this patch. This actually minimizes the amount of code throughout, though paradoxically the patch file is larger than before...

        Show
        Andrzej Bialecki added a comment - NutchTool is an abstract class in this patch. This actually minimizes the amount of code throughout, though paradoxically the patch file is larger than before...
        Hide
        Andrzej Bialecki added a comment -

        This patch simplifies the NutchTool API and reduces changes to implementations of NutchTool. I'd like to commit this patch soon.

        Show
        Andrzej Bialecki added a comment - This patch simplifies the NutchTool API and reduces changes to implementations of NutchTool. I'd like to commit this patch soon.
        Hide
        Andrzej Bialecki added a comment -

        Updated patch. This changes the NutchTool API to allow for execution steps that are not mapreduce jobs, and to pass arguments in arbitrary order, which was a side-effect of the Restlet API.

        As a proof of concept I reimplemented the Crawler class (a one-shot crawler). If there are no objections I'll commit this shortly.

        Show
        Andrzej Bialecki added a comment - Updated patch. This changes the NutchTool API to allow for execution steps that are not mapreduce jobs, and to pass arguments in arbitrary order, which was a side-effect of the Restlet API. As a proof of concept I reimplemented the Crawler class (a one-shot crawler). If there are no objections I'll commit this shortly.
        Hide
        Andrzej Bialecki added a comment -

        Examples (with the db equivalent to the one in db.formatted.gz):

        $ curl -s 'http://localhost:8192/nutch/db?fields=url&end=http://www.freebsd.org/&start=http://www.egothor.org/'| ./json_pp
        [
          {
            "url": "http://www.egothor.org/"
          }, 
          {
            "url": "http://www.freebsd.org/"
          }
        ]
        
        $ curl -s 'http://localhost:8192/nutch/db?fields=url,outlinks,markers,protocolStatus,parseStatus,contentType&start=http://www.getopt.org/&end=http://www.getopt.org/'| ./json_pp
        [
          {
            "contentType": "text/html", 
            "url": "http://www.getopt.org/", 
            "markers": {
              "_updmrk_": "1288890451-1134865895"
            }, 
            "parseStatus": "success/ok (1/0), args=[]", 
            "protocolStatus": "SUCCESS, args=[]", 
            "outlinks": {
              "http://www.getopt.org/luke/": "Luke", 
              "http://www.getopt.org/ecimf/contrib/ONTO/REA": "REA Ontology page", 
              "http://www.getopt.org/CV.pdf": "CV here", 
              "http://www.getopt.org/utils/build/api": "API", 
              "http://svn.apache.org/viewvc/hadoop/hbase/trunk/src/java/org/apache/hadoop/hbase/util/JenkinsHash.java": "available here", 
              "http://www.getopt.org/murmur/MurmurHash.java": "MurmurHash.java", 
              "http://www.ebxml.org/": "ebXML / ebTWG", 
              "http://www.freebsd.org/": "FreeBSD", 
              "http://www.getopt.org/luke/webstart.html": "Launch with Java WebStart", 
              "http://www.freebsd.org/%7Epicobsd": "PicoBSD", 
              "http://home.comcast.net/~bretm/hash/6.html": "this discussion", 
              "http://protege.stanford.edu/": "Protege", 
              "http://jakarta.apache.org/lucene": "Lucene", 
              "http://www.getopt.org/ecimf/contrib/ONTO/ebxml": "ebXML Ontology", 
              "http://www.getopt.org/ecimf/": "here", 
              "http://www.isthe.com/chongo/tech/comp/fnv/": "his website", 
              "http://www.getopt.org/stempel/index.html": "Stempel", 
              "http://www.sigram.com/": "SIGRAM", 
              "http://www.egothor.org/": "Egothor", 
              "http://thinlet.sourceforge.net/": "Thinlet", 
              "http://www.getopt.org/utils/dist/utils-1.0.jar": "binary", 
              "http://www.ecimf.org/": "ECIMF"
            }
          }
        ]
        
        Show
        Andrzej Bialecki added a comment - Examples (with the db equivalent to the one in db.formatted.gz): $ curl -s 'http: //localhost:8192/nutch/db?fields=url&end=http://www.freebsd.org/&start=http://www.egothor.org/'| ./json_pp [ { "url" : "http: //www.egothor.org/" }, { "url" : "http: //www.freebsd.org/" } ] $ curl -s 'http: //localhost:8192/nutch/db?fields=url,outlinks,markers,protocolStatus,parseStatus,contentType&start=http://www.getopt.org/&end=http://www.getopt.org/'| ./json_pp [ { "contentType" : "text/html" , "url" : "http: //www.getopt.org/" , "markers" : { "_updmrk_" : "1288890451-1134865895" }, "parseStatus" : "success/ok (1/0), args=[]" , "protocolStatus" : "SUCCESS, args=[]" , "outlinks" : { "http: //www.getopt.org/luke/" : "Luke" , "http: //www.getopt.org/ecimf/contrib/ONTO/REA" : "REA Ontology page" , "http: //www.getopt.org/CV.pdf" : "CV here" , "http: //www.getopt.org/utils/build/api" : "API" , "http: //svn.apache.org/viewvc/hadoop/hbase/trunk/src/java/org/apache/hadoop/hbase/util/JenkinsHash.java" : "available here" , "http: //www.getopt.org/murmur/MurmurHash.java" : "MurmurHash.java" , "http: //www.ebxml.org/" : "ebXML / ebTWG" , "http: //www.freebsd.org/" : "FreeBSD" , "http: //www.getopt.org/luke/webstart.html" : "Launch with Java WebStart" , "http: //www.freebsd.org/%7Epicobsd" : "PicoBSD" , "http: //home.comcast.net/~bretm/hash/6.html" : " this discussion" , "http: //protege.stanford.edu/" : "Protege" , "http: //jakarta.apache.org/lucene" : "Lucene" , "http: //www.getopt.org/ecimf/contrib/ONTO/ebxml" : "ebXML Ontology" , "http: //www.getopt.org/ecimf/" : "here" , "http: //www.isthe.com/chongo/tech/comp/fnv/" : "his website" , "http: //www.getopt.org/stempel/index.html" : "Stempel" , "http: //www.sigram.com/" : "SIGRAM" , "http: //www.egothor.org/" : "Egothor" , "http: //thinlet.sourceforge.net/" : "Thinlet" , "http: //www.getopt.org/utils/dist/utils-1.0.jar" : "binary" , "http: //www.ecimf.org/" : "ECIMF" } } ]
        Hide
        Andrzej Bialecki added a comment -

        Updated patch - this recognizes now URL parameters such as fields, start/end keys, batch and crawl id.

        Show
        Andrzej Bialecki added a comment - Updated patch - this recognizes now URL parameters such as fields, start/end keys, batch and crawl id.
        Hide
        Andrzej Bialecki added a comment -

        Example DB content (this was passed through a JSON pretty-printer, otherwise it's just one giant line...).

        Show
        Andrzej Bialecki added a comment - Example DB content (this was passed through a JSON pretty-printer, otherwise it's just one giant line...).
        Hide
        Andrzej Bialecki added a comment -

        This patch adds bulk retrieval of crawl results. This is still very rough, e.g. there's no way to select crawlId or limit the fields... but it returns proper JSON.

        This patch also includes other enhancements and bugfixes - with this patch I was able to perform a complete crawl cycle via REST.

        Show
        Andrzej Bialecki added a comment - This patch adds bulk retrieval of crawl results. This is still very rough, e.g. there's no way to select crawlId or limit the fields... but it returns proper JSON. This patch also includes other enhancements and bugfixes - with this patch I was able to perform a complete crawl cycle via REST.

          People

          • Assignee:
            Andrzej Bialecki
            Reporter:
            Andrzej Bialecki
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development