Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-1286

Refactoring/reimplementing crawling API (NutchApp)

    Details

      Description

      This issue is to track changes we (Mathijs and I) have planned for the API and webapp in Nutchgora. We have a pretty good idea of how we want to be using the crawl API. It may involve some major refactoring or perhaps a side implementation next the current NutchApp functionality. It depends on how much we can reuse the existing components. The bottom line is that there will be a strictly defined Java API that provide everyting related from crawling/indexing to job control. (Listing jobs, tracking progress and aborting jobs being part of it). There will be no server or service for tracking crawling states, all will be persisted one way or the other and queryable from the API. The REST server shall be a very thin layer on top of the Java implementation. A rich web interface will be very easy layer too, once we have a cleanly (but extensive) defined API. But we will start to make to API usable from a simple command-line interface.

      More details will be provided later on.. feel free to comment if you have suggestions/questions.

        Issue Links

          Activity

          Show
          ferdy.g Ferdy Galema added a comment - Useful wiki http://wiki.apache.org/nutch/NutchAdministrationUserInterface
          Hide
          lewismc Lewis John McGibbney added a comment -

          For reference, an brief description from Marko regarding the UI which was designed here [0].

          + a new extension point that describes ui component
          + a ui component is a plugin uses backend classes from nutch to provide functionality (e.g. inject, fetch, configuration or whatever)
          + a ui component can deploy to a webserver as a new webapp
          + a application that was starting a webserver e.g.jetty and deploy all implemented ui components to the webserver

          the goal was to use the plugin api to develop separately ui components that can be deploy to the webserver as a new context.

          + every ui compoment can have more than one instance
          + with this approach we was able to create different type of crawls (e.g. fast crawl, long running crawl ...)
          + every type has one instance of a ui compoment

          + an important ui component we implemented was a component to configure the Configuration object
          + with that you can configure your crawl instance with different plugins or different configurations for a fetcher or whatever

          our ui components was directly using the nutch backend.

          It would be nice to compile a diff list describing changes between implementations.

          [0] https://github.com/101tec/nutch

          Show
          lewismc Lewis John McGibbney added a comment - For reference, an brief description from Marko regarding the UI which was designed here [0] . + a new extension point that describes ui component + a ui component is a plugin uses backend classes from nutch to provide functionality (e.g. inject, fetch, configuration or whatever) + a ui component can deploy to a webserver as a new webapp + a application that was starting a webserver e.g.jetty and deploy all implemented ui components to the webserver the goal was to use the plugin api to develop separately ui components that can be deploy to the webserver as a new context. + every ui compoment can have more than one instance + with this approach we was able to create different type of crawls (e.g. fast crawl, long running crawl ...) + every type has one instance of a ui compoment + an important ui component we implemented was a component to configure the Configuration object + with that you can configure your crawl instance with different plugins or different configurations for a fetcher or whatever our ui components was directly using the nutch backend. It would be nice to compile a diff list describing changes between implementations. [0] https://github.com/101tec/nutch
          Hide
          ferdy.g Ferdy Galema added a comment -

          Thanks for updating the list.

          As a side note, I am almost finished with a command-line interface to new api. I will post it here (including example usage) when it is done. For now it is a separate implementation with pretty much almost all code in a new package, so that it doesn't break existing parts. And this way it can be easily compared to the existing api.

          Show
          ferdy.g Ferdy Galema added a comment - Thanks for updating the list. As a side note, I am almost finished with a command-line interface to new api. I will post it here (including example usage) when it is done. For now it is a separate implementation with pretty much almost all code in a new package, so that it doesn't break existing parts. And this way it can be easily compared to the existing api.
          Hide
          ferdy.g Ferdy Galema added a comment -

          Hmm I wasn't aware of the existing Jira issues about a new webapp. I just added NUTCH-841 as "related to".

          However since I'm implementing this issue as non-intrusive as possible, it should not collide with any existing attempts to build a new webapp..

          Show
          ferdy.g Ferdy Galema added a comment - Hmm I wasn't aware of the existing Jira issues about a new webapp. I just added NUTCH-841 as "related to". However since I'm implementing this issue as non-intrusive as possible, it should not collide with any existing attempts to build a new webapp..
          Hide
          lewismc Lewis John McGibbney added a comment -

          Resolving as won't fix as Ferdy Galema is not around and we have no context.

          Show
          lewismc Lewis John McGibbney added a comment - Resolving as won't fix as Ferdy Galema is not around and we have no context.

            People

            • Assignee:
              Unassigned
              Reporter:
              ferdy.g Ferdy Galema
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development