Issue Details (XML | Word | Printable)

Key: NUTCH-251
Type: Improvement Improvement
Status: Open Open
Priority: Minor Minor
Assignee: Unassigned
Reporter: Stefan Groschupf
Votes: 10
Watchers: 6
Operations

If you were logged in you would be able to see more operations.
Nutch

Administration GUI

Created: 22/Apr/06 04:43 AM   Updated: 15/Oct/09 10:44 AM
Return to search
Component/s: None
Affects Version/s: 0.8
Fix Version/s: 1.1

Time Tracking:
Not Specified

File Attachments:
  Size
Text File Licensed for inclusion in ASF works hadoop_nutch_gui_v1.patch 2006-04-22 05:00 AM Stefan Groschupf 13 kB
GZip Archive Licensed for inclusion in ASF works Nutch-251-AdminGUI.tar.gz 2006-11-23 02:33 PM Enis Soztutar 3.96 MB
Zip Archive Licensed for inclusion in ASF works nutch_gui_plugins_v1.zip 2006-04-22 05:00 AM Stefan Groschupf 474 kB
Text File Licensed for inclusion in ASF works nutch_gui_v1.patch 2006-04-22 05:00 AM Stefan Groschupf 32 kB


 Description  « Hide
Having a web based administration interface would help to make nutch administration and management much more user friendly.

 All   Comments   Work Log   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
Stefan Groschupf added a comment - 22/Apr/06 05:00 AM
This is a early preview patch of the nutch gui.
There are known issues, however it is a starting point from where we can continue building a solid administration user interface.

This patch introduce following functionalities:

+ web based administration gui via embed web container
+ gui is fully based on the plugin system, so it is customizable and extendable using plugins
+ all plugins can be internationalized
+ introduce the concept of nutch instances, a mechanism to have separated configurable nutch deployments using the same code base. (e.g intranet search, webpage search)
+ plug able authentication, currently it comes with a default user - password tuple based on the configuration but for example LDAP integration can be easily realized.

The patch it comes with following plugins:
+ admin-listing
++ required by the web ui to show all deployed plugins as tabs on a webpage

+ admin-instance
++ lists all instances and allows to create a new instance

+ admin-configuration
++ configure a nutch instance (configuration will be written as nutch-site.xml to hdd)

+ admin-inject
++ inject urls in a crawlDb

+admin-system
++ shows status of system

+admin-job
++ shows status of jobs

+ admin-crawldb-status
++ shows crawldb entries filtered by status or shows the status of a given url (usefully to check if a page was already fetched)

+admin-management
++ generate segment
++ fetch segment
++ parse segment (if required)
++ update crawldb
++ invert links
++ index segment
++ delete segment, parse, index etc.

+admin-scheduling
++ quartz based cron job management to run a time driven "generate - fetch - updatedb - invertlins - index" job

Known issues
+ require hadoop changes
+ local running jobs can not be stopped but distributed running jobs can be stopped
+ index searcher does not use index folders inside of segment folders as in nutch 0.7 but the gui place the index folder in the segment folder
++ searcher is unable to find indices
+ put to search does not work since searcher does not support dynamically adding of index folders
+ linkdb inverter does not update but overwrite a linkdb - this is a general nutch bug but affect the gui as well.
+ the nutch gui introduce locking by storing lock files in folders, this mechanism is ignored by the nutch command line tools.

It would be great if users can test the gui and reports bugs and help to improve the patch.
This is a very complex patch and it is difficult to stay in sync with the latest changes so in case we miss something
until generation this patch and the patch does not work as expected please don't blame us but give us some time and hints to fix the problems.

help is welcome by following tasks:
+ fixing languages issues in java doc, api and bundle files
+ translate bundles in more languages (currently it comes with english and german bundles)
+ heavily test and find bugs and provide fixes
+ write help texts and documentation

How to:

+ checkout latest nutch sources

+ checkout hadoop sources
+ patch hadoop with the hadoop patch
+ build hadoop jar
+ remove old hadoop jar from nutch/lib
+ place new hadoop jar in nutch/lib

+ uncompress plugin zip file
+ place plugins in nutch/src/plugins (patch not possible since svn does not support binary patches)
+ patch nutch with nutch patch
+ start gui with bin/nutch gui <folderWhereYourInstanceDataWillBeStored)
+ point your browser to: http://localhost:50060/general/
+ username and password are "admin". ( can be changed in nutch-default.xml)
+ select the "default" instance or create a new instance.

Thanks to everybody that helped to get this implement and do the first beta tests, but specially to Marko hacking all jsp's!
I suggest to add this patch to a nutch 0.9 branch and add a gui component in the jira to go from there.
I really hope I didn't miss anything or upload the wrong files now. :-O


Zaheed Haque added a comment - 22/Apr/06 07:07 PM
Great Job, Stefan and Marko! Just tried it looks very good. I am currently running 3 search site with 0.8-dev, using automated shell scripts I can do most, but this patch will allow me to run things from home, some of my clients doesn't allow ssh access. So I hope this patch gets included, I have just voted for it

Thanks again Stefan and Marko for your hard work!

Cheers


Thomas Delnoij added a comment - 15/May/06 06:08 PM
I am getting compilation errors when I apply the patches and recompile Nutch. I think the Xalan Jars are missing. After I downloaded them, the admin plugins compile without errors. BTW: The jars are also missing from the binary distribution.

nutch.newbie added a comment - 20/Nov/06 09:13 PM
Some random thoughts...

I am a strong supporter of XML. Can we not re-think about this like SOLR-58 or plain/jsp like the way hadoop does it?

http://issues.apache.org/jira/browse/SOLR-58

Do we really need to use Nutch plugin architecture? The patch is currently out dated so I think it would be good idea to give it a another round of discussion.


Sami Siren added a comment - 21/Nov/06 05:13 AM
>I am a strong supporter of XML. Can we not re-think about this like SOLR-58 or plain/jsp like the way hadoop does it?

I would say neither of those. We should concentrate on building a good java admin api. everything after that is implementation details as the api can then be easily exposed to xml or something else remotely usable. By doing it this way the admin functionality can easily be integrated to various places and technologies.

Some kind of extension mechanism needs to be used because nutch is extendable in general (You could plug in additions to admin gui as you plug functionality to nutch). IMO that is not 1st priority. I would propose to put in the basic functionality first for configuring , scheduling and generally managing crawls, then add more functionality on top of that.


nutch.newbie added a comment - 21/Nov/06 08:49 PM
Are you thinking of something like UI extension point like in contrib/web2 ? I completely agree with you in terms of a solid admin API.

Enis Soztutar added a comment - 23/Nov/06 02:33 PM
I have updated the patch written by stephan.
This version works with Nutch-0.9-dev and hadoop-0.7.1 (current version of nutch so far)

First extract the tar.gaz file into the root of nutch. It should copy
src/plugin/admin-*
lib/xalan.jar lib/serializer.jar and lib/hadoop-0.7.2-dev.jar
hadoop_0.7.1_nutch_gui_v2.patch
nutch_0.9-dev_gui_v2.patch

then patch nutch with
patch -p0 <nutch_0.9-dev_gui_v2.patch
(you can test the patch first by running : patch -p0 --dry-run <nutch_0.9-dev_gui_v2.patch

Patched hadoop is included in the archive, but if you wish you can patch hadoop using
patch -p0 hadoop_0.7.1_nutch_gui_v2.patch

I have :
converted necessary java.io.File fields and arguments to org.apache.hadoop.fs.Path
replaced deprecated LogFormatter's with LogFactory's
used generics with collections(changed only that I've seen)
written PathSerializable which is implements Serializable interface(needed for scheduling)
Some hadoop changes and some changes due to hadoop conflicts.

I have not tested every feature of this plugin so, there still can be some bugs.


Sami Siren added a comment - 23/Nov/06 08:08 PM
> Are you thinking of something like UI extension point like in contrib/web2 ?
not necessarily, that was also a quick hack I put together. It however allows you to plug in new functionality or layout via plugin (from inside jar). But I guess stefan has also implemented something like that in his patch.

Marc Brette added a comment - 05/Sep/07 04:33 PM
Is there any new version of the patch, or any plan to put it in the nutch repository ?

Doğacan Güney added a comment - 05/Sep/07 06:08 PM
> Is there any new version of the patch, or any plan to put it in the nutch repository ?

This issue has not been updated for quite some time and Stefan Groschupf is not working on this issue (or [I think] on nutch) anymore. Anyone would like to see this feature in nutch, but it needs someone to pick up the latest patch (or start from scratch), make a design and implement it.


Andrzej Bialecki added a comment - 06/Feb/09 01:21 PM
Move to 1.1 - needs a significant update.

Marko Bauhardt added a comment - 09/Aug/09 06:34 PM - edited
On github.com we have forked the branch "tags/release-1.0" from the nutch project http://github.com/apache/nutch.
This fork is available at http://github.com/101tec/nutch/tree/nutch-gui.

wiki:
http://wiki.github.com/101tec/nutch

Downloads:
http://github.com/101tec/nutch/downloads


Andrzej Bialecki added a comment - 09/Oct/09 02:00 PM
Marko - this is a very nice frontend, and I think that many beginners would find it useful. We could add this to Nutch distribution in contrib/. However, we need this submitted as a patch under Apache license - we can't accept contributions directly from external SCMs.

Marko Bauhardt added a comment - 15/Oct/09 08:22 AM
Thanks Andrzej. Sounds good.
Ok great, i create a patch. the "problem" ist that we use some libraries. for that reason i cant create a plain patch file. but i can create a combination of patch file's and shell scripts that copies the jar files in the specified lib folder's.

we are using the following libraries
cron4j-1.1.6.jar
commons-fileupload-1.2.1.jar
jstl-1.1.2.jar
spring.jar
commons-io-1.4.jar
mockito-all-1.8.0.jar
standard-1.1.2.jar
commons-logging.jar
spring-webmvc.jar
Yahoo UI for javascript, css etc: http://developer.yahoo.com/yui/

I'm not a license expert, so i'm not sure if we can submit all that libs and yui with apache2.

yui is licensed under BSD
mockito is only a test mock framework. it is licesed under MIT License

what do you think. can we submit the nutch gui with all these libraries under apache license?


Andrzej Bialecki added a comment - 15/Oct/09 08:30 AM
You can create a tar of everything and attach here, plus a patch if you need to patch anything in existing classes. Although with such a large component I think we will probably need a software grant (see http://www.apache.org/licenses/software-grant.txt).

Re: licensing - generally speaking anything under BSD or MIT is ok, anything under *GPL cannot be added to SVN, though it can still be downloaded during build.


Marko Bauhardt added a comment - 15/Oct/09 10:44 AM
ok i have checked the licenses from all jars. here is a summary

+ looks like that the jar 'cron4j' is licensed under lgpl so i cant submit this to svn. it should be download while building or something like that.
+ the standard-1.1.2.jar should be licensed under apache license.
+ all spring jars are apache license
+ all commons jar's are apache license
+ mockito is mit license
+ yui is bsd license

but
+ jstl-1.2.2 is maybe " Sun Binary Code License". but i 'm not sure. how we can handle with this jar? can we deliver nutch binary with this jar?

ps.
i have copied the jstl and standard jar from the spring binary download package.