|
Great Job, Stefan and Marko! Just tried it looks very good. I am currently running 3 search site with 0.8-dev, using automated shell scripts I can do most, but this patch will allow me to run things from home, some of my clients doesn't allow ssh access. So I hope this patch gets included, I have just voted for it
Thanks again Stefan and Marko for your hard work! Cheers I am getting compilation errors when I apply the patches and recompile Nutch. I think the Xalan Jars are missing. After I downloaded them, the admin plugins compile without errors. BTW: The jars are also missing from the binary distribution.
Some random thoughts...
I am a strong supporter of XML. Can we not re-think about this like http://issues.apache.org/jira/browse/SOLR-58 Do we really need to use Nutch plugin architecture? The patch is currently out dated so I think it would be good idea to give it a another round of discussion. >I am a strong supporter of XML. Can we not re-think about this like
I would say neither of those. We should concentrate on building a good java admin api. everything after that is implementation details as the api can then be easily exposed to xml or something else remotely usable. By doing it this way the admin functionality can easily be integrated to various places and technologies. Some kind of extension mechanism needs to be used because nutch is extendable in general (You could plug in additions to admin gui as you plug functionality to nutch). IMO that is not 1st priority. I would propose to put in the basic functionality first for configuring , scheduling and generally managing crawls, then add more functionality on top of that. Are you thinking of something like UI extension point like in contrib/web2 ? I completely agree with you in terms of a solid admin API.
I have updated the patch written by stephan.
This version works with Nutch-0.9-dev and hadoop-0.7.1 (current version of nutch so far) First extract the tar.gaz file into the root of nutch. It should copy then patch nutch with Patched hadoop is included in the archive, but if you wish you can patch hadoop using I have : I have not tested every feature of this plugin so, there still can be some bugs. > Are you thinking of something like UI extension point like in contrib/web2 ?
not necessarily, that was also a quick hack I put together. It however allows you to plug in new functionality or layout via plugin (from inside jar). But I guess stefan has also implemented something like that in his patch. Is there any new version of the patch, or any plan to put it in the nutch repository ?
> Is there any new version of the patch, or any plan to put it in the nutch repository ?
This issue has not been updated for quite some time and Stefan Groschupf is not working on this issue (or [I think] on nutch) anymore. Anyone would like to see this feature in nutch, but it needs someone to pick up the latest patch (or start from scratch), make a design and implement it. Move to 1.1 - needs a significant update.
On github.com we have forked the branch "tags/release-1.0" from the nutch project http://github.com/apache/nutch
This fork is available at http://github.com/101tec/nutch/tree/nutch-gui wiki: Downloads: Marko - this is a very nice frontend, and I think that many beginners would find it useful. We could add this to Nutch distribution in contrib/. However, we need this submitted as a patch under Apache license - we can't accept contributions directly from external SCMs.
Thanks Andrzej. Sounds good.
Ok great, i create a patch. the "problem" ist that we use some libraries. for that reason i cant create a plain patch file. but i can create a combination of patch file's and shell scripts that copies the jar files in the specified lib folder's. we are using the following libraries I'm not a license expert, so i'm not sure if we can submit all that libs and yui with apache2. yui is licensed under BSD what do you think. can we submit the nutch gui with all these libraries under apache license? You can create a tar of everything and attach here, plus a patch if you need to patch anything in existing classes. Although with such a large component I think we will probably need a software grant (see http://www.apache.org/licenses/software-grant.txt
Re: licensing - generally speaking anything under BSD or MIT is ok, anything under *GPL cannot be added to SVN, though it can still be downloaded during build. ok i have checked the licenses from all jars. here is a summary
+ looks like that the jar 'cron4j' is licensed under lgpl so i cant submit this to svn. it should be download while building or something like that. but ps. |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
There are known issues, however it is a starting point from where we can continue building a solid administration user interface.
This patch introduce following functionalities:
+ web based administration gui via embed web container
+ gui is fully based on the plugin system, so it is customizable and extendable using plugins
+ all plugins can be internationalized
+ introduce the concept of nutch instances, a mechanism to have separated configurable nutch deployments using the same code base. (e.g intranet search, webpage search)
+ plug able authentication, currently it comes with a default user - password tuple based on the configuration but for example LDAP integration can be easily realized.
The patch it comes with following plugins:
+ admin-listing
++ required by the web ui to show all deployed plugins as tabs on a webpage
+ admin-instance
++ lists all instances and allows to create a new instance
+ admin-configuration
++ configure a nutch instance (configuration will be written as nutch-site.xml to hdd)
+ admin-inject
++ inject urls in a crawlDb
+admin-system
++ shows status of system
+admin-job
++ shows status of jobs
+ admin-crawldb-status
++ shows crawldb entries filtered by status or shows the status of a given url (usefully to check if a page was already fetched)
+admin-management
++ generate segment
++ fetch segment
++ parse segment (if required)
++ update crawldb
++ invert links
++ index segment
++ delete segment, parse, index etc.
+admin-scheduling
++ quartz based cron job management to run a time driven "generate - fetch - updatedb - invertlins - index" job
Known issues
+ require hadoop changes
+ local running jobs can not be stopped but distributed running jobs can be stopped
+ index searcher does not use index folders inside of segment folders as in nutch 0.7 but the gui place the index folder in the segment folder
++ searcher is unable to find indices
+ put to search does not work since searcher does not support dynamically adding of index folders
+ linkdb inverter does not update but overwrite a linkdb - this is a general nutch bug but affect the gui as well.
+ the nutch gui introduce locking by storing lock files in folders, this mechanism is ignored by the nutch command line tools.
It would be great if users can test the gui and reports bugs and help to improve the patch.
This is a very complex patch and it is difficult to stay in sync with the latest changes so in case we miss something
until generation this patch and the patch does not work as expected please don't blame us but give us some time and hints to fix the problems.
help is welcome by following tasks:
+ fixing languages issues in java doc, api and bundle files
+ translate bundles in more languages (currently it comes with english and german bundles)
+ heavily test and find bugs and provide fixes
+ write help texts and documentation
How to:
+ checkout latest nutch sources
+ checkout hadoop sources
+ patch hadoop with the hadoop patch
+ build hadoop jar
+ remove old hadoop jar from nutch/lib
+ place new hadoop jar in nutch/lib
+ uncompress plugin zip file
+ place plugins in nutch/src/plugins (patch not possible since svn does not support binary patches)
+ patch nutch with nutch patch
+ start gui with bin/nutch gui <folderWhereYourInstanceDataWillBeStored)
+ point your browser to: http://localhost:50060/general/
+ username and password are "admin". ( can be changed in nutch-default.xml)
+ select the "default" instance or create a new instance.
Thanks to everybody that helped to get this implement and do the first beta tests, but specially to Marko hacking all jsp's!
I suggest to add this patch to a nutch 0.9 branch and add a gui component in the jira to go from there.
I really hope I didn't miss anything or upload the wrong files now. :-O