Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-837

Remove search servers and Lucene dependencies

    Details

    • Type: Task
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 1.1
    • Fix Version/s: nutchgora
    • Component/s: web gui
    • Labels:
      None

      Description

      One of the main aspects of 2.0 is the delegation of the indexing and search to external resources like SOLR. We can simplify the code a lot by getting rid of the :

      • search servers
      • indexing and analysis with Lucene
      • search side functionalities : ontologies / clustering etc...
        In the short term only SOLR / SOLRCloud will be supported but the plan would be to add other systems as well.
      1. NUTCH-837.patch
        1.10 MB
        Andrzej Bialecki

        Activity

        Hide
        ab Andrzej Bialecki added a comment -

        Warning - Nutch veterans may want to sit down before reading, because it looks like half of Nutch code is deleted in this patch...

        This patch implements the changes. All tests (that remain) pass, and a full crawl cycle plus Solr indexing works as before. There is no single entry point in Nutch at this moment for searching - we may want to add a minimal test search setup based on Solr in another patch.

        Show
        ab Andrzej Bialecki added a comment - Warning - Nutch veterans may want to sit down before reading, because it looks like half of Nutch code is deleted in this patch... This patch implements the changes. All tests (that remain) pass, and a full crawl cycle plus Solr indexing works as before. There is no single entry point in Nutch at this moment for searching - we may want to add a minimal test search setup based on Solr in another patch.
        Hide
        chrismattmann Chris A. Mattmann added a comment -

        hahah uh oh!

        I'll try and take a look before next Tuesday...

        Show
        chrismattmann Chris A. Mattmann added a comment - hahah uh oh! I'll try and take a look before next Tuesday...
        Hide
        jnioche Julien Nioche added a comment -

        I think we can also get rid of :

        • docs/
        • WAR related tasks in ANT
        • src/web/
        • src/xmlcatalog/
        • src/engines/
        Show
        jnioche Julien Nioche added a comment - I think we can also get rid of : docs/ WAR related tasks in ANT src/web/ src/xmlcatalog/ src/engines/
        Hide
        chrismattmann Chris A. Mattmann added a comment -

        Hey Julien:

        How are we going to replace the Nutch webapp?

        Cheers,
        Chris

        Show
        chrismattmann Chris A. Mattmann added a comment - Hey Julien: How are we going to replace the Nutch webapp? Cheers, Chris
        Hide
        jnioche Julien Nioche added a comment -

        Hi Chris,

        My position on this is that we simply wouldn't replace it. We delegate the search to SOLR and expect people to reuse existing front ends for SOLR or build custom ones (as I expect real world deployments of Nutch would do anyway). Maintaining the webapps takes some effort that I doubt we can afford given the limited number of active committers that we have. I'd rather we focused on crawl-related functionalities.

        WDYT?

        J.

        Show
        jnioche Julien Nioche added a comment - Hi Chris, My position on this is that we simply wouldn't replace it. We delegate the search to SOLR and expect people to reuse existing front ends for SOLR or build custom ones (as I expect real world deployments of Nutch would do anyway). Maintaining the webapps takes some effort that I doubt we can afford given the limited number of active committers that we have. I'd rather we focused on crawl-related functionalities. WDYT? J.
        Hide
        chrismattmann Chris A. Mattmann added a comment -

        I'm not sure I agree

        The Nutch webapp is just a set of web pages that let someone know that Search is working. They are decent web pages, have a great look and feel and are something I've seen nearly every newbie Nutch user I've been around leverage to tell whether or not Nutch installed correctly.

        I'm also a fan of the "let's not loose functionality on a technology upgrade task" mantra. That is, we are reorganizing the architecture of Nutch to improve it, not to take away functionality. We should at least support the baseline of functionality that was present in 1.x.

        That said, I'm not sure the existing webapp should be maintained in its current form. Maybe we should take a pass at updating the webapp to work with the Nutch 2.0 architecture underneath. I'm happy to pick up a shovel and dig on that one.

        Cheers,
        Chris

        Show
        chrismattmann Chris A. Mattmann added a comment - I'm not sure I agree The Nutch webapp is just a set of web pages that let someone know that Search is working. They are decent web pages, have a great look and feel and are something I've seen nearly every newbie Nutch user I've been around leverage to tell whether or not Nutch installed correctly. I'm also a fan of the "let's not loose functionality on a technology upgrade task" mantra. That is, we are reorganizing the architecture of Nutch to improve it, not to take away functionality. We should at least support the baseline of functionality that was present in 1.x. That said, I'm not sure the existing webapp should be maintained in its current form. Maybe we should take a pass at updating the webapp to work with the Nutch 2.0 architecture underneath. I'm happy to pick up a shovel and dig on that one. Cheers, Chris
        Hide
        jnioche Julien Nioche added a comment -

        Thanks for your comments Chris

        The Nutch webapp is just a set of web pages that let someone know that Search is working. They are decent web pages, have a great look and feel and are something I've seen nearly every newbie Nutch user I've been around leverage to tell whether or not Nutch installed correctly.

        well the SOLR webapps would be just as good if not better for debugging. You get all sorts of stats + can debug your queries etc... The front end and its configuration is also a common source of trouble for beginners.

        I'm also a fan of the "let's not loose functionality on a technology upgrade task" mantra. That is, we are reorganizing the architecture of Nutch to improve it, not to take away functionality. We should at least support the baseline of functionality that was present in 1.x.

        I don't think it is completely lost we still do have the webapps from SOLR
        Regardless of the debug aspect mentioned earlier I really think that any real application based on Nutch would customise the front end anyway.

        That said, I'm not sure the existing webapp should be maintained in its current form. Maybe we should take a pass at updating the webapp to work with the Nutch 2.0 architecture underneath. I'm happy to pick up a shovel and dig on that one.

        This would need doing indeed i.e. get the cached data or inlinks straight from the webtable via GORA. Speaking of which we should probably think in terms of "what functionalities do we have in Nutch that are currently missing in SOLR", one of them being to be able to get the cache from HDFS/GORA/etc... without having to store the content in the index.

        Show
        jnioche Julien Nioche added a comment - Thanks for your comments Chris The Nutch webapp is just a set of web pages that let someone know that Search is working. They are decent web pages, have a great look and feel and are something I've seen nearly every newbie Nutch user I've been around leverage to tell whether or not Nutch installed correctly. well the SOLR webapps would be just as good if not better for debugging. You get all sorts of stats + can debug your queries etc... The front end and its configuration is also a common source of trouble for beginners. I'm also a fan of the "let's not loose functionality on a technology upgrade task" mantra. That is, we are reorganizing the architecture of Nutch to improve it, not to take away functionality. We should at least support the baseline of functionality that was present in 1.x. I don't think it is completely lost we still do have the webapps from SOLR Regardless of the debug aspect mentioned earlier I really think that any real application based on Nutch would customise the front end anyway. That said, I'm not sure the existing webapp should be maintained in its current form. Maybe we should take a pass at updating the webapp to work with the Nutch 2.0 architecture underneath. I'm happy to pick up a shovel and dig on that one. This would need doing indeed i.e. get the cached data or inlinks straight from the webtable via GORA. Speaking of which we should probably think in terms of "what functionalities do we have in Nutch that are currently missing in SOLR", one of them being to be able to get the cache from HDFS/GORA/etc... without having to store the content in the index.
        Hide
        chrismattmann Chris A. Mattmann added a comment -

        Hey Julien,

        Yep that's the point. Solr != Nutch, so Solr's Webapp can't be expected to be = Nutch's webapp. The example you cited about cached data is a great one, because Solr's webapp doesn't really support that (nor should it IMHO).

        So, I think we should still have a Nutch webapp and in my mind it's a must-have for a 2.0 release...not to worry though I'm volunteering to help do it!

        Cheers,
        Chris

        Show
        chrismattmann Chris A. Mattmann added a comment - Hey Julien, Yep that's the point. Solr != Nutch, so Solr's Webapp can't be expected to be = Nutch's webapp. The example you cited about cached data is a great one, because Solr's webapp doesn't really support that (nor should it IMHO). So, I think we should still have a Nutch webapp and in my mind it's a must-have for a 2.0 release...not to worry though I'm volunteering to help do it! Cheers, Chris
        Hide
        ab Andrzej Bialecki added a comment -

        Updated patch against r959954 (after NUTCH-836).

        Show
        ab Andrzej Bialecki added a comment - Updated patch against r959954 (after NUTCH-836 ).
        Hide
        ab Andrzej Bialecki added a comment - - edited

        So, I think we should still have a Nutch webapp and in my mind it's a must-have for a 2.0 release...

        I agree. But for the moment it's better to delete the old webapp stuff that we know for sure doesn't work with the current Nutch, and it will be completely reimplemented anyway. Refactoring it to work with the new Solr-based app is likely not worth it - we can achieve a similar effect to the current webapp by just tweaking the styling of the Solritas handler.

        Show
        ab Andrzej Bialecki added a comment - - edited So, I think we should still have a Nutch webapp and in my mind it's a must-have for a 2.0 release... I agree. But for the moment it's better to delete the old webapp stuff that we know for sure doesn't work with the current Nutch, and it will be completely reimplemented anyway. Refactoring it to work with the new Solr-based app is likely not worth it - we can achieve a similar effect to the current webapp by just tweaking the styling of the Solritas handler.
        Hide
        chrismattmann Chris A. Mattmann added a comment -

        Okey dok, I created NUTCH-841 to track it. Julien, Andrzej, you have my +1 to take your axe to the old one

        Show
        chrismattmann Chris A. Mattmann added a comment - Okey dok, I created NUTCH-841 to track it. Julien, Andrzej, you have my +1 to take your axe to the old one
        Hide
        jnioche Julien Nioche added a comment -

        Show
        jnioche Julien Nioche added a comment -
        Hide
        jnioche Julien Nioche added a comment -

        Comments on the latest patch :

        • default.properties : some entries can be removed
          docs.dir = ./docs
          docs.src = ${basedir}/src/web
          xmlcatalog.dir = ${basedir}/src/xmlcatalog
          build.webapps = ${build.dir}/webapps
          web.src.dir = ./src/web
          src.webapps = ./src/webapps
          
        • docs/ : still there
        • src/web/ : ditto

        apart from that +1

        Show
        jnioche Julien Nioche added a comment - Comments on the latest patch : default.properties : some entries can be removed docs.dir = ./docs docs.src = ${basedir}/src/web xmlcatalog.dir = ${basedir}/src/xmlcatalog build.webapps = ${build.dir}/webapps web.src.dir = ./src/web src.webapps = ./src/webapps docs/ : still there src/web/ : ditto apart from that +1
        Hide
        ab Andrzej Bialecki added a comment -

        Committed in r960064. Thanks for review!

        Show
        ab Andrzej Bialecki added a comment - Committed in r960064. Thanks for review!
        Hide
        hudson Hudson added a comment -
        Show
        hudson Hudson added a comment - Integrated in Nutch-trunk #1197 (See http://hudson.zones.apache.org/hudson/job/Nutch-trunk/1197/ )

          People

          • Assignee:
            ab Andrzej Bialecki
            Reporter:
            jnioche Julien Nioche
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development