Nutch
  1. Nutch
  2. NUTCH-902

Add all necessary files and configuration so that nutch can be used with different backends out-of-the-box

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: nutchbase
    • Fix Version/s: nutchgora
    • Component/s: documentation, storage
    • Labels:
      None
    • Patch Info:
      Patch Available

      Description

      As per the discussion in the mailing list and http://wiki.apache.org/nutch/GORA_HBase, it will be good to include all the necessary files and configuration. I propose that we maintain configuration for at least SQL, HBase and Cassandra.

      The following changes are needed:
      conf/gora-sql-mapping.xml
      conf/gora-hbase-mapping.xml
      conf/gora-cassandra-mapping.xml
      comments on nutch-default and ivy.xml

      Shall we also include jars from gora-hbase, gora-cassandra and their dependencies ?

      1. NUTCH-902.patch
        6 kB
        Lewis John McGibbney
      2. NUTCH-902-v2.patch
        7 kB
        Lewis John McGibbney
      3. NUTCH-902-v3.patch
        0.7 kB
        Lewis John McGibbney

        Issue Links

        There are no Sub-Tasks for this issue.

          Activity

          Hide
          Enis Soztutar added a comment -

          Nice work guys. I'm closing the issue per discussion. It seems we have everything commented out and ready to be set free
          One suggestion would be to add some documentation in the wiki or site showing how to use nutchgora with other stores, if we don't have it already.

          Show
          Enis Soztutar added a comment - Nice work guys. I'm closing the issue per discussion. It seems we have everything commented out and ready to be set free One suggestion would be to add some documentation in the wiki or site showing how to use nutchgora with other stores, if we don't have it already.
          Hide
          Lewis John McGibbney added a comment -

          Hey Enis. When you get a min can you please check and close. We are getting ever closer to a 2.0 RC here :0)
          Thank you, great work Ferdy.

          Show
          Lewis John McGibbney added a comment - Hey Enis. When you get a min can you please check and close. We are getting ever closer to a 2.0 RC here :0) Thank you, great work Ferdy.
          Hide
          Hudson added a comment -

          Integrated in Nutch-nutchgora #240 (See https://builds.apache.org/job/Nutch-nutchgora/240/)
          NUTCH-902 (merge different "storage.data.store.class" entries into one) (Revision 1330807)

          Result = SUCCESS
          ferdy :
          Files :

          • /nutch/branches/nutchgora/conf/nutch-default.xml
          Show
          Hudson added a comment - Integrated in Nutch-nutchgora #240 (See https://builds.apache.org/job/Nutch-nutchgora/240/ ) NUTCH-902 (merge different "storage.data.store.class" entries into one) (Revision 1330807) Result = SUCCESS ferdy : Files : /nutch/branches/nutchgora/conf/nutch-default.xml
          Hide
          Ferdy Galema added a comment -

          Ok done. (Note that I did not actually check the stores, I simply merged the nutch-default.xml entries)

          Show
          Ferdy Galema added a comment - Ok done. (Note that I did not actually check the stores, I simply merged the nutch-default.xml entries)
          Hide
          Ferdy Galema added a comment -

          Alright I'll change and commit the "storage.data.store.class" property description.

          Aside from that I think we can close this issue. Effort can be put into NUTCH-1205 and after that actual testing of the stores to see if the current configuration is sufficient for out-of-the-box usage. If this is not the case for some stores, we can always create new issues for thosde. (To prevent too much clutter in this issue).

          Show
          Ferdy Galema added a comment - Alright I'll change and commit the "storage.data.store.class" property description. Aside from that I think we can close this issue. Effort can be put into NUTCH-1205 and after that actual testing of the stores to see if the current configuration is sufficient for out-of-the-box usage. If this is not the case for some stores, we can always create new issues for thosde. (To prevent too much clutter in this issue).
          Hide
          Lewis John McGibbney added a comment -

          Yeah +1.
          Is there anything else you find we require from Enis initial comments on this issue?

          Show
          Lewis John McGibbney added a comment - Yeah +1. Is there anything else you find we require from Enis initial comments on this issue?
          Hide
          Ferdy Galema added a comment -

          I think nutch-default.xml does not correctly use the description field of the "storage.data.store.class" property. The description should describe what the property is about, not what the value is about. So instead of the various entries:

          <property>
          <name>storage.data.store.class</name>
          <value>org.apache.gora.cassandra.store.CassandraStore</value>
          <description>Gora class for storing data in Apache Cassandra</description>
          </property>
          -->

          <!--
          <property>
          <name>storage.data.store.class</name>
          <value>org.apache.gora.hbase.store.HBaseStore</value>
          <description>Gora class for storing data in Apache HBase</description>
          </property>
          -->

          so on..

          I propose to add a single property entry with the following description like this:

          <property>
          <name>storage.data.store.class</name>
          <value>org.apache.gora.sql.store.SqlStore</value>
          <description>The Gora DataStore class for storing/retrieving data.
          Currently the following stores are available:

          org.apache.gora.sql.store.SqlStore
          A DataStore implementation for RDBMS with a SQL interface.
          SqlStore uses JDBC drivers to communicate with the DB.

          org.apache.gora.hbase.store.HBaseStore
          DataStore implementation for Hadoop HBase.

          etcetera

          </description>
          </property>

          This has the additional benefit to make the nutch-default.xml look cleaner, imho.

          Show
          Ferdy Galema added a comment - I think nutch-default.xml does not correctly use the description field of the "storage.data.store.class" property. The description should describe what the property is about, not what the value is about. So instead of the various entries: <property> <name>storage.data.store.class</name> <value>org.apache.gora.cassandra.store.CassandraStore</value> <description>Gora class for storing data in Apache Cassandra</description> </property> --> <!-- <property> <name>storage.data.store.class</name> <value>org.apache.gora.hbase.store.HBaseStore</value> <description>Gora class for storing data in Apache HBase</description> </property> --> so on.. I propose to add a single property entry with the following description like this: <property> <name>storage.data.store.class</name> <value>org.apache.gora.sql.store.SqlStore</value> <description>The Gora DataStore class for storing/retrieving data. Currently the following stores are available: org.apache.gora.sql.store.SqlStore A DataStore implementation for RDBMS with a SQL interface. SqlStore uses JDBC drivers to communicate with the DB. org.apache.gora.hbase.store.HBaseStore DataStore implementation for Hadoop HBase. etcetera </description> </property> This has the additional benefit to make the nutch-default.xml look cleaner, imho.
          Hide
          Lewis John McGibbney added a comment -

          I made some commits on this to in include the memory store, AvroStore, DataFileAvroStore and Accumulo properties to nutch-site and some rough properties to gora.properties. I'm not clued up on the Accumulo mappings and we have no mappings for *AvroStore implementations therefore this one really should stay open. This being said I do however feel that what is currently committed in Nutchgora is enough for anyone to work with. wdygt?

          Show
          Lewis John McGibbney added a comment - I made some commits on this to in include the memory store, AvroStore, DataFileAvroStore and Accumulo properties to nutch-site and some rough properties to gora.properties. I'm not clued up on the Accumulo mappings and we have no mappings for *AvroStore implementations therefore this one really should stay open. This being said I do however feel that what is currently committed in Nutchgora is enough for anyone to work with. wdygt?
          Hide
          Ferdy Galema added a comment -

          Just made a second commit regarding gora-hbase:
          -Removed the comment that the hbase jar should be put in the lib folder manually as this is not needed anymore.
          -Removed the explicit zookeeper dependency, it is already included transitively: gora-hbase --> hbase --> zookeeper.
          -Exclude hsqldb because this is already explicitely included elsewhere in the ivy file.

          Show
          Ferdy Galema added a comment - Just made a second commit regarding gora-hbase: -Removed the comment that the hbase jar should be put in the lib folder manually as this is not needed anymore. -Removed the explicit zookeeper dependency, it is already included transitively: gora-hbase --> hbase --> zookeeper. -Exclude hsqldb because this is already explicitely included elsewhere in the ivy file.
          Hide
          Ferdy Galema added a comment -

          Committed change to the gora-hbase line in ivy: use 'default' dependancy instead of 'compile'. This way the gora-hbase module is actually placed in the lib target folder.

          Show
          Ferdy Galema added a comment - Committed change to the gora-hbase line in ivy: use 'default' dependancy instead of 'compile'. This way the gora-hbase module is actually placed in the lib target folder.
          Hide
          Hudson added a comment -

          Integrated in Nutch-nutchgora #185 (See https://builds.apache.org/job/Nutch-nutchgora/185/)
          NUTCH-902 Add all necessary files and configuration so that nutch can be used with different backends out-of-the-box (Revision 1297401)

          Result = SUCCESS
          ferdy :
          Files :

          • /nutch/branches/nutchgora/conf/gora-sql-mapping.xml
          Show
          Hudson added a comment - Integrated in Nutch-nutchgora #185 (See https://builds.apache.org/job/Nutch-nutchgora/185/ ) NUTCH-902 Add all necessary files and configuration so that nutch can be used with different backends out-of-the-box (Revision 1297401) Result = SUCCESS ferdy : Files : /nutch/branches/nutchgora/conf/gora-sql-mapping.xml
          Hide
          Ferdy Galema added a comment -

          I just committed a minor change to the sql mapping. (The content field should have the length that is the default max content in nutch-default.xml, namely 65536. Tested this and it works.

          Show
          Ferdy Galema added a comment - I just committed a minor change to the sql mapping. (The content field should have the length that is the default max content in nutch-default.xml, namely 65536. Tested this and it works.
          Hide
          Hudson added a comment -

          Integrated in Nutch-nutchgora #180 (See https://builds.apache.org/job/Nutch-nutchgora/180/)
          NUTCH-902 subcommit: change maxVersions to 1 of families in gora-hbase-mapping.xml (Revision 1295613)

          Result = SUCCESS
          ferdy :
          Files :

          • /nutch/branches/nutchgora/conf/gora-hbase-mapping.xml
          Show
          Hudson added a comment - Integrated in Nutch-nutchgora #180 (See https://builds.apache.org/job/Nutch-nutchgora/180/ ) NUTCH-902 subcommit: change maxVersions to 1 of families in gora-hbase-mapping.xml (Revision 1295613) Result = SUCCESS ferdy : Files : /nutch/branches/nutchgora/conf/gora-hbase-mapping.xml
          Hide
          Ferdy Galema added a comment - - edited

          Note I changed gora-hbase-mapping.xml slightly: I added maxVersions=1 for each column family, since currently the HBaseStore for Gora is not prepared at all for multiple versions. It makes no sense to have it set to more than 1, for now.

          Show
          Ferdy Galema added a comment - - edited Note I changed gora-hbase-mapping.xml slightly: I added maxVersions=1 for each column family, since currently the HBaseStore for Gora is not prepared at all for multiple versions. It makes no sense to have it set to more than 1, for now.
          Hide
          Hudson added a comment -

          Integrated in Nutch-nutchgora-ant #9 (See https://builds.apache.org/job/Nutch-nutchgora-ant/9/)
          commit to address NUTCH-902 and update to changes.txt

          lewismc : http://svn.apache.org/viewvc/nutch/branches/nutchgora/viewvc/?view=rev&root=&revision=1195403
          Files :

          • /nutch/branches/nutchgora/CHANGES.txt
          • /nutch/branches/nutchgora/build.xml
          • /nutch/branches/nutchgora/conf/gora-cassandra-mapping.xml
          • /nutch/branches/nutchgora/conf/gora-hbase-mapping.xml
          • /nutch/branches/nutchgora/conf/gora-sql-mapping.xml
          • /nutch/branches/nutchgora/conf/nutch-default.xml
          • /nutch/branches/nutchgora/ivy/ivy.xml
          Show
          Hudson added a comment - Integrated in Nutch-nutchgora-ant #9 (See https://builds.apache.org/job/Nutch-nutchgora-ant/9/ ) commit to address NUTCH-902 and update to changes.txt lewismc : http://svn.apache.org/viewvc/nutch/branches/nutchgora/viewvc/?view=rev&root=&revision=1195403 Files : /nutch/branches/nutchgora/CHANGES.txt /nutch/branches/nutchgora/build.xml /nutch/branches/nutchgora/conf/gora-cassandra-mapping.xml /nutch/branches/nutchgora/conf/gora-hbase-mapping.xml /nutch/branches/nutchgora/conf/gora-sql-mapping.xml /nutch/branches/nutchgora/conf/nutch-default.xml /nutch/branches/nutchgora/ivy/ivy.xml
          Hide
          Hudson added a comment -

          Integrated in Nutch-nutchgora #55 (See https://builds.apache.org/job/Nutch-nutchgora/55/)
          commit to address NUTCH-902 and update to changes.txt

          lewismc : http://svn.apache.org/viewvc/nutch/branches/nutchgora/viewvc/?view=rev&root=&revision=1195403
          Files :

          • /nutch/branches/nutchgora/CHANGES.txt
          • /nutch/branches/nutchgora/build.xml
          • /nutch/branches/nutchgora/conf/gora-cassandra-mapping.xml
          • /nutch/branches/nutchgora/conf/gora-hbase-mapping.xml
          • /nutch/branches/nutchgora/conf/gora-sql-mapping.xml
          • /nutch/branches/nutchgora/conf/nutch-default.xml
          • /nutch/branches/nutchgora/ivy/ivy.xml
          Show
          Hudson added a comment - Integrated in Nutch-nutchgora #55 (See https://builds.apache.org/job/Nutch-nutchgora/55/ ) commit to address NUTCH-902 and update to changes.txt lewismc : http://svn.apache.org/viewvc/nutch/branches/nutchgora/viewvc/?view=rev&root=&revision=1195403 Files : /nutch/branches/nutchgora/CHANGES.txt /nutch/branches/nutchgora/build.xml /nutch/branches/nutchgora/conf/gora-cassandra-mapping.xml /nutch/branches/nutchgora/conf/gora-hbase-mapping.xml /nutch/branches/nutchgora/conf/gora-sql-mapping.xml /nutch/branches/nutchgora/conf/nutch-default.xml /nutch/branches/nutchgora/ivy/ivy.xml
          Hide
          Lewis John McGibbney added a comment -

          patch to include previous config changes to NUTCHGORA/ivy/ivy.xml

          Show
          Lewis John McGibbney added a comment - patch to include previous config changes to NUTCHGORA/ivy/ivy.xml
          Hide
          Lewis John McGibbney added a comment -

          There are some slight problems here, I was getting a problem with unresolved dependencies when I was building Nutchgora with Cassandra as backend, so you need to add the following to ivy/ivy.xml

          <!--
              	Uncomment this to use Cassandra as Gora backend. 
          -->
          
          		<dependency org="org.apache.gora" name="gora-cassandra" rev="0.1.1-incubating" conf="*->default">
          			<exclude org="org.apache.thrift" />
          			<exclude org="org.apache.cassandra" />
          		</dependency>
          

          Enis, I completely agree with your comment, that every dependency should be managed from within Gora, however if the changes are not in the above gora dependency on maven repo then we cannot use them, therefore the exclusions need to be added in Nutchgora_home/ivy/ivy.xml prior to using the ant runtime target. I will attach a patch for the following in due course.

          Show
          Lewis John McGibbney added a comment - There are some slight problems here, I was getting a problem with unresolved dependencies when I was building Nutchgora with Cassandra as backend, so you need to add the following to ivy/ivy.xml <!-- Uncomment this to use Cassandra as Gora backend. --> <dependency org= "org.apache.gora" name= "gora-cassandra" rev= "0.1.1-incubating" conf= "*-> default " > <exclude org= "org.apache.thrift" /> <exclude org= "org.apache.cassandra" /> </dependency> Enis, I completely agree with your comment, that every dependency should be managed from within Gora, however if the changes are not in the above gora dependency on maven repo then we cannot use them, therefore the exclusions need to be added in Nutchgora_home/ivy/ivy.xml prior to using the ant runtime target. I will attach a patch for the following in due course.
          Hide
          Lewis John McGibbney added a comment -

          Reopened as Cassandra configurations in ivy/ivy.xml are not complete.

          Show
          Lewis John McGibbney added a comment - Reopened as Cassandra configurations in ivy/ivy.xml are not complete.
          Hide
          Enis Soztutar added a comment -

          Patch looks good, but can you please test with cassandra. Theoretically, gora-cassandra should contain all dependencies itself, so I don't think we need to add other dependencies there.

          Show
          Enis Soztutar added a comment - Patch looks good, but can you please test with cassandra. Theoretically, gora-cassandra should contain all dependencies itself, so I don't think we need to add other dependencies there.
          Hide
          Lewis John McGibbney added a comment -

          Committed @ revision 1195403 in Nutchgora branch.

          I would ask if Enis could now do a final check and now close. Thank you for the pointers.

          Show
          Lewis John McGibbney added a comment - Committed @ revision 1195403 in Nutchgora branch. I would ask if Enis could now do a final check and now close. Thank you for the pointers.
          Hide
          Lewis John McGibbney added a comment -

          Revised patch to incorporate additional comments.

          Show
          Lewis John McGibbney added a comment - Revised patch to incorporate additional comments.
          Hide
          Lewis John McGibbney added a comment -

          Hi Enis,

          1) This was to disambiguate nutch build files within my Eclipse IDE. Both 1.4 trunk and Nutchgora branch are both called Nutch. This adds more overhead to the cleaning, testing, building etc from within the dev environment.
          2) Yes this is correct. I will substantiate the annotations, and will also determine whether or not we need some additional dependency targets when I fire this into a Cassandra instance. Thanks for commenting.

          Show
          Lewis John McGibbney added a comment - Hi Enis, 1) This was to disambiguate nutch build files within my Eclipse IDE. Both 1.4 trunk and Nutchgora branch are both called Nutch. This adds more overhead to the cleaning, testing, building etc from within the dev environment. 2) Yes this is correct. I will substantiate the annotations, and will also determine whether or not we need some additional dependency targets when I fire this into a Cassandra instance. Thanks for commenting.
          Hide
          Enis Soztutar added a comment -

          By all means Lewis.

          Patch looks good, I have just two comments:

          • Why do we change the project name to be nutchgora in build.xml?
          • Can you add some comments at nutch-default.xml for property "storage.data.store.class". We already have values for HBase and Cassandra, but I think if we can add a brief comment there, this would be great.
          Show
          Enis Soztutar added a comment - By all means Lewis. Patch looks good, I have just two comments: Why do we change the project name to be nutchgora in build.xml? Can you add some comments at nutch-default.xml for property "storage.data.store.class". We already have values for HBase and Cassandra, but I think if we can add a brief comment there, this would be great.
          Hide
          Lewis John McGibbney added a comment -

          This is the beginning of a patch to address the ticket. It smartens up some files here and there, however as I've not been able to test recently on cassandra I don't know which additional dependencies are required to be added to ivy/ivy.xml (hector???).

          Finally, I've just used 'other' implementations from various resources for both cassandra and hbase xml mapping files. Obviously this is up for debate so please comment.

          Show
          Lewis John McGibbney added a comment - This is the beginning of a patch to address the ticket. It smartens up some files here and there, however as I've not been able to test recently on cassandra I don't know which additional dependencies are required to be added to ivy/ivy.xml (hector???). Finally, I've just used 'other' implementations from various resources for both cassandra and hbase xml mapping files. Obviously this is up for debate so please comment.
          Hide
          Lewis John McGibbney added a comment -

          Hi Enis, is it OK if I start work on this. Currently we support a gora-sql-mapping.xml and gora.properties file but I'm happy to begin working on your suggestions as above. Over time, I've already committed the comments and configuration properties to to nutch-default.xml therefore a patch would contain the remaining parts from above that you have highlighted.

          One last thing, does the mapping document from the GORA_HBase tutorial suit as default options for gora-hbase-mapping?

          Show
          Lewis John McGibbney added a comment - Hi Enis, is it OK if I start work on this. Currently we support a gora-sql-mapping.xml and gora.properties file but I'm happy to begin working on your suggestions as above. Over time, I've already committed the comments and configuration properties to to nutch-default.xml therefore a patch would contain the remaining parts from above that you have highlighted. One last thing, does the mapping document from the GORA_HBase tutorial suit as default options for gora-hbase-mapping?

            People

            • Assignee:
              Lewis John McGibbney
              Reporter:
              Enis Soztutar
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development