Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: nutchgora
    • Fix Version/s: nutchgora
    • Component/s: None
    • Labels:
      None
    • Patch Info:
      Patch Available

      Description

      Having a separate GORA table for storing information about hosts (and domains?) would be very useful for :

      • customising the behaviour of the fetching on a host basis e.g. number of threads, min time between threads etc...
      • storing stats
      • keeping metadata and possibly propagate them to the webpages
      • keeping a copy of the robots.txt and possibly use that later to filter the webtable
      • store sitemaps files and update the webtable accordingly

      I'll try to come up with a GORA schema for such a host table but any comments are of course already welcome

      1. hostdb.patch
        25 kB
        Doğacan Güney
      2. NUTCH-882-v1.patch
        20 kB
        Julien Nioche
      3. NUTCH-882-v3.txt
        45 kB
        Ferdy Galema
      4. NUTCH-882-v3.txt
        45 kB
        Ferdy Galema

        Issue Links

          Activity

          Hide
          Julien Nioche added a comment -

          Here is an initial version of the Host table. Very minimalistic for now but we'll make it evolve as we go.

          • src/gora/host.avsc : avro schema for the host table
          • modified gora mapping for sql backend
          • src/java/org/apache/nutch/host/HostDBReader.java : displays the info about an entry of the host table (or all of them)
          • src/java/org/apache/nutch/host/HostInjectorJob.java : mapreduce job which takes a seed host list and populates the host table with it
          • src/java/org/apache/nutch/host/HostMDApplierJob.java : mapreduce job which reads through the webtable and adds metadata taken from the host table

          The key for the host table is reverted - just like the WebPages in the webtable. The hosts are represented with a URL in order to know about the protocol as well. The metadata are similar to the ones in the WebPages.

          Obviously we'll need to add more things to it but this is already useful for projecting metadata by host onto the webtable. You can do that by :

          1. Injecting the host metadata into the host table
          ./nutch org.apache.nutch.host.HostInjectorJob hostlist
          
          1. Applying the metadata to the WebTable
          ./nutch org.apache.nutch.host.HostMDApplierJob
          
          1. Check on the Webtable using
          ./nutch org.apache.nutch.crawl.WebTableReader
          

          You can of course do the same thing by putting the metadata in the URL seedlist then propagating them using custom URL filters but this won't work for instance if a page is found from a page which belongs to a different host. The approach described here is a bit cleaner and can be used in a larger number of situations, e.g. when the metadata values are not known at the time of the URL seeding. We have developed a plugin for the detection of adult content which works quite well but found that the results were better after aggregating the stats at the host level, marking them as adult via a metadata, then project back onto the webtable and let a custom indexer use the value coming from the host to override the detection at the page level. Anyway, that was just an example

          I am planning to add more to this code in the short term but would like to hear your comments on it. In particular I am planning to :

          • add a class which populate the host table given a webtable and (possibly) add statistics to it at the same time
          • maybe create a new plugin endpoint so that such statistics could be achieved using custom user functions
          • write an example of such an endpoint which would add stats per status
          • look into the management of robots.txt and sitemaps
          • see how we could leverage these host-related metadata to give specific instructions for the fetching (number of threads, time between calls) etc...

          Julien

          Show
          Julien Nioche added a comment - Here is an initial version of the Host table. Very minimalistic for now but we'll make it evolve as we go. src/gora/host.avsc : avro schema for the host table modified gora mapping for sql backend src/java/org/apache/nutch/host/HostDBReader.java : displays the info about an entry of the host table (or all of them) src/java/org/apache/nutch/host/HostInjectorJob.java : mapreduce job which takes a seed host list and populates the host table with it src/java/org/apache/nutch/host/HostMDApplierJob.java : mapreduce job which reads through the webtable and adds metadata taken from the host table The key for the host table is reverted - just like the WebPages in the webtable. The hosts are represented with a URL in order to know about the protocol as well. The metadata are similar to the ones in the WebPages. Obviously we'll need to add more things to it but this is already useful for projecting metadata by host onto the webtable. You can do that by : Injecting the host metadata into the host table ./nutch org.apache.nutch.host.HostInjectorJob hostlist Applying the metadata to the WebTable ./nutch org.apache.nutch.host.HostMDApplierJob Check on the Webtable using ./nutch org.apache.nutch.crawl.WebTableReader You can of course do the same thing by putting the metadata in the URL seedlist then propagating them using custom URL filters but this won't work for instance if a page is found from a page which belongs to a different host. The approach described here is a bit cleaner and can be used in a larger number of situations, e.g. when the metadata values are not known at the time of the URL seeding. We have developed a plugin for the detection of adult content which works quite well but found that the results were better after aggregating the stats at the host level, marking them as adult via a metadata, then project back onto the webtable and let a custom indexer use the value coming from the host to override the detection at the page level. Anyway, that was just an example I am planning to add more to this code in the short term but would like to hear your comments on it. In particular I am planning to : add a class which populate the host table given a webtable and (possibly) add statistics to it at the same time maybe create a new plugin endpoint so that such statistics could be achieved using custom user functions write an example of such an endpoint which would add stats per status look into the management of robots.txt and sitemaps see how we could leverage these host-related metadata to give specific instructions for the fetching (number of threads, time between calls) etc... Julien
          Hide
          Andrzej Bialecki added a comment -

          This functionality is very useful for larger crawls. Some comments about the design:

          • the table can be populated by injection, as in the patch, or from webtable. Since keys are from different spaces (url-s vs. hosts) I think it would be very tricky to try to do this on the fly in one of the existing jobs... so this means an additional step in the workflow.
          • I'm worried about the scalability of the approach taken by HostMDApplierJob - per-host data will be multiplied by the number of urls from a host and put into webtable, which will in turn balloon the size of webtable...

          A little background: what we see here is a design issue typical for mapreduce, where you have to merge data keyed by keys from different spaces (with different granularity). Possible solutions involve:

          • first converting the data to a common key space and then submit both data as mapreduce inputs, or
          • submitting only the finer-grained input to mapreduce and dynamically converting the keys on the fly (and reading data directly from the coarser-grained source, accessing it randomly).

          A similar situation is described in HADOOP-3063 together with a solution, namely, to use random access and use Bloom filters to quickly discover missing keys.

          So I propose that instead of statically merging the data (HostMDApplierJob) we could merge it dynamically on the fly, by implementing a high-performance reader of host table, and then use this reader directly in the context of map()/reduce() tasks as needed. This reader should use a Bloom filter to quickly determine nonexistent keys, and it may use a limited amount of in-memory cache for existing records. The bloom filter data should be re-computed on updates and stored/retrieved, to avoid lengthy initialization.

          The cost of using this approach is IMHO much smaller than the cost of statically joining this data. The static join costs both space and time to execute an additional jon. Let's consider the dynamic join cost, e.g. in Fetcher - HostDBReader would be used only when initializing host queues, so the number of IO-s would be at most the number of unique hosts on the fetchlist (at most, because some of host data may be missing - here's Bloom filter to the rescue to quickly discover this without doing any IO). During updatedb we would likely want to access this data in DbUpdateReducer. Keys are URLs here, and they are ordered in ascending order - but they are in host-reversed format, which means that URLs from similar hosts and domains are close together. This is beneficial, because when we read data from HostDBReader we will read records that are close together, thus avoiding seeks. We can also cache the retrieved per-host data in DbUpdateReducer.

          Show
          Andrzej Bialecki added a comment - This functionality is very useful for larger crawls. Some comments about the design: the table can be populated by injection, as in the patch, or from webtable. Since keys are from different spaces (url-s vs. hosts) I think it would be very tricky to try to do this on the fly in one of the existing jobs... so this means an additional step in the workflow. I'm worried about the scalability of the approach taken by HostMDApplierJob - per-host data will be multiplied by the number of urls from a host and put into webtable, which will in turn balloon the size of webtable... A little background: what we see here is a design issue typical for mapreduce, where you have to merge data keyed by keys from different spaces (with different granularity). Possible solutions involve: first converting the data to a common key space and then submit both data as mapreduce inputs, or submitting only the finer-grained input to mapreduce and dynamically converting the keys on the fly (and reading data directly from the coarser-grained source, accessing it randomly). A similar situation is described in HADOOP-3063 together with a solution, namely, to use random access and use Bloom filters to quickly discover missing keys. So I propose that instead of statically merging the data (HostMDApplierJob) we could merge it dynamically on the fly, by implementing a high-performance reader of host table, and then use this reader directly in the context of map()/reduce() tasks as needed. This reader should use a Bloom filter to quickly determine nonexistent keys, and it may use a limited amount of in-memory cache for existing records. The bloom filter data should be re-computed on updates and stored/retrieved, to avoid lengthy initialization. The cost of using this approach is IMHO much smaller than the cost of statically joining this data. The static join costs both space and time to execute an additional jon. Let's consider the dynamic join cost, e.g. in Fetcher - HostDBReader would be used only when initializing host queues, so the number of IO-s would be at most the number of unique hosts on the fetchlist (at most, because some of host data may be missing - here's Bloom filter to the rescue to quickly discover this without doing any IO). During updatedb we would likely want to access this data in DbUpdateReducer. Keys are URLs here, and they are ordered in ascending order - but they are in host-reversed format, which means that URLs from similar hosts and domains are close together. This is beneficial, because when we read data from HostDBReader we will read records that are close together, thus avoiding seeks. We can also cache the retrieved per-host data in DbUpdateReducer.
          Hide
          Julien Nioche added a comment -

          Thanks for your comments Andrzej

          the table can be populated by injection, as in the patch, or from webtable. Since keys are from different spaces (url-s vs. hosts) I think it would be very tricky to try to do this on the fly in one of
          the existing jobs... so this means an additional step in the workflow.

          yes, that's what I meant by add a class which populate the host table given a webtable and (possibly) add statistics to it at the same time

          • I'm worried about the scalability of the approach taken by HostMDApplierJob - per-host data will be multiplied by the number of urls from a host and put into webtable, which will in turn balloon the size of webtable...

          there would be a duplication of information indeed. In most cases that would be just a few bytes for a metadatum so no big deal compared to the overall size of a webpage object.

          Re-high performance reader, this is a nice idea and the example you gave for the Fetcher is very relevant. In terms of performance, it should not be that different from what I've implemented in the HostMDApplierJob, at least if most hosts are present in the host table. I suppose we could keep the HostMDApplier at least for the time being and open a separate JIRA for the BloomFiltered-Reader. Shall we put that in GORA?

          I'll start working on some code to populate/update the hostDB from an existing webtable

          Show
          Julien Nioche added a comment - Thanks for your comments Andrzej the table can be populated by injection, as in the patch, or from webtable. Since keys are from different spaces (url-s vs. hosts) I think it would be very tricky to try to do this on the fly in one of the existing jobs... so this means an additional step in the workflow. yes, that's what I meant by add a class which populate the host table given a webtable and (possibly) add statistics to it at the same time I'm worried about the scalability of the approach taken by HostMDApplierJob - per-host data will be multiplied by the number of urls from a host and put into webtable, which will in turn balloon the size of webtable... there would be a duplication of information indeed. In most cases that would be just a few bytes for a metadatum so no big deal compared to the overall size of a webpage object. Re-high performance reader, this is a nice idea and the example you gave for the Fetcher is very relevant. In terms of performance, it should not be that different from what I've implemented in the HostMDApplierJob, at least if most hosts are present in the host table. I suppose we could keep the HostMDApplier at least for the time being and open a separate JIRA for the BloomFiltered-Reader. Shall we put that in GORA? I'll start working on some code to populate/update the hostDB from an existing webtable
          Hide
          Doğacan Güney added a comment -

          I would like to start implementing the idea proposed by Andrzej. I have one question: I would like Host information to be accessible to plugins as well. Unfortunately, this will mean yet another API break for plugins. Should we do it like MR and introduce a Context object? So a typical plugin would look like this:

          public void filter(String url, WebPage page, Context context); // context object will have a host in it for now. In the future, it may have other objects.

          What do you think?

          Show
          Doğacan Güney added a comment - I would like to start implementing the idea proposed by Andrzej. I have one question: I would like Host information to be accessible to plugins as well. Unfortunately, this will mean yet another API break for plugins. Should we do it like MR and introduce a Context object? So a typical plugin would look like this: public void filter(String url, WebPage page, Context context); // context object will have a host in it for now. In the future, it may have other objects. What do you think?
          Hide
          Chris A. Mattmann added a comment -

          Hey Doğacan:

          +1 to introducing a NutchContext object. We're starting to get enough information and enough of a need to build out our own specific property set.

          Cheers,
          Chris

          Show
          Chris A. Mattmann added a comment - Hey Doğacan: +1 to introducing a NutchContext object. We're starting to get enough information and enough of a need to build out our own specific property set. Cheers, Chris
          Hide
          Andrzej Bialecki added a comment -

          +1 to NutchContext. See also NUTCH-907 because the changes required in Gora API will likely make this task easier (once implemented ).

          Show
          Andrzej Bialecki added a comment - +1 to NutchContext. See also NUTCH-907 because the changes required in Gora API will likely make this task easier (once implemented ).
          Hide
          Doğacan Güney added a comment -

          I have implemented a NutchContext object (which only has a Host in it for now). I also added a fast Host reader as Andrzej suggested by using bloom filters. For now, I also extended InjectorJob to have a NutchContext object and extended scoring filters to also accept NutchContext as an argument (only scoring filters for now, but I will extend this to all plugins). The fast host reader uses a new table (called metatable... yeah not very creative , to read and write bloom filter data. The idea is, obviously, metatable stores information about other tables.

          Unfortunately, there is a huge problem that I need help with. I will try to explain it with an example. Let's say a ParserJob has 6 maps. We extended parse plugins so they also can use NutchContext objects. The problem is each map will update its OWN bloom filter and try to write its OWN bloom filter back to metatable. This, of course, breaks HostDb implementation as one map task overwrites bloom filter data. As a fix, I thought each task can write its own bloom filter to a temporary location using its task id. Once a job finishes, we can then read all tasks and write out a single bloom filter using data from all tasks. This is a very HACKISH solution though.

          What do you guys think? Any better solutions?

          Show
          Doğacan Güney added a comment - I have implemented a NutchContext object (which only has a Host in it for now). I also added a fast Host reader as Andrzej suggested by using bloom filters. For now, I also extended InjectorJob to have a NutchContext object and extended scoring filters to also accept NutchContext as an argument (only scoring filters for now, but I will extend this to all plugins). The fast host reader uses a new table (called metatable... yeah not very creative , to read and write bloom filter data. The idea is, obviously, metatable stores information about other tables. Unfortunately, there is a huge problem that I need help with. I will try to explain it with an example. Let's say a ParserJob has 6 maps. We extended parse plugins so they also can use NutchContext objects. The problem is each map will update its OWN bloom filter and try to write its OWN bloom filter back to metatable. This, of course, breaks HostDb implementation as one map task overwrites bloom filter data. As a fix, I thought each task can write its own bloom filter to a temporary location using its task id. Once a job finishes, we can then read all tasks and write out a single bloom filter using data from all tasks. This is a very HACKISH solution though. What do you guys think? Any better solutions?
          Hide
          Doğacan Güney added a comment -

          Here is an initial version. This doesn't have a bloom filter due to reasons I outlined in my previous comment. It adds a NutchContext object and adds NutchContext to a single method in ScoringFilters to demonstrate how it will look. I made a random change in OPICScoringFilter to demonstrate example usage.

          Show
          Doğacan Güney added a comment - Here is an initial version. This doesn't have a bloom filter due to reasons I outlined in my previous comment. It adds a NutchContext object and adds NutchContext to a single method in ScoringFilters to demonstrate how it will look. I made a random change in OPICScoringFilter to demonstrate example usage.
          Hide
          Andrzej Bialecki added a comment -

          Doğacan, I missed your previous comment... the issue with partial bloom filters is usually solved that each task stores each own filter - this worked well for MapFile-s because they consisted of multiple parts, so then a Reader would open a part and a corresponding bloom filter.

          Here it's more complicated, I agree... though this reminds me of the situation that is handled by DynamicBloomFilter: it's basically a set of Bloom filters with a facade that hides this fact from the user. Here we could construct something similar, i.e. don't merge partial filters after closing the output, but instead when opening a Reader read all partial filters and pretend they are one.

          Show
          Andrzej Bialecki added a comment - Doğacan, I missed your previous comment... the issue with partial bloom filters is usually solved that each task stores each own filter - this worked well for MapFile-s because they consisted of multiple parts, so then a Reader would open a part and a corresponding bloom filter. Here it's more complicated, I agree... though this reminds me of the situation that is handled by DynamicBloomFilter: it's basically a set of Bloom filters with a facade that hides this fact from the user. Here we could construct something similar, i.e. don't merge partial filters after closing the output, but instead when opening a Reader read all partial filters and pretend they are one.
          Hide
          Doğacan Güney added a comment -

          Thanks for the comments Andrzej. I think I can implement that solution as well, but first, I have a suggestion:

          I was thinking... Since we already store URLs in reverse-url form, they are ordered by host names. So instead of a bloom filter, we can write a hostdb that scans the host table for x rows starting from the requested host... This will probably make more sense with an example

          So, let's say we have URLs from domains a.com, b.com, c.com, etc... So webtable keys will look like this:

          com.a/....
          com.a.www/....
          com.a.www/....
          com.b/...
          com.b.pages/....
          com.b.pages/...
          ...
          ....
          .....
          etc...

          So, if we ask for say, "com.b" from host table, then we can run a scan (say, for a hundred rows) starting from com.b then cache all results. During MapReduce, the next host we request will almost certainly be from com.b.* or com.

          {c,d,e}

          etc, and thus they will be cached.

          The downside of this approach is quite obvious: Nutch will be reading a lot of hosts even if jobs do not need it (if you want to store host info for only 1 host per 100, this approach will read all hosts). Still, if we assume NutchContext (and thus host) will exist for most URLs, this should not be a problem.

          What do you think?

          Show
          Doğacan Güney added a comment - Thanks for the comments Andrzej. I think I can implement that solution as well, but first, I have a suggestion: I was thinking... Since we already store URLs in reverse-url form, they are ordered by host names. So instead of a bloom filter, we can write a hostdb that scans the host table for x rows starting from the requested host... This will probably make more sense with an example So, let's say we have URLs from domains a.com, b.com, c.com, etc... So webtable keys will look like this: com.a/.... com.a.www/.... com.a.www/.... com.b/... com.b.pages/.... com.b.pages/... ... .... ..... etc... So, if we ask for say, "com.b" from host table, then we can run a scan (say, for a hundred rows) starting from com.b then cache all results. During MapReduce, the next host we request will almost certainly be from com.b.* or com. {c,d,e} etc, and thus they will be cached. The downside of this approach is quite obvious: Nutch will be reading a lot of hosts even if jobs do not need it (if you want to store host info for only 1 host per 100, this approach will read all hosts). Still, if we assume NutchContext (and thus host) will exist for most URLs, this should not be a problem. What do you think?
          Hide
          Mathijs Homminga added a comment -

          Last activity on this issue was more than a year ago. I'd like to get it rolling again.
          I suggest that I start with updating the patches to work with the latest nutchgora branch.

          Show
          Mathijs Homminga added a comment - Last activity on this issue was more than a year ago. I'd like to get it rolling again. I suggest that I start with updating the patches to work with the latest nutchgora branch.
          Hide
          Mathijs Homminga added a comment -

          Julien, did you make a start with "I'll start working on some code to populate/update the hostDB from an existing webtable"? If not, I'll start on that one too.

          Show
          Mathijs Homminga added a comment - Julien, did you make a start with "I'll start working on some code to populate/update the hostDB from an existing webtable"? If not, I'll start on that one too.
          Hide
          Julien Nioche added a comment -

          nope, go ahead

          Show
          Julien Nioche added a comment - nope, go ahead
          Hide
          Mathijs Homminga added a comment -

          Status:
          I have updated the patches to match the current HEAD (nutchgora). Also added a HostDbUpdateJob which populates the host db from an existing web table (needed to fix an issue in GORA for this: https://issues.apache.org/jira/browse/GORA-105).

          I'm currently finishing some work on the NutchContext and will post the patch somewhere next week.

          Show
          Mathijs Homminga added a comment - Status: I have updated the patches to match the current HEAD (nutchgora). Also added a HostDbUpdateJob which populates the host db from an existing web table (needed to fix an issue in GORA for this: https://issues.apache.org/jira/browse/GORA-105 ). I'm currently finishing some work on the NutchContext and will post the patch somewhere next week.
          Hide
          Mathijs Homminga added a comment - - edited

          Hi guys,

          I have second thoughts on implementing the NutchContext concept at this stage.

          All Nutch processes are centered around the concept of a WebPage. And I agree, many of these processes and their plugins might benefit from additional input which is related to, but not directly part of a WebPage. Like host statistics, metadata or domain information.

          The proposed NutchContext solution is elegant in the way that it makes this additional information available to plugins, in an extensible way.
          However, it indeed requires a big API break for plugins (since we don't use abstract base classes for all the plugins, we can't fix it there to keep them compatible).

          I'm afraid that a patch that tries to implement the Host table and the NutchContext at the same time, will have a hard time to make it to the repository

          I propose to move the NutchContext approach to a new issue.
          Plugins and other components can still use Host information by using the HostDB class directly to perform efficient host lookups when needed. We can then decide later to make this part of the NutchContext.

          Agree?

          Show
          Mathijs Homminga added a comment - - edited Hi guys, I have second thoughts on implementing the NutchContext concept at this stage. All Nutch processes are centered around the concept of a WebPage. And I agree, many of these processes and their plugins might benefit from additional input which is related to, but not directly part of a WebPage. Like host statistics, metadata or domain information. The proposed NutchContext solution is elegant in the way that it makes this additional information available to plugins, in an extensible way. However, it indeed requires a big API break for plugins (since we don't use abstract base classes for all the plugins, we can't fix it there to keep them compatible). I'm afraid that a patch that tries to implement the Host table and the NutchContext at the same time, will have a hard time to make it to the repository I propose to move the NutchContext approach to a new issue. Plugins and other components can still use Host information by using the HostDB class directly to perform efficient host lookups when needed. We can then decide later to make this part of the NutchContext. Agree?
          Hide
          Lewis John McGibbney added a comment -

          Mathijs, my opinion is that you have a clean sheet of paper to begin with certain aspects of this one (simply because you've stepped up to take it on). You obviously have you own idea about how you would like to see the new host table design and also have justification behind the eventual implementation (and API break/redesign) of NutchContext. I think it's wise to think sensibly about NOT breaking the plugin API at this stage and that an incremental approach to addressing this one is a suitable strategy. Feel free to open another issue for the NutchContext issue, as quite rightly this appears to have now morphed into it's own sub domain of the umbrella issue.

          Show
          Lewis John McGibbney added a comment - Mathijs, my opinion is that you have a clean sheet of paper to begin with certain aspects of this one (simply because you've stepped up to take it on). You obviously have you own idea about how you would like to see the new host table design and also have justification behind the eventual implementation (and API break/redesign) of NutchContext. I think it's wise to think sensibly about NOT breaking the plugin API at this stage and that an incremental approach to addressing this one is a suitable strategy. Feel free to open another issue for the NutchContext issue, as quite rightly this appears to have now morphed into it's own sub domain of the umbrella issue.
          Hide
          Patrick Hennig added a comment -

          Hi,

          you wrote, you have updated the patches for the Host table to pupulate it from the webpages.

          But the files are from august and september. Where I can find your updated patch?

          Show
          Patrick Hennig added a comment - Hi, you wrote, you have updated the patches for the Host table to pupulate it from the webpages. But the files are from august and september. Where I can find your updated patch?
          Hide
          Ferdy Galema added a comment -

          Hey Patrick,

          We are currently finishing the work for this issue. There is still one minor issue that is not fully working yet (namely host inlinks/outlinks are not populated), but we are still trying to make that work. If this is does not succeed in a few days, we will submit the patches anyhow.

          Thanks for you interest.

          Show
          Ferdy Galema added a comment - Hey Patrick, We are currently finishing the work for this issue. There is still one minor issue that is not fully working yet (namely host inlinks/outlinks are not populated), but we are still trying to make that work. If this is does not succeed in a few days, we will submit the patches anyhow. Thanks for you interest.
          Hide
          Ferdy Galema added a comment -

          New version of patch. (On behalf of Mathijs I am finishing this issue. Nevertheless he has done much of the hard work!)

          Building hostdb links (inlinks and outlinks at the host level) works now too. Use:
          org.apache.nutch.host.HostDbUpdateJob -linkDb

          This patch adds Host store definitions to the gora mapping for HBase only. (Other stores can be added easily later on). It needs GORA-105. So you can only use the added functionality when using a trunk version of Gora. Or wait until Nutchgora updates to Gora 0.2. (Should be soon).

          No tests are included yet. For now this is okay, because by default this patch does not change existing functionality. (Also it's a bit of a pain to add tests because current tests depend on a valid SQLStore but updating Gora results in a dropped SQLStore so there an issue that needs to be solved first. In another issue that is).

          Will commit this in a few days.

          Show
          Ferdy Galema added a comment - New version of patch. (On behalf of Mathijs I am finishing this issue. Nevertheless he has done much of the hard work!) Building hostdb links (inlinks and outlinks at the host level) works now too. Use: org.apache.nutch.host.HostDbUpdateJob -linkDb This patch adds Host store definitions to the gora mapping for HBase only. (Other stores can be added easily later on). It needs GORA-105 . So you can only use the added functionality when using a trunk version of Gora. Or wait until Nutchgora updates to Gora 0.2. (Should be soon). No tests are included yet. For now this is okay, because by default this patch does not change existing functionality. (Also it's a bit of a pain to add tests because current tests depend on a valid SQLStore but updating Gora results in a dropped SQLStore so there an issue that needs to be solved first. In another issue that is). Will commit this in a few days.
          Hide
          Ferdy Galema added a comment -

          Committed. I realize that the current state is far from finished, however I figured it is enough to close this longstanding issue off. This makes room for people to easily play around with it and make improvements where necessary. (Adding definitions for other stores, new features such as storing stats etcetera.)

          I'll leave the final closing to Julien, since he is the original reporter.

          Please let me know if any of you disagree.

          Show
          Ferdy Galema added a comment - Committed. I realize that the current state is far from finished, however I figured it is enough to close this longstanding issue off. This makes room for people to easily play around with it and make improvements where necessary. (Adding definitions for other stores, new features such as storing stats etcetera.) I'll leave the final closing to Julien, since he is the original reporter. Please let me know if any of you disagree.
          Hide
          Julien Nioche added a comment -

          Ferdy I'll let you close it. I don't have time to give nutchgora a try and can't confirm that the patch does what it is supposed to. Thanks

          Show
          Julien Nioche added a comment - Ferdy I'll let you close it. I don't have time to give nutchgora a try and can't confirm that the patch does what it is supposed to. Thanks
          Hide
          Ferdy Galema added a comment -

          Ok.

          Thanks to anyone who was involved.

          Show
          Ferdy Galema added a comment - Ok. Thanks to anyone who was involved.
          Hide
          Hudson added a comment -

          Integrated in Nutch-nutchgora #240 (See https://builds.apache.org/job/Nutch-nutchgora/240/)
          NUTCH-882 Design a Host table in GORA (Revision 1330728)

          Result = SUCCESS
          ferdy :
          Files :

          • /nutch/branches/nutchgora/CHANGES.txt
          • /nutch/branches/nutchgora/build.xml
          • /nutch/branches/nutchgora/conf/gora-hbase-mapping.xml
          • /nutch/branches/nutchgora/default.properties
          • /nutch/branches/nutchgora/ivy/ivy.xml
          • /nutch/branches/nutchgora/src/gora/host.avsc
          • /nutch/branches/nutchgora/src/java/org/apache/nutch/fetcher/FetcherReducer.java
          • /nutch/branches/nutchgora/src/java/org/apache/nutch/host
          • /nutch/branches/nutchgora/src/java/org/apache/nutch/host/HostDb.java
          • /nutch/branches/nutchgora/src/java/org/apache/nutch/host/HostDbReader.java
          • /nutch/branches/nutchgora/src/java/org/apache/nutch/host/HostDbUpdateJob.java
          • /nutch/branches/nutchgora/src/java/org/apache/nutch/host/HostDbUpdateReducer.java
          • /nutch/branches/nutchgora/src/java/org/apache/nutch/host/HostInjectorJob.java
          • /nutch/branches/nutchgora/src/java/org/apache/nutch/indexer/IndexerReducer.java
          • /nutch/branches/nutchgora/src/java/org/apache/nutch/storage/Host.java
          • /nutch/branches/nutchgora/src/java/org/apache/nutch/storage/StorageUtils.java
          • /nutch/branches/nutchgora/src/java/org/apache/nutch/storage/WebTableCreator.java
          • /nutch/branches/nutchgora/src/java/org/apache/nutch/util/Histogram.java
          • /nutch/branches/nutchgora/src/java/org/apache/nutch/util/TableUtil.java
          • /nutch/branches/nutchgora/src/java/org/apache/nutch/util/domain/DomainStatistics.java
          Show
          Hudson added a comment - Integrated in Nutch-nutchgora #240 (See https://builds.apache.org/job/Nutch-nutchgora/240/ ) NUTCH-882 Design a Host table in GORA (Revision 1330728) Result = SUCCESS ferdy : Files : /nutch/branches/nutchgora/CHANGES.txt /nutch/branches/nutchgora/build.xml /nutch/branches/nutchgora/conf/gora-hbase-mapping.xml /nutch/branches/nutchgora/default.properties /nutch/branches/nutchgora/ivy/ivy.xml /nutch/branches/nutchgora/src/gora/host.avsc /nutch/branches/nutchgora/src/java/org/apache/nutch/fetcher/FetcherReducer.java /nutch/branches/nutchgora/src/java/org/apache/nutch/host /nutch/branches/nutchgora/src/java/org/apache/nutch/host/HostDb.java /nutch/branches/nutchgora/src/java/org/apache/nutch/host/HostDbReader.java /nutch/branches/nutchgora/src/java/org/apache/nutch/host/HostDbUpdateJob.java /nutch/branches/nutchgora/src/java/org/apache/nutch/host/HostDbUpdateReducer.java /nutch/branches/nutchgora/src/java/org/apache/nutch/host/HostInjectorJob.java /nutch/branches/nutchgora/src/java/org/apache/nutch/indexer/IndexerReducer.java /nutch/branches/nutchgora/src/java/org/apache/nutch/storage/Host.java /nutch/branches/nutchgora/src/java/org/apache/nutch/storage/StorageUtils.java /nutch/branches/nutchgora/src/java/org/apache/nutch/storage/WebTableCreator.java /nutch/branches/nutchgora/src/java/org/apache/nutch/util/Histogram.java /nutch/branches/nutchgora/src/java/org/apache/nutch/util/TableUtil.java /nutch/branches/nutchgora/src/java/org/apache/nutch/util/domain/DomainStatistics.java

            People

            • Assignee:
              Unassigned
              Reporter:
              Julien Nioche
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development