Details

    • New Feature
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 1.0.0
    • 1.0.0
    • None
    • None
    • All

    • Patch Available

    Description

      This is a basic pagerank type link analysis tool for nutch which simulates a sparse matrix using inlinks and outlinks and converges after a given number of iterations. This tool is mean to replace the current scoring system in nutch with a system that converges instead of exponentially increasing scores. Also includes a tool to create an outlinkdb.

      Attachments

        1. NUTCH-635-1-20080612.patch
          35 kB
          Dennis Kubes
        2. NUTCH-635-2-20080613.patch
          47 kB
          Dennis Kubes
        3. NUTCH-635-3-20080614.patch
          41 kB
          Dennis Kubes
        4. NUTCH-635-4-20080615.patch
          45 kB
          Dennis Kubes
        5. NUTCH-635-5-20080620.patch
          63 kB
          Dennis Kubes
        6. NUTCH-635-6-20080725.patch
          182 kB
          Dennis Kubes
        7. NUTCH-635-7-20080808.patch
          225 kB
          Dennis Kubes
        8. NUTCH-635-9-20081126.patch
          141 kB
          Dennis Kubes

        Issue Links

          Activity

            musepwizard Dennis Kubes added a comment -

            Basic patch, doesn't include unit tests but it has been tested. Includes the LinkAnalysis tool and the Outlink tool. Still needs to handle cases such at telelportation and rank sinks. But here it is as a first pass for people to see.

            musepwizard Dennis Kubes added a comment - Basic patch, doesn't include unit tests but it has been tested. Includes the LinkAnalysis tool and the Outlink tool. Still needs to handle cases such at telelportation and rank sinks. But here it is as a first pass for people to see.
            musepwizard Dennis Kubes added a comment -

            Updated patch. Contains a score updater for crawl db. A scoring filter to work with the link analysis tool. Updated the LinkAnalysis tool to handle reciprocal links, links from the same domain/subdomains, rank sinks, and link loops. Also included a display tool to view inlinks/outlinks and scores for a given url. Should be ready for large scale testing. Tested on a dataset of 25K pages and the results were promising.

            musepwizard Dennis Kubes added a comment - Updated patch. Contains a score updater for crawl db. A scoring filter to work with the link analysis tool. Updated the LinkAnalysis tool to handle reciprocal links, links from the same domain/subdomains, rank sinks, and link loops. Also included a display tool to view inlinks/outlinks and scores for a given url. Should be ready for large scale testing. Tested on a dataset of 25K pages and the results were promising.

            This patch looks great! A few comments:

            • in OutlinksDb.reduce() you use a simple assignment mostRecent = next. This doesn't work as expected, because Hadoop iterator reuses the same single instance of Outlinks under the hood, so if you keep a reference to it its value will mysteriously change under your feet as you call values.next(). This should be replaced with a deep copy (or clone) of the instance, either through a dedicated method of Outlinks or WritableUtils.copy().
            • you should avoid spurious whitespace changes to existing classes, this makes the reading more difficult ... (e.g. Outlink.java)
            • in Outlinks.write() I think there's a bug - you write out System.currentTimeMillis() instead of this.timestamp, is this intentional?
            • in LinkAnalysis.Counter.map() , since you output static values, you should avoid creating new instances and use a pair of static instances.
            • by the way, in an implementation of similar algo I used Hadoop Counters to count the totals, this way you avoid storing magic numbers in the db itself (although you still need to preserve them somewhere, so I'd create an additional file with this value ... well, perhaps not so elegant either after all ).
            • LinkAnalysis.Analyzer.reduce() - you should retrieve config parameters in configure(Job), otherwise you pay the price of getting floats from Configuration (which involves repeated creation of Float via Float.parseFloat()). Also, HashPartitioner should be created once. Well, this is a general comment to this patch - it creates a lot of objects unnecessarily. We can optimize it now or later, whatever you prefer.

            I didn't go into the algorithm itself yet to give any useful comments ... But I have a dataset of ~4mln pages I can test it on.

            ab Andrzej Bialecki added a comment - This patch looks great! A few comments: in OutlinksDb.reduce() you use a simple assignment mostRecent = next. This doesn't work as expected, because Hadoop iterator reuses the same single instance of Outlinks under the hood, so if you keep a reference to it its value will mysteriously change under your feet as you call values.next(). This should be replaced with a deep copy (or clone) of the instance, either through a dedicated method of Outlinks or WritableUtils.copy(). you should avoid spurious whitespace changes to existing classes, this makes the reading more difficult ... (e.g. Outlink.java) in Outlinks.write() I think there's a bug - you write out System.currentTimeMillis() instead of this.timestamp, is this intentional? in LinkAnalysis.Counter.map() , since you output static values, you should avoid creating new instances and use a pair of static instances. by the way, in an implementation of similar algo I used Hadoop Counters to count the totals, this way you avoid storing magic numbers in the db itself (although you still need to preserve them somewhere, so I'd create an additional file with this value ... well, perhaps not so elegant either after all ). LinkAnalysis.Analyzer.reduce() - you should retrieve config parameters in configure(Job), otherwise you pay the price of getting floats from Configuration (which involves repeated creation of Float via Float.parseFloat()). Also, HashPartitioner should be created once. Well, this is a general comment to this patch - it creates a lot of objects unnecessarily. We can optimize it now or later, whatever you prefer. I didn't go into the algorithm itself yet to give any useful comments ... But I have a dataset of ~4mln pages I can test it on.
            musepwizard Dennis Kubes added a comment -

            Andrzej Bialecki Wrote:

            • in OutlinksDb.reduce() you use a simple assignment mostRecent = next. This doesn't work as expected, because Hadoop iterator reuses the same single instance of Outlinks under the hood, so if you keep a reference to it its value will mysteriously change under your feet as you call values.next(). This should be replaced with a deep copy (or clone) of the instance, either through a dedicated method of Outlinks or WritableUtils.copy().

            Fixed this. Thanks. I knew it happened for writables but wasn't aware that it was implemented the same way in the iterators.

            • you should avoid spurious whitespace changes to existing classes, this makes the reading more difficult ... (e.g. Outlink.java)

            That was a mistake, fixed it.

            • in Outlinks.write() I think there's a bug - you write out System.currentTimeMillis() instead of this.timestamp, is this intentional?

            Nope, that was a bug from an earlier version of it. Fixed.

            • in LinkAnalysis.Counter.map() , since you output static values, you should avoid creating new instances and use a pair of static instances.
            • by the way, in an implementation of similar algo I used Hadoop Counters to count the totals, this way you avoid storing magic numbers in the db itself (although you still need to preserve them somewhere, so I'd create an additional file with this value ... well, perhaps not so elegant either after all ).

            This is really just a temp file. I count the urls put it into a file using a single reduce task and then read it back in the update method of LinkAnalysis and pass it into the jobs through conf. Once it is read I delete the file.

            • LinkAnalysis.Analyzer.reduce() - you should retrieve config parameters in configure(Job), otherwise you pay the price of getting floats from Configuration (which involves repeated creation of Float via Float.parseFloat()). Also, HashPartitioner should be created once. Well, this is a general comment to this patch - it creates a lot of objects unnecessarily. We can optimize it now or later, whatever you prefer.

            I think a bit of both. I fixed the HashPartitioner one. My intention with this first version is to get a workable tool that converges the score and to provide workarounds for the common types of link spam such at reciprocal links and link farms / tightly knit communities. Once it is working we can always optimize the speed later. That being said the current version is faster than I thought it would be. The current patch does converge and it handled reciprocal links and some cases of link farms but it is currently being overinflued by link loops of three or more sights. Once I have that taken care of I will post a new path.

            musepwizard Dennis Kubes added a comment - Andrzej Bialecki Wrote: in OutlinksDb.reduce() you use a simple assignment mostRecent = next. This doesn't work as expected, because Hadoop iterator reuses the same single instance of Outlinks under the hood, so if you keep a reference to it its value will mysteriously change under your feet as you call values.next(). This should be replaced with a deep copy (or clone) of the instance, either through a dedicated method of Outlinks or WritableUtils.copy(). Fixed this. Thanks. I knew it happened for writables but wasn't aware that it was implemented the same way in the iterators. you should avoid spurious whitespace changes to existing classes, this makes the reading more difficult ... (e.g. Outlink.java) That was a mistake, fixed it. in Outlinks.write() I think there's a bug - you write out System.currentTimeMillis() instead of this.timestamp, is this intentional? Nope, that was a bug from an earlier version of it. Fixed. in LinkAnalysis.Counter.map() , since you output static values, you should avoid creating new instances and use a pair of static instances. by the way, in an implementation of similar algo I used Hadoop Counters to count the totals, this way you avoid storing magic numbers in the db itself (although you still need to preserve them somewhere, so I'd create an additional file with this value ... well, perhaps not so elegant either after all ). This is really just a temp file. I count the urls put it into a file using a single reduce task and then read it back in the update method of LinkAnalysis and pass it into the jobs through conf. Once it is read I delete the file. LinkAnalysis.Analyzer.reduce() - you should retrieve config parameters in configure(Job), otherwise you pay the price of getting floats from Configuration (which involves repeated creation of Float via Float.parseFloat()). Also, HashPartitioner should be created once. Well, this is a general comment to this patch - it creates a lot of objects unnecessarily. We can optimize it now or later, whatever you prefer. I think a bit of both. I fixed the HashPartitioner one. My intention with this first version is to get a workable tool that converges the score and to provide workarounds for the common types of link spam such at reciprocal links and link farms / tightly knit communities. Once it is working we can always optimize the speed later. That being said the current version is faster than I thought it would be. The current patch does converge and it handled reciprocal links and some cases of link farms but it is currently being overinflued by link loops of three or more sights. Once I have that taken care of I will post a new path.
            musepwizard Dennis Kubes added a comment -

            Stable patch that fixes some of the issues commented on and mentioned previously. This patch converges well on a dataset of over 100K pages and handles reciprocal linking. As of yet link farms don't seem to be a problem but we shall see.

            musepwizard Dennis Kubes added a comment - Stable patch that fixes some of the issues commented on and mentioned previously. This patch converges well on a dataset of over 100K pages and handles reciprocal linking. As of yet link farms don't seem to be a problem but we shall see.

            One more question: you said the algorithm converges, but do you have a reference set of values from this dataset, calculated using some other pagerank impl? It would be worthwhile to make sure that the values are indeed the PageRank, as described, and not yet another subtle variation such as our OPIC

            There are a few Java packages for computing PageRank, we could adapt one of those to serve as a baseline:

            http://law.dsi.unimi.it/
            http://webla.sourceforge.net/javadocs/pt/tumba/links/PageRank.html

            ab Andrzej Bialecki added a comment - One more question: you said the algorithm converges, but do you have a reference set of values from this dataset, calculated using some other pagerank impl? It would be worthwhile to make sure that the values are indeed the PageRank, as described, and not yet another subtle variation such as our OPIC There are a few Java packages for computing PageRank, we could adapt one of those to serve as a baseline: http://law.dsi.unimi.it/ http://webla.sourceforge.net/javadocs/pt/tumba/links/PageRank.html
            musepwizard Dennis Kubes added a comment -

            Adds normalization for many links from a single domain and a penalty threshold for very few inlinks. Also adds the ability to alter the boost into the index to compensate for front end query boosts.

            musepwizard Dennis Kubes added a comment - Adds normalization for many links from a single domain and a penalty threshold for very few inlinks. Also adds the ability to alter the boost into the index to compensate for front end query boosts.
            musepwizard Dennis Kubes added a comment -

            Andrzej Bialecki wrote:

            > One more question: you said the algorithm converges, but do you have a reference set of values from this dataset, calculated using some other pagerank impl? It would be worthwhile to make sure that the > > values are indeed the PageRank, as described, and not yet another subtle variation such as our OPIC

            I was doing it low tech. By turning on the debug logging, warning it is a large output, and using grep you can see the score converge after a few iterations

            > There are a few Java packages for computing PageRank, we could adapt one of those to serve as a baseline:
            >
            > http://law.dsi.unimi.it/
            > http://webla.sourceforge.net/javadocs/pt/tumba/links/PageRank.html

            I agree it would be a good comparison. Strictly speaking though it is not just pagerank. There are optimizations for multiple links from a given domain, penalties for very few inlinks, and a minimum score value. All of which are able to be changed through the configuration. Besides that it does follow the original pagerank algorithm closely.

            musepwizard Dennis Kubes added a comment - Andrzej Bialecki wrote: > One more question: you said the algorithm converges, but do you have a reference set of values from this dataset, calculated using some other pagerank impl? It would be worthwhile to make sure that the > > values are indeed the PageRank, as described, and not yet another subtle variation such as our OPIC I was doing it low tech. By turning on the debug logging, warning it is a large output, and using grep you can see the score converge after a few iterations > There are a few Java packages for computing PageRank, we could adapt one of those to serve as a baseline: > > http://law.dsi.unimi.it/ > http://webla.sourceforge.net/javadocs/pt/tumba/links/PageRank.html I agree it would be a good comparison. Strictly speaking though it is not just pagerank. There are optimizations for multiple links from a given domain, penalties for very few inlinks, and a minimum score value. All of which are able to be changed through the configuration. Besides that it does follow the original pagerank algorithm closely.
            musepwizard Dennis Kubes added a comment -

            Refactored patch that removes network calls using MapFile.Readers and simulates better a row matrix though inverting and merging inlink scores. This patch works in the general sort-merge-process structure of MapReduce and as such should be significantly faster. The previous jobs were taking far to long to process on a large dataset. This patch includes the link anlaysis tool, a tool for updating the crawl db with a new score and clearing scores of urls with no score, an outlink database tool, a new inlink database tool that will keep inlinks consistent with outlinks, and a new scoring plugin which replaces the opic plugin.

            The order of tool runs should now be: Inject, Generate, Fetch, UpdateDb, OutlinkDb, InlinkDb, LinkAnalysis, ScoreUpdater, Indexer

            musepwizard Dennis Kubes added a comment - Refactored patch that removes network calls using MapFile.Readers and simulates better a row matrix though inverting and merging inlink scores. This patch works in the general sort-merge-process structure of MapReduce and as such should be significantly faster. The previous jobs were taking far to long to process on a large dataset. This patch includes the link anlaysis tool, a tool for updating the crawl db with a new score and clearing scores of urls with no score, an outlink database tool, a new inlink database tool that will keep inlinks consistent with outlinks, and a new scoring plugin which replaces the opic plugin. The order of tool runs should now be: Inject, Generate, Fetch, UpdateDb, OutlinkDb, InlinkDb, LinkAnalysis, ScoreUpdater, Indexer
            musepwizard Dennis Kubes added a comment -

            Finished link analysis and indexer framework along with tools.

            musepwizard Dennis Kubes added a comment - Finished link analysis and indexer framework along with tools.

            A few comments to the latest patch:

            • some crucial javadoc is missing, such as the comments on class level (at least), especially if they are cmd-line utilities or classes that support a major functionality.
            • perhaps we don't need a separate Node db, this information can be added directly to the CrawlDb, which could save us the trouble with running the ScoreUpdater.
            • minor thing, but in many classes you use a repeating pattern of creating instances of List, HashSet, ObjWritable, etc, etc inside the map()/reduce() methods, while they should be created once and reused.
            • LinkDatum:
              • linkType should be byte, not int - this saves 3 bytes on each entry.
            • LinkRank:
              • I wonder if we couldn't skip the Counter job, and instead collect the total number of links via Hadoop job counters. I.e. define counters in Mapper/Reducer of the analysis job, and then after the job is done you can retrieve them from a RunningJob instance. We could then maintain this value on each update of the db in a well-known location, as you do this already, except we could skip this additional runCounter(..) job ...
            • Loops:
              • Loops.Route.readFields(): I think it's better to use Text.readString() instead of DataInput.readUTF(). Or for that matter, replace the plain Strings with Text, since many times in other places in Loops you need to create a Text object anyway, out of one of Route's fields.
            • LinkUpdater:
              • I don't understand why clearScore is set to 0.00001f. What's with the magic number?
            • ReprUrlFixer should go into tools.compat
            • ResolveUrls uses ReprUrlFixer log, it should use its own. Besides, this tool is not relevant to this patch, so I think it should be submitted separately.
            • the new indexing framework: I like the added flexibility, but the cost for that seems high. Previously we only had to run a single map-red job to create an index, now we have to run at least 6 jobs, each with a large dataset. I vote for splitting the patch and creating a separate issue for this framework, so that we can discuss it further.
            ab Andrzej Bialecki added a comment - A few comments to the latest patch: some crucial javadoc is missing, such as the comments on class level (at least), especially if they are cmd-line utilities or classes that support a major functionality. perhaps we don't need a separate Node db, this information can be added directly to the CrawlDb, which could save us the trouble with running the ScoreUpdater. minor thing, but in many classes you use a repeating pattern of creating instances of List, HashSet, ObjWritable, etc, etc inside the map()/reduce() methods, while they should be created once and reused. LinkDatum: linkType should be byte, not int - this saves 3 bytes on each entry. LinkRank: I wonder if we couldn't skip the Counter job, and instead collect the total number of links via Hadoop job counters. I.e. define counters in Mapper/Reducer of the analysis job, and then after the job is done you can retrieve them from a RunningJob instance. We could then maintain this value on each update of the db in a well-known location, as you do this already, except we could skip this additional runCounter(..) job ... Loops: Loops.Route.readFields(): I think it's better to use Text.readString() instead of DataInput.readUTF(). Or for that matter, replace the plain Strings with Text, since many times in other places in Loops you need to create a Text object anyway, out of one of Route's fields. LinkUpdater: I don't understand why clearScore is set to 0.00001f. What's with the magic number? ReprUrlFixer should go into tools.compat ResolveUrls uses ReprUrlFixer log, it should use its own. Besides, this tool is not relevant to this patch, so I think it should be submitted separately. the new indexing framework: I like the added flexibility, but the cost for that seems high. Previously we only had to run a single map-red job to create an index, now we have to run at least 6 jobs, each with a large dataset. I vote for splitting the patch and creating a separate issue for this framework, so that we can discuss it further.
            musepwizard Dennis Kubes added a comment -

            > # some crucial javadoc is missing, such as the comments on class level (at least), especially if they are cmd-line utilities or classes that support a major functionality.

            Yup. Still going through and doing the javadoc. I will have all that done before any final commit.

            > # perhaps we don't need a separate Node db, this information can be added directly to the CrawlDb, which could save us the trouble with running the ScoreUpdater.

            The crawldb can get huge as you know. It could be updated into the crawldb but then we are stuck using the crawldb everywhere we currently use the nodedb, which is a lot of places both in the analysis and in the indexing. The way it currently is works much faster and allows us to at a glance see scores and number of links per url using the NodeReader tool.

            > linkType should be byte, not int - this saves 3 bytes on each entry.

            Done

            > Loops.Route.readFields(): I think it's better to use Text.readString() instead of DataInput.readUTF(). Or for that matter, replace the plain Strings with Text, since many times in other places in Loops you need to create a Text object anyway, out of one of Route's fields.

            This has been fixed in a more recent patch

            > I don't understand why clearScore is set to 0.00001f. What's with the magic number?

            Leftovers. This has been fixed to 0.0f

            > ReprUrlFixer should go into tools.compat

            Done

            > the new indexing framework: I like the added flexibility, but the cost for that seems high. Previously we only had to run a single map-red job to create an index, now we have to run at least 6 jobs, each with a large dataset. I vote for splitting the patch and creating a separate issue for this framework, so that we can discuss it further.

            I agree. This patch is getting big and the indexing stuff should go into a separate issue. I will create one. Also I have reworked the indexer to allow for field filters. I will post the new patch on the new issue.

            I agree that it is more jobs but I don't see a way around that. And the new analysis is also more jobs. I am not afraid of running more jobs on the system as that can be automated. I am afraid of not having the flexibility that I need and the ability to apply a type of analysis. The current indexer locks in the databases that can be used and we need more flexibility than that, not just in the what is indexed but also how. With this approach we can create fields from any MR job and then integrate and index all of those fields. New fields and analysis scores can be added without changing the indexing code. The newer patch also creates an extension point for field filters that allow manipulation of the fields and document in the index once the fields are aggregated together. This allows a great deal of flexibility in indexing fields, aggregates and manipulating document boosts, and in taking other actions such as blacklisting. Again I will post the new patch soon.

            musepwizard Dennis Kubes added a comment - > # some crucial javadoc is missing, such as the comments on class level (at least), especially if they are cmd-line utilities or classes that support a major functionality. Yup. Still going through and doing the javadoc. I will have all that done before any final commit. > # perhaps we don't need a separate Node db, this information can be added directly to the CrawlDb, which could save us the trouble with running the ScoreUpdater. The crawldb can get huge as you know. It could be updated into the crawldb but then we are stuck using the crawldb everywhere we currently use the nodedb, which is a lot of places both in the analysis and in the indexing. The way it currently is works much faster and allows us to at a glance see scores and number of links per url using the NodeReader tool. > linkType should be byte, not int - this saves 3 bytes on each entry. Done > Loops.Route.readFields(): I think it's better to use Text.readString() instead of DataInput.readUTF(). Or for that matter, replace the plain Strings with Text, since many times in other places in Loops you need to create a Text object anyway, out of one of Route's fields. This has been fixed in a more recent patch > I don't understand why clearScore is set to 0.00001f. What's with the magic number? Leftovers. This has been fixed to 0.0f > ReprUrlFixer should go into tools.compat Done > the new indexing framework: I like the added flexibility, but the cost for that seems high. Previously we only had to run a single map-red job to create an index, now we have to run at least 6 jobs, each with a large dataset. I vote for splitting the patch and creating a separate issue for this framework, so that we can discuss it further. I agree. This patch is getting big and the indexing stuff should go into a separate issue. I will create one. Also I have reworked the indexer to allow for field filters. I will post the new patch on the new issue. I agree that it is more jobs but I don't see a way around that. And the new analysis is also more jobs. I am not afraid of running more jobs on the system as that can be automated. I am afraid of not having the flexibility that I need and the ability to apply a type of analysis. The current indexer locks in the databases that can be used and we need more flexibility than that, not just in the what is indexed but also how. With this approach we can create fields from any MR job and then integrate and index all of those fields. New fields and analysis scores can be added without changing the indexing code. The newer patch also creates an extension point for field filters that allow manipulation of the fields and document in the index once the fields are aggregated together. This allows a great deal of flexibility in indexing fields, aggregates and manipulating document boosts, and in taking other actions such as blacklisting. Again I will post the new patch soon.
            musepwizard Dennis Kubes added a comment -

            Final patch, includes comments, change suggestions, the new scoring and link analysis tools, and the new indexing framework.

            musepwizard Dennis Kubes added a comment - Final patch, includes comments, change suggestions, the new scoring and link analysis tools, and the new indexing framework.
            musepwizard Dennis Kubes added a comment -

            Final patch. Includes comment and code change suggestions. Includes new scoring, link analysis, and indexing frameworks and tools.

            musepwizard Dennis Kubes added a comment - Final patch. Includes comment and code change suggestions. Includes new scoring, link analysis, and indexing frameworks and tools.

            Dennis, please split this patch into the link analysis and indexing parts, and move the part related to the new indexing framework to a separate issue, so that we deal only with the link analysis patch here. Thank you!

            ab Andrzej Bialecki added a comment - Dennis, please split this patch into the link analysis and indexing parts, and move the part related to the new indexing framework to a separate issue, so that we deal only with the link analysis patch here. Thank you!
            musepwizard Dennis Kubes added a comment -

            Breaks out the new indexing framework into its own patch NUTCH-646. Removes the ResolveURLs tool into its own patch. Makes the patch java 5 compatible.

            musepwizard Dennis Kubes added a comment - Breaks out the new indexing framework into its own patch NUTCH-646 . Removes the ResolveURLs tool into its own patch. Makes the patch java 5 compatible.
            dogacan Dogacan Guney added a comment -

            I have skimmed through the last patches in this one and NUTCH-646. But I am confused. Are the patches swapped? This one here seems to be about indexing, while NUTCH-646 has loops and link analysis and web graphs

            dogacan Dogacan Guney added a comment - I have skimmed through the last patches in this one and NUTCH-646 . But I am confused. Are the patches swapped? This one here seems to be about indexing, while NUTCH-646 has loops and link analysis and web graphs
            musepwizard Dennis Kubes added a comment -

            Ooops, yeah crud. I must have switched them. I have a little cleanup to do on the indexing one, then I will repost. Sorry about that.

            musepwizard Dennis Kubes added a comment - Ooops, yeah crud. I must have switched them. I have a little cleanup to do on the indexing one, then I will repost. Sorry about that.
            dogacan Dogacan Guney added a comment -

            Sorry for the late review....

            Patch looks great, and since this is very self contained I see no reason why we do not commit this immediately.

            Some notes:

            • Can we also commit a small (5K-6K nodes maybe) test graph, so that future changes can be tested against it?
            • There are many WritableUtils.clone calls in the code. I don't think that they are necessary.
            • Instead of ObjectWritable, I would suggest using NutchWritable. NutchWritable is lighter.
            • There are a couple of new warnings. Mostly with unused JobConf-s and with OptionBuilder.
            • It may be a good idea to create some plugins for webgraph package to give users some control over which
              outlinks they want to filter and which to keep (obviously for later)
            • Can you explain your score formula?
                  // calculate linkRank score formula
                  float linkRankScore = (1 - this.dampingFactor)
                    + (this.dampingFactor * totalInlinkScore);
            
            

            I may be mistaken, but you only seem to have the use case where the random surfer clicks a link on a page and not the he-types-a-new-url-to-start-over use case. Also, why do you add 0.15 (as default value) to every score?

            dogacan Dogacan Guney added a comment - Sorry for the late review.... Patch looks great, and since this is very self contained I see no reason why we do not commit this immediately. Some notes: Can we also commit a small (5K-6K nodes maybe) test graph, so that future changes can be tested against it? There are many WritableUtils.clone calls in the code. I don't think that they are necessary. Instead of ObjectWritable, I would suggest using NutchWritable. NutchWritable is lighter. There are a couple of new warnings. Mostly with unused JobConf-s and with OptionBuilder. It may be a good idea to create some plugins for webgraph package to give users some control over which outlinks they want to filter and which to keep (obviously for later) Can you explain your score formula? // calculate linkRank score formula float linkRankScore = (1 - this .dampingFactor) + ( this .dampingFactor * totalInlinkScore); I may be mistaken, but you only seem to have the use case where the random surfer clicks a link on a page and not the he-types-a-new-url-to-start-over use case. Also, why do you add 0.15 (as default value) to every score?
            musepwizard Dennis Kubes added a comment -

            Updated final patch for new link analysis framework. I am also going to write up some documentation on the wiki for how this new process works.

            musepwizard Dennis Kubes added a comment - Updated final patch for new link analysis framework. I am also going to write up some documentation on the wiki for how this new process works.
            musepwizard Dennis Kubes added a comment -

            Committed with revision 723441

            musepwizard Dennis Kubes added a comment - Committed with revision 723441
            hudson Hudson added a comment -
            hudson Hudson added a comment - Integrated in Nutch-trunk #667 (See http://hudson.zones.apache.org/hudson/job/Nutch-trunk/667/ )

            People

              musepwizard Dennis Kubes
              musepwizard Dennis Kubes
              Votes:
              4 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: