Nutch
  1. Nutch
  2. NUTCH-1300

Indexer to filter and normalize URL's

    Details

    • Type: New Feature New Feature
    • Status: Resolved
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.6
    • Component/s: indexer
    • Labels:
      None
    • Patch Info:
      Patch Available

      Description

      Indexers should be able to normalize URL's. This is useful when a new normalizer is applied to the entire CrawlDB. Without it, some or all records in a segment cannot be indexed at all.

        Issue Links

          Activity

          Hide
          Markus Jelsma added a comment -

          renamed issue for clarity.

          Show
          Markus Jelsma added a comment - renamed issue for clarity.
          Hide
          Hudson added a comment -

          Integrated in Nutch-trunk #1869 (See https://builds.apache.org/job/Nutch-trunk/1869/)
          NUTCH-1300 Indexer to filter normalize URL's (Revision 1349262)

          Result = SUCCESS
          markus : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1349262
          Files :

          • /nutch/trunk/CHANGES.txt
          • /nutch/trunk/src/java/org/apache/nutch/indexer/IndexerMapReduce.java
          • /nutch/trunk/src/java/org/apache/nutch/indexer/solr/SolrIndexer.java
          • /nutch/trunk/src/java/org/apache/nutch/net/URLNormalizers.java
          Show
          Hudson added a comment - Integrated in Nutch-trunk #1869 (See https://builds.apache.org/job/Nutch-trunk/1869/ ) NUTCH-1300 Indexer to filter normalize URL's (Revision 1349262) Result = SUCCESS markus : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1349262 Files : /nutch/trunk/CHANGES.txt /nutch/trunk/src/java/org/apache/nutch/indexer/IndexerMapReduce.java /nutch/trunk/src/java/org/apache/nutch/indexer/solr/SolrIndexer.java /nutch/trunk/src/java/org/apache/nutch/net/URLNormalizers.java
          Hide
          Hudson added a comment -

          Integrated in nutch-trunk-maven #310 (See https://builds.apache.org/job/nutch-trunk-maven/310/)
          NUTCH-1300 Indexer to filter normalize URL's (Revision 1349262)

          Result = SUCCESS
          markus :
          Files :

          • /nutch/trunk/CHANGES.txt
          • /nutch/trunk/src/java/org/apache/nutch/indexer/IndexerMapReduce.java
          • /nutch/trunk/src/java/org/apache/nutch/indexer/solr/SolrIndexer.java
          • /nutch/trunk/src/java/org/apache/nutch/net/URLNormalizers.java
          Show
          Hudson added a comment - Integrated in nutch-trunk-maven #310 (See https://builds.apache.org/job/nutch-trunk-maven/310/ ) NUTCH-1300 Indexer to filter normalize URL's (Revision 1349262) Result = SUCCESS markus : Files : /nutch/trunk/CHANGES.txt /nutch/trunk/src/java/org/apache/nutch/indexer/IndexerMapReduce.java /nutch/trunk/src/java/org/apache/nutch/indexer/solr/SolrIndexer.java /nutch/trunk/src/java/org/apache/nutch/net/URLNormalizers.java
          Hide
          Markus Jelsma added a comment -

          Committed for 1.6 in rev. 1349262.

          The -filter and -normalize options are now available and a new scope SCOPE_NORMALIZE was added. Thanks Sebastian and Lewis.

          Show
          Markus Jelsma added a comment - Committed for 1.6 in rev. 1349262. The -filter and -normalize options are now available and a new scope SCOPE_NORMALIZE was added. Thanks Sebastian and Lewis.
          Hide
          Markus Jelsma added a comment -

          Sure! I'll add a command line option and update the tool description on the wiki. Will upload the patch and commit when trunk is 1.6.

          Show
          Markus Jelsma added a comment - Sure! I'll add a command line option and update the tool description on the wiki. Will upload the patch and commit when trunk is 1.6.
          Hide
          Lewis John McGibbney added a comment -

          Hi Markus. Before commenting on NUTCH-1323, I would also agree with Sebastian w.r.t commandline options. As with NUTCH-1139, there was a clear cut decision made to support cmd line options, so this patch would also need them added to work correctly? Additionally, we know that many people find this convenient and comprehensive, especially if they are documented well on the wiki :0) Apart from this I'm also +1

          Show
          Lewis John McGibbney added a comment - Hi Markus. Before commenting on NUTCH-1323 , I would also agree with Sebastian w.r.t commandline options. As with NUTCH-1139 , there was a clear cut decision made to support cmd line options, so this patch would also need them added to work correctly? Additionally, we know that many people find this convenient and comprehensive, especially if they are documented well on the wiki :0) Apart from this I'm also +1
          Hide
          Markus Jelsma added a comment -

          20120304-push-1.6

          Show
          Markus Jelsma added a comment - 20120304-push-1.6
          Hide
          Markus Jelsma added a comment -

          I think a scope "index" makes sense. It would make building a two-way normalizer a bit easier. Commandline options can be added but you can use -D option as well.

          Show
          Markus Jelsma added a comment - I think a scope "index" makes sense. It would make building a two-way normalizer a bit easier. Commandline options can be added but you can use -D option as well.
          Hide
          Sebastian Nagel added a comment -

          +1

          • effective fix for a serious problem: long running continuous crawls require adjustments of the normalization rules quite often
          • tested (with 1.4): costs (time spent for extra normalization) are ok compared to the benefit

          Two suggestions:

          1. Does a URLNormalizer scope "index" make sense? E.g., if only outlinks are normalized and default rules are empty, the scope "index" may use the same rules as scope "outlink".
          2. Wouldn't commandline options for solrindex be nice? Most other tools (generate, updatedb, invertlinks) have options such as -filter / -norm / -noNorm.
          Show
          Sebastian Nagel added a comment - +1 effective fix for a serious problem: long running continuous crawls require adjustments of the normalization rules quite often tested (with 1.4): costs (time spent for extra normalization) are ok compared to the benefit Two suggestions: Does a URLNormalizer scope "index" make sense? E.g., if only outlinks are normalized and default rules are empty, the scope "index" may use the same rules as scope "outlink". Wouldn't commandline options for solrindex be nice? Most other tools (generate, updatedb, invertlinks) have options such as -filter / -norm / -noNorm.
          Hide
          Markus Jelsma added a comment -

          Patch for 1.5.

          Show
          Markus Jelsma added a comment - Patch for 1.5.

            People

            • Assignee:
              Markus Jelsma
              Reporter:
              Markus Jelsma
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development