Nutch
  1. Nutch
  2. NUTCH-1300

Indexer to filter and normalize URL's

    Details

    • Type: New Feature New Feature
    • Status: Resolved
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.6
    • Component/s: indexer
    • Labels:
      None
    • Patch Info:
      Patch Available

      Description

      Indexers should be able to normalize URL's. This is useful when a new normalizer is applied to the entire CrawlDB. Without it, some or all records in a segment cannot be indexed at all.

        Issue Links

          Activity

          Markus Jelsma created issue -
          Hide
          Markus Jelsma added a comment -

          Patch for 1.5.

          Show
          Markus Jelsma added a comment - Patch for 1.5.
          Markus Jelsma made changes -
          Field Original Value New Value
          Attachment NUTCH-1300-1.5-1.patch [ 12517385 ]
          Hide
          Sebastian Nagel added a comment -

          +1

          • effective fix for a serious problem: long running continuous crawls require adjustments of the normalization rules quite often
          • tested (with 1.4): costs (time spent for extra normalization) are ok compared to the benefit

          Two suggestions:

          1. Does a URLNormalizer scope "index" make sense? E.g., if only outlinks are normalized and default rules are empty, the scope "index" may use the same rules as scope "outlink".
          2. Wouldn't commandline options for solrindex be nice? Most other tools (generate, updatedb, invertlinks) have options such as -filter / -norm / -noNorm.
          Show
          Sebastian Nagel added a comment - +1 effective fix for a serious problem: long running continuous crawls require adjustments of the normalization rules quite often tested (with 1.4): costs (time spent for extra normalization) are ok compared to the benefit Two suggestions: Does a URLNormalizer scope "index" make sense? E.g., if only outlinks are normalized and default rules are empty, the scope "index" may use the same rules as scope "outlink". Wouldn't commandline options for solrindex be nice? Most other tools (generate, updatedb, invertlinks) have options such as -filter / -norm / -noNorm.
          Hide
          Markus Jelsma added a comment -

          I think a scope "index" makes sense. It would make building a two-way normalizer a bit easier. Commandline options can be added but you can use -D option as well.

          Show
          Markus Jelsma added a comment - I think a scope "index" makes sense. It would make building a two-way normalizer a bit easier. Commandline options can be added but you can use -D option as well.
          Markus Jelsma made changes -
          Link This issue blocks NUTCH-1323 [ NUTCH-1323 ]
          Hide
          Markus Jelsma added a comment -

          20120304-push-1.6

          Show
          Markus Jelsma added a comment - 20120304-push-1.6
          Markus Jelsma made changes -
          Fix Version/s 1.6 [ 12319941 ]
          Fix Version/s 1.5 [ 12318246 ]
          Markus Jelsma made changes -
          Patch Info Patch Available [ 10042 ]
          Hide
          Lewis John McGibbney added a comment -

          Hi Markus. Before commenting on NUTCH-1323, I would also agree with Sebastian w.r.t commandline options. As with NUTCH-1139, there was a clear cut decision made to support cmd line options, so this patch would also need them added to work correctly? Additionally, we know that many people find this convenient and comprehensive, especially if they are documented well on the wiki :0) Apart from this I'm also +1

          Show
          Lewis John McGibbney added a comment - Hi Markus. Before commenting on NUTCH-1323 , I would also agree with Sebastian w.r.t commandline options. As with NUTCH-1139 , there was a clear cut decision made to support cmd line options, so this patch would also need them added to work correctly? Additionally, we know that many people find this convenient and comprehensive, especially if they are documented well on the wiki :0) Apart from this I'm also +1
          Hide
          Markus Jelsma added a comment -

          Sure! I'll add a command line option and update the tool description on the wiki. Will upload the patch and commit when trunk is 1.6.

          Show
          Markus Jelsma added a comment - Sure! I'll add a command line option and update the tool description on the wiki. Will upload the patch and commit when trunk is 1.6.
          Hide
          Markus Jelsma added a comment -

          Committed for 1.6 in rev. 1349262.

          The -filter and -normalize options are now available and a new scope SCOPE_NORMALIZE was added. Thanks Sebastian and Lewis.

          Show
          Markus Jelsma added a comment - Committed for 1.6 in rev. 1349262. The -filter and -normalize options are now available and a new scope SCOPE_NORMALIZE was added. Thanks Sebastian and Lewis.
          Markus Jelsma made changes -
          Status Open [ 1 ] Resolved [ 5 ]
          Resolution Fixed [ 1 ]
          Hide
          Hudson added a comment -

          Integrated in nutch-trunk-maven #310 (See https://builds.apache.org/job/nutch-trunk-maven/310/)
          NUTCH-1300 Indexer to filter normalize URL's (Revision 1349262)

          Result = SUCCESS
          markus :
          Files :

          • /nutch/trunk/CHANGES.txt
          • /nutch/trunk/src/java/org/apache/nutch/indexer/IndexerMapReduce.java
          • /nutch/trunk/src/java/org/apache/nutch/indexer/solr/SolrIndexer.java
          • /nutch/trunk/src/java/org/apache/nutch/net/URLNormalizers.java
          Show
          Hudson added a comment - Integrated in nutch-trunk-maven #310 (See https://builds.apache.org/job/nutch-trunk-maven/310/ ) NUTCH-1300 Indexer to filter normalize URL's (Revision 1349262) Result = SUCCESS markus : Files : /nutch/trunk/CHANGES.txt /nutch/trunk/src/java/org/apache/nutch/indexer/IndexerMapReduce.java /nutch/trunk/src/java/org/apache/nutch/indexer/solr/SolrIndexer.java /nutch/trunk/src/java/org/apache/nutch/net/URLNormalizers.java
          Hide
          Hudson added a comment -

          Integrated in Nutch-trunk #1869 (See https://builds.apache.org/job/Nutch-trunk/1869/)
          NUTCH-1300 Indexer to filter normalize URL's (Revision 1349262)

          Result = SUCCESS
          markus : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1349262
          Files :

          • /nutch/trunk/CHANGES.txt
          • /nutch/trunk/src/java/org/apache/nutch/indexer/IndexerMapReduce.java
          • /nutch/trunk/src/java/org/apache/nutch/indexer/solr/SolrIndexer.java
          • /nutch/trunk/src/java/org/apache/nutch/net/URLNormalizers.java
          Show
          Hudson added a comment - Integrated in Nutch-trunk #1869 (See https://builds.apache.org/job/Nutch-trunk/1869/ ) NUTCH-1300 Indexer to filter normalize URL's (Revision 1349262) Result = SUCCESS markus : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1349262 Files : /nutch/trunk/CHANGES.txt /nutch/trunk/src/java/org/apache/nutch/indexer/IndexerMapReduce.java /nutch/trunk/src/java/org/apache/nutch/indexer/solr/SolrIndexer.java /nutch/trunk/src/java/org/apache/nutch/net/URLNormalizers.java
          Gavin made changes -
          Link This issue blocks NUTCH-1323 [ NUTCH-1323 ]
          Gavin made changes -
          Link This issue is depended upon by NUTCH-1323 [ NUTCH-1323 ]
          Lewis John McGibbney made changes -
          Status Resolved [ 5 ] Closed [ 6 ]
          Markus Jelsma made changes -
          Link This issue is duplicated by NUTCH-1614 [ NUTCH-1614 ]
          Markus Jelsma made changes -
          Resolution Fixed [ 1 ]
          Status Closed [ 6 ] Reopened [ 4 ]
          Markus Jelsma made changes -
          Summary Indexer to normalize URL's Indexer to filter and normalize URL's
          Hide
          Markus Jelsma added a comment -

          renamed issue for clarity.

          Show
          Markus Jelsma added a comment - renamed issue for clarity.
          Markus Jelsma made changes -
          Status Reopened [ 4 ] Resolved [ 5 ]
          Resolution Fixed [ 1 ]
          Transition Time In Source Status Execution Times Last Executer Last Execution Date
          Open Open Resolved Resolved
          98d 10h 44m 1 Markus Jelsma 12/Jun/12 12:27
          Resolved Resolved Closed Closed
          343d 16h 26m 1 Lewis John McGibbney 22/May/13 04:53
          Closed Closed Reopened Reopened
          56d 14h 44m 1 Markus Jelsma 17/Jul/13 19:38
          Reopened Reopened Resolved Resolved
          27s 1 Markus Jelsma 17/Jul/13 19:38

            People

            • Assignee:
              Markus Jelsma
              Reporter:
              Markus Jelsma
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development