Nutch
  1. Nutch
  2. NUTCH-1320

IndexChecker and ParseChecker choke on IDN's

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 1.4
    • Fix Version/s: 1.6
    • Component/s: None
    • Labels:
      None
    • Patch Info:
      Patch Available

      Description

      These handy debug tools do not handle IDN's and throw an NPE

      bin/nutch parsechecker http://例子.測試/%E9%A6%96%E9%A0%81

      Exception in thread "main" java.lang.NullPointerException
              at org.apache.nutch.indexer.IndexingFiltersChecker.run(IndexingFiltersChecker.java:71)
              at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
              at org.apache.nutch.indexer.IndexingFiltersChecker.main(IndexingFiltersChecker.java:116)
      

        Issue Links

          Activity

          Hide
          Hudson added a comment -

          Integrated in Nutch-trunk #1865 (See https://builds.apache.org/job/Nutch-trunk/1865/)
          NUTCH-1320 IndexChecker and ParseChecker choke on IDN's (Revision 1347755)

          Result = SUCCESS
          markus : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1347755
          Files :

          • /nutch/trunk/CHANGES.txt
          • /nutch/trunk/src/java/org/apache/nutch/indexer/IndexingFiltersChecker.java
          • /nutch/trunk/src/java/org/apache/nutch/parse/ParserChecker.java
          • /nutch/trunk/src/java/org/apache/nutch/util/URLUtil.java
          Show
          Hudson added a comment - Integrated in Nutch-trunk #1865 (See https://builds.apache.org/job/Nutch-trunk/1865/ ) NUTCH-1320 IndexChecker and ParseChecker choke on IDN's (Revision 1347755) Result = SUCCESS markus : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1347755 Files : /nutch/trunk/CHANGES.txt /nutch/trunk/src/java/org/apache/nutch/indexer/IndexingFiltersChecker.java /nutch/trunk/src/java/org/apache/nutch/parse/ParserChecker.java /nutch/trunk/src/java/org/apache/nutch/util/URLUtil.java
          Hide
          Hudson added a comment -

          Integrated in nutch-trunk-maven #299 (See https://builds.apache.org/job/nutch-trunk-maven/299/)
          NUTCH-1320 IndexChecker and ParseChecker choke on IDN's (Revision 1347755)

          Result = SUCCESS
          markus :
          Files :

          • /nutch/trunk/CHANGES.txt
          • /nutch/trunk/src/java/org/apache/nutch/indexer/IndexingFiltersChecker.java
          • /nutch/trunk/src/java/org/apache/nutch/parse/ParserChecker.java
          • /nutch/trunk/src/java/org/apache/nutch/util/URLUtil.java
          Show
          Hudson added a comment - Integrated in nutch-trunk-maven #299 (See https://builds.apache.org/job/nutch-trunk-maven/299/ ) NUTCH-1320 IndexChecker and ParseChecker choke on IDN's (Revision 1347755) Result = SUCCESS markus : Files : /nutch/trunk/CHANGES.txt /nutch/trunk/src/java/org/apache/nutch/indexer/IndexingFiltersChecker.java /nutch/trunk/src/java/org/apache/nutch/parse/ParserChecker.java /nutch/trunk/src/java/org/apache/nutch/util/URLUtil.java
          Hide
          Markus Jelsma added a comment -

          Committed for 1.6 in rev. 1347755.
          Thanks Lewis

          Show
          Markus Jelsma added a comment - Committed for 1.6 in rev. 1347755. Thanks Lewis
          Hide
          Markus Jelsma added a comment -

          20120304-push-1.6

          Show
          Markus Jelsma added a comment - 20120304-push-1.6
          Hide
          Markus Jelsma added a comment -

          Somewhere down the line IDN's enter the CrawlDB in ASCII so there is no problem there but these tools lack conversion. The filter and normalizer checker tools would also benefit. This also suggests the need of an IDNNormalizer that does toUnicode when indexing, you don't want http://xn--*/ URL's in your index.

          Show
          Markus Jelsma added a comment - Somewhere down the line IDN's enter the CrawlDB in ASCII so there is no problem there but these tools lack conversion. The filter and normalizer checker tools would also benefit. This also suggests the need of an IDNNormalizer that does toUnicode when indexing, you don't want http://xn--*/ URL's in your index.
          Hide
          Lewis John McGibbney added a comment -

          Nice Markus. +1. Is there scope for this to be applied elsewhere, or is parserchecker the only instance (so far) where you've encountered the problem?

          Show
          Lewis John McGibbney added a comment - Nice Markus. +1. Is there scope for this to be applied elsewhere, or is parserchecker the only instance (so far) where you've encountered the problem?
          Hide
          Markus Jelsma added a comment -

          Patch for 1.5. URLUtil now has a toASCII and toUnicode method wrapping the java.net.IDN methods. These take an URL and return a normalized one.

          Show
          Markus Jelsma added a comment - Patch for 1.5. URLUtil now has a toASCII and toUnicode method wrapping the java.net.IDN methods. These take an URL and return a normalized one.

            People

            • Assignee:
              Markus Jelsma
              Reporter:
              Markus Jelsma
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development