Details

    • Type: New Feature New Feature
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: 1.10
    • Component/s: None
    • Labels:
      None

      Description

      Right now, IDN's are indexed as ASCII. An IDNNormalizer is to be used with an indexer so it will encode ASCII URL's to their proper unicode equivalant.

        Issue Links

          Activity

          Markus Jelsma created issue -
          Markus Jelsma made changes -
          Field Original Value New Value
          Link This issue relates to NUTCH-1320 [ NUTCH-1320 ]
          Hide
          Markus Jelsma added a comment -

          ...or, we could do a toUnicode for outlinks or directly in the fetcher. This also makes sense because as ASCII these URL's are longer, sometimes much longer. This can stir trouble for filters that, partly, rely on string length. If both conversions are implemented in the fetcher or protocol library then we don't have to worry about it, and have better logging in the fetcher!

          Show
          Markus Jelsma added a comment - ...or, we could do a toUnicode for outlinks or directly in the fetcher. This also makes sense because as ASCII these URL's are longer, sometimes much longer. This can stir trouble for filters that, partly, rely on string length. If both conversions are implemented in the fetcher or protocol library then we don't have to worry about it, and have better logging in the fetcher!
          Hide
          Markus Jelsma added a comment -

          Any comments?

          Show
          Markus Jelsma added a comment - Any comments?
          Lewis John McGibbney made changes -
          Fix Version/s 1.7 [ 12323281 ]
          Lewis John McGibbney made changes -
          Fix Version/s 1.9 [ 12324611 ]
          Fix Version/s 1.7 [ 12323281 ]
          Hide
          İlhami KALKAN added a comment -

          I patched for Nutch 2.x.

          Show
          İlhami KALKAN added a comment - I patched for Nutch 2.x.
          İlhami KALKAN made changes -
          Attachment Nutch-1321.patch [ 12598397 ]
          İlhami KALKAN made changes -
          Attachment Nutch-1321.patch [ 12598397 ]
          İlhami KALKAN made changes -
          Attachment Nutch-1321.patch [ 12598421 ]
          Hide
          Markus Jelsma added a comment -

          Check URLUtils, we already have methods for IDN there. They should be reused.

          Show
          Markus Jelsma added a comment - Check URLUtils, we already have methods for IDN there. They should be reused.
          Hide
          İlhami KALKAN added a comment -

          Hi Markus.
          I have a question for IDN. I have urls like http://www.çevir.com. In inject and parse phases, they are rejected by URLFilters. Besides, I want to index them as unicode like www.çevir.com. How can I do this? By fixing IDNNormalizer plugin patch? I think, convert them to punycode in inject and parse phase then force revert to unicode url to index. Am I wrong or do you have any recommendation.
          I addition, toUNICODE method does not work correctly. When I try to convert http://www.çevir.com with this method, it return java.net.URISyntaxException. If we dont use URI object, there is no problem. Why do we need URI object in toUNICODE method?

          Show
          İlhami KALKAN added a comment - Hi Markus. I have a question for IDN. I have urls like http://www .çevir.com. In inject and parse phases, they are rejected by URLFilters. Besides, I want to index them as unicode like www.çevir.com. How can I do this? By fixing IDNNormalizer plugin patch? I think, convert them to punycode in inject and parse phase then force revert to unicode url to index. Am I wrong or do you have any recommendation. I addition, toUNICODE method does not work correctly. When I try to convert http://www .çevir.com with this method, it return java.net.URISyntaxException. If we dont use URI object, there is no problem. Why do we need URI object in toUNICODE method?
          Hide
          Markus Jelsma added a comment -

          Hi - you can attach a patch with the fix.

          Show
          Markus Jelsma added a comment - Hi - you can attach a patch with the fix.
          Hide
          İlhami KALKAN added a comment -

          I added patch file. Non-ascii urls are converted punycode by BasicURLNormalizer.java in inject phase and also parse phase while extracting outlinks. In index phase, punycodes are converted to unicode.

          Show
          İlhami KALKAN added a comment - I added patch file. Non-ascii urls are converted punycode by BasicURLNormalizer.java in inject phase and also parse phase while extracting outlinks. In index phase, punycodes are converted to unicode.
          İlhami KALKAN made changes -
          Attachment idnNormalizer.patch [ 12618931 ]
          Hide
          Sebastian Nagel added a comment -

          Hi İlhami KALKAN,
          great! Thanks! The patch looks good (not tested yet). A few comments:

          1. method isPunycode(url)
            String[] arr = url.split("\\.");
            if (arr[1].startsWith("xn--"))
            

            fails for URLs like http://www.medizin.xn--uni-tbingen-xhb.de/

          2. maybe we should make the decoding from Punycode to Unicode in scope indexer configurable by some property "urlnormalizer.idn.indexer.decode" or similar. URLs are used as ordinary content (tokenized field "url") and unique ID (field "id") for updating and deleting indexed documents. Some indexer back-ends may require the id field to be pure ASCII or Punycode.
          3. cosmetics: code should be formatted by eclipse-codeformat.xml, patches generated as decribed in 1, 2.
          Show
          Sebastian Nagel added a comment - Hi İlhami KALKAN , great! Thanks! The patch looks good (not tested yet). A few comments: method isPunycode(url) String [] arr = url.split( "\\." ); if (arr[1].startsWith( "xn--" )) fails for URLs like http://www.medizin.xn--uni-tbingen-xhb.de/ maybe we should make the decoding from Punycode to Unicode in scope indexer configurable by some property "urlnormalizer.idn.indexer.decode" or similar. URLs are used as ordinary content (tokenized field "url") and unique ID (field "id") for updating and deleting indexed documents. Some indexer back-ends may require the id field to be pure ASCII or Punycode. cosmetics: code should be formatted by eclipse-codeformat.xml , patches generated as decribed in 1 , 2 .
          Hide
          İlhami KALKAN added a comment -

          Hi Sebastian,
          1-)This code block is belongs to old patch version, Nutch-1321.patch. Sorry about was not removing it. New version of isPunycode(url) exist in idnNormalizer.patch.
          2-)This patch revert only url which is punycoded to unicode while indexing. 'id' is not reverted to unicode. Holding punycoded value while indexing.
          Is this enough for updating and deleting indexed documents or If we need to punycoded url, can you explain a little more why we need this?

          Show
          İlhami KALKAN added a comment - Hi Sebastian, 1-)This code block is belongs to old patch version, Nutch-1321.patch. Sorry about was not removing it. New version of isPunycode(url) exist in idnNormalizer.patch. 2-)This patch revert only url which is punycoded to unicode while indexing. 'id' is not reverted to unicode. Holding punycoded value while indexing. Is this enough for updating and deleting indexed documents or If we need to punycoded url, can you explain a little more why we need this?
          İlhami KALKAN made changes -
          Attachment Nutch-1321.patch [ 12598421 ]
          Hide
          Sebastian Nagel added a comment -

          Sorry, I should have checked the date of patches to get the latest one. The right patch is correctly formatted and applies well. Thanks!

          You are right regarding point 2: in 2.x 'id' is the reversed (and punycoded) URL. In 1.x the situation is different. But for 2.x there is definitely no problem. For 1.x this should be discussed.

          Testing the patch failed because URLUtil.toUNICODE() returned null for punycoded URLs (opened NUTCH-1685).

          Is there really a need for isPunycode(). At least, for the current patch it checks for punycode by converting to Unicode and comparing the result with the original URL. It would be more efficient to convert it unconditionally (without changes to the URL if it's not an internationalized domain name).

          Show
          Sebastian Nagel added a comment - Sorry, I should have checked the date of patches to get the latest one. The right patch is correctly formatted and applies well. Thanks! You are right regarding point 2: in 2.x 'id' is the reversed (and punycoded) URL. In 1.x the situation is different. But for 2.x there is definitely no problem. For 1.x this should be discussed. Testing the patch failed because URLUtil.toUNICODE() returned null for punycoded URLs (opened NUTCH-1685 ). Is there really a need for isPunycode(). At least, for the current patch it checks for punycode by converting to Unicode and comparing the result with the original URL. It would be more efficient to convert it unconditionally (without changes to the URL if it's not an internationalized domain name).
          Sebastian Nagel made changes -
          Link This issue depends upon NUTCH-1681 [ NUTCH-1681 ]
          İlhami KALKAN made changes -
          Attachment idnNormalizer.patch [ 12618931 ]
          Hide
          İlhami KALKAN added a comment -

          Hi Sebastian,
          I dont know enough information about 1.x so I patched it for 2.x.
          You are right. isPunycode() method is unefficient and not necessary. According to convert every url which will index, only convert urls which contains "xn--" that ACE prefix for IDNA. I fixed it. Thanks!

          Show
          İlhami KALKAN added a comment - Hi Sebastian, I dont know enough information about 1.x so I patched it for 2.x. You are right. isPunycode() method is unefficient and not necessary. According to convert every url which will index, only convert urls which contains "xn--" that ACE prefix for IDNA. I fixed it. Thanks!
          İlhami KALKAN made changes -
          Attachment idnNormalizer.patch [ 12620205 ]
          Hide
          Sebastian Nagel added a comment -

          +1 That's reasonable: checking for "xn--" will avoid useless conversion of most non-IDNA URLs. Short and effective patch. Thanks!

          Show
          Sebastian Nagel added a comment - +1 That's reasonable: checking for "xn--" will avoid useless conversion of most non-IDNA URLs. Short and effective patch. Thanks!
          Hide
          Sebastian Nagel added a comment -

          In BasicURLNormalizer URLs are already split into parts (protocol, host, etc.): we could call directly IDN.toASCII(host) which would be more efficient than using URLUtil.toASCII(url) and doing split and concatenation twice.

          Maybe we should move the decoding of the punycoded URLs from IndexUtil to index-basic / BasicIndexingFilter: field "url" is filled here. In case of redirects it's filled with reprUrl which should be decoded as well.

          Regarding a port to 1.x: trunk does currently not differentiate between 'id' and 'url'. IDN-decoding the URL in NutchDocument may cause that documents are not properly deleted, cf. NUTCH-1708 for a similar problem and discussions.

          Show
          Sebastian Nagel added a comment - In BasicURLNormalizer URLs are already split into parts (protocol, host, etc.): we could call directly IDN.toASCII(host) which would be more efficient than using URLUtil.toASCII(url) and doing split and concatenation twice. Maybe we should move the decoding of the punycoded URLs from IndexUtil to index-basic / BasicIndexingFilter: field "url" is filled here. In case of redirects it's filled with reprUrl which should be decoded as well. Regarding a port to 1.x: trunk does currently not differentiate between 'id' and 'url'. IDN-decoding the URL in NutchDocument may cause that documents are not properly deleted, cf. NUTCH-1708 for a similar problem and discussions.
          Julien Nioche made changes -
          Fix Version/s 1.10 [ 12327187 ]
          Fix Version/s 1.9 [ 12324611 ]

            People

            • Assignee:
              Markus Jelsma
              Reporter:
              Markus Jelsma
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:

                Development