Nutch
  1. Nutch
  2. NUTCH-363

Fetcher normalizes everything at least twice

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Not A Problem
    • Affects Version/s: 0.8
    • Fix Version/s: nutchgora
    • Component/s: fetcher
    • Labels:
      None
    • Environment:

      OS X 10.4.7

      Description

      New links are normalized twice by the fetcher:

      First in DOMContentUtils.getOutlinks, where the constructor Outlink(url.toString(), linkText.toString().trim(), conf) normalizes the URL.

      The second time is in ParseOutputFormat.write().

      For some URLs (e.g. those repeated on a page) a given URL may be normalized a number of times, but it is always normalized at least twice.

      For those of us with expensive normalizations, this is probably burning some CPU.

      I'd gladly fix this, but I'm not yet familiar enough with the code to know if there are some hidden assumptions which rely on this behavior.

      [A related note is that URLs are normalized *before* filtering; this is causing a lot of extra normalization as well. In general, filters may not be safe to run before normalization, but there is likely a class of them which are (filtering out .gif/.jpg etc). Perhaps the notion of a "pre-normalizer filter" would be a useful one?]

        Activity

        Doug Cook created issue -
        Hide
        iwan cornelius added a comment -

        Has this been resolved or a work around found? I'd like to use the normalizer to add a a url to the existing url and this 'feature' is creating problems.

        Cheers

        Show
        iwan cornelius added a comment - Has this been resolved or a work around found? I'd like to use the normalizer to add a a url to the existing url and this 'feature' is creating problems. Cheers
        Hide
        Emmanuel Joke added a comment -

        FYI, The operation to normalize link within the object Outlink has been removed.

        Show
        Emmanuel Joke added a comment - FYI, The operation to normalize link within the object Outlink has been removed.
        Hide
        Alex McLintock added a comment -

        So this issue can be closed, right? Any objections?

        Show
        Alex McLintock added a comment - So this issue can be closed, right? Any objections?
        Hide
        Chris A. Mattmann added a comment -
        • per the comments, this issue is no longer a problem. If you feel it is, or you still see this behavior, please file a new issue and let us know.
        Show
        Chris A. Mattmann added a comment - per the comments, this issue is no longer a problem. If you feel it is, or you still see this behavior, please file a new issue and let us know.
        Chris A. Mattmann made changes -
        Field Original Value New Value
        Status Open [ 1 ] Resolved [ 5 ]
        Fix Version/s 1.2 [ 12315152 ]
        Resolution Not A Problem [ 8 ]
        Chris A. Mattmann made changes -
        Fix Version/s 2.0 [ 12314893 ]
        Fix Version/s 1.2 [ 12315152 ]
        Show
        Markus Jelsma added a comment - Bulk close of resolved issues: http://www.lucidimagination.com/search/document/2738eeb014805854/clean_up_open_legacy_issues_in_jira
        Markus Jelsma made changes -
        Status Resolved [ 5 ] Closed [ 6 ]
        Transition Time In Source Status Execution Times Last Executer Last Execution Date
        Open Open Resolved Resolved
        1386d 11h 10m 1 Chris A. Mattmann 26/Jun/10 06:57
        Resolved Resolved Closed Closed
        279d 9h 10m 1 Markus Jelsma 01/Apr/11 16:07

          People

          • Assignee:
            Unassigned
            Reporter:
            Doug Cook
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development