Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-1212

ParseOutputFormat has redundant code

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Minor
    • Resolution: Fixed
    • 1.4
    • 1.5
    • parser
    • None

    Description

      In ParseOutputFormat, I see a code block:

               // collect outlinks for subsequent db update
               Outlink[] links = parseData.getOutlinks();
               int outlinksToStore = Math.min(maxOutlinks, links.length);
               if (ignoreExternalLinks) {
                 try {
                   fromHost = new URL(fromUrl).getHost().toLowerCase();
                 } catch (MalformedURLException e) {
                   fromHost = null;
                 }
               } else {
                 fromHost = null;
               }
      

      The if(ignoreExternalLinks) part then gets subsequently set and
      reset in the ensuing for loop:

               int validCount = 0;
               CrawlDatum adjust = null;
               List<Entry<Text, CrawlDatum>> targets = new ArrayList<Entry<Text, CrawlDatum>>(outlinksToStore);
               List<Outlink> outlinkList = new ArrayList<Outlink>(outlinksToStore);
               for (int i = 0; i < links.length && validCount < outlinksToStore; i++) {
                 String toUrl = links[i].getToUrl();
                 // ignore links to self (or anchors within the page)
                 if (fromUrl.equals(toUrl)) {
                   continue;
                 }
                 if (ignoreExternalLinks) {
                   try {
                     toHost = new URL(toUrl).getHost().toLowerCase();
                   } catch (MalformedURLException e) {
                     toHost = null;
                   }
                   if (toHost == null || !toHost.equals(fromHost)) { // external links
                     continue; // skip it
                   }
                 }
      

      Isn't that redundant? I don't think the first if block is needed.

      Attachments

        Issue Links

          Activity

            People

              chrismattmann Chris A. Mattmann
              chrismattmann Chris A. Mattmann
              Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: