Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-1212

ParseOutputFormat has redundant code

    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Closed
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: 1.4
    • Fix Version/s: 1.5
    • Component/s: parser
    • Labels:
      None

      Description

      In ParseOutputFormat, I see a code block:

               // collect outlinks for subsequent db update
               Outlink[] links = parseData.getOutlinks();
               int outlinksToStore = Math.min(maxOutlinks, links.length);
               if (ignoreExternalLinks) {
                 try {
                   fromHost = new URL(fromUrl).getHost().toLowerCase();
                 } catch (MalformedURLException e) {
                   fromHost = null;
                 }
               } else {
                 fromHost = null;
               }
      

      The if(ignoreExternalLinks) part then gets subsequently set and
      reset in the ensuing for loop:

               int validCount = 0;
               CrawlDatum adjust = null;
               List<Entry<Text, CrawlDatum>> targets = new ArrayList<Entry<Text, CrawlDatum>>(outlinksToStore);
               List<Outlink> outlinkList = new ArrayList<Outlink>(outlinksToStore);
               for (int i = 0; i < links.length && validCount < outlinksToStore; i++) {
                 String toUrl = links[i].getToUrl();
                 // ignore links to self (or anchors within the page)
                 if (fromUrl.equals(toUrl)) {
                   continue;
                 }
                 if (ignoreExternalLinks) {
                   try {
                     toHost = new URL(toUrl).getHost().toLowerCase();
                   } catch (MalformedURLException e) {
                     toHost = null;
                   }
                   if (toHost == null || !toHost.equals(fromHost)) { // external links
                     continue; // skip it
                   }
                 }
      

      Isn't that redundant? I don't think the first if block is needed.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                chrismattmann Chris A. Mattmann
                Reporter:
                chrismattmann Chris A. Mattmann
              • Votes:
                0 Vote for this issue
                Watchers:
                0 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: