Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-1483 Can't crawl filesystem with protocol-file plugin
  3. NUTCH-1885

Protocol-file should treat symbolic links as redirects

    XMLWordPrintableJSON

Details

    • Sub-task
    • Status: Closed
    • Minor
    • Resolution: Fixed
    • 1.9, 2.2.1
    • 2.3, 1.10
    • protocol
    • None
    • Patch Available

    Description

      (reported by angela_wang, see NUTCH-1884, [1 and [2)

      If a file is a symbolic link or contains a link on it's path:, protocol-file follows the link immediately and returns a Content object with the canonical path (all symbolic links resolved) in field "Location". This may cause

      • the Parse object not available under its expected URL (see NUTCH-1884)
      • dubious CrawlDatums (status fetched!) in CrawlDb (first URL is a symbolic link to second item):
        file:/var/www/redir_test.html   Version: 7
        Status: 2 (db_fetched)
        ...
        Signature: null
        Metadata: 
                Content-Type=text/html
                _pst_=success(1), lastModified=0
        
        file:/var/www/test.html Version: 7
        Status: 2 (db_fetched)
        ...
        Signature: 50fa8436398f0ecb6b15eaba0574ef23
        Metadata: 
                Content-Type=text/html
                _pst_=success(1), lastModified=0
        

        Because signature is null these will never result in duplicates in index.

      Protocol-file should instead explicitly redirect to the link target. This should be the default, optionally we could add a property to restore the old behavior.

      Should not be difficult to resolve: FileResponse already has status "redirect" for symlinks, but File.getProtocolOutput() then resolves the links internally. So we just need to return a redirect response before links are resolved/followed.

      Attachments

        1. NUTCH-1885-2x-v1.patch
          4 kB
          Sebastian Nagel
        2. NUTCH-1885-trunk-v1.patch
          4 kB
          Sebastian Nagel

        Activity

          People

            Unassigned Unassigned
            snagel Sebastian Nagel
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: