Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-1957

FileDumper output file name collisions

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Minor
    • Resolution: Fixed
    • 1.10
    • 1.10
    • tool
    • Patch Available

    Description

      The FileDumper extracts file base name and extension and use <basename>.<extension>(e.g. given the url https://www.aoncadis.org/contact/Yarrow%20Axford/project.html, the <basename>.<extension> will be project.html) as the file name to dump the file.

      Code from FileDumper.java:

      String url = key.toString();
      String baseName = FilenameUtils.getBaseName(url);
      String extension = FilenameUtils.getExtension(url);
      ...
      String filename = baseName + "." + extension;

      This introduce file name collision and leads to loss of data when using bin/nutch dump.

      Sample logs:
      2015-03-10 23:38:01,192 INFO tools.FileDumper - Dumping URL: http://beringsea.eol.ucar.edu/data/
      2015-03-10 23:38:01,193 INFO tools.FileDumper - Skipping writing: [testFileName/.html]: file already exists
      2015-03-10 23:38:16,717 INFO tools.FileDumper - Dumping URL: http://catalog.eol.ucar.edu/
      2015-03-10 23:38:16,719 INFO tools.FileDumper - Skipping writing: [testFileName/.html]: file already exists

      2015-03-10 23:38:46,411 INFO tools.FileDumper - Dumping URL: https://www.aoncadis.org/contact/Carin%20Ashjian/project.html
      2015-03-10 23:38:46,411 INFO tools.FileDumper - Skipping writing: [testFileName/project.html]: file already exists
      2015-03-10 23:38:46,411 INFO tools.FileDumper - Dumping URL: https://www.aoncadis.org/contact/Christopher%20Arp/project.html
      2015-03-10 23:38:46,411 INFO tools.FileDumper - Skipping writing: [testFileName/project.html]: file already exists
      2015-03-10 23:38:46,411 INFO tools.FileDumper - Dumping URL: https://www.aoncadis.org/contact/Dr.%20Knut%20Aagaard/project.html
      2015-03-10 23:38:46,412 INFO tools.FileDumper - Skipping writing: [testFileName/project.html]: file already exists
      2015-03-10 23:38:46,412 INFO tools.FileDumper - Dumping URL: https://www.aoncadis.org/contact/Eric%20C.%20Apel/project.html
      2015-03-10 23:38:46,412 INFO tools.FileDumper - Skipping writing: [testFileName/project.html]: file already exists
      2015-03-10 23:38:46,412 INFO tools.FileDumper - Dumping URL: https://www.aoncadis.org/contact/John%20T.%20Andrews/project.html
      2015-03-10 23:38:46,412 INFO tools.FileDumper - Skipping writing: [testFileName/project.html]: file already exists
      2015-03-10 23:38:46,412 INFO tools.FileDumper - Dumping URL: https://www.aoncadis.org/contact/Juha%20Alatalo/project.html
      2015-03-10 23:38:46,413 INFO tools.FileDumper - Skipping writing: [testFileName/project.html]: file already exists
      2015-03-10 23:38:46,413 INFO tools.FileDumper - Dumping URL: https://www.aoncadis.org/contact/Kerim%20Aydin/project.html
      2015-03-10 23:38:46,413 INFO tools.FileDumper - Skipping writing: [testFileName/project.html]: file already exists
      2015-03-10 23:38:46,413 INFO tools.FileDumper - Dumping URL: https://www.aoncadis.org/contact/Knut%20Aagaard/project.html
      2015-03-10 23:38:46,413 INFO tools.FileDumper - Skipping writing: [testFileName/project.html]: file already exists
      2015-03-10 23:38:46,413 INFO tools.FileDumper - Dumping URL: https://www.aoncadis.org/contact/Mary%20Albert/project.html
      2015-03-10 23:38:46,414 INFO tools.FileDumper - Skipping writing: [testFileName/project.html]: file already exists
      2015-03-10 23:38:46,414 INFO tools.FileDumper - Dumping URL: https://www.aoncadis.org/contact/Yarrow%20Axford/project.html

      Attachments

        1. NUTCH-1957.patch
          10 kB
          Renxia Wang

        Activity

          People

            chrismattmann Chris A. Mattmann
            zhique Renxia Wang
            Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: