Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-2183

Improvement to SegmentChecker for skipping non-segments present in segments directory

    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 1.11
    • Fix Version/s: 1.12
    • Component/s: indexer, segment
    • Labels:
      None

      Description

      The scenario is that you have a bunch of Nutch data which has been gathered over some period of time. Some of the data structures are present, some are not. In segments directory for example there is .zip files (don't ask why) and in other directories there are .tar.gz files, etc.
      This patch improves the SegmentChecker to skip directories or files present within the segments directory which are not 14 characters in length as ALL segments are. It also uses this check for individual segments if used by the IndexingJob. This means that we can prevent the Indexer blowing up if it is run on one segment (e.g. without -dir option) and detects some arbitrary directory present within segments/ which actually turns out not to be a segment afterall.

        Attachments

        1. NUTCH-2183.patch
          2 kB
          Lewis John McGibbney

          Activity

            People

            • Assignee:
              lewismc Lewis John McGibbney
              Reporter:
              lewismc Lewis John McGibbney
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: