Uploaded image for project: 'Crunch (Retired)'
  1. Crunch (Retired)
  2. CRUNCH-369

Crunch doesn't use custom getSplits functions of FileInputFormat subclasses

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Minor
    • Resolution: Duplicate
    • 0.9.0
    • None
    • IO
    • None

    Description

      Suppose I create a source for a custom InputFormat which is a subclass of FileInputFormat; e.g.

      TableSource<LongWritable, FastQWritable> source = From.formattedFile(
      inputFile, FastQInputFormatNew.class, LongWritable.class,
      FastQWritable.class);

      where FastQInputFormat is a subclass of FileInputFormat.

      This won't work as expected because by default CrunchInputFormat.getSplits will end up using CrunchCombineFileInputFormat to split the file. This doesn't work because my custom FIleInputFormat uses a custom file splitter.

      I can work around this by explicitly disabling the combining: e.g

      source.inputConf(RuntimeParameters.DISABLE_COMBINE_FILE, Boolean.TRUE.toString());

      but this doesn't strike me as the best solution. If I tell Crunch to use a custom InputFormat I shouldn't have to specify a second config option in order to tell Crunch to respect the getSplits function in my custom InputFormat.

      I think CrunchInputFormat.getSplits should check that the format class exactly matches FileInputFormat; i.e. it isn't a subclass. For subclasses Crunch should use the getsplits function in the custom InputFormat class. I think changing the check to the following might work

      if (format.getClass().equals(FileInputFormat.class) &&
      !conf.getBoolean(RuntimeParameters.DISABLE_COMBINE_FILE, true)

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              jeremy@lewi.us Jeremy Lewi
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: