Uploaded image for project: 'Crunch'
  1. Crunch
  2. CRUNCH-690

materialize() directly from Source skips compression and glob handling

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 0.14.0
    • Fix Version/s: None
    • Component/s: Core
    • Labels:
      None

      Description

      I recently came to notice some oddities that can occur with PCollections materialized without any DoFns performed.

      For example:

      PCollection<String> data = pipeline.read(From.textFile("/data/*.txt"));
      Iterable<String> materializedData = data.materialize();
      

      Results in:

      org.apache.crunch.CrunchRuntimeException: java.io.IOException: No files found to materialize at: /data/*.txt
      

      The globbing in the input path is not evaluated.

      Another issue is that decompression is not handled. For example, suppose I do:

      PCollection<String> data = pipeline.read(From.textFile("/data/file1.txt.gz"));
      Iterable<String> materializedData = data.materialize();
      

      This will succeed, but the Strings in materializedData will be garbled junk as the data was never decompressed.

      If I run parallelDo(IdentityFn.getInstance()) on the data before materializing, both problems go away. Globbing and decompression are both handled.

      The code path to read data materialized directly is different than the code path for data that passes through a DoFn. When the data passes through a DoFn, all the magic in FileInputFormat happens. If it doesn't pass through a DoFn, that is skipped.

      The latter problem is probably the worse of the two. Potentially there could be a pipeline that succeeds but silently processes data incorrectly.

      I only bumped into this with some perhaps weird stuff I was trying in a test and I have the workaround of running IdentityFn so a fix is not urgent for me but I wonder about possibly patching in something to avoid this silent problem. I'm not going to jump on it immediately but rather at least leave this issue out here for visibility.

        Attachments

          Activity

            People

            • Assignee:
              jwills Josh Wills
              Reporter:
              ben.roling Ben Roling
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated: