Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-30314

Unable to read all records from compressed delimited file input format

    XMLWordPrintableJSON

Details

    Description

      I am reading gzipped JSON line-delimited files in the batch mode using FileSystem Connector. For reading the files a new table is created with the following configuration:

      CREATE TEMPORARY TABLE `my_database`.`my_table` (
        `my_field1` BIGINT,
        `my_field2` INT,
        `my_field3` VARCHAR(2147483647)
      ) WITH (
        'connector' = 'filesystem',
        'path' = 'path-to-input-dir',
        'format' = 'json',
        'json.ignore-parse-errors' = 'false',
        'json.fail-on-missing-field' = 'true'
      ) 

      In the input directory I have two files: input-00000.json.gz and input-00001.json.gz. As it comes from the filenames, the files are compressed with GZIP. Each of the files contains 10 records. The issue is that only 2 records from each file are read (4 in total). If decompressed versions of the same data files are used, all 20 records are read.

      As far as I understand, that problem may be related to the fact that split length, which is used when the files are read, is in fact the length of a compressed file. So files are closed before all records are read from them because read position of the decompressed file stream exceeds split length.

      Probably, it makes sense to add a flag to FSDataInputStream, so we could identify if the file compressed or not. The flag can be set to true in InputStreamFSInputWrapper because it is used for wrapping compressed file streams. With such a flag it could be possible to differentiate non-splittable compressed files and only rely on the end of the stream.

      Attachments

        1. input.json.gz
          0.2 kB
          Dmitry Yaraev
        2. input.json
          2 kB
          Dmitry Yaraev
        3. input.json.zip
          0.3 kB
          Dmitry Yaraev

        Issue Links

          Activity

            People

              echauchot Etienne Chauchot
              dyaraev Dmitry Yaraev
              Votes:
              2 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: