[FLINK-25311] DelimitedInputFormat cannot read compressed files correctly - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: 1.14.2
Fix Version/s: None
Component/s: API / Core
Labels:
- pull-request-available
- stale-assigned

Description

This is reported from the user mailing list.

Run the following test to reproduce this bug.

import org.apache.flink.table.api.EnvironmentSettings;

import org.apache.flink.table.api.TableEnvironment;
import org.apache.flink.table.api.internal.TableEnvironmentImpl;

import org.junit.Test;

public class MyTest {

    @Test
    public void myTest() throws Exception {
        EnvironmentSettings settings = EnvironmentSettings.inBatchMode();
        TableEnvironment tEnv = TableEnvironmentImpl.create(settings);
        tEnv.executeSql(
                        "create table T1 ( a INT ) with ( 'connector' = 'filesystem', 'format' = 'json', 'path' = '/tmp/gao.json' )")
                .await();
        tEnv.executeSql(
                        "create table T2 ( a INT ) with ( 'connector' = 'filesystem', 'format' = 'json', 'path' = '/tmp/gao.gz' )")
                .await();
        tEnv.executeSql("select count(*) from T1 UNION ALL select count(*) from T2").print();
    }
}

Data files used are attached in the attachment.

The result is

+----------------------+
|               EXPR$0 |
+----------------------+
|                  100 |
|                   24 |
+----------------------+

which is obviously incorrect.

This is because DelimitedInputFormat#fillBuffer cannot deal with compressed files correctly. It limits the number of (uncompressed) bytes read with splitLength, while splitLength is the length of compressed bytes, so they cannot match.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

gao.gz
15/Dec/21 02:51
0.2 kB
Caizhi Weng
gao.json
15/Dec/21 02:51
1.0 kB
Caizhi Weng

Issue Links

links to

GitHub Pull Request #18273

GitHub Pull Request #18299

Activity

People

Assignee:: Jinxin.Tang

Reporter:: Caizhi Weng

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 15/Dec/21 02:49

Updated:: 25/Apr/22 10:53