[HADOOP-15206] BZip2 drops and duplicates records when input split size is small - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 2.8.3, 3.0.0
Fix Version/s: 3.1.0, 2.10.0, 2.9.1, 2.8.4, 2.7.6, 3.0.3
Component/s: None
Labels:
None

Hadoop Flags:

Reviewed

Description

BZip2 can drop and duplicate record when input split file is small. I confirmed that this issue happens when the input split size is between 1byte and 4bytes.

I am seeing the following 2 problem behaviors.

1. Drop record:

BZip2 skips the first record in the input file when the input split size is small

Set the split size to 3 and tested to load 100 records (0, 1, 2..99)

2018-02-01 10:52:33,502 INFO  [Thread-17] mapred.TestTextInputFormat (TestTextInputFormat.java:verifyPartitions(317)) - splits[1]=file:/work/count-mismatch2/hadoop/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient/target/test-dir/TestTextInputFormat/test.bz2:3+3 count=99

> The input format read only 99 records but not 100 records

2. Duplicate Record:

2 input splits has same BZip2 records when the input split size is small

Set the split size to 1 and tested to load 100 records (0, 1, 2..99)

2018-02-01 11:18:49,309 INFO [Thread-17] mapred.TestTextInputFormat (TestTextInputFormat.java:verifyPartitions(318)) - splits[3]=file /work/count-mismatch2/hadoop/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient/target/test-dir/TestTextInputFormat/test.bz2:3+1 count=99
2018-02-01 11:18:49,310 WARN [Thread-17] mapred.TestTextInputFormat (TestTextInputFormat.java:verifyPartitions(308)) - conflict with 1 in split 4 at position 8

I experienced this error when I execute Spark (SparkSQL) job under the following conditions:

The file size of the input files are small (around 1KB)

Hadoop cluster has many slave nodes (able to launch many executor tasks)

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

HADOOP-15206-test.patch
01/Feb/18 20:44
1 kB
Aki Tanaka
HADOOP-15206.001.patch
06/Feb/18 07:23
2 kB
Aki Tanaka
HADOOP-15206.002.patch
06/Feb/18 18:08
3 kB
Aki Tanaka
HADOOP-15206.003.patch
08/Feb/18 20:10
3 kB
Aki Tanaka
HADOOP-15206.004.patch
11/Feb/18 06:06
6 kB
Aki Tanaka
HADOOP-15206.005.patch
13/Feb/18 16:33
6 kB
Aki Tanaka
HADOOP-15206.006.patch
14/Feb/18 18:57
4 kB
Aki Tanaka
HADOOP-15206.007.patch
15/Feb/18 04:13
3 kB
Aki Tanaka
HADOOP-15206.008.patch
15/Feb/18 18:26
4 kB
Aki Tanaka

Activity

People

Assignee:: Aki Tanaka

Reporter:: Aki Tanaka

Votes:: 0 Vote for this issue

Watchers:: 14 Start watching this issue

Dates

Created:: 01/Feb/18 20:43

Updated:: 05/Apr/18 17:16

Resolved:: 16/Feb/18 21:30