BZip2 can drop and duplicate record when input split file is small. I confirmed that this issue happens when the input split size is between 1byte and 4bytes.
I am seeing the following 2 problem behaviors.
1. Drop record:
BZip2 skips the first record in the input file when the input split size is small
Set the split size to 3 and tested to load 100 records (0, 1, 2..99)
> The input format read only 99 records but not 100 records
2. Duplicate Record:
2 input splits has same BZip2 records when the input split size is small
Set the split size to 1 and tested to load 100 records (0, 1, 2..99)
I experienced this error when I execute Spark (SparkSQL) job under the following conditions:
- The file size of the input files are small (around 1KB)
- Hadoop cluster has many slave nodes (able to launch many executor tasks)