[FLUME-2967] Corrupted gzip files generated when writting to S3 - ASF JIRA

Details

Type: Bug
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: 1.6.0
Fix Version/s: None
Component/s: Sinks+Sources
Labels:
None
Environment:

Java(TM) SE Runtime Environment (build 1.8.0_72-b15)
Java HotSpot(TM) 64-Bit Server VM (build 25.72-b15, mixed mode)

Amazon Linux AMI release 2016.03
4.1.17-22.30.amzn1.x86_64

Description

a flume process configured with the following parameters writes corrupt gzip files to AWS S3

Configuration

#### SINKS ####
#sink to write to S3
a1.sinks.khdfs.type = hdfs
a1.sinks.khdfs.hdfs.path = s3n://@logs.tigo.com/useractivity/%Y/%m/%d/p6-v2/
a1.sinks.khdfs.hdfs.fileType = CompressedStream
a1.sinks.khdfs.hdfs.codeC = gzip
a1.sinks.khdfs.hdfs.filePrefix = useractivity
a1.sinks.khdfs.hdfs.fileSuffix = .json.gz
a1.sinks.khdfs.hdfs.writeFormat = Writable
a1.sinks.khdfs.hdfs.rollCount = 100
a1.sinks.khdfs.hdfs.rollSize = 0
a1.sinks.khdfs.hdfs.callTimeout = 120000
a1.sinks.khdfs.hdfs.batchSize = 1000
a1.sinks.khdfs.hdfs.threadsPoolSize = 40
a1.sinks.khdfs.hdfs.rollTimerPoolSize = 1
a1.sinks.khdfs.channel = chdfs

the input is a simple JSON structure

{
  "origin": "Mi Tigo App sv",
  "date": "2016-08-05T14:26:10.859Z",
  "country": "SV",
  "action": "MI-TIGO-APP Header Enrichment",
  "msisdn": "76821107",
  "ip": "181.189.178.89",
  "useragent": "Mi Tigo  samsung zerofltedv SM-G920I 5.1.1 22 V: 31 (1.503.0.73)",
  "data": {
    "variables": "{\"!msisdn\":\"76821107\"}"
  },
  "event_id": "mta_login"
}

i use a combination of hdfs sink and the following libraries in the plugins.d/hdfs/libext folder

  hdfs group: 'com.amazonaws', name: 'aws-java-sdk-s3', version: '1.10.72'
  hdfs group: 'org.apache.hadoop', name: 'hadoop-common', version: '2.5.2'
  hdfs group: 'org.apache.hadoop', name: 'hadoop-hdfs', version: '2.5.2'
  hdfs group: 'org.apache.hadoop', name: 'hadoop-annotations', version: '2.5.2'
  hdfs group: 'org.apache.hadoop', name: 'hadoop-auth', version: '2.5.2'
  hdfs group: 'org.apache.hadoop', name: 'hadoop-mapreduce-client-core', version: '2.5.2'
  hdfs group: 'org.apache.hadoop', name: 'hadoop-mapreduce-client-common', version: '2.5.2'
  hdfs group: 'org.apache.hadoop', name: 'hadoop-mapreduce-client-jobclient', version: '2.5.2'
  hdfs group: 'commons-configuration', name: 'commons-configuration', version: '1.10'
  hdfs group: 'net.java.dev.jets3t', name: 'jets3t', version: '0.9.4'
  hdfs group: 'org.apache.httpcomponents', name: 'httpclient', version: '4.5.2'
  hdfs group: 'org.apache.httpcomponents', name: 'httpcore', version: '4.4.5'

i expect a file with 100 events and compressed in gzip format to be on S3, but the generated file is damaged:

the size of the compressed size is greater than the internal file
most tools fails to decompress the file, arguing is damaged.

gzip -d forcefully decompresses, not without complaining about extra
trailing garbage

gzip -d useractivity.1470407170478.json.gz 
gzip: useractivity.1470407170478.json.gz: decompression OK, trailing garbage ignored

last but not least, the resulting file from the forced decompression contains only one or two lines, where 100 is expected.

we tried (to no avail) :

both Writable and Text file types
all options on controlling the file content by rolling: time, events, size
all combinations of recipes for writing to S3, including more than one set of libraries
all schemas (s3n, s3a)
not compressing. this generates the expected json files just fine.
vanilla flume libraries
heavily replaced flume libraries, with newer or different versions of libraries (just in case)
read all available documentation

we haven't tried:

install Hadoop and refer libraries in classpath (we want to avoid this, we are not using Hadoop on the Flume nodes)

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

useractivity.1470406765436.json.gz
05/Aug/16 18:08
11 kB
Alberto Sarubbi

Issue Links

is blocked by

HADOOP-8522 ResetableGzipOutputStream creates invalid gzip files when finish() and resetState() are used

Resolved

Corrupted gzip files generated when writting to S3

Details

Description

Configuration

we tried (to no avail) :

we haven't tried:

Attachments

Attachments

Issue Links

Activity

People

Dates