Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
1.6.0
-
None
-
None
-
Java(TM) SE Runtime Environment (build 1.8.0_72-b15)
Java HotSpot(TM) 64-Bit Server VM (build 25.72-b15, mixed mode)Amazon Linux AMI release 2016.03
4.1.17-22.30.amzn1.x86_64
Description
a flume process configured with the following parameters writes corrupt gzip files to AWS S3
Configuration
#### SINKS #### #sink to write to S3 a1.sinks.khdfs.type = hdfs a1.sinks.khdfs.hdfs.path = s3n://@logs.tigo.com/useractivity/%Y/%m/%d/p6-v2/ a1.sinks.khdfs.hdfs.fileType = CompressedStream a1.sinks.khdfs.hdfs.codeC = gzip a1.sinks.khdfs.hdfs.filePrefix = useractivity a1.sinks.khdfs.hdfs.fileSuffix = .json.gz a1.sinks.khdfs.hdfs.writeFormat = Writable a1.sinks.khdfs.hdfs.rollCount = 100 a1.sinks.khdfs.hdfs.rollSize = 0 a1.sinks.khdfs.hdfs.callTimeout = 120000 a1.sinks.khdfs.hdfs.batchSize = 1000 a1.sinks.khdfs.hdfs.threadsPoolSize = 40 a1.sinks.khdfs.hdfs.rollTimerPoolSize = 1 a1.sinks.khdfs.channel = chdfs
the input is a simple JSON structure
{ "origin": "Mi Tigo App sv", "date": "2016-08-05T14:26:10.859Z", "country": "SV", "action": "MI-TIGO-APP Header Enrichment", "msisdn": "76821107", "ip": "181.189.178.89", "useragent": "Mi Tigo samsung zerofltedv SM-G920I 5.1.1 22 V: 31 (1.503.0.73)", "data": { "variables": "{\"!msisdn\":\"76821107\"}" }, "event_id": "mta_login" }
i use a combination of hdfs sink and the following libraries in the plugins.d/hdfs/libext folder
hdfs group: 'com.amazonaws', name: 'aws-java-sdk-s3', version: '1.10.72' hdfs group: 'org.apache.hadoop', name: 'hadoop-common', version: '2.5.2' hdfs group: 'org.apache.hadoop', name: 'hadoop-hdfs', version: '2.5.2' hdfs group: 'org.apache.hadoop', name: 'hadoop-annotations', version: '2.5.2' hdfs group: 'org.apache.hadoop', name: 'hadoop-auth', version: '2.5.2' hdfs group: 'org.apache.hadoop', name: 'hadoop-mapreduce-client-core', version: '2.5.2' hdfs group: 'org.apache.hadoop', name: 'hadoop-mapreduce-client-common', version: '2.5.2' hdfs group: 'org.apache.hadoop', name: 'hadoop-mapreduce-client-jobclient', version: '2.5.2' hdfs group: 'commons-configuration', name: 'commons-configuration', version: '1.10' hdfs group: 'net.java.dev.jets3t', name: 'jets3t', version: '0.9.4' hdfs group: 'org.apache.httpcomponents', name: 'httpclient', version: '4.5.2' hdfs group: 'org.apache.httpcomponents', name: 'httpcore', version: '4.4.5'
i expect a file with 100 events and compressed in gzip format to be on S3, but the generated file is damaged:
- the size of the compressed size is greater than the internal file
- most tools fails to decompress the file, arguing is damaged.
- gzip -d forcefully decompresses, not without complaining about extra
trailing garbagegzip -d useractivity.1470407170478.json.gz gzip: useractivity.1470407170478.json.gz: decompression OK, trailing garbage ignored
- last but not least, the resulting file from the forced decompression contains only one or two lines, where 100 is expected.
we tried (to no avail) :
- both Writable and Text file types
- all options on controlling the file content by rolling: time, events, size
- all combinations of recipes for writing to S3, including more than one set of libraries
- all schemas (s3n, s3a)
- not compressing. this generates the expected json files just fine.
- vanilla flume libraries
- heavily replaced flume libraries, with newer or different versions of libraries (just in case)
- read all available documentation
we haven't tried:
- install Hadoop and refer libraries in classpath (we want to avoid this, we are not using Hadoop on the Flume nodes)
Attachments
Attachments
Issue Links
- is blocked by
-
HADOOP-8522 ResetableGzipOutputStream creates invalid gzip files when finish() and resetState() are used
- Resolved