Uploaded image for project: 'Flume'
  1. Flume
  2. FLUME-2967

Corrupted gzip files generated when writting to S3

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 1.6.0
    • None
    • Sinks+Sources
    • None
    • Java(TM) SE Runtime Environment (build 1.8.0_72-b15)
      Java HotSpot(TM) 64-Bit Server VM (build 25.72-b15, mixed mode)

      Amazon Linux AMI release 2016.03
      4.1.17-22.30.amzn1.x86_64

    Description

      a flume process configured with the following parameters writes corrupt gzip files to AWS S3

      Configuration

      #### SINKS ####
      #sink to write to S3
      a1.sinks.khdfs.type = hdfs
      a1.sinks.khdfs.hdfs.path = s3n://@logs.tigo.com/useractivity/%Y/%m/%d/p6-v2/
      a1.sinks.khdfs.hdfs.fileType = CompressedStream
      a1.sinks.khdfs.hdfs.codeC = gzip
      a1.sinks.khdfs.hdfs.filePrefix = useractivity
      a1.sinks.khdfs.hdfs.fileSuffix = .json.gz
      a1.sinks.khdfs.hdfs.writeFormat = Writable
      a1.sinks.khdfs.hdfs.rollCount = 100
      a1.sinks.khdfs.hdfs.rollSize = 0
      a1.sinks.khdfs.hdfs.callTimeout = 120000
      a1.sinks.khdfs.hdfs.batchSize = 1000
      a1.sinks.khdfs.hdfs.threadsPoolSize = 40
      a1.sinks.khdfs.hdfs.rollTimerPoolSize = 1
      a1.sinks.khdfs.channel = chdfs
      

      the input is a simple JSON structure

      {
        "origin": "Mi Tigo App sv",
        "date": "2016-08-05T14:26:10.859Z",
        "country": "SV",
        "action": "MI-TIGO-APP Header Enrichment",
        "msisdn": "76821107",
        "ip": "181.189.178.89",
        "useragent": "Mi Tigo  samsung zerofltedv SM-G920I 5.1.1 22 V: 31 (1.503.0.73)",
        "data": {
          "variables": "{\"!msisdn\":\"76821107\"}"
        },
        "event_id": "mta_login"
      }
      

      i use a combination of hdfs sink and the following libraries in the plugins.d/hdfs/libext folder

        hdfs group: 'com.amazonaws', name: 'aws-java-sdk-s3', version: '1.10.72'
        hdfs group: 'org.apache.hadoop', name: 'hadoop-common', version: '2.5.2'
        hdfs group: 'org.apache.hadoop', name: 'hadoop-hdfs', version: '2.5.2'
        hdfs group: 'org.apache.hadoop', name: 'hadoop-annotations', version: '2.5.2'
        hdfs group: 'org.apache.hadoop', name: 'hadoop-auth', version: '2.5.2'
        hdfs group: 'org.apache.hadoop', name: 'hadoop-mapreduce-client-core', version: '2.5.2'
        hdfs group: 'org.apache.hadoop', name: 'hadoop-mapreduce-client-common', version: '2.5.2'
        hdfs group: 'org.apache.hadoop', name: 'hadoop-mapreduce-client-jobclient', version: '2.5.2'
        hdfs group: 'commons-configuration', name: 'commons-configuration', version: '1.10'
        hdfs group: 'net.java.dev.jets3t', name: 'jets3t', version: '0.9.4'
        hdfs group: 'org.apache.httpcomponents', name: 'httpclient', version: '4.5.2'
        hdfs group: 'org.apache.httpcomponents', name: 'httpcore', version: '4.4.5'
      

      i expect a file with 100 events and compressed in gzip format to be on S3, but the generated file is damaged:

      • the size of the compressed size is greater than the internal file
      • most tools fails to decompress the file, arguing is damaged.
      • gzip -d forcefully decompresses, not without complaining about extra
        trailing garbage
        gzip -d useractivity.1470407170478.json.gz 
        gzip: useractivity.1470407170478.json.gz: decompression OK, trailing garbage ignored
        
      • last but not least, the resulting file from the forced decompression contains only one or two lines, where 100 is expected.

      we tried (to no avail) :

      • both Writable and Text file types
      • all options on controlling the file content by rolling: time, events, size
      • all combinations of recipes for writing to S3, including more than one set of libraries
      • all schemas (s3n, s3a)
      • not compressing. this generates the expected json files just fine.
      • vanilla flume libraries
      • heavily replaced flume libraries, with newer or different versions of libraries (just in case)
      • read all available documentation

      we haven't tried:

      • install Hadoop and refer libraries in classpath (we want to avoid this, we are not using Hadoop on the Flume nodes)

      Attachments

        1. useractivity.1470406765436.json.gz
          11 kB
          Alberto Sarubbi

        Issue Links

          Activity

            People

              Unassigned Unassigned
              asarubbi@gmail.com Alberto Sarubbi
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated: