Uploaded image for project: 'Apache Avro'
  1. Apache Avro
  2. AVRO-2203

avro module in python generates different bytes while writing file to local storage and s3

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Blocker
    • Resolution: Cannot Reproduce
    • 1.8.0
    • None
    • python
    • None
    • S3. UNIX, HDFS, python

    Description

      Hi, 

      I am trying to convert a csv file to avro format and store it on S3 storage using python. During this process, I see that there is data loss in the file written to s3 storage. This is confirmed by converting the avro file on local storage and avro file on s3 storage to json format by comparing the content and total number of lines present in each file. 

      A deep investigation into this issue shows that avro data generated while writing to local storage is not exactly same as the avro data generated while writing to s3 storage. 

       I suspect issue is in getting a writer object using DatumWriter. 

      writer = avro.datafile.DataFileWriter(<fileobject>, avro.io.DatumWriter(), schema)

      Exact code is present in git hub link below- 

      https://github.com/mpenkov/smart_open/blob/209/integration-tests/test_209.py

      Could you please help solve this issue?

       

      Thanks

      Vinuthna

       

      Attachments

        Activity

          People

            Unassigned Unassigned
            vinuthna91 Vinuthna
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: