Uploaded image for project: 'Apache Gobblin'
  1. Apache Gobblin
  2. GOBBLIN-106

HiveSerDeConverter with serde.serializer.type=ORC requires fork.record.queue.capacity=1

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None
    • None

    Description

      When using the `HiveSerDeConverter` to convert data to ORC format, the `fork.record.queue.capacity` must be set to `1` in order to avoid dropping records + writing duplicate records.

      The problem is Hive's `OrcSerde` caches converted records - the Serde has a single `OrcSerdeRow` (basically an ORC row) and every time the `serialize()` method is called, the object is re-used. A few other Serdes do this, such as `AvroSerDe` - the `AbstractSerDe.serialize()` method actually states that this is the expected behavior, and that users should make a copy of the object returned by the `serialize()` method.

      The problem is that `OrcSerdeRow` is package protected and has no public constructor, so no copy can be made. This would be ok if we immediately wrote out the `OrcSerdeRow`, but all Gobblin jobs have a buffer that the `Writer` reads from. This buffering can cause race conditions where records get dropped. The only way to get around this is by setting the buffer queue to size 1 so that records get immediately written.

      For ORC, data is written out by Hive's `OrcOutputFormat`. It occurred to me that the ORC `Writer` would also buffer records in memory, but I traced that logic and for reasons too hard to explain in a GitHub Issue this isn't a concern.

      I confirmed this bug locally, when I hit cases where writing to ORC would write duplicate records, and drop other records.

      This is documented in the Writing to ORC Guide: http://gobblin.readthedocs.io/en/latest/case-studies/Writing-ORC-Data/

      Github Url : https://github.com/linkedin/gobblin/issues/1007
      Github Reporter : stakiar
      Github Created At : 2016-05-20T19:20:34Z
      Github Updated At : 2016-08-18T00:34:52Z

      Comments


      jeffwang66 wrote on 2016-08-18T00:34:52Z : Hi, Sahil:

      I used HiveSerDeConverter and HiveWritableHdfsDataWriterBuilder to write records to ORC files. I noticed that some records are duplicated and some others are missing even though I set fork.record.queue.capacity=1. Is there any plan to fix this issue?

      Thanks

      Github Url : https://github.com/linkedin/gobblin/issues/1007#issuecomment-240591042

      Attachments

        Activity

          People

            Unassigned Unassigned
            stakiar Sahil Takiar
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated: