Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
None
-
None
-
None
Description
When using the `HiveSerDeConverter` to convert data to ORC format, the `fork.record.queue.capacity` must be set to `1` in order to avoid dropping records + writing duplicate records.
The problem is Hive's `OrcSerde` caches converted records - the Serde has a single `OrcSerdeRow` (basically an ORC row) and every time the `serialize()` method is called, the object is re-used. A few other Serdes do this, such as `AvroSerDe` - the `AbstractSerDe.serialize()` method actually states that this is the expected behavior, and that users should make a copy of the object returned by the `serialize()` method.
The problem is that `OrcSerdeRow` is package protected and has no public constructor, so no copy can be made. This would be ok if we immediately wrote out the `OrcSerdeRow`, but all Gobblin jobs have a buffer that the `Writer` reads from. This buffering can cause race conditions where records get dropped. The only way to get around this is by setting the buffer queue to size 1 so that records get immediately written.
For ORC, data is written out by Hive's `OrcOutputFormat`. It occurred to me that the ORC `Writer` would also buffer records in memory, but I traced that logic and for reasons too hard to explain in a GitHub Issue this isn't a concern.
I confirmed this bug locally, when I hit cases where writing to ORC would write duplicate records, and drop other records.
This is documented in the Writing to ORC Guide: http://gobblin.readthedocs.io/en/latest/case-studies/Writing-ORC-Data/
Github Url : https://github.com/linkedin/gobblin/issues/1007
Github Reporter : stakiar
Github Created At : 2016-05-20T19:20:34Z
Github Updated At : 2016-08-18T00:34:52Z
Comments
jeffwang66 wrote on 2016-08-18T00:34:52Z : Hi, Sahil:
I used HiveSerDeConverter and HiveWritableHdfsDataWriterBuilder to write records to ORC files. I noticed that some records are duplicated and some others are missing even though I set fork.record.queue.capacity=1. Is there any plan to fix this issue?
Thanks
Github Url : https://github.com/linkedin/gobblin/issues/1007#issuecomment-240591042