Hadoop Common
  1. Hadoop Common
  2. HADOOP-10669

Avro serialization does not flush buffered serialized values causing data lost

    Details

    • Type: Bug Bug
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: 2.4.0
    • Fix Version/s: None
    • Component/s: io
    • Labels:
      None
    • Tags:
      avro serialization

      Description

      Found this debugging Nutch.

      MapTask serializes keys and values to the same stream, in pairs:

      keySerializer.serialize(key);
      .....
      valSerializer.serialize(value);
      .....
      bb.write(b0, 0, 0);

      AvroSerializer does not flush its buffer after each serialization. So if it is used for valSerializer, the values are only partially written or not written at all to the output stream before the record is marked as complete (the last line above).

      <EDIT> Added HADOOP-10699_all.patch. This is a less intrusive fix, as it does not try to flush MapTask stream. Instead, we write serialized values directly to MapTask stream and avoid using a buffer on avro side.

      1. HADOOP-10669.patch
        0.7 kB
        Mikhail Bernadsky
      2. HADOOP-10669_alt.patch
        0.8 kB
        Mikhail Bernadsky

        Activity

        Mikhail Bernadsky created issue -
        Mikhail Bernadsky made changes -
        Field Original Value New Value
        Attachment HADOOP-10669.patch [ 12648854 ]
        Mikhail Bernadsky made changes -
        Attachment HADOOP-10669_alt.patch [ 12648898 ]
        Mikhail Bernadsky made changes -
        Description Found this debugging Nutch.

        MapTask serializes keys and values to the same stream, in pairs:

        keySerializer.serialize(key);
        .....
        valSerializer.serialize(value);
         .....
        bb.write(b0, 0, 0);

        AvroSerializer does not flush its buffer after each serialization. So if it is used for valSerializer, the values are only partially written or not written at all to the output stream before the record is marked as complete (the last line above).
        Found this debugging Nutch.

        MapTask serializes keys and values to the same stream, in pairs:

        keySerializer.serialize(key);
        .....
        valSerializer.serialize(value);
         .....
        bb.write(b0, 0, 0);

        AvroSerializer does not flush its buffer after each serialization. So if it is used for valSerializer, the values are only partially written or not written at all to the output stream before the record is marked as complete (the last line above).

        <EDIT> Added HADOOP-10699_all.patch. This is a less intrusive fix, as it does not try to flush MapTask stream. Instead, we write serialized values directly to MapTask stream and avoid using a buffer on avro side.

          People

          • Assignee:
            Unassigned
            Reporter:
            Mikhail Bernadsky
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:

              Development