Uploaded image for project: 'Pig'
  1. Pig
  2. PIG-3655

BinStorage and InterStorage approach to record markers is broken

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 0.2.0, 0.3.0, 0.4.0, 0.5.0, 0.6.0, 0.7.0, 0.8.0, 0.8.1, 0.9.0, 0.9.1, 0.9.2, 0.10.0, 0.11, 0.10.1, 0.12.0, 0.11.1
    • 0.18.0
    • None
    • None

    Description

      The way that the record readers for these storage formats seek to the first record in an input split is to find the byte sequence 1 2 3 110 for BinStorage or 1 2 3 19-21|28-30|36-45 for InterStorage. If this sequence occurs in the data for any reason (for example the integer 16909166 stored big endian encodes to the byte sequence for BinStorage) other than to mark the start of a tuple it can cause mysterious failures in pig jobs because the record reader will try to decode garbage and fail.

      For this approach of using an unlikely sequence to mark record boundaries, it is important to reduce the probability of the sequence occuring naturally in the data by ensuring that your record marker is sufficiently long. Hadoop SequenceFile uses 128 bits for this and randomly generates the sequence for each file (selecting a fixed, predetermined value opens up the possibility of a mean person intentionally sending you that value). This makes it extremely unlikely that collisions will occur. In the long run I think that pig should also be doing this.

      As a quick fix it might be good to save the current position in the file before entering readDatum, and if an exception is thrown seek back to the saved position and resume trying to find the next record marker.

      Attachments

        1. PIG-3655.0.patch
          12 kB
          Ádám Szita
        2. PIG-3655.1.patch
          17 kB
          Ádám Szita
        3. PIG-3655.2.patch
          16 kB
          Ádám Szita
        4. PIG-3655.3.patch
          23 kB
          Ádám Szita
        5. PIG-3655.4.patch
          26 kB
          Ádám Szita
        6. PIG-3655.5.patch
          27 kB
          Ádám Szita
        7. PIG-3655.sparkNulls.2.patch
          7 kB
          Ádám Szita
        8. PIG-3655.sparkNulls.patch
          0.8 kB
          Ádám Szita

        Issue Links

          Activity

            People

              szita Ádám Szita
              jeffplaisance Jeff Plaisance
              Votes:
              0 Vote for this issue
              Watchers:
              11 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: