Avro
  1. Avro
  2. AVRO-1530

Java DataFileStream does not allow distinguishing between empty files and corrupt files

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Won't Fix
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      When writing data to HDFS, especially with Flume, it's possible to write empty files. When you run Hive queries over this data, the job fails with "Not a data file." from here https://github.com/apache/avro/blob/trunk/lang/java/avro/src/main/java/org/apache/avro/file/DataFileStream.java#L102

        Issue Links

          Activity

          Transition Time In Source Status Execution Times Last Executer Last Execution Date
          Open Open Resolved Resolved
          7d 37m 1 Brock Noland 30/Jun/14 21:31
          Brock Noland made changes -
          Status Open [ 1 ] Resolved [ 5 ]
          Resolution Won't Fix [ 2 ]
          Hide
          Brock Noland added a comment -

          Let's take this forward in HIVE-7316. We'll re-open this if needed.

          Show
          Brock Noland added a comment - Let's take this forward in HIVE-7316 . We'll re-open this if needed.
          Brock Noland made changes -
          Field Original Value New Value
          Link This issue is related to HIVE-7316 [ HIVE-7316 ]
          Hide
          Brock Noland added a comment -

          I was thinking that we'd throw a subclass of IOException so that'd be backwards compatible. Though, I do agree that clients could ignore zero length files.

          Show
          Brock Noland added a comment - I was thinking that we'd throw a subclass of IOException so that'd be backwards compatible. Though, I do agree that clients could ignore zero length files.
          Hide
          Doug Cutting added a comment -

          That would be an incompatible change. Some folks might rely on the current behaviour.

          One can detect an empty file by looking at its length. No valid avro data file will ever be empty.

          Show
          Doug Cutting added a comment - That would be an incompatible change. Some folks might rely on the current behaviour. One can detect an empty file by looking at its length. No valid avro data file will ever be empty.
          Hide
          Brock Noland added a comment -

          I'd propose that when reading the header EOF and IO Exception should be treated differently. For example the EOF exception should be propagated so that upstream users, e.g. Hive, can detect the difference between the two errors.

          Show
          Brock Noland added a comment - I'd propose that when reading the header EOF and IO Exception should be treated differently. For example the EOF exception should be propagated so that upstream users, e.g. Hive, can detect the difference between the two errors.
          Brock Noland created issue -

            People

            • Assignee:
              Unassigned
              Reporter:
              Brock Noland
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development