Uploaded image for project: 'Apache Avro'
  1. Apache Avro
  2. AVRO-3482

DataFileReader should reuse MAGIC data read from inputstream

    XMLWordPrintableJSON

Details

    Description

      https://github.com/apache/avro/blob/master/lang/java/avro/src/main/java/org/apache/avro/file/DataFileReader.java#L60-L72

       

      byte[] magic = new byte[MAGIC.length];
          in.seek(0);
          int offset = 0;
          int length = magic.length;
          while (length > 0) {
            int bytesRead = in.read(magic, offset, length);
            if (bytesRead < 0)
              throw new EOFException("Unexpected EOF with " + length + " bytes remaining to read");
      
            length -= bytesRead;
            offset += bytesRead;
          }
          in.seek(0); <--- This will force the inputstream to switch to "random" io policy in next read in cloud connectors!
      
          if (Arrays.equals(MAGIC, magic)) // current format
            return new DataFileReader<>(in, reader);
          if (Arrays.equals(DataFileReader12.MAGIC, magic)) // 1.2 format
            return new DataFileReader12<>(in, reader);
      
       
      

       

      With cloud stores, this can turn out to be expensive as the stream has to be closed and reopened in cloud connectors (e.g s3).

      It will be helpful to reuse the MAGIC bytes read from inputstream and pass it on to DataFileReader / DataFileReader12. This will ensure that, file can be read in sequential manner in cloud stores and help in reducing IO calls.

      Attachments

        Issue Links

          Activity

            People

              rajesh.balamohan Rajesh Balamohan
              rajesh.balamohan Rajesh Balamohan
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 3.5h
                  3.5h