Uploaded image for project: 'ORC'
  1. ORC
  2. ORC-1075

Support reading ORC files with no column statistics

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 1.8.0
    • 1.8.0
    • Java, Reader
    • None

    Description

      I have attached an ORC file that seems not to include ColumnStatistics in RowIndex.

      From the ORC spec, seems RowIndex.ColumnStatistics is not a required field ???

       

      message RowIndexEntry {
        repeated uint64 positions = 1 [packed=true];
        optional ColumnStatistics statistics = 2;
      }
      message RowIndex {
        repeated RowIndexEntry entry = 1;                                                        
      }
      

      The meta of the ORC file

       

      $ orctools meta none.orc 
      log4j:WARN No appenders could be found for logger (org.apache.hadoop.util.Shell).
      log4j:WARN Please initialize the log4j system properly.
      log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
      Processing data file none.orc [length: 124]
      Structure for none.orc
      File Version: 0.12 with ORIGINAL
      Rows: 3
      Compression: NONE
      Calendar: Julian/Gregorian
      Type: struct<INT:int>
      Stripe Statistics:
        Stripe 1:
          Column 0: count: 3 hasNull: true
          Column 1: count: 3 hasNull: true min: 1 max: 3 sum: 6
      File Statistics:
      Stripes:
        Stripe: offset: 3 data: 4 rows: 3 tail: 32 index: 10
          Stream: column 0 section ROW_INDEX start: 3 length 4
          Stream: column 1 section ROW_INDEX start: 7 length 6
          Stream: column 1 section DATA start: 13 length 4
          Encoding column 0: DIRECT
          Encoding column 1: DIRECT_V2
      File length: 124 bytes
      Padding length: 0 bytes
      Padding ratio: 0%
      

       

      the data of the orc file

      $ orctools data none.orc 
      log4j:WARN No appenders could be found for logger (org.apache.hadoop.util.Shell).
      log4j:WARN Please initialize the log4j system properly.
      log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
      Processing data file none.orc [length: 124]
      {"INT":1}
      {"INT":2}
      {"INT":3}

      I have below code trying to read each row of the ORC file

      // Pick the schema we want to read using schema evolution
      TypeDescription readSchema =
      TypeDescription.fromString("struct<INT:int>");
      
      // Get the information from the file footer
      Reader reader = OrcFile.createReader(new Path("none.orc"),
                      OrcFile.readerOptions(new Configuration()));
      
      System.out.println("File schema: " + reader.getSchema());
      System.out.println("Row count: " + reader.getNumberOfRows());
      
      RecordReader rowIterator = reader.rows(
       reader.options()
           .schema(readSchema)
           .searchArgument(SearchArgumentFactory.newBuilder()
               .equals("INT", PredicateLeaf.Type.LONG, 2L)
           .build(), new String[]{"INT"}) //predict push down
      );
      
      // Read the row data
      VectorizedRowBatch batch = readSchema.createRowBatch();
      LongColumnVector x = (LongColumnVector) batch.cols[0];
      
      while (rowIterator.nextBatch(batch)) {
        System.out.println(batch.size);
        for (int row = 0; row < batch.size; ++row) {
          int xRow = x.isRepeating ? 0 : row;
          System.out.println("INT: " + (x.noNulls || !x.isNull[xRow] ?    
                        x.vector[xRow] :null));
        }
      }
      rowIterator.close();

       

      output from 1.6.11

      File schema: struct<INT:int>
      Row count: 3

      output from 1.5.10

      File schema: struct<INT:int>
      Row count: 3
      3
      INT: 1
      INT: 2
      INT: 3

       

      Actually, I found this issue on Spark 3.2 which depends on ORC 1.6.11, while there is no such issue on spark 3.0.x which depends on ORC 1.5.10

       

       

      Attachments

        1. none-1.orc
          0.1 kB
          Bobby Wang

        Issue Links

          Activity

            People

              Guiyankuang Yiqun Zhang
              wbo4958 Bobby Wang
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: