[ORC-1075] Support reading ORC files with no column statistics - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 1.8.0
Fix Version/s: 1.8.0
Component/s: Java, Reader
Labels:
None

Description

I have attached an ORC file that seems not to include ColumnStatistics in RowIndex.

From the ORC spec, seems RowIndex.ColumnStatistics is not a required field ???

message RowIndexEntry {
  repeated uint64 positions = 1 [packed=true];
  optional ColumnStatistics statistics = 2;
}
message RowIndex {
  repeated RowIndexEntry entry = 1;                                                        
}

The meta of the ORC file

$ orctools meta none.orc 
log4j:WARN No appenders could be found for logger (org.apache.hadoop.util.Shell).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
Processing data file none.orc [length: 124]
Structure for none.orc
File Version: 0.12 with ORIGINAL
Rows: 3
Compression: NONE
Calendar: Julian/Gregorian
Type: struct<INT:int>
Stripe Statistics:
  Stripe 1:
    Column 0: count: 3 hasNull: true
    Column 1: count: 3 hasNull: true min: 1 max: 3 sum: 6
File Statistics:
Stripes:
  Stripe: offset: 3 data: 4 rows: 3 tail: 32 index: 10
    Stream: column 0 section ROW_INDEX start: 3 length 4
    Stream: column 1 section ROW_INDEX start: 7 length 6
    Stream: column 1 section DATA start: 13 length 4
    Encoding column 0: DIRECT
    Encoding column 1: DIRECT_V2
File length: 124 bytes
Padding length: 0 bytes
Padding ratio: 0%

the data of the orc file

$ orctools data none.orc 
log4j:WARN No appenders could be found for logger (org.apache.hadoop.util.Shell).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
Processing data file none.orc [length: 124]
{"INT":1}
{"INT":2}
{"INT":3}

I have below code trying to read each row of the ORC file

// Pick the schema we want to read using schema evolution
TypeDescription readSchema =
TypeDescription.fromString("struct<INT:int>");

// Get the information from the file footer
Reader reader = OrcFile.createReader(new Path("none.orc"),
                OrcFile.readerOptions(new Configuration()));

System.out.println("File schema: " + reader.getSchema());
System.out.println("Row count: " + reader.getNumberOfRows());

RecordReader rowIterator = reader.rows(
 reader.options()
     .schema(readSchema)
     .searchArgument(SearchArgumentFactory.newBuilder()
         .equals("INT", PredicateLeaf.Type.LONG, 2L)
     .build(), new String[]{"INT"}) //predict push down
);

// Read the row data
VectorizedRowBatch batch = readSchema.createRowBatch();
LongColumnVector x = (LongColumnVector) batch.cols[0];

while (rowIterator.nextBatch(batch)) {
  System.out.println(batch.size);
  for (int row = 0; row < batch.size; ++row) {
    int xRow = x.isRepeating ? 0 : row;
    System.out.println("INT: " + (x.noNulls || !x.isNull[xRow] ?    
                  x.vector[xRow] :null));
  }
}
rowIterator.close();

output from 1.6.11

File schema: struct<INT:int>
Row count: 3

output from 1.5.10

File schema: struct<INT:int>
Row count: 3
3
INT: 1
INT: 2
INT: 3

Actually, I found this issue on Spark 3.2 which depends on ORC 1.6.11, while there is no such issue on spark 3.0.x which depends on ORC 1.5.10

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

none-1.orc
05/Jan/22 07:45
0.1 kB
Bobby Wang

Issue Links

relates to

ORC-1553 Reading information from Row group, where there are 0 records of SArg column

Closed

links to

GitHub Pull Request #992

Activity

People

Assignee:: Yiqun Zhang

Reporter:: Bobby Wang

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Dates

Created:: 05/Jan/22 07:48

Updated:: 20/Dec/23 12:35

Resolved:: 04/Sep/22 00:46