Source changes - FishEye

Shows the 20 most recent commits for Parquet.

Vitalii Diravka <> committed e9065b55ea560e7f737d6fcb4948f9e945b9b14f (20 files)
Reviews: none

DRILL-5660: Parquet metadata caching improvements
1. Bumped up metadata file version to v3_1.
2. Introduced MetadataVersion comparable class.
3. Added support to ignore unknown metadata version (for example, metadata generated from future versions of drill).
4. Added support to ignore corrupted or missing metadata files.
5. Removed "%20" symbols from path if present.

closes #877

drill master
Padma Penumarthy <> committed 9cf6faa7aa834c7ba654ce956c8b523ff3464658 (1 file)
Reviews: none

DRILL-5587: Validate Parquet blockSize and pageSize configured with SYSTEM/SESSION option
close #852

Vitalii Diravka <> committed b714b2d74f80755d02de87b3151d94cb9cfc6794 (4 files)
Padma Penumarthy <> committed 9ab91ff2640a8e89b92869d7dbb15acb9b602cd3 (4 files)
Reviews: none

DRILL-5379: Set Hdfs Block Size based on Parquet Block Size
Provide an option to specify blocksize during file creation.
This will help create parquet files with single block on HDFS, helping improve performance when we read those files.

See DRILL-5379 for details.

closes #826

Vitalii Diravka <> committed 964a947315c31571954c1f4f56ac3336ac7bbcda (9 files)
Volodymyr Vysotskyi <> committed 5df49ab9c250114d93aa90507b029fad77f4e6bd (6 files)
Reviews: none

DRILL-4139: Add missing Interval, VarBinary and Varchar with nulls partition pruning support. Fix metadata serialization for fixed_len_byte_array types. Fix partition pruning for decimal type. Fix loss of scale value for DECIMAL in parquet partition pruning. Fix partition pruning for primitive types with null values. Update parquet table metadata version to v3_3. DRILL-4139: Fix wrong parquet metadata cache version after resolving conflicts with DRILL-4264
closes #805

Paul Rogers <> committed 676ea889bb69e9e0a733cab29665236d066bd1ab (17 files)
Reviews: none

DRILL-5356: Refactor Parquet Record Reader
The Parquet reader is Drill's premier data source and has worked very well
for many years. As with any piece of code, it has grown in complexity over
that time and has become hard to understand and maintain.

In work in another project, we found that Parquet is accidentally creating
"low density" batches: record batches with little actual data compared to
the amount of memory allocated. We'd like to fix that.

However, the current complexity of the reader code creates a barrier to
making improvements: the code is so complex that it is often better to
leave bugs unfixed, or risk spending large amounts of time struggling to
make small changes.

This commit offers to help revitalize the Parquet reader. Functionality is
identical to the code in master; but code has been pulled apart into
various classes each of which focuses on one part of the task: building
up a schema, keeping track of read state, a strategy for reading various
combinations of records, etc. The idea is that it is easier to understand
several small, focused classes than one huge, complex class. Indeed, the
idea of small, focused classes is common in the industry; it is nothing new.

Unit tests pass with the change. Since no logic has chanaged, we only moved
lines of code, that is a good indication that everything still works.

Also includes fixes based on review comments.

closes #789

drill master
Parth Chandra <> committed 1766ffc4960e8f7c1efc981a9302688a8c6cd427 (1 file)
Reviews: none

DRILL-5349: Fix TestParquetWriter unit tests when synchronous parquet reader is used.
close apache/drill#780

Paul Rogers <> committed 79811db5aa8c7f2cdbe6f74c0a40124bea9fb1fd (23 files)
Reviews: none

DRILL-5284: Roll-up of final fixes for managed sort
See subtasks for details.

* Provide detailed, accurate estimate of size consumed by a record batch
* Managed external sort spills too often with Parquet data
* Managed External Sort fails with OOM
* External sort refers to the deprecated HDFS param
* Config param drill.exec.sort.external.batch.size is not used
* NPE in managed external sort while spilling to disk
* External Sort BatchGroup leaks memory if an OOM occurs during read
* DRILL-5294: Under certain low-memory conditions, need to force the sort to merge
two batches to make progress, even though this is a bit more than
comfortably fits into memory.

close #761

drill 1.10.0
Parth Chandra <> committed 152c87aa6cad84c8752b4a87967c7826cc90dbaa (2 files)
Reviews: none

DRILL-5351: Minimize bounds checking in var len vectors for Parquet reader
close #781

Arina Ielchiieva <> committed f80d77e6bb60afc562331192ac5d08545c1f1c02 (1 file)
Reviews: none

DRILL-5040: Parquet writer unable to delete table folder on abort
close apache/drill#744

Parth Chandra <> committed ddcf89548bd33c0cd3e062f1f6d5027eed822372 (1 file)
Reviews: none

DRILL-5240: Parquet - fix unnecessary object creation while checking for null values in nullable var length columns
This closes #740

Paul Rogers <> committed 38f816a45924654efd085bf7f1da7d97a4a51e38 (2 files)
Reviews: none

DRILL-5157: Multiple Snappy versions on class path
Multiple Snappy versions on class path; causes unit test failures.

This fix updates the Snappy library and adds dependency management to
exclude older versions brought in by Avro and Parquet.

Parth Chandra <> committed 052010108a47856f9b1a3c0c470b6572948dc749 (12 files)
Reviews: none

DRILL-5207: Improve Parquet Scan pipelining. Add a configurable AsyncPageReader Queue. Enforce total size of parquet row group. Do not initialize BufferedDirectBufInputStream buffer in init. Wait for first read. Change default size of BufferedDirectBufInputStream. Do not invoke getOptions too many times in Parquet reader. Add metrics for processing time, and decoding time for varlen and fixedlen columns.
This closes #723

Vitalii Diravka <> committed eef3b3fb6f4e76e95510253d155d0659e387fc99 (3 files)
Reviews: none

DRILL-4996: Parquet Date auto-correction is not working in auto-partitioned parquet files generated by drill-1.6
- Changed detection approach of corrupted date values for the case, when parquet files are generated by drill:
  the corruption status is determined by looking at the min/max values in the metadata;
- Appropriate refactoring of TestCorruptParquetDateCorrection.

This closes #687

Vitalii Diravka <> committed 4a0fd56c106550eee26ca68eaed6108f0dbad798 (7 files)