Details
-
Task
-
Status: Open
-
Major
-
Resolution: Unresolved
-
None
-
None
-
None
Description
We should build a corpus of Parquet files that client implementations can use for validation. In addition to the input files, it should contain a description or a verbatim copy of the data in each file, so that readers can validate their results.
As a starting point we can look at the old parquet-compatibility repo and Impala's test data, in particular the Parquet files it contains.
$ find testdata | grep -i parq testdata/workloads/tpch/queries/insert_parquet.test testdata/workloads/functional-planner/queries/PlannerTest/parquet-filtering.test testdata/workloads/functional-planner/queries/PlannerTest/parquet-stats-agg.test testdata/workloads/functional-query/queries/QueryTest/parquet-filtering.test testdata/workloads/functional-query/queries/QueryTest/parquet-zero-rows.test testdata/workloads/functional-query/queries/QueryTest/insert_parquet_invalid_codec.test testdata/workloads/functional-query/queries/QueryTest/parquet-corrupt-rle-counts-abort.test testdata/workloads/functional-query/queries/QueryTest/parquet-ambiguous-list-legacy.test testdata/workloads/functional-query/queries/QueryTest/parquet-stats-agg.test testdata/workloads/functional-query/queries/QueryTest/parquet-deprecated-stats.test testdata/workloads/functional-query/queries/QueryTest/nested-types-parquet-stats.test testdata/workloads/functional-query/queries/QueryTest/parquet-resolution-by-name.test testdata/workloads/functional-query/queries/QueryTest/parquet-abort-on-error.test testdata/workloads/functional-query/queries/QueryTest/mt-dop-parquet.test testdata/workloads/functional-query/queries/QueryTest/parquet.test testdata/workloads/functional-query/queries/QueryTest/parquet-corrupt-rle-counts.test testdata/workloads/functional-query/queries/QueryTest/parquet-continue-on-error.test testdata/workloads/functional-query/queries/QueryTest/mt-dop-parquet-nested.test testdata/workloads/functional-query/queries/QueryTest/parquet-ambiguous-list-modern.test testdata/workloads/functional-query/queries/QueryTest/parquet-stats.test testdata/max_nesting_depth/int_map/file.parq testdata/max_nesting_depth/struct/file.parq testdata/max_nesting_depth/struct_map/file.parq testdata/max_nesting_depth/int_array/file.parq testdata/max_nesting_depth/struct_array/file.parq testdata/parquet_nested_types_encodings testdata/parquet_nested_types_encodings/README testdata/parquet_nested_types_encodings/UnannotatedListOfGroups.parquet testdata/parquet_nested_types_encodings/AmbiguousList_Modern.parquet testdata/parquet_nested_types_encodings/UnannotatedListOfPrimitives.parquet testdata/parquet_nested_types_encodings/AmbiguousList.json testdata/parquet_nested_types_encodings/AvroPrimitiveInList.parquet testdata/parquet_nested_types_encodings/ThriftPrimitiveInList.parquet testdata/parquet_nested_types_encodings/bad-avro.parquet testdata/parquet_nested_types_encodings/AmbiguousList.avsc testdata/parquet_nested_types_encodings/SingleFieldGroupInList.parquet testdata/parquet_nested_types_encodings/ThriftSingleFieldGroupInList.parquet testdata/parquet_nested_types_encodings/AvroSingleFieldGroupInList.parquet testdata/parquet_nested_types_encodings/AmbiguousList_Legacy.parquet testdata/parquet_nested_types_encodings/bad-thrift.parquet testdata/ComplexTypesTbl/nonnullable.parq testdata/ComplexTypesTbl/nullable.parq testdata/bad_parquet_data testdata/bad_parquet_data/README testdata/bad_parquet_data/dict-encoded-out-of-bounds.parq testdata/bad_parquet_data/plain-encoded-negative-len.parq testdata/bad_parquet_data/plain-encoded-out-of-bounds.parq testdata/bad_parquet_data/dict-encoded-negative-len.parq testdata/parquet_schema_resolution testdata/parquet_schema_resolution/README testdata/parquet_schema_resolution/switched_map.json testdata/parquet_schema_resolution/switched_map.avsc testdata/parquet_schema_resolution/switched_map.parq testdata/src/main/java/org/apache/impala/datagenerator/JsonToParquetConverter.java testdata/LineItemMultiBlock/lineitem_one_row_group.parquet testdata/LineItemMultiBlock/lineitem_sixblocks.parquet testdata/data/zero_rows_zero_row_groups.parquet testdata/data/chars-formats.parquet testdata/data/multiple_rowgroups.parquet testdata/data/bad_parquet_data.parquet testdata/data/bad_metadata_len.parquet testdata/data/huge_num_rows.parquet testdata/data/bad_compressed_size.parquet testdata/data/zero_rows_one_row_group.parquet testdata/data/bad_rle_repeat_count.parquet testdata/data/bad_column_metadata.parquet testdata/data/alltypesagg_hive_13_1.parquet testdata/data/bad_dict_page_offset.parquet testdata/data/bad_rle_literal_count.parquet testdata/data/bad_magic_number.parquet testdata/data/repeated_values.parquet testdata/data/schemas/malformed_decimal_tiny.parquet testdata/data/schemas/alltypestiny.parquet testdata/data/schemas/nested/modern_nested.parquet testdata/data/schemas/nested/legacy_nested.parquet testdata/data/schemas/enum/enum.parquet testdata/data/schemas/decimal.parquet testdata/data/schemas/zipcode_incomes.parquet testdata/data/repeated_root_schema.parquet testdata/data/long_page_header.parquet testdata/data/deprecated_statistics.parquet testdata/data/kite_required_fields.parquet testdata/data/out_of_range_timestamp.parquet
Impala also has a tool to generate Parquet files from JSON files: https://github.com/apache/incubator-impala/blob/master/testdata/src/main/java/org/apache/impala/datagenerator/JsonToParquetConverter.java
Arrow has a similar tool: https://github.com/apache/arrow/blob/master/integration/integration_test.py
Attachments
Issue Links
- is related to
-
PARQUET-1241 [C++] Use LZ4 frame format
- Resolved