Uploaded image for project: 'Parquet'
  1. Parquet
  2. PARQUET-1118

Build a corpus of Parquet files that client implementations can use for validation

    XMLWordPrintableJSON

Details

    • Task
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None
    • parquet-format
    • None

    Description

      We should build a corpus of Parquet files that client implementations can use for validation. In addition to the input files, it should contain a description or a verbatim copy of the data in each file, so that readers can validate their results.

      As a starting point we can look at the old parquet-compatibility repo and Impala's test data, in particular the Parquet files it contains.

      $ find testdata | grep -i parq
      testdata/workloads/tpch/queries/insert_parquet.test
      testdata/workloads/functional-planner/queries/PlannerTest/parquet-filtering.test
      testdata/workloads/functional-planner/queries/PlannerTest/parquet-stats-agg.test
      testdata/workloads/functional-query/queries/QueryTest/parquet-filtering.test
      testdata/workloads/functional-query/queries/QueryTest/parquet-zero-rows.test
      testdata/workloads/functional-query/queries/QueryTest/insert_parquet_invalid_codec.test
      testdata/workloads/functional-query/queries/QueryTest/parquet-corrupt-rle-counts-abort.test
      testdata/workloads/functional-query/queries/QueryTest/parquet-ambiguous-list-legacy.test
      testdata/workloads/functional-query/queries/QueryTest/parquet-stats-agg.test
      testdata/workloads/functional-query/queries/QueryTest/parquet-deprecated-stats.test
      testdata/workloads/functional-query/queries/QueryTest/nested-types-parquet-stats.test
      testdata/workloads/functional-query/queries/QueryTest/parquet-resolution-by-name.test
      testdata/workloads/functional-query/queries/QueryTest/parquet-abort-on-error.test
      testdata/workloads/functional-query/queries/QueryTest/mt-dop-parquet.test
      testdata/workloads/functional-query/queries/QueryTest/parquet.test
      testdata/workloads/functional-query/queries/QueryTest/parquet-corrupt-rle-counts.test
      testdata/workloads/functional-query/queries/QueryTest/parquet-continue-on-error.test
      testdata/workloads/functional-query/queries/QueryTest/mt-dop-parquet-nested.test
      testdata/workloads/functional-query/queries/QueryTest/parquet-ambiguous-list-modern.test
      testdata/workloads/functional-query/queries/QueryTest/parquet-stats.test
      testdata/max_nesting_depth/int_map/file.parq
      testdata/max_nesting_depth/struct/file.parq
      testdata/max_nesting_depth/struct_map/file.parq
      testdata/max_nesting_depth/int_array/file.parq
      testdata/max_nesting_depth/struct_array/file.parq
      testdata/parquet_nested_types_encodings
      testdata/parquet_nested_types_encodings/README
      testdata/parquet_nested_types_encodings/UnannotatedListOfGroups.parquet
      testdata/parquet_nested_types_encodings/AmbiguousList_Modern.parquet
      testdata/parquet_nested_types_encodings/UnannotatedListOfPrimitives.parquet
      testdata/parquet_nested_types_encodings/AmbiguousList.json
      testdata/parquet_nested_types_encodings/AvroPrimitiveInList.parquet
      testdata/parquet_nested_types_encodings/ThriftPrimitiveInList.parquet
      testdata/parquet_nested_types_encodings/bad-avro.parquet
      testdata/parquet_nested_types_encodings/AmbiguousList.avsc
      testdata/parquet_nested_types_encodings/SingleFieldGroupInList.parquet
      testdata/parquet_nested_types_encodings/ThriftSingleFieldGroupInList.parquet
      testdata/parquet_nested_types_encodings/AvroSingleFieldGroupInList.parquet
      testdata/parquet_nested_types_encodings/AmbiguousList_Legacy.parquet
      testdata/parquet_nested_types_encodings/bad-thrift.parquet
      testdata/ComplexTypesTbl/nonnullable.parq
      testdata/ComplexTypesTbl/nullable.parq
      testdata/bad_parquet_data
      testdata/bad_parquet_data/README
      testdata/bad_parquet_data/dict-encoded-out-of-bounds.parq
      testdata/bad_parquet_data/plain-encoded-negative-len.parq
      testdata/bad_parquet_data/plain-encoded-out-of-bounds.parq
      testdata/bad_parquet_data/dict-encoded-negative-len.parq
      testdata/parquet_schema_resolution
      testdata/parquet_schema_resolution/README
      testdata/parquet_schema_resolution/switched_map.json
      testdata/parquet_schema_resolution/switched_map.avsc
      testdata/parquet_schema_resolution/switched_map.parq
      testdata/src/main/java/org/apache/impala/datagenerator/JsonToParquetConverter.java
      testdata/LineItemMultiBlock/lineitem_one_row_group.parquet
      testdata/LineItemMultiBlock/lineitem_sixblocks.parquet
      testdata/data/zero_rows_zero_row_groups.parquet
      testdata/data/chars-formats.parquet
      testdata/data/multiple_rowgroups.parquet
      testdata/data/bad_parquet_data.parquet
      testdata/data/bad_metadata_len.parquet
      testdata/data/huge_num_rows.parquet
      testdata/data/bad_compressed_size.parquet
      testdata/data/zero_rows_one_row_group.parquet
      testdata/data/bad_rle_repeat_count.parquet
      testdata/data/bad_column_metadata.parquet
      testdata/data/alltypesagg_hive_13_1.parquet
      testdata/data/bad_dict_page_offset.parquet
      testdata/data/bad_rle_literal_count.parquet
      testdata/data/bad_magic_number.parquet
      testdata/data/repeated_values.parquet
      testdata/data/schemas/malformed_decimal_tiny.parquet
      testdata/data/schemas/alltypestiny.parquet
      testdata/data/schemas/nested/modern_nested.parquet
      testdata/data/schemas/nested/legacy_nested.parquet
      testdata/data/schemas/enum/enum.parquet
      testdata/data/schemas/decimal.parquet
      testdata/data/schemas/zipcode_incomes.parquet
      testdata/data/repeated_root_schema.parquet
      testdata/data/long_page_header.parquet
      testdata/data/deprecated_statistics.parquet
      testdata/data/kite_required_fields.parquet
      testdata/data/out_of_range_timestamp.parquet
      

      Impala also has a tool to generate Parquet files from JSON files: https://github.com/apache/incubator-impala/blob/master/testdata/src/main/java/org/apache/impala/datagenerator/JsonToParquetConverter.java

      Arrow has a similar tool: https://github.com/apache/arrow/blob/master/integration/integration_test.py

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              lv Lars Volker
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated: