

    • Sub-task
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 1.5.0
    • 1.6.0
    • SQL, Tests
    • None


      Due to a set of unfortunate historical issues, it's relatively hard to achieve full interoperability among various Parquet data models. Spark 1.5 implemented all backwards-compatibility rules defined in parquet-format spec on the read path (SPARK-6774) to improve this. However, testing all those corner cases can be really challenging. Currently, we are testing Parquet compatibility/interoperability by two means:

      1. Generate Parquet files by other systems, bundle them into Spark source tree as testing resources, and write test cases against them to ensure that we can interpret them correctly. Currently, we are testing parquet-thrift and parquet-protobuf compatibility in this way.
        • Pros: Easy to write test cases, easy to test against multiple versions of a given external system/libraries (by generating Parquet files with these versions)
        • Cons: Hard to track how testing Parquet files are generated
      2. Make external libraries as testing dependencies, and call their APIs directly to write Parquet files and verify them. Currently, parquet-avro compatibility is tested using this approach.
        • Pros: Easy to track how testing Parquet files are generated
        • Cons:
          • Often requires code generation (Avro/Thrift/ProtoBuf/...), either complicates build system by using build time code generation, or bloats the code base by checking in generated Java files. The former one is especially annoying because Spark has two build systems, and require two sets of plugins to do code generation (e.g., for Avro, we need both sbt-avro and avro-maven-plugin).
          • Can only test a single version of a given target library

      Inspired by the writeDirect method in parquet-avro testing code, a direct write API can be a good complement for testing Parquet compatibilities. Ideally, this API should

      1. be easy to construct arbitrary complex Parquet records
      2. have a DSL that reflects the nested nature of Parquet records

      In this way, it would be both easy to track Parquet file generation and easy to cover various versions of external libraries. However, test case authors must be really careful when constructing the test cases and ensure constructed Parquet structures are identical to those generated by the target systems/libraries. We're probably not going to replace the above two approaches with this API, but just add it as a complement.




            lian cheng Cheng Lian
            lian cheng Cheng Lian
            0 Vote for this issue
            2 Start watching this issue

