Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-33149

Why does ArrayType schema change between read/write for parquet files?

    XMLWordPrintableJSON

    Details

    • Type: Question
    • Status: Resolved
    • Priority: Major
    • Resolution: Invalid
    • Affects Version/s: 3.0.0
    • Fix Version/s: None
    • Component/s: Input/Output
    • Labels:
      None

      Description

      I have parquet files that have been produced with org.apache.parquet Java library (not Spark). The schema has a list of authors that's reported by https://github.com/apache/parquet-mr/tree/master/parquet-tools like this:

      repeated binary authors (STRING);

      If I do spark.read.parquet(input_dir).write.parquet(output_dir) and do the same for the output files, the authors column has been changed into:

      optional group authors (LIST) {
        repeated group list

      {     optional binary element (STRING);   }

      }

      It seems to mean the same thing but from schema perspective this is different.

      I have other set of tools that are reading the output from this step (with real logic) but I fail to match the schemas. The original data works fine. Also, df.printSchema() shows the same for both (except for possible nullability change, which we can ignore for this case)

      Any thoughts if this is intentional and if I have any control for this from within Spark?

       

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              juhai Juha Iso-Sipilä
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: