Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-12762

[Python] ListType doesn't preserve field name after pickle and unpickle

    XMLWordPrintableJSON

Details

    Description

      Here is a small reproducer:

      import pandas as pd
      from pyspark.sql import SparkSession
      import pyarrow.parquet as pq
      import pickle
      
      df = pd.DataFrame(
          {
              "A": [
                  ["aa", "bb "],
                  ["c"],
                  ["d", "ee", "", "f"],
                  ["ggg", "H"],
                  [""],
              ]
          }
      )
      
      spark = SparkSession.builder.appName("GenSparkData").getOrCreate()
      spark_df = spark.createDataFrame(df)
      spark_df.write.parquet("list_str.pq", "overwrite")
      
      ds = pq.ParquetDataset("list_str.pq")
      assert pickle.loads(pickle.dumps(ds.schema)) == ds.schema # PASSES
      assert pickle.loads(pickle.dumps(ds.schema.to_arrow_schema())) == ds.schema.to_arrow_schema() # FAILS
      

      Attachments

        Issue Links

          Activity

            People

              amol- Alessandro Molina
              jjgalvez Juan Galvez
              Votes:
              1 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 0.5h
                  0.5h