Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-36696

spark.read.parquet loads empty dataset

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Blocker
    • Resolution: Fixed
    • 3.2.0
    • 3.2.0
    • SQL
    • None

    Description

      Here's a parquet file Spark 3.2/master can't read properly.

      The file was stored by pandas and must contain 3650 rows, but Spark 3.2/master returns an empty dataset.

      >>> import pandas as pd
      >>> len(pd.read_parquet('/path/to/example.parquet'))
      3650
      
      >>> spark.read.parquet('/path/to/example.parquet').count()
      0
      

      I guess it's caused by the parquet 1.12.0.

      When I reverted two commits related to the parquet 1.12.0 from branch-3.2:

      it reads the data successfully.

      We need to add some workaround, or revert the commits.

      Attachments

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            Unassigned Unassigned
            ueshin Takuya Ueshin
            Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment