Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-1357

[Python] Data corruption in reading multi-file parquet dataset

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 0.5.0, 0.6.0
    • 0.7.0
    • Python
    • None
    • python 3.5.3

    Description

      I generated a parquet dataset in Spark that has two files. PyArrow corrupts the data of the second file if I read them both in using pyarrow's parquet directory loading mode.

      $ ls -l data
      total 28608
      rw-rw-r- 1 jarno jarno 14651449 Aug 15 09:30 part-00000-fd2ff92f-201b-4ffc-b0b4-275e06a7fa02.snappy.parquet
      rw-rw-r- 1 jarno jarno 14636502 Aug 15 09:30 part-00001-fd2ff92f-201b-4ffc-b0b4-275e06a7fa02.snappy.parquet

      import pyarrow.parquet as pq

      tab1 = pq.read_table('data')
      df1 = tab1.to_pandas()
      df1[df1.account_id == 38658373328].legal_labels.tolist()

      1. [array([ 2, 3, 5, 8, 10, 11, 13, 14, 17, 18, 19, 21, 22, 31, 60, 61, 63,
      2. 64, 65, 66, 69, 70, 74, 75, 77, 82, 0, 1, 2, 3, 5, 8, 10, 11,
      3. 13, 14, 17, 18, 19, 21, 22])]

      tab2 = pq.read_table('data/part-00001-fd2ff92f-201b-4ffc-b0b4-275e06a7fa02.snappy.parquet')
      df2 = tab2.to_pandas()
      df2[df2.account_id == 38658373328].legal_labels.tolist()

      1. [array([ 0, 1, 2, 3, 5, 8, 10, 11, 13, 14, 17, 18, 19, 21, 22, 24, 28,
      2. 30, 31, 36, 38, 39, 40, 41, 43, 49, 60, 61, 62, 63, 64, 65, 66, 67,
      3. 69, 70, 74, 75, 77, 82, 90])]

      Unfortunately I cannot share the data files, and I was not able to create a dummy data file pair that would have triggered the bug. I'm sending this bug report in the hope that it is still useful without a minimal repro example.

      Attachments

        Activity

          People

            wesm Wes McKinney
            jseppanen Jarno Seppanen
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: