Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-5666

[Python] Underscores in partition (string) values are dropped when reading dataset

    XMLWordPrintableJSON

Details

    Description

      When reading a partitioned dataset, in which the partition column contains string values with underscores, pyarrow seems to be ignoring the underscores in the resulting values.

      For example if I write and then read a dataset as follows:

      import pyarrow as pa
      import pandas as pd
      
      df = pd.DataFrame({
          "year_week": ["2019_2", "2019_3"],
          "value": [1, 2]
      })
      
      table = pa.Table.from_pandas(df.head())
      pq.write_to_dataset(table, 'test', partition_cols=["year_week"])
      
      table2 = pq.ParquetDataset('test').read()
      

      The resulting 'year_week' column in table 2 has lost the underscores:

      table2[1] # Gives:
      
      <Column name='year_week' type=DictionaryType(dictionary<values=int64, indices=int32, ordered=0>)>
      [
      
        -- dictionary:
          [
            20192,
            20193
          ]
        -- indices:
          [
            0
          ],
      
        -- dictionary:
          [
            20192,
            20193
          ]
        -- indices:
          [
            1
          ]
      ]
      

      Is this intentional behaviour or is this a bug in arrow?

      Attachments

        Issue Links

          Activity

            People

              jorisvandenbossche Joris Van den Bossche
              jrderuiter Julian de Ruiter
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: