Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-10635

[C++] ORC reader issue with bool column

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Minor
    • Resolution: Not A Bug
    • 1.0.1
    • None
    • C++

    Description

      The ORC file contains single column of boolean type, from row number `20000` the values are mismatching compared to what is expected.

       

      As per my observation, the writer used for this ORC file assumes RLE is aligned with row index boundaries. That means, no two row groups will share same byte. And there will be no offset within byte. But I think that pyarrow considers whatever leftover of that partial byte which was left at end of a row group as data which causes the shift in the values.

       

      I have attached another parquet file with same data for reference. You would notice that ORC considers last two bits of partial byte and shifts the data by two rows.

       

      // code placeholder
      from pyarrow import orc
      f = orc.ORCFile('broken_bool.orc')
      pdf_orc=f.read().to_pandas() 
      pdf_pq=pd.read_parquet("bool_pq.parquet")  
      pdf_orc.col_bool.dropna()[pdf_orc.col_bool.dropna() != pdf_pq.col_bool.dropna()] 
      
      20002 False 
      20004 False 
      20005 True 
      20007 False 
      20014 True 
      ... 
      21973 False 
      21974 False 
      21985 True 
      21988 True 
      21993 False
      

       

       

      Attachments

        1. bool_pq.parquet
          5 kB
          Ramakrishna Prabhu
        2. broken_bool.zip
          4 kB
          Ramakrishna Prabhu

        Activity

          People

            yingzhou474 Ian Alexander Joiner
            rgsl888 Ramakrishna Prabhu
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: