Uploaded image for project: 'Parquet'
  1. Parquet
  2. PARQUET-1033

Mismatched Read and Write

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • cpp-1.1.0
    • cpp-1.2.0
    • parquet-cpp
    • None
    • Rstudio

    Description

      The readbatchspaced reads in more lines than the actual data in file with nulls.
      So I've been trying to write something like [bla.csv] with mixed nulls.

      The problem is that, when I use writebatchspaced to write and readbatchspaced to read back,

      Instead of getting the correct values, I'm getting less values than I initially wrote and additional nulls in the middle, a brief example as follows

      written

      -2147483648
      -2147483648
      30
      40
      50
      60
      70
      80
      90
      -2147483648
      -2147483648
      

      actual read

      -2147483648
      -2147483648
      -2147483648
      -2147483648
      30
      40
      50
      60
      70
      -2147483648
      9
      80
      90
      -2147483648
      -2147483648
      

      My code for reader

      Unable to find source-code formatter for language: c++. Available languages are: actionscript, ada, applescript, bash, c, c#, c++, cpp, css, erlang, go, groovy, haskell, html, java, javascript, js, json, lua, none, nyan, objc, perl, php, python, r, rainbow, ruby, scala, sh, sql, swift, visualbasic, xml, yaml
                  int64_t rows_read = _c_reader->ReadBatchSpaced(arraysize, definition_level.data(), repetition_level.data(), ivalues.data(), valid_bits.data(), 0, &levels_read, &values_read, &null_count);
                  for (int tmp = 0; tmp < rows_read; tmp ++)
                  {
                    if (definition_level[tmp] < col_rep_type[__c])
                    {
                      ivalues[tmp] = NA_INTEGER;
      
                    }
                    //simply set value
                    if (fsize != 1 && filter[tmp + offset + cur_offset])
                    {
                      //rvec[__c].set(fcnt[__c],0,values[tmp]);
                      dff.set_value(fcnt[__c],0,__c,ivalues[tmp]);
                      fcnt[__c] ++;
                    }
                    else if (fsize == 1)
                    {
                      //rvec[__c].set(tmp,offset+cur_offset,values[tmp]);
                      dff.set_value(tmp,offset+cur_offset,__c,ivalues[tmp]);
                    }
                  }
      

      my code for writer

      Unable to find source-code formatter for language: c++. Available languages are: actionscript, ada, applescript, bash, c, c#, c++, cpp, css, erlang, go, groovy, haskell, html, java, javascript, js, json, lua, none, nyan, objc, perl, php, python, r, rainbow, ruby, scala, sh, sql, swift, visualbasic, xml, yaml
              parquet::Int64Writer* int64_writer = static_cast<parquet::Int64Writer*>(rg_writer->NextColumn());
              IntegerVector tmpvec = df[__c];
              for (int tmp = 0; tmp < rows_to_write; tmp++)
              {
                ivec[tmp] = tmpvec[tmp+offset];
                if (tmpvec[tmp+offset] == NA_INTEGER)
                {
                  def_level[tmp]=0;
                }
              }
              int64_writer->WriteBatchSpaced(rows_to_write, def_level.data(), rep_level.data(), valid_bits.data(), 0, ivec.data());
      

      Attachments

        1. wrong.csv
          29 kB
          yugu
        2. bla.csv
          23 kB
          yugu

        Activity

          People

            uwe Uwe Korn
            elderrex yugu
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: