Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-13438

[C++] Can't use StreamWriter with ToParquetSchema schema

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 4.0.1
    • None
    • C++
    • None

    Description

      Hi there,

      First of all, I'm not sure if I'm doing this correctly, as it took a bit of reverse engineering to figure this out. 

      I'm using Arrow 4.0.1 on Ubuntu with C++.

      I followed the streaming example and created:

      #include <cassert>
      #include <chrono>
      #include <cstdint>
      #include <cstring>
      #include <ctime>
      #include <iomanip>
      #include <iostream>
      #include <utility>
      
      #include "arrow/io/file.h"
      #include "parquet/exception.h"
      #include "parquet/stream_reader.h"
      #include "parquet/stream_writer.h"
      
      std::shared_ptr<parquet::schema::GroupNode> GetSchema() {
        parquet::schema::NodeVector fields;
        fields.push_back(parquet::schema::PrimitiveNode::Make(
            "int64_field", parquet::Repetition::OPTIONAL, parquet::Type::INT64,
            parquet::ConvertedType::NONE));
      
        return std::static_pointer_cast<parquet::schema::GroupNode>(
            parquet::schema::GroupNode::Make("schema", parquet::Repetition::REQUIRED, fields));
      }
      
      int main() {
        std::shared_ptr<arrow::io::FileOutputStream> outfile;
      
        PARQUET_ASSIGN_OR_THROW(
            outfile,
            arrow::io::FileOutputStream::Open("parquet-stream-api-example.parquet"));
      
        parquet::WriterProperties::Builder builder;
        parquet::StreamWriter os{parquet::ParquetFileWriter::Open(outfile, GetSchema(), builder.build())};
      
        os << int64_t(10);
      
        return 0;
      }
      

      The code terminates with:

      terminate called after throwing an instance of 'parquet::ParquetException'
        what():  Column converted type mismatch.  Column 'int64_field' has converted type[NONE] not 'INT_64' 

      What I'm not sure about is parquet::ConvertedType::NONE part. The example provides this value even for primitives, while it's my understanding that it's necessary? If I do provide it, the code works.

      Now, to the reverse engineering part. I'm trying to write to Parquet using StreamWriter. StreamWriter requires parquet::schema::{{GroupNode}} as the schema, but I begin with arrow::Schema I found that it can be converted to {{parquet::SchemaDescriptor}} using parquet::arrow::ToParquetSchema }}utility. Looking at the utility implementation I can see that {{logical_type is set to None which equals to parquet::ConvertedType::None and hence the converted schema can't be used due to the issue I described above.

      1. Do we need to provide ConvertedType even for primitives?
      2. Is it a bug in the schema conversion utility or ColumnCheck assert?
      3. Or is it expected behavior, in this case, what's a suggested approach? Build Parquet schema instead of Arrow Schema?

      Thank you,

      Vasily.

      Attachments

        Activity

          People

            Unassigned Unassigned
            vasily.fomin Vasily Fomin
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated: