Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-1167

[Python] Create chunked BinaryArray in Table.from_pandas when a column's data exceeds 2GB

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 0.5.0
    • Python
    • None

    Description

      When writing a pyarrow Table (instantiated from a Pandas dataframe reading in a ~5GB CSV file) to a parquet file, the interpreter cores with the following stack trace from gdb:

      #0  __memmove_avx_unaligned () at ../sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S:181
      #1  0x00007fbaa5c779f1 in parquet::InMemoryOutputStream::Write(unsigned char const*, long) () from /home/ubuntu/.local/lib/python3.5/site-packages/pyarrow/libparquet.so.1
      #2  0x00007fbaa5c0ce97 in parquet::PlainEncoder<parquet::DataType<(parquet::Type::type)6> >::Put(parquet::ByteArray const*, int) ()
         from /home/ubuntu/.local/lib/python3.5/site-packages/pyarrow/libparquet.so.1
      #3  0x00007fbaa5c18855 in parquet::TypedColumnWriter<parquet::DataType<(parquet::Type::type)6> >::WriteMiniBatch(long, short const*, short const*, parquet::ByteArray const*) ()
         from /home/ubuntu/.local/lib/python3.5/site-packages/pyarrow/libparquet.so.1
      #4  0x00007fbaa5c189d5 in parquet::TypedColumnWriter<parquet::DataType<(parquet::Type::type)6> >::WriteBatch(long, short const*, short const*, parquet::ByteArray const*) ()
         from /home/ubuntu/.local/lib/python3.5/site-packages/pyarrow/libparquet.so.1
      #5  0x00007fbaa5be0900 in arrow::Status parquet::arrow::FileWriter::Impl::TypedWriteBatch<parquet::DataType<(parquet::Type::type)6>, arrow::BinaryType>(parquet::ColumnWriter*, std::shared_ptr<arrow::Array> const&, long, short const*, short const*) () from /home/ubuntu/.local/lib/python3.5/site-packages/pyarrow/libparquet.so.1
      #6  0x00007fbaa5be171d in parquet::arrow::FileWriter::Impl::WriteColumnChunk(arrow::Array const&) () from /home/ubuntu/.local/lib/python3.5/site-packages/pyarrow/libparquet.so.1
      #7  0x00007fbaa5be1dad in parquet::arrow::FileWriter::WriteColumnChunk(arrow::Array const&) () from /home/ubuntu/.local/lib/python3.5/site-packages/pyarrow/libparquet.so.1
      #8  0x00007fbaa5be2047 in parquet::arrow::FileWriter::WriteTable(arrow::Table const&, long) () from /home/ubuntu/.local/lib/python3.5/site-packages/pyarrow/libparquet.so.1
      #9  0x00007fbaa51e1f53 in __pyx_pw_7pyarrow_8_parquet_13ParquetWriter_5write_table(_object*, _object*, _object*) ()
         from /home/ubuntu/.local/lib/python3.5/site-packages/pyarrow/_parquet.cpython-35m-x86_64-linux-gnu.so
      #10 0x00000000004e9bc7 in PyCFunction_Call () at ../Objects/methodobject.c:98
      #11 0x0000000000529885 in do_call (nk=<optimized out>, na=<optimized out>, pp_stack=0x7ffe6510a6c0, func=<optimized out>) at ../Python/ceval.c:4933
      #12 call_function (oparg=<optimized out>, pp_stack=0x7ffe6510a6c0) at ../Python/ceval.c:4732
      #13 PyEval_EvalFrameEx () at ../Python/ceval.c:3236
      #14 0x000000000052d2e3 in _PyEval_EvalCodeWithName () at ../Python/ceval.c:4018
      #15 0x0000000000528eee in fast_function (nk=<optimized out>, na=<optimized out>, n=<optimized out>, pp_stack=0x7ffe6510a8d0, func=<optimized out>) at ../Python/ceval.c:4813
      #16 call_function (oparg=<optimized out>, pp_stack=0x7ffe6510a8d0) at ../Python/ceval.c:4730
      #17 PyEval_EvalFrameEx () at ../Python/ceval.c:3236
      #18 0x000000000052d2e3 in _PyEval_EvalCodeWithName () at ../Python/ceval.c:4018
      #19 0x0000000000528eee in fast_function (nk=<optimized out>, na=<optimized out>, n=<optimized out>, pp_stack=0x7ffe6510aae0, func=<optimized out>) at ../Python/ceval.c:4813
      #20 call_function (oparg=<optimized out>, pp_stack=0x7ffe6510aae0) at ../Python/ceval.c:4730
      #21 PyEval_EvalFrameEx () at ../Python/ceval.c:3236
      #22 0x0000000000528814 in fast_function (nk=<optimized out>, na=<optimized out>, n=<optimized out>, pp_stack=0x7ffe6510ac10, func=<optimized out>) at ../Python/ceval.c:4803
      #23 call_function (oparg=<optimized out>, pp_stack=0x7ffe6510ac10) at ../Python/ceval.c:4730
      #24 PyEval_EvalFrameEx () at ../Python/ceval.c:3236
      #25 0x0000000000528814 in fast_function (nk=<optimized out>, na=<optimized out>, n=<optimized out>, pp_stack=0x7ffe6510ad40, func=<optimized out>) at ../Python/ceval.c:4803
      #26 call_function (oparg=<optimized out>, pp_stack=0x7ffe6510ad40) at ../Python/ceval.c:4730
      #27 PyEval_EvalFrameEx () at ../Python/ceval.c:3236
      #28 0x000000000052d2e3 in _PyEval_EvalCodeWithName () at ../Python/ceval.c:4018
      #29 0x000000000052dfdf in PyEval_EvalCodeEx () at ../Python/ceval.c:4039
      #30 PyEval_EvalCode (co=<optimized out>, globals=<optimized out>, locals=<optimized out>) at ../Python/ceval.c:777
      #31 0x00000000005fd2c2 in run_mod () at ../Python/pythonrun.c:976
      #32 0x00000000005ff76a in PyRun_FileExFlags () at ../Python/pythonrun.c:929
      #33 0x00000000005ff95c in PyRun_SimpleFileExFlags () at ../Python/pythonrun.c:396
      #34 0x000000000063e7d6 in run_file (p_cf=0x7ffe6510afb0, filename=0x2161260 L"scripts/parquet_export.py", fp=0x226fde0) at ../Modules/main.c:318
      #35 Py_Main () at ../Modules/main.c:768
      #36 0x00000000004cfe41 in main () at ../Programs/python.c:65
      #37 0x00007fbadf0db830 in __libc_start_main (main=0x4cfd60 <main>, argc=2, argv=0x7ffe6510b1c8, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7ffe6510b1b8)
          at ../csu/libc-start.c:291
      #38 0x00000000005d5f29 in _start ()
      

      This is occurring in a pretty vanilla call to `pq.write_table(table, output)`. Before the crash, I'm able to print out the table's schema and it looks a little odd (all columns are explicitly specified in pandas.read_csv() to be strings...

      _id: string
      ref_id: string
      ref_no: string
      stage: string
      stage2_ref_id: string
      org_id: string
      classification: string
      solicitation_no: string
      notice_type: string
      business_category: string
      procurement_mode: string
      funding_instrument: string
      funding_source: string
      approved_budget: string
      publish_date: string
      closing_date: string
      contract_duration: string
      calendar_type: string
      trade_agreement: string
      pre_bid_date: string
      pre_bid_venue: string
      procuring_entity_org_id: string
      procuring_entity_org: string
      client_agency_org_id: string
      client_agency_org: string
      contact_person: string
      contact_person_address: string
      tender_title: string
      description: string
      other_info: string
      reason: string
      created_by: string
      creation_date: string
      modified_date: string
      special_instruction: string
      collection_contact: string
      tender_status: string
      collection_point: string
      date_available: string
      serialid: string
      __index_level_0__: int64
      -- metadata --
      pandas: {"index_columns": ["__index_level_0__"], "columns": [{"pandas_type": "unicode", "numpy_type": "object", "metadata": null, "name": "_id"}, {"pandas_type": "unicode", "numpy_type": "object", "metadata": null, "name": "ref_id"}, {"pandas_type": "unicode", "numpy_type": "object", "metadata": null, "name": "ref_no"}, {"pandas_type": "unicode", "numpy_type": "object", "metadata": null, "name": "stage"}, {"pandas_type": "mixed", "numpy_type": "object", "metadata": null, "name": "stage2_ref_id"}, {"pandas_type": "unicode", "numpy_type": "object", "metadata": null, "name": "org_id"}, {"pandas_type": "unicode", "numpy_type": "object", "metadata": null, "name": "classification"}, {"pandas_type": "mixed", "numpy_type": "object", "metadata": null, "name": "solicitation_no"}, {"pandas_type": "unicode", "numpy_type": "object", "metadata": null, "name": "notice_type"}, {"pandas_type": "unicode", "numpy_type": "object", "metadata": null, "name": "business_category"}, {"pandas_type": "unicode", "numpy_type": "object", "metadata": null, "name": "procurement_mode"}, {"pandas_type": "mixed", "numpy_type": "object", "metadata": null, "name": "funding_instrument"}, {"pandas_type": "unicode", "numpy_type": "object", "metadata": null, "name": "funding_source"}, {"pandas_type": "unicode", "numpy_type": "object", "metadata": null, "name": "approved_budget"}, {"pandas_type": "mixed", "numpy_type": "object", "metadata": null, "name": "publish_date"}, {"pandas_type": "mixed", "numpy_type": "object", "metadata": null, "name": "closing_date"}, {"pandas_type": "unicode", "numpy_type": "object", "metadata": null, "name": "contract_duration"}, {"pandas_type": "mixed", "numpy_type": "object", "metadata": null, "name": "calendar_type"}, {"pandas_type": "unicode", "numpy_type": "object", "metadata": null, "name": "trade_agreement"}, {"pandas_type": "mixed", "numpy_type": "object", "metadata": null, "name": "pre_bid_date"}, {"pandas_type": "mixed", "numpy_type": "object", "metadata": null, "name": "pre_bid_venue"}, {"pandas_type": "unicode", "numpy_type": "object", "metadata": null, "name": "procuring_entity_org_id"}, {"pandas_type": "unicode", "numpy_type": "object", "metadata": null, "name": "procuring_entity_org"}, {"pandas_type": "unicode", "numpy_type": "object", "metadata": null, "name": "client_agency_org_id"}, {"pandas_type": "mixed", "numpy_type": "object", "metadata": null, "name": "client_agency_org"}, {"pandas_type": "mixed", "numpy_type": "object", "metadata": null, "name": "contact_person"}, {"pandas_type": "unicode", "numpy_type": "object", "metadata": null, "name": "contact_person_address"}, {"pandas_type": "mixed", "numpy_type": "object", "metadata": null, "name": "tender_title"}, {"pandas_type": "mixed", "numpy_type": "object", "metadata": null, "name": "description"}, {"pandas_type": "mixed", "numpy_type": "object", "metadata": null, "name": "other_info"}, {"pandas_type": "mixed", "numpy_type": "object", "metadata": null, "name": "reason"}, {"pandas_type": "unicode", "numpy_type": "object", "metadata": null, "name": "created_by"}, {"pandas_type": "unicode", "numpy_type": "object", "metadata": null, "name": "creation_date"}, {"pandas_type": "unicode", "numpy_type": "object", "metadata": null, "name": "modified_date"}, {"pandas_type": "mixed", "numpy_type": "object", "metadata": null, "name": "special_instruction"}, {"pandas_type": "mixed", "numpy_type": "object", "metadata": null, "name": "collection_contact"}, {"pandas_type": "mixed", "numpy_type": "object", "metadata": null, "name": "tender_status"}, {"pandas_type": "mixed", "numpy_type": "object", "metadata": null, "name": "collection_point"}, {"pandas_type": "mixed", "numpy_type": "object", "metadata": null, "name": "date_available"}, {"pandas_type": "unicode", "numpy_type": "object", "metadata": null, "name": "serialid"}, {"pandas_type": "int64", "numpy_type": "int64", "metadata": null, "name": "__index_level_0__"}], "pandas_version": "0.19.2"}
      Segmentation fault (core dumped)
      

      Attachments

        Issue Links

          Activity

            People

              wesm Wes McKinney
              jeffknupp Jeff Knupp
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: