Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-1873

[Python] Segmentation fault when loading total 2GB of parquet files




      We are trying to load 100 parquet files, and each of them is around 20MB. Before we port ARROW-1830 into our pyarrow distribution, we use glob to list all the files, and then load them as pandas dataframe through pyarrow.

      The schema of the parquet files is like

       |-- dateint: integer (nullable = true)
       |-- profileid: long (nullable = true)
       |-- time: long (nullable = true)
       |-- label: double (nullable = true)
       |-- weight: double (nullable = true)
       |-- features: array (nullable = true)
       |    |-- element: double (containsNull = true)

      If we only load couple of them, it works without any issue. However, when loading 100 of them, we got segmentation fault as the following. FYI, if we flatten features: array[double] into top level, the file sizes are around the same, and work fine too.

      Is there anything we can try to eliminate this issue? Thanks.

      >>> import glob
      >>> files = glob.glob("/home/dbt/data/*")
      >>> data = pq.ParquetDataset(files).read().to_pandas()
      [New Thread 0x7fffe8f84700 (LWP 23769)]
      [New Thread 0x7fffe3b93700 (LWP 23770)]
      [New Thread 0x7fffe3392700 (LWP 23771)]
      [New Thread 0x7fffe2b91700 (LWP 23772)]
      [Thread 0x7fffe2b91700 (LWP 23772) exited]
      [Thread 0x7fffe3b93700 (LWP 23770) exited]
      Thread 4 "python" received signal SIGSEGV, Segmentation fault.
      [Switching to Thread 0x7fffe3392700 (LWP 23771)]
      0x00007ffff270fc94 in arrow::Status arrow::VisitTypeInline<arrow::py::ArrowDeserializer>(arrow::DataType const&, arrow::py::ArrowDeserializer*) ()
         from /home/dbt/miniconda3/lib/python3.6/site-packages/pyarrow/../../../libarrow_python.so.0
      (gdb) backtrace
      #0  0x00007ffff270fc94 in arrow::Status arrow::VisitTypeInline<arrow::py::ArrowDeserializer>(arrow::DataType const&, arrow::py::ArrowDeserializer*) ()
         from /home/dbt/miniconda3/lib/python3.6/site-packages/pyarrow/../../../libarrow_python.so.0
      #1  0x00007ffff2700b5a in arrow::py::ConvertColumnToPandas(arrow::py::PandasOptions, std::shared_ptr<arrow::Column> const&, _object*, _object**) ()
         from /home/dbt/miniconda3/lib/python3.6/site-packages/pyarrow/../../../libarrow_python.so.0
      #2  0x00007ffff2714985 in arrow::Status arrow::py::ConvertListsLike<arrow::DoubleType>(arrow::py::PandasOptions, std::shared_ptr<arrow::Column> const&, _object**) () from /home/dbt/miniconda3/lib/python3.6/site-packages/pyarrow/../../../libarrow_python.so.0
      #3  0x00007ffff2716b92 in arrow::py::ObjectBlock::Write(std::shared_ptr<arrow::Column> const&, long, long) ()
         from /home/dbt/miniconda3/lib/python3.6/site-packages/pyarrow/../../../libarrow_python.so.0
      #4  0x00007ffff270a489 in arrow::py::DataFrameBlockCreator::WriteTableToBlocks(int)::{lambda(int)#1}::operator()(int) const ()
         from /home/dbt/miniconda3/lib/python3.6/site-packages/pyarrow/../../../libarrow_python.so.0
      #5  0x00007ffff270a67c in std::thread::_Impl<std::_Bind_simple<arrow::Status arrow::ParallelFor<arrow::py::DataFrameBlockCreator::WriteTableToBlocks(int)::{lambda(int)#1}&>(int, int, arrow::py::DataFrameBlockCreator::WriteTableToBlocks(int)::{lambda(int)#1}&)::{lambda()#1} ()> >::_M_run() ()
         from /home/dbt/miniconda3/lib/python3.6/site-packages/pyarrow/../../../libarrow_python.so.0
      #6  0x00007ffff1e30c5c in std::execute_native_thread_routine_compat (__p=<optimized out>)
          at /opt/conda/conda-bld/compilers_linux-64_1505664199673/work/.build/src/gcc-7.2.0/libstdc++-v3/src/c++11/thread.cc:110
      #7  0x00007ffff7bc16ba in start_thread (arg=0x7fffe3392700) at pthread_create.c:333
      #8  0x00007ffff78f73dd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109


          Issue Links



              • Assignee:
                wesm Wes McKinney
                dbtsai DB Tsai
              • Votes:
                0 Vote for this issue
                4 Start watching this issue


                • Created: