Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-1873

[Python] Segmentation fault when loading total 2GB of parquet files

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.8.0
    • Component/s: None

      Description

      We are trying to load 100 parquet files, and each of them is around 20MB. Before we port ARROW-1830 into our pyarrow distribution, we use glob to list all the files, and then load them as pandas dataframe through pyarrow.

      The schema of the parquet files is like

      root
       |-- dateint: integer (nullable = true)
       |-- profileid: long (nullable = true)
       |-- time: long (nullable = true)
       |-- label: double (nullable = true)
       |-- weight: double (nullable = true)
       |-- features: array (nullable = true)
       |    |-- element: double (containsNull = true)
      

      If we only load couple of them, it works without any issue. However, when loading 100 of them, we got segmentation fault as the following. FYI, if we flatten features: array[double] into top level, the file sizes are around the same, and work fine too.

      Is there anything we can try to eliminate this issue? Thanks.

      >>> import glob
      >>> files = glob.glob("/home/dbt/data/*")
      >>> data = pq.ParquetDataset(files).read().to_pandas()
      [New Thread 0x7fffe8f84700 (LWP 23769)]
      [New Thread 0x7fffe3b93700 (LWP 23770)]
      [New Thread 0x7fffe3392700 (LWP 23771)]
      [New Thread 0x7fffe2b91700 (LWP 23772)]
      [Thread 0x7fffe2b91700 (LWP 23772) exited]
      [Thread 0x7fffe3b93700 (LWP 23770) exited]
      
      Thread 4 "python" received signal SIGSEGV, Segmentation fault.
      [Switching to Thread 0x7fffe3392700 (LWP 23771)]
      0x00007ffff270fc94 in arrow::Status arrow::VisitTypeInline<arrow::py::ArrowDeserializer>(arrow::DataType const&, arrow::py::ArrowDeserializer*) ()
         from /home/dbt/miniconda3/lib/python3.6/site-packages/pyarrow/../../../libarrow_python.so.0
      (gdb) backtrace
      #0  0x00007ffff270fc94 in arrow::Status arrow::VisitTypeInline<arrow::py::ArrowDeserializer>(arrow::DataType const&, arrow::py::ArrowDeserializer*) ()
         from /home/dbt/miniconda3/lib/python3.6/site-packages/pyarrow/../../../libarrow_python.so.0
      #1  0x00007ffff2700b5a in arrow::py::ConvertColumnToPandas(arrow::py::PandasOptions, std::shared_ptr<arrow::Column> const&, _object*, _object**) ()
         from /home/dbt/miniconda3/lib/python3.6/site-packages/pyarrow/../../../libarrow_python.so.0
      #2  0x00007ffff2714985 in arrow::Status arrow::py::ConvertListsLike<arrow::DoubleType>(arrow::py::PandasOptions, std::shared_ptr<arrow::Column> const&, _object**) () from /home/dbt/miniconda3/lib/python3.6/site-packages/pyarrow/../../../libarrow_python.so.0
      #3  0x00007ffff2716b92 in arrow::py::ObjectBlock::Write(std::shared_ptr<arrow::Column> const&, long, long) ()
         from /home/dbt/miniconda3/lib/python3.6/site-packages/pyarrow/../../../libarrow_python.so.0
      #4  0x00007ffff270a489 in arrow::py::DataFrameBlockCreator::WriteTableToBlocks(int)::{lambda(int)#1}::operator()(int) const ()
         from /home/dbt/miniconda3/lib/python3.6/site-packages/pyarrow/../../../libarrow_python.so.0
      #5  0x00007ffff270a67c in std::thread::_Impl<std::_Bind_simple<arrow::Status arrow::ParallelFor<arrow::py::DataFrameBlockCreator::WriteTableToBlocks(int)::{lambda(int)#1}&>(int, int, arrow::py::DataFrameBlockCreator::WriteTableToBlocks(int)::{lambda(int)#1}&)::{lambda()#1} ()> >::_M_run() ()
         from /home/dbt/miniconda3/lib/python3.6/site-packages/pyarrow/../../../libarrow_python.so.0
      #6  0x00007ffff1e30c5c in std::execute_native_thread_routine_compat (__p=<optimized out>)
          at /opt/conda/conda-bld/compilers_linux-64_1505664199673/work/.build/src/gcc-7.2.0/libstdc++-v3/src/c++11/thread.cc:110
      #7  0x00007ffff7bc16ba in start_thread (arg=0x7fffe3392700) at pthread_create.c:333
      #8  0x00007ffff78f73dd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
      

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                wesmckinn Wes McKinney
                Reporter:
                dbtsai DB Tsai
              • Votes:
                0 Vote for this issue
                Watchers:
                4 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: