Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-12065

[C++][Python] Segfault reading JSON file

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 3.0.0
    • 4.0.0
    • C++, Python
    • arch linux, 31G ram

    Description

      I noticed this when doing some analysis on a not very complex, but reasonably large json file and I've simplified it to a fairly minimal reproduction:

      ```

      import pyarrow.json
      pyarrow.json.read_json('test.json')

      ```

      and `test.json` is

      ```

      {"A":"<0 repeated 1.6 million times>"} {"B":[]}

      ```

      this seems like it shouldn't be too large to load into memory all-at-once, so I'm surprised there is a segfault

      running via gdb and getting a backtrace gives

      ```

      (gdb) bt
      #0 0x00007ffff5c1965d in std::_shared_ptr<arrow::Buffer, (gnu_cxx::_Lock_policy)2>::shared_ptr(std::shared_ptr<arrow::Buffer, (_gnu_cxx::_Lock_policy)2> const&) () from /home/patrick/.local/lib/python3.9/site-packages/pyarrow/libarrow.so.300
      #1 0x00007ffff5ca8d9e in arrow::json::ChunkedListArrayBuilder::Insert(long, std::shared_ptr<arrow::Field> const&, std::shared_ptr<arrow::Array> const&) () from /home/patrick/.local/lib/python3.9/site-packages/pyarrow/libarrow.so.300
      #2 0x00007ffff5cabcc8 in arrow::json::ChunkedStructArrayBuilder::Finish(std::shared_ptr<arrow::ChunkedArray>*) () from /home/patrick/.local/lib/python3.9/site-packages/pyarrow/libarrow.so.300
      #3 0x00007ffff5c1fc16 in arrow::json::TableReaderImpl::Read() () from /home/patrick/.local/lib/python3.9/site-packages/pyarrow/libarrow.so.300
      #4 0x00007fffcf73da69 in __pyx_pw_7pyarrow_5_json_1read_json(_object*, _object*, _object*) () from /home/patrick/.local/lib/python3.9/site-packages/pyarrow/_json.cpython-39-x86_64-linux-gnu.so
      #5 0x00007ffff7d35a43 in ?? () from /usr/lib/libpython3.9.so.1.0
      #6 0x00007ffff7d1be6d in _PyObject_MakeTpCall () from /usr/lib/libpython3.9.so.1.0
      #7 0x00007ffff7d17b3a in _PyEval_EvalFrameDefault () from /usr/lib/libpython3.9.so.1.0
      #8 0x00007ffff7d119ad in ?? () from /usr/lib/libpython3.9.so.1.0
      #9 0x00007ffff7d11371 in _PyEval_EvalCodeWithName () from /usr/lib/libpython3.9.so.1.0
      #10 0x00007ffff7dd3f83 in PyEval_EvalCode () from /usr/lib/libpython3.9.so.1.0
      #11 0x00007ffff7de43dd in ?? () from /usr/lib/libpython3.9.so.1.0
      #12 0x00007ffff7ddfc7b in ?? () from /usr/lib/libpython3.9.so.1.0
      #13 0x00007ffff7cf38ab in ?? () from /usr/lib/libpython3.9.so.1.0
      #14 0x00007ffff7cf3a63 in PyRun_InteractiveLoopFlags () from /usr/lib/libpython3.9.so.1.0
      #15 0x00007ffff7c81f6b in PyRun_AnyFileExFlags () from /usr/lib/libpython3.9.so.1.0
      #16 0x00007ffff7c7665c in ?? () from /usr/lib/libpython3.9.so.1.0
      #17 0x00007ffff7dc6fa9 in Py_BytesMain () from /usr/lib/libpython3.9.so.1.0
      #18 0x00007ffff7a43b25 in __libc_start_main () from /usr/lib/libc.so.6
      #19 0x000055555555504e in _start ()
      (gdb)

      ```

       

      Attachments

        1. image-2021-03-23-15-43-29-139.png
          10 kB
          Patrick
        2. segfault.sh
          0.1 kB
          Patrick
        3. test.tgz
          2 kB
          Patrick

        Issue Links

          Activity

            People

              apitrou Antoine Pitrou
              paddygord Patrick
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 0.5h
                  0.5h