Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-17943

[Python] Coredump when joining big large_strings

Add voteWatch issue
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 9.0.0
    • None
    • C++, Python

    Description

      joining large strings in pyarrow results in this error:

      terminate called after throwing an instance of 'std::length_error'
        what():  vector::_M_default_append
      Aborted (core dumped) 

      example code:
      note that this needs quite some ram (run on 128GB)

      import pyarrow as pa    
           
      ids = [x for x in range(2**24)]    
      text = ['a'*2**10]*2**24    
      schema = pa.schema([    
          ('Id', pa.int32()),    
          ('Text', pa.large_string()),    
          ])    
           
      tab1 = pa.Table.from_arrays([ids, text],schema=schema)    
      tab2 = pa.Table.from_arrays([ids, text],schema=schema)    
           
      joined = tab1.join(tab2, keys='Id', right_keys='Id', left_suffix='tab1'

      the same results in a segfault, if i use this schema

      schema = pa.schema([
          ('Id', pa.int32()),
          ('Text', pa.string()),
          ])

       

       

       

       

      Attachments

        Activity

          People

            Unassigned Unassigned
            flowpoint flowpoint

            Dates

              Created:
              Updated:

              Slack

                Issue deployment