Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-1695

[Serialization] Fix reference counting of numpy arrays created in custom serialializer

    XMLWordPrintableJSON

Details

    Description

      The problem happens with the following code:

      import numpy as np
      import pyarrow
      import sys
      
      class Bar(object):
          pass
      
      def bar_custom_serializer(obj):
          x = np.zeros(4)
          return x
      
      def bar_custom_deserializer(serialized_obj):
          return serialized_obj
      
      pyarrow._default_serialization_context.register_type(Bar, "Bar", pickle=False, custom_serializer=bar_custom_serializer, custom_deserializer=bar_custom_deserializer)
      
      pyarrow.serialize(Bar())
      

      After execution of pyarrow.serialize, the interpreter crashes in the garbage collection routine.

      This happens if a numpy array is returned in the custom serializer but there is no other reference to the numpy array. The reason this is not a problem in the current code is that so far we haven't created new numpy arrays in the custom serializer.

      I think the problem here is that the numpy array hits reference count zero between the end of SerializeSequences in python_to_arrow.cc and the call to NdarrayToTensor. I'll push a fix later today, which just increases and decreases the reference counts at the appropriate places.

      Attachments

        Issue Links

          Activity

            People

              pcmoritz Philipp Moritz
              pcmoritz Philipp Moritz
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: