Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-1695

[Serialization] Fix reference counting of numpy arrays created in custom serialializer

    Details

      Description

      The problem happens with the following code:

      import numpy as np
      import pyarrow
      import sys
      
      class Bar(object):
          pass
      
      def bar_custom_serializer(obj):
          x = np.zeros(4)
          return x
      
      def bar_custom_deserializer(serialized_obj):
          return serialized_obj
      
      pyarrow._default_serialization_context.register_type(Bar, "Bar", pickle=False, custom_serializer=bar_custom_serializer, custom_deserializer=bar_custom_deserializer)
      
      pyarrow.serialize(Bar())
      

      After execution of pyarrow.serialize, the interpreter crashes in the garbage collection routine.

      This happens if a numpy array is returned in the custom serializer but there is no other reference to the numpy array. The reason this is not a problem in the current code is that so far we haven't created new numpy arrays in the custom serializer.

      I think the problem here is that the numpy array hits reference count zero between the end of SerializeSequences in python_to_arrow.cc and the call to NdarrayToTensor. I'll push a fix later today, which just increases and decreases the reference counts at the appropriate places.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                pcmoritz Philipp Moritz
                Reporter:
                pcmoritz Philipp Moritz
              • Votes:
                0 Vote for this issue
                Watchers:
                3 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: