Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-17641

[python] Deserializing ParseOptions does not set up invalid row handler correctly

    XMLWordPrintableJSON

Details

    Description

      Serializing and deserializing a csv.ParseOptions object with an invalid_row_handler will render the handler unusable. This is likely because the setter is not called correctly in the _setstate_ method.

      Reproduction script:

       

      import cloudpickle
      from pyarrow import csv
      
      
      invalid_csv = """f1,f2
      3,4
      5,6
      \x00\x00
      7,8"""
      
      source = "test.csv"
      with open(source, "w") as f:
          f.write(invalid_csv)
      
      
      def read_file(path, parse_options):
          # Uncomment this for a fix!
          # parse_options.invalid_row_handler = parse_options.invalid_row_handler
      
          with open(path, "rb") as f:
              return csv.read_csv(f, parse_options=parse_options)
      
      
      parse_options = csv.ParseOptions(delimiter=",", invalid_row_handler=lambda i: "skip")
      
      # Will succeed
      print(read_file(source, parse_options=parse_options))
      
      parse_options = cloudpickle.loads(cloudpickle.dumps(parse_options))
      
      # Will fail
      print(read_file(source, parse_options=parse_options))
      
      
      

       

      Attachments

        Issue Links

          Activity

            People

              kaifricke Kai Fricke
              kaifricke Kai Fricke
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 1h 10m
                  1h 10m