Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-5419

[C++] CSV strings_can_be_null option doesn't respect all null_values

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Minor
    • Resolution: Fixed
    • None
    • 0.14.0
    • C++, Python
    • Python 3.6.8
      PyArrow 0.13.1.dev225+g184b8deb
      NumPy 1.16.3
      Pandas 0.24.2

    Description

      Relates to ARROW-5195 and https://github.com/apache/arrow/issues/4184

      I was testing the new strings_can_be_null ConvertOption (built from git 184b8deb651c6f6308c0fa2a595f5a40f5da8ce8) in conjunction with the CSV reader and noted that when enabled and an empty string is parsed that it doesn't return NULL despite '' being in the default null_values list (https://github.com/apache/arrow/blob/f7ef65e5fc367f1f5649dfcea0754e413fcca394/cpp/src/arrow/csv/options.cc#L28)

      options.null_values = {"", "#N/A", "#N/A N/A", "#NA", "-1.#IND", "-1.#QNAN",
      "-NaN", "-nan", "1.#IND", "1.#QNAN", "N/A", "NA",
      "NULL", "NaN", "n/a", "nan", "null"};
      

      Given that the strings_can_be_null option was added to expose the same NULL processing functionality with respect to strings as pandas.read_csv, I believe that it should also be able to handle empty strings. ** 

      In Pandas:

      content = b"a,b\n1,null\n2,\n3,test"
      df = pd.read_csv(io.BytesIO(content))
      print(df)
         a     b
      0  1   NaN
      1  2   NaN
      2  3  test
      

      In PyArrow:

      convert_options = pc.ConvertOptions(strings_can_be_null=True)
      table = pc.read_csv(io.BytesIO(content), convert_options=convert_options)
      print(table.to_pydict())
      OrderedDict([('a', [1, 2, 3]), ('b', [None, '', 'test'])])
      

       

       

      Attachments

        Issue Links

          Activity

            People

              apitrou Antoine Pitrou
              dennis.waldron Dennis Waldron
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 0.5h
                  0.5h