Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-10247

[C++][Dataset] Cannot write dataset with dictionary column as partition field

    XMLWordPrintableJSON

Details

    Description

      When the column to use for partitioning is dictionary encoded, we get this error:

      In [9]: import pyarrow.dataset as ds
      
      In [10]: part = ["xxx"] * 3 + ["yyy"] * 3
          ...: table = pa.table([
          ...:     pa.array(range(len(part))),
          ...:     pa.array(part).dictionary_encode(),
          ...: ], names=['col', 'part'])
      
      In [11]: part = ds.partitioning(table.select(["part"]).schema)
      
      In [12]: ds.write_dataset(table, "test_dataset_dict_part", format="parquet", partitioning=part)
      ---------------------------------------------------------------------------
      ArrowTypeError                            Traceback (most recent call last)
      <ipython-input-12-c7b81c9b0bda> in <module>
      ----> 1 ds.write_dataset(table, "test_dataset_dict_part", format="parquet", partitioning=part)
      
      ~/scipy/repos/arrow/python/pyarrow/dataset.py in write_dataset(data, base_dir, basename_template, format, partitioning, schema, filesystem, file_options, use_threads)
          773     _filesystemdataset_write(
          774         data, base_dir, basename_template, schema,
      --> 775         filesystem, partitioning, file_options, use_threads,
          776     )
      
      ~/scipy/repos/arrow/python/pyarrow/_dataset.pyx in pyarrow._dataset._filesystemdataset_write()
      
      ~/scipy/repos/arrow/python/pyarrow/error.pxi in pyarrow.lib.check_status()
      
      ArrowTypeError: scalar xxx (of type string) is invalid for part: dictionary<values=string, indices=int32, ordered=0>
      In ../src/arrow/dataset/filter.cc, line 1082, code: VisitConjunctionMembers(*and_.left_operand(), visitor)
      In ../src/arrow/dataset/partition.cc, line 257, code: VisitKeys(expr, [&](const std::string& name, const std::shared_ptr<Scalar>& value) { auto&& _error_or_value28 = (FieldRef(name).FindOneOrNone(*schema_)); do { ::arrow::Status __s = ::arrow::internal::GenericToStatus((_error_or_value28).status()); do { if ((__builtin_expect(!!(!__s.ok()), 0))) { ::arrow::Status _st = (__s); _st.AddContextLine("../src/arrow/dataset/partition.cc", 257, "(_error_or_value28).status()"); return _st; } } while (0); } while (false); auto match = std::move(_error_or_value28).ValueUnsafe();;; if (match) { const auto& field = schema_->field(match[0]); if (!value->type->Equals(field->type())) { return Status::TypeError("scalar ", value->ToString(), " (of type ", *value->type, ") is invalid for ", field->ToString()); } values[match[0]] = value.get(); } return Status::OK(); })
      In ../src/arrow/dataset/file_base.cc, line 321, code: (_error_or_value24).status()
      In ../src/arrow/dataset/file_base.cc, line 354, code: task_group->Finish()
      

      While this seems a quit normal use case, as this column will typically be repeated many times (and we also support reading it as such with dictionary type, so a roundtrip is currently not possible in that case)

      I tagged it for 2.0.0 for a moment in case it's possible today, but I didn't yet look into how easy it would be to fix.

      Attachments

        Issue Links

          Activity

            People

              bkietz Ben Kietzman
              jorisvandenbossche Joris Van den Bossche
              Votes:
              1 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 6h
                  6h