Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-15310

[C++][Python][Dataset] Detect (and warn?) when DirectoryPartitioning is parsing an actually hive-style file path?



    • Improvement
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • 9.0.0
    • C++, Python


      When you have a hive-style partitioned dataset, with our current dataset(..) API, it's relatively easy to mess up the inferred partitioning and get confusing results.

      For example, if you specify the partitioning field names with partitioning=[...] (which is not needed for hive style since those are inferred), we actually assume you want directory partitioning. This DirectoryPartitioning will then parse the hive-style file paths and take the full "key=value" as the data values for the field.
      And then, doing a filter can result in a confusing empty result (because "value" doesn't match "key=value").

      I am wondering if we can't relatively cheaply detect this case, and eg give an informative warning about this to the user.

      Basically what happens is this:

      >>> part = ds.DirectoryPartitioning(pa.schema([("part", "string")]))
      >>> part.parse("part=a")
      <pyarrow.dataset.Expression (part == "part=a")>

      If the parsed value is a string that contains a "=" (and in this case also contains the field name), that is I think a clear sign that (in the large majority of cases) the user is doing something wrong.

      I am not fully sure where and at what stage the check could be done though. Doing it for every path in the dataset might be too costly.

      Illustrative code example:

      import pyarrow as pa
      import pyarrow.parquet as pq
      import pyarrow.dataset as ds
      import pathlib
      ## constructing a small dataset with 1 hive-style partitioning level
      basedir = pathlib.Path(".") / "dataset_wrong_partitioning"
      (basedir / "part=a").mkdir(exist_ok=True)
      (basedir / "part=b").mkdir(exist_ok=True)
      table1 = pa.table({'a': [1, 2, 3], 'b': [1, 2, 3]})
      pq.write_table(table1, basedir / "part=a" / "data.parquet")
      table2 = pa.table({'a': [4, 5, 6], 'b': [1, 2, 3]})
      pq.write_table(table2, basedir / "part=b" / "data.parquet")

      Reading as is (not specifying a partitioning, so default to no partitioning) will at least give an error about a missing field:

      Unable to find source-code formatter for language: python. Available languages are: actionscript, ada, applescript, bash, c, c#, c++, cpp, css, erlang, go, groovy, haskell, html, java, javascript, js, json, lua, none, nyan, objc, perl, php, python, r, rainbow, ruby, scala, sh, sql, swift, visualbasic, xml, yaml
      >>> dataset = ds.dataset(basedir)
      >>> dataset.to_table(filter=ds.field("part") == "a")
      ArrowInvalid: No match for FieldRef.Name(part) in a: int64

      But specifying the partitioning field name (which currently gets (silently) interpreted as directory partitioning) gives a confusing empty result:

      >>> dataset = ds.dataset(basedir, partitioning=["part"])
      >>> dataset.to_table(filter=ds.field("part") == "a")
      a: int64
      b: int64
      part: string
      a: []
      b: []
      part: []

      This filter doesn't work because the values in the "part" column are not "a" but "part=a":

      >>> dataset.to_table().to_pandas()
         a  b    part
      0  1  1  part=a
      1  2  2  part=a
      2  3  3  part=a
      3  4  1  part=b
      4  5  2  part=b
      5  6  3  part=b


        Issue Links



              Unassigned Unassigned
              jorisvandenbossche Joris Van den Bossche
              0 Vote for this issue
              2 Start watching this issue