Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-12988

[CI] The kartothek nightly integration build is failing (test_update_dataset_from_ddf_empty)

    XMLWordPrintableJSON

Details

    Description

      The nightly "kartothek" integration builds are failing.

      More specifically, the test_update_dataset_from_ddf_empty is failing with:

      =================================== FAILURES ===================================
      ___________________ test_update_dataset_from_ddf_empty[True] ___________________
      
      store_factory = functools.partial(<function get_store_from_url at 0x7f1434733050>, 'hfs:///tmp/pytest-of-root/pytest-0/test_update_dataset_from_ddf_e0/store')
      shuffle = True
      
          @pytest.mark.parametrize("shuffle", [True, False])
          def test_update_dataset_from_ddf_empty(store_factory, shuffle):
              with pytest.raises(ValueError, match="Cannot store empty datasets"):
                  update_dataset_from_ddf(
      >               dask.dataframe.from_delayed([], meta=(("a", int),)),
                      store_factory,
                      dataset_uuid="output_dataset_uuid",
                      table="core",
                      shuffle=shuffle,
                      partition_on=["a"],
                  ).compute()
      
      tests/io/dask/dataframe/test_update.py:57: 
      _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
      
      dfs = [], meta = (('a', <class 'int'>),), divisions = None
      prefix = 'from-delayed', verify_meta = True
      
          @insert_meta_param_description
          def from_delayed(
              dfs, meta=None, divisions=None, prefix="from-delayed", verify_meta=True
          ):
              """Create Dask DataFrame from many Dask Delayed objects
          
              Parameters
              ----------
              dfs : list of Delayed
                  An iterable of ``dask.delayed.Delayed`` objects, such as come from
                  ``dask.delayed`` These comprise the individual partitions of the
                  resulting dataframe.
              $META
              divisions : tuple, str, optional
                  Partition boundaries along the index.
                  For tuple, see https://docs.dask.org/en/latest/dataframe-design.html#partitions
                  For string 'sorted' will compute the delayed values to find index
                  values.  Assumes that the indexes are mutually sorted.
                  If None, then won't use index information
              prefix : str, optional
                  Prefix to prepend to the keys.
              verify_meta : bool, optional
                  If True check that the partitions have consistent metadata, defaults to True.
              """
              from dask.delayed import Delayed
          
              if isinstance(dfs, Delayed):
                  dfs = [dfs]
              dfs = [
                  delayed(df) if not isinstance(df, Delayed) and hasattr(df, "key") else df
                  for df in dfs
              ]
              for df in dfs:
                  if not isinstance(df, Delayed):
                      raise TypeError("Expected Delayed object, got %s" % type(df).__name__)
          
      >       parent_meta = delayed(make_meta)(dfs[0]).compute()
      E       IndexError: list index out of range
      
      /opt/conda/envs/arrow/lib/python3.7/site-packages/dask/dataframe/io/io.py:591: IndexError
      

      (from https://github.com/ursacomputing/crossbow/runs/2756067090)

      Not directly sure if this is a kartothek issue or a pyarrow issue. But also created an issue on their side: https://github.com/JDASoftwareGroup/kartothek/issues/475

      Attachments

        Issue Links

          Activity

            People

              jorisvandenbossche Joris Van den Bossche
              jorisvandenbossche Joris Van den Bossche
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 2h 40m
                  2h 40m