Uploaded image for project: 'Beam'
  1. Beam
  2. BEAM-13335

DataFrame sources produce excessively large index

Details

    • Improvement
    • Status: Resolved
    • P2
    • Resolution: Fixed
    • None
    • 2.36.0
    • dsl-dataframe
    • None

    Description

      DataFrame reads attempt to match user expectations by giving every element across all
      shards a unique index. This is done by embedding the filepath
      itself in the index, but this results in the (often quite long) path
      being duplicated for every element (sometimes exceeding the size of the
      data itself).

      We should instead generate a guaranteed unique numeric index.

      Attachments

        Activity

          People

            robertwb Robert Bradshaw
            bhulette Brian Hulette
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Time Tracking

                Estimated:
                Original Estimate - Not Specified
                Not Specified
                Remaining:
                Remaining Estimate - 0h
                0h
                Logged:
                Time Spent - 5h
                5h