[BEAM-13335] DataFrame sources produce excessively large index - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: P2
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 2.36.0
Component/s: dsl-dataframe
Labels:
None

Description

DataFrame reads attempt to match user expectations by giving every element across all
shards a unique index. This is done by embedding the filepath
itself in the index, but this results in the (often quite long) path
being duplicated for every element (sometimes exceeding the size of the
data itself).

We should instead generate a guaranteed unique numeric index.

Attachments

Issue Links

links to

GitHub Pull Request #16066

GitHub Pull Request #16089

Activity

People

Assignee:: Robert Bradshaw

Reporter:: Brian Hulette

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 29/Nov/21 13:37

Updated:: 01/Dec/21 15:40

Resolved:: 29/Nov/21 22:31

Time Tracking

Estimated:

Not Specified

Remaining:

Logged: