In typical development settings, multiple tables with very different concepts are mapped to the same `DataFrame` class. The inheritance from the pyspark `DataFrame` class is a bit cumbersome because of the chainable methods and it also makes it difficult to abstract regularly used queries. The proposal is to generate a `DynamicDataFrame` that allows easy inheritance retaining `DataFrame` methods without losing chainability neither for the newly generated queries nor for the usual dataframe ones.
In our experience, this allowed us to iterate much faster, generating business-centric classes in a couple of lines of code. Here's an example of what the application code would look like. Attached in the end is a summary of the different strategies that are usually pursued when trying to abstract queries.
The PR linked to this ticket is an implementation of the DynamicDataFramed used in this snippet.
Other strategies found for handling the query abstraction:
1. Functions: using functions that call dataframes and returns them transformed. It had a couple of pitfalls: we had to manage the namespaces carefully, there is no clear new object and also the "chainability" didn't feel very pyspark-y.
2. MonkeyPatching DataFrame: we monkeypatched (https://stackoverflow.com/questions/5626193/what-is-monkey-patching) methods with the regularly done queries inside the DataFrame class. This one kept it pyspark-y, but there was no easy way to handle segregated namespaces/
3. Inheritances: create the class `MyBusinessDataFrame`, inherit from `DataFrame` and implement the methods there. This one solves all the issues, but with a caveat: the chainable methods cast the result explicitly to `DataFrame` (see https://github.com/apache/spark/blob/master/python/pyspark/sql/dataframe.py#L1910 e g). Therefore, everytime you use one of the parent's methods you'd have to re-cast to `MyBusinessDataFrame`, making the code cumbersome.
(see https://mail-archives.apache.org/mod_mbox/spark-dev/202111.mbox/browser for the link to the original mail in which we proposed this feature)