[SPARK-7591] FSBasedRelation interface tweaks - ASF JIRA

XML

Word

Printable

JSON

Renaming FSBasedRelation to HadoopFsRelation
Since itss all coupled with Hadoop FileSystem and job API.
HadoopFsRelation should have a no-arg constructor
paths and partitionColumns should just be methods to be overridden, rather than constructor arguments. This makes data source developers life easier by having a no-arg constructor and being serialization friendly.
Renaming HadoopFsRelation.prepareForWrite to HadoopFsRelation.prepareJobForWrite
The new name explicitly suggests developers should only touch the Job instance for preparation work (which is also documented in Scaladoc).
Allowing serialization while creating {{OutputWriter}}s
To be more precise, {{OutputWriter}}s are never created on driver side and serialized to executor side. But the factory that creates {{OutputWriter}}s should be created on driver side and serialized.
The reason behind this is that, passing all needed materials to OutputWriter instances via Hadoop Configuration is doable but sometimes neither intuitive nor convenient. Resorting to serialization makes data source developers' life easier. Actually this happens when I was migrating the Parquet data source, and wanted to pass the final output path (instead of temporary work path) to the output writer (see here). There I have to put a property into the Configuration object.

links to

[Github] Pull Request #6150 (liancheng)