Details
-
Improvement
-
Status: Resolved
-
Major
-
Resolution: Incomplete
-
1.6.3, 2.1.1, 2.2.0
-
None
Description
Many people are using Kinesis Firehose to ingest data into a S3-based data lake. Kinesis Firehose produces a directory layout like this:
s3://data-lake-bucket/my-prefix/2017/08/11/10/my-stream-2017-08-11-11-10-10 s3://data-lake-bucket/my-prefix/2017/08/11/11/my-stream-2017-08-11-11-11-10 . . . s3://data-lake-bucket/my-prefix/2017/08/12/00/my-stream-2017-08-12-00-01-01
Spark is (like Hive) not supporting this kind of partitioning. Therefore it would be great, if you could configure a CustomPartitionDiscoverer or PartitionSpecProvider to provide a custom partition mapping and easily select a date range of files afterwards. Sadly, the partition discovery is deeply integrated into DataSource.
Could this be encapsulated smarter to be able to intercept the default behaviour?
Another partition schema that I've seen a lot in this context is:
s3://data-lake-bucket/prefix/2017-08-11/file.1.json s3://data-lake-bucket/prefix/2017-08-11/file.2.json . . . s3://data-lake-bucket/prefix/2017-08-12/file.1.json