Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-21706

Support Custom PartitionSpec Provider for Kinesis Firehose or similar

    Details

      Description

      Many people are using Kinesis Firehose to ingest data into a S3-based data lake. Kinesis Firehose produces a directory layout like this:

      s3://data-lake-bucket/my-prefix/2017/08/11/10/my-stream-2017-08-11-11-10-10
      s3://data-lake-bucket/my-prefix/2017/08/11/11/my-stream-2017-08-11-11-11-10
        .
        .
        .
      s3://data-lake-bucket/my-prefix/2017/08/12/00/my-stream-2017-08-12-00-01-01
      

      Spark is (like Hive) not supporting this kind of partitioning. Therefore it would be great, if you could configure a CustomPartitionDiscoverer or PartitionSpecProvider to provide a custom partition mapping and easily select a date range of files afterwards. Sadly, the partition discovery is deeply integrated into DataSource.
      Could this be encapsulated smarter to be able to intercept the default behaviour?

      Another partition schema that I've seen a lot in this context is:

      s3://data-lake-bucket/prefix/2017-08-11/file.1.json
      s3://data-lake-bucket/prefix/2017-08-11/file.2.json
        .
        .
        .
      s3://data-lake-bucket/prefix/2017-08-12/file.1.json
      

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              sebastianherold Sebastian Herold
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated: