Uploaded image for project: 'Beam'
  1. Beam
  2. BEAM-8470

Create a new Spark runner based on Spark Structured streaming framework

    XMLWordPrintableJSON

    Details

      Description

      Why is it worth creating a new runner based on structured streaming:

      Because this new framework brings:

      • Unified batch and streaming semantics:
      • no more RDD/DStream distinction, as in Beam (only PCollection)
      • Better state management:
      • incremental state instead of saving all each time
      • No more synchronous saving delaying computation: per batch and partition delta file saved asynchronously + in-memory hashmap synchronous put/get
      • Schemas in datasets:
      • The dataset knows the structure of the data (fields) and can optimize later on
      • Schemas in PCollection in Beam
      • New Source API
      • Very close to Beam bounded source and unbounded sources

      Why make a new runner from scratch?

      • Structured streaming framework is very different from the RDD/Dstream framework

      We hope to gain

      • More up to date runner in terms of libraries: leverage new features
      • Leverage learnt practices from the previous runners
      • Better performance thanks to the DAG optimizer (catalyst) and by simplifying the code.
      • Simplify the code and ease the maintenance

       

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                echauchot Etienne Chauchot
                Reporter:
                echauchot Etienne Chauchot
              • Votes:
                0 Vote for this issue
                Watchers:
                1 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved:

                  Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 15h 50m
                  15h 50m