Uploaded image for project: 'MRQL'
  1. MRQL
  2. MRQL-63

Add support for MRQL streaming in spark streaming mode

    XMLWordPrintableJSON

Details

    • New Feature
    • Status: Resolved
    • Major
    • Resolution: Implemented
    • 0.9.4
    • None
    • Streaming
    • None

    Description

      This patch introduces a major extension to MRQL, called MRQL streaming.
      We can now run continuous MRQL queries on streams of data.
      Currently, it works on Spark Streaming only but we may add support for Flink Streaming and/or Storm in the future.
      It has been tested in Spark local mode and in Spark distributed mode on a Yarn cluster.

      MRQL now supports window-based streaming based on a sliding window during a certain time interval. To support MRQL streaming, you need to add the parameter "-stream t" to the mrql command, where t is the time interval in milliseconds. Then MRQL will processes the new batch of data in the input streams every t milliseconds.
      A stream source in MRQL takes the form stream(...), which has the same parameters as the source(...) form. For example:

      select (k,avg(p.Y))
      from p in stream(binary,"tmp/points.bin")
      group by k: p.X;
      

      This query process all sequence files in the directory tmp/points.bin and then checks this directory every t milliseconds for new files. When a new file is inserted in the directory (or if the modification time of an existing file changes), it processes the new files. One may work on multiple files and the query may contain both stream and regular data sources. If there is at least one stream source, the query becomes continuous (never stops). One may dump the output stream to binary or CVS files using the existing MRQL syntax:

      store "tmp/out" from e
      

      This dumps the output of the continuous query e into tmp/out/f1, tmp/out/f2, ... etc.

      Example for testing:
      First create data:

      mrql.spark -local queries/points.mrql 100

      Then run the continuous query:

      mrql.spark -local -stream 1000 queries/streaming.mrql

      On a separate terminal, you can type:

      touch tmp/points.bin/part-00000

      to process a new batch of data.

      Attachments

        1. MRQL-63.patch
          41 kB
          Leonidas Fegaras

        Activity

          People

            fegaras Leonidas Fegaras
            fegaras Leonidas Fegaras
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: