Pig
  1. Pig
  2. PIG-2586

A better plan/data flow visualizer

    Details

    • Type: Improvement Improvement
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: impl
    • Labels:

      Description

      Pig supports a dot graph style plan to visualize the logical/physical/mapreduce plan (explain with -dot option, see http://ofps.oreilly.com/titles/9781449302641/developing_and_testing.html). However, dot graph takes extra step to generate the plan graph and the quality of the output is not good. It's better we can implement a better visualizer for Pig. It should:
      1. show operator type and alias
      2. turn on/off output schema
      3. dive into foreach inner plan on demand
      4. provide a way to show operator source code, eg, tooltip of an operator (plan don't currently have this information, but you can assume this is in place)
      5. besides visualize logical/physical/mapreduce plan, visualize the script itself is also useful
      6. may rely on some java graphic library such as Swing

      This is a candidate project for Google summer of code 2013. More information about the program can be found at https://cwiki.apache.org/confluence/display/PIG/GSoc2013

      Functionality implemented so far, is available at
      https://reviews.apache.org/r/12077/

      1. graph.zip
        117 kB
        Allan Avendaño
      2. visualize.zip
        119 kB
        Allan Avendaño
      3. patch04
        35 kB
        Allan Avendaño

        Activity

        Hide
        Russell Jurney added a comment -

        I would like to mentor this. The way to go here is with a web interface to this data.

        For instance, using these libraries:

        https://github.com/glejeune/Ruby-Graphviz
        http://www.sinatrarb.com/
        http://neyric.github.com/wireit/

        there would be enough time for a GSoC participant to really make serious progress at this.

        Show
        Russell Jurney added a comment - I would like to mentor this. The way to go here is with a web interface to this data. For instance, using these libraries: https://github.com/glejeune/Ruby-Graphviz http://www.sinatrarb.com/ http://neyric.github.com/wireit/ there would be enough time for a GSoC participant to really make serious progress at this.
        Hide
        Daniel Dai added a comment -

        Great, thanks Russell!

        Show
        Daniel Dai added a comment - Great, thanks Russell!
        Hide
        Dimitris Bousis added a comment -

        Hi all,

        My name is Dimitris Bousis, currently doing my Master in Computer Engineering & Informatics in University of Patras, Greece. My research interests include cloud & distributed computing with related technologies such as Hadoop, HBase, Cassandra , Pig & Hive. Though i have not started any research activity with the technologies (I plan to do so after the summer), i have taken an elective course in Hadoop,HDFS, HBase & Cassandra during my undergraduate studies.

        I am interested in applying for this GsoC 2012 project. Flow visualization is really useful when it comes in debugging and breaking down of any form structural query. From the mentor's comment above I assume that there should exist a web interface parsing the DOT format in order to present the plans produced by explain. Furthermore, I'd like to suggest D3.js a js lib that lets you to bind arbitrary data to a Document Object Model (DOM), and then apply data-driven transformations to the document. This library uses HTML5,CSS3 and SVG to represent data within a page.

        Please comment this post for anything you consider necessary. Looking forward working with you this summer.

        Dimitris Bousis

        Show
        Dimitris Bousis added a comment - Hi all, My name is Dimitris Bousis, currently doing my Master in Computer Engineering & Informatics in University of Patras, Greece. My research interests include cloud & distributed computing with related technologies such as Hadoop, HBase, Cassandra , Pig & Hive. Though i have not started any research activity with the technologies (I plan to do so after the summer), i have taken an elective course in Hadoop,HDFS, HBase & Cassandra during my undergraduate studies. I am interested in applying for this GsoC 2012 project. Flow visualization is really useful when it comes in debugging and breaking down of any form structural query. From the mentor's comment above I assume that there should exist a web interface parsing the DOT format in order to present the plans produced by explain. Furthermore, I'd like to suggest D3.js a js lib that lets you to bind arbitrary data to a Document Object Model (DOM), and then apply data-driven transformations to the document. This library uses HTML5,CSS3 and SVG to represent data within a page. Please comment this post for anything you consider necessary. Looking forward working with you this summer. Dimitris Bousis
        Hide
        Dmitriy V. Ryaboy added a comment -

        I know at least one of the students who applied to GSOC for this project was accepted, with Russ and Daniel co-mentoring. Could the proposal be posted in this ticket?

        Show
        Dmitriy V. Ryaboy added a comment - I know at least one of the students who applied to GSOC for this project was accepted, with Russ and Daniel co-mentoring. Could the proposal be posted in this ticket?
        Hide
        Bill Graham added a comment -

        As part of Twitter's Hackweek we developed a first pass at a visualization tool for Pig that focused on visualizing the run-time execution of jobs in a pig script. This helps our developers when running scripts with very large DAGs. We're in the process of open sourcing it, but I'll describe it here to see if parts of it might be leveraged, built upon or learned from.

        • Design
          When executing a pig script from the command line, we insert a PigProgressNotificationListener per PIG-2525. The PPNL launches an embedded Jetty server that exposes a json API of dag/script/job/progress info. Also embedded is the HTML/js/css content for a single page that renders the DAG, polls for updates, and shows progress.
        • Viz
          We use d3.js to render a chord diagram of the script (see http://mbostock.github.com/d3/ex/chord.html), where each arc in the circle is a job and each chord is a dependancy. This requires PIG-2660. We also render a tableview of all jobs where we show alias and feature initially, but then add jobName, #reducers, #mappers and progress percents once we have that. Other related patches required are PIG-2663 and PIG-2664.
        • Future work
        • Better visualization. The chord diagram is ok, but we'd like to find a good JS library for DAG rendering (ala GraphViz) and include that as an option too.
        • Non-embedded mode. The Jetty server should be deployable as a standalone app server. Clients can push their state to it and the server has a persistant data store. Embedded mode is still useful during development.
        • Better script bindings. Being able to reference a pop-up of the script with highlighting of certain parts (see PIG-2659) would be useful.
        Show
        Bill Graham added a comment - As part of Twitter's Hackweek we developed a first pass at a visualization tool for Pig that focused on visualizing the run-time execution of jobs in a pig script. This helps our developers when running scripts with very large DAGs. We're in the process of open sourcing it, but I'll describe it here to see if parts of it might be leveraged, built upon or learned from. Design When executing a pig script from the command line, we insert a PigProgressNotificationListener per PIG-2525 . The PPNL launches an embedded Jetty server that exposes a json API of dag/script/job/progress info. Also embedded is the HTML/js/css content for a single page that renders the DAG, polls for updates, and shows progress. Viz We use d3.js to render a chord diagram of the script (see http://mbostock.github.com/d3/ex/chord.html ), where each arc in the circle is a job and each chord is a dependancy. This requires PIG-2660 . We also render a tableview of all jobs where we show alias and feature initially, but then add jobName, #reducers, #mappers and progress percents once we have that. Other related patches required are PIG-2663 and PIG-2664 . Future work Better visualization. The chord diagram is ok, but we'd like to find a good JS library for DAG rendering (ala GraphViz) and include that as an option too. Non-embedded mode. The Jetty server should be deployable as a standalone app server. Clients can push their state to it and the server has a persistant data store. Embedded mode is still useful during development. Better script bindings. Being able to reference a pop-up of the script with highlighting of certain parts (see PIG-2659 ) would be useful.
        Hide
        Daniel Dai added a comment -

        @Dmitriy
        Sorry, just notice this, we get one student accepted. The proposal is http://www.google-melange.com/gsoc/proposal/review/google/gsoc2012/manuranga/5002.

        @Bill
        Can you post some screenshots?

        Also we need to make use of the source location (PIG-2659) in the visualizer.

        Show
        Daniel Dai added a comment - @Dmitriy Sorry, just notice this, we get one student accepted. The proposal is http://www.google-melange.com/gsoc/proposal/review/google/gsoc2012/manuranga/5002 . @Bill Can you post some screenshots? Also we need to make use of the source location ( PIG-2659 ) in the visualizer.
        Hide
        Bill Graham added a comment -

        Yes, I should be able to post some sanitized snapshots later this week. We also have plans to integrate PIG-2659.

        Show
        Bill Graham added a comment - Yes, I should be able to post some sanitized snapshots later this week. We also have plans to integrate PIG-2659 .
        Hide
        Manuranga Perera added a comment -

        HI, I am currently working on creating (client side) SVGs for a given plan (in json format).
        source code for this is available at following github repo : https://github.com/manuranga/svg-graph .
        currently it is possible to generate simple plans with some nested plans.

        Show
        Manuranga Perera added a comment - HI, I am currently working on creating (client side) SVGs for a given plan (in json format). source code for this is available at following github repo : https://github.com/manuranga/svg-graph . currently it is possible to generate simple plans with some nested plans.
        Hide
        Aniket Mokashi added a comment -

        Do we have a patch for this?

        Show
        Aniket Mokashi added a comment - Do we have a patch for this?
        Hide
        Daniel Dai added a comment -

        There is a partial patch in https://github.com/manuranga/svg-graph. It has not linked to Pig yet.

        Show
        Daniel Dai added a comment - There is a partial patch in https://github.com/manuranga/svg-graph . It has not linked to Pig yet.
        Hide
        Aniket Mokashi added a comment -

        Is there any work/patch for explain -script 111.pig -graphics? This is very useful feature.

        Show
        Aniket Mokashi added a comment - Is there any work/patch for explain -script 111.pig -graphics? This is very useful feature.
        Hide
        Daniel Dai added a comment -

        Unfortunately no.

        Show
        Daniel Dai added a comment - Unfortunately no.
        Hide
        Dmitriy V. Ryaboy added a comment -

        Do we need this given Ambrose (and from what I hear, Ambari)?

        What is the difference between what this proposes and what Ambrose does?

        https://github.com/twitter/ambrose

        There is an Ambrose patch to add inner plans, too:
        https://github.com/twitter/ambrose/issues/62

        Show
        Dmitriy V. Ryaboy added a comment - Do we need this given Ambrose (and from what I hear, Ambari)? What is the difference between what this proposes and what Ambrose does? https://github.com/twitter/ambrose There is an Ambrose patch to add inner plans, too: https://github.com/twitter/ambrose/issues/62
        Hide
        Daniel Dai added a comment -

        The goal for it is to visualize plan (logical/mapreduce plan) rather than jobs. Does Ambrose has that?

        Show
        Daniel Dai added a comment - The goal for it is to visualize plan (logical/mapreduce plan) rather than jobs. Does Ambrose has that?
        Hide
        Dmitriy V. Ryaboy added a comment -

        It does with the linked patch (it also visualizes the MR plan, without details of what's happening inside the map or reduce stage, without the patch).

        Show
        Dmitriy V. Ryaboy added a comment - It does with the linked patch (it also visualizes the MR plan, without details of what's happening inside the map or reduce stage, without the patch).
        Hide
        Daniel Dai added a comment -

        But no logical plan, right?

        Show
        Daniel Dai added a comment - But no logical plan, right?
        Hide
        Dmitriy V. Ryaboy added a comment -

        Hm I guess we can add logical plan if we want – just need to feed it to the PPNL somehow. Ambrose is pretty separate from Pig specifics, if you give it a dag, it'll draw it.

        Do people use the logical plan to diagnose issues? I don't think I have had to do that yet.

        Show
        Dmitriy V. Ryaboy added a comment - Hm I guess we can add logical plan if we want – just need to feed it to the PPNL somehow. Ambrose is pretty separate from Pig specifics, if you give it a dag, it'll draw it. Do people use the logical plan to diagnose issues? I don't think I have had to do that yet.
        Hide
        Daniel Dai added a comment -

        Probably not for end user, but for us to figure out what's wrong with a script, this is useful.

        Show
        Daniel Dai added a comment - Probably not for end user, but for us to figure out what's wrong with a script, this is useful.
        Hide
        Allan Avendaño added a comment -

        Hi to everyone,

        I am Allan Avendaño, I collaborated with Rank operator PIG-2353 on GSOC2012. As visualization is also part of my research interests, I am excited to collaborate with this idea this year.
        I had a look into Ambrose and it could be an initial step to consider for sketches for the visualizer, also we have to consider some other visualizations options.

        Show
        Allan Avendaño added a comment - Hi to everyone, I am Allan Avendaño, I collaborated with Rank operator PIG-2353 on GSOC2012. As visualization is also part of my research interests, I am excited to collaborate with this idea this year. I had a look into Ambrose and it could be an initial step to consider for sketches for the visualizer, also we have to consider some other visualizations options.
        Hide
        Daniel Dai added a comment -

        That's awesome if you can work on it this summer. I definitely open to other visualization options. I would happy to mentor.

        Show
        Daniel Dai added a comment - That's awesome if you can work on it this summer. I definitely open to other visualization options. I would happy to mentor.
        Hide
        Allan Avendaño added a comment -

        Hi Daniel!

        I finished my first draft for the proposal, but I don't know if it is possible to have some feedback from the community. I made it available here https://docs.google.com/file/d/0B3SX2UYQ8_1sRGF1aklNeGRkWGs/edit?usp=sharing

        Thanks in advance for your comments.

        Show
        Allan Avendaño added a comment - Hi Daniel! I finished my first draft for the proposal, but I don't know if it is possible to have some feedback from the community. I made it available here https://docs.google.com/file/d/0B3SX2UYQ8_1sRGF1aklNeGRkWGs/edit?usp=sharing Thanks in advance for your comments.
        Hide
        Daniel Dai added a comment -

        Thanks, I will take a look.

        Show
        Daniel Dai added a comment - Thanks, I will take a look.
        Hide
        Daniel Dai added a comment -

        Looks good. In the schedule, can we get some sample pages first before actual implementation?

        Show
        Daniel Dai added a comment - Looks good. In the schedule, can we get some sample pages first before actual implementation?
        Hide
        Allan Avendaño added a comment -

        Sure, my idea is to do it on the first week (June 17 – June 23), and every time when it needs. I will add a comment in this regard.

        Show
        Allan Avendaño added a comment - Sure, my idea is to do it on the first week (June 17 – June 23), and every time when it needs. I will add a comment in this regard.
        Hide
        Allan Avendaño added a comment -

        Patch to add "visualize" command.
        visualize -out <out_folder> [-script <folder_to_script>] <alias>

        Show
        Allan Avendaño added a comment - Patch to add "visualize" command. visualize -out <out_folder> [-script <folder_to_script>] <alias>
        Hide
        Allan Avendaño added a comment -

        This file must be unzipped and placed inside pig folder.
        This works as templates folder for "visualize" command.

        Show
        Allan Avendaño added a comment - This file must be unzipped and placed inside pig folder. This works as templates folder for "visualize" command.
        Hide
        Allan Avendaño added a comment -

        Example output folder of an script.

        Show
        Allan Avendaño added a comment - Example output folder of an script.
        Hide
        Allan Avendaño added a comment -

        Patch for visualization

        Show
        Allan Avendaño added a comment - Patch for visualization
        Hide
        Allan Avendaño added a comment -

        Add it to the root level inside pig folder

        Show
        Allan Avendaño added a comment - Add it to the root level inside pig folder
        Hide
        Allan Avendaño added a comment -

        Output file example

        Show
        Allan Avendaño added a comment - Output file example
        Hide
        Allan Avendaño added a comment -

        In order to use "visualize command", first apply the patch and put the folder at root level inside pig.
        The command for visualization is the following:

        visualize [-script <script_file>] -out <output_folder> <alias>

        Show
        Allan Avendaño added a comment - In order to use "visualize command", first apply the patch and put the folder at root level inside pig. The command for visualization is the following: visualize [-script <script_file>] -out <output_folder> <alias>

          People

          • Assignee:
            Allan Avendaño
            Reporter:
            Daniel Dai
          • Votes:
            1 Vote for this issue
            Watchers:
            12 Start watching this issue

            Dates

            • Created:
              Updated:

              Development