Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.12.0
    • Component/s: None
    • Labels:
      None

      Description

      Write an adapter that uses Pig to do processing. (This is different than Piglet, which is a Pig-Latin-like front-end to Calcite.)

        Issue Links

          Activity

          Hide
          elilevine Eli Levine added a comment -

          I have cleaned up my work on a Pig adapter for Calcite. There is more to do, of course but it is in a good enough shape to hopefully get feedback. Created a new pull request here: https://github.com/apache/calcite/pull/365

          My work up to date has concentrated on Pig-specific RelNodes and associated factories, the ability to have RelBuilder construct relational expressions, and generating Pig Latin scripts from those. Everything works, is fairly well tested, and conforms to Calcite's checkstyle rules. I plan to expand the breadth of functionality supported by this adapter over time. Things like supporting more data types, more agg functions and more complicated filters are certainly high on my priority list.

          Julian Hyde, would be great to get your and Daniel Dai's feedback on the PR. Thanks!

          Show
          elilevine Eli Levine added a comment - I have cleaned up my work on a Pig adapter for Calcite. There is more to do, of course but it is in a good enough shape to hopefully get feedback. Created a new pull request here: https://github.com/apache/calcite/pull/365 My work up to date has concentrated on Pig-specific RelNodes and associated factories, the ability to have RelBuilder construct relational expressions, and generating Pig Latin scripts from those. Everything works, is fairly well tested, and conforms to Calcite's checkstyle rules. I plan to expand the breadth of functionality supported by this adapter over time. Things like supporting more data types, more agg functions and more complicated filters are certainly high on my priority list. Julian Hyde , would be great to get your and Daniel Dai 's feedback on the PR. Thanks!
          Hide
          elilevine Eli Levine added a comment -

          Julian Hyde, mind adding me as a contributor in Jira so I can assign this work to myself?

          Show
          elilevine Eli Levine added a comment - Julian Hyde , mind adding me as a contributor in Jira so I can assign this work to myself?
          Hide
          julianhyde Julian Hyde added a comment -

          Eli Levine, Looks like a great start.

          It's not really an adapter yet, because there isn't a SchemaFactory. Do you think it will ever make sense to have a schema factory? (Pig is like Spark in that it doesn't have its own catalog, and therefore doesn't "own" any data. Calcite's "adapter" concept has always been an imperfect fit for these kind of systems.)

          Before check in, let's come up with a "hello world" application. If we decide to make this an adapter, then that will simply be a model.json file and an example JDBC connect string.

          Am I correct that PigTest is executing queries using Pig? (Presumably in local mode.) If so that's awesome.

          I notice that PigTableScanRule's constructor takes a Schema. Let's see if we can fix that, and make the rule a singleton. It will make a lot of things simpler.

          I see that you're initializing RelBuilder with Pig-specific factories. No one's done that before; I'm pleased that it worked. Usually people will translate to logical RelNode}}s (e.g. {{LogicalFilter) and rules such as PigFilterRule will kick in and convert them to the corresponding Pig RelNode. One downside with your approach is that you won't be able to do hybrid queries (e.g. joining a JDBC table to a Pig table). But we should keep your approach for now.

          A couple of things will need to be done before commit, but not now:

          • Fill out class comments
          • Add a section to adapter.md
          Show
          julianhyde Julian Hyde added a comment - Eli Levine , Looks like a great start. It's not really an adapter yet, because there isn't a SchemaFactory . Do you think it will ever make sense to have a schema factory? (Pig is like Spark in that it doesn't have its own catalog, and therefore doesn't "own" any data. Calcite's "adapter" concept has always been an imperfect fit for these kind of systems.) Before check in, let's come up with a "hello world" application. If we decide to make this an adapter, then that will simply be a model.json file and an example JDBC connect string. Am I correct that PigTest is executing queries using Pig? (Presumably in local mode.) If so that's awesome. I notice that PigTableScanRule 's constructor takes a Schema . Let's see if we can fix that, and make the rule a singleton. It will make a lot of things simpler. I see that you're initializing RelBuilder with Pig-specific factories. No one's done that before; I'm pleased that it worked. Usually people will translate to logical RelNode}}s (e.g. {{LogicalFilter ) and rules such as PigFilterRule will kick in and convert them to the corresponding Pig RelNode . One downside with your approach is that you won't be able to do hybrid queries (e.g. joining a JDBC table to a Pig table). But we should keep your approach for now. A couple of things will need to be done before commit, but not now: Fill out class comments Add a section to adapter.md
          Hide
          julianhyde Julian Hyde added a comment -

          Eli Levine, I've added you as a contributor and assigned this case to you.

          Show
          julianhyde Julian Hyde added a comment - Eli Levine , I've added you as a contributor and assigned this case to you.
          Hide
          elilevine Eli Levine added a comment -

          Thanks for the comments, Julian. I think adding a SchemaFactory is definitely an option. What will make it difficult for the Pig adapter to support the full SQL to results path that other adapters (e.g. Cassandra) perform is the fact that Pig is a batch system that produces results asynchronously.

          Certainly for my use-case at Salesforce I am planning on taking Pig scripts produced by this Pig-Calcite code and executing them outside of Calcite. The results of running these scripts will never make it back to the client. I wonder if this is a common pattern for batch compute engines in general, such as Pig or Spark. Maybe all Calcite needs to do for these types of engines is to produce physical execution plans (e.g. Pig scripts) that can be then executed separately from Calcite... I guess I'm thinking out loud here. Would be great to hear your thoughts about Calcite working with batch compute systems in general.

          Show
          elilevine Eli Levine added a comment - Thanks for the comments, Julian. I think adding a SchemaFactory is definitely an option. What will make it difficult for the Pig adapter to support the full SQL to results path that other adapters (e.g. Cassandra) perform is the fact that Pig is a batch system that produces results asynchronously. Certainly for my use-case at Salesforce I am planning on taking Pig scripts produced by this Pig-Calcite code and executing them outside of Calcite. The results of running these scripts will never make it back to the client. I wonder if this is a common pattern for batch compute engines in general, such as Pig or Spark. Maybe all Calcite needs to do for these types of engines is to produce physical execution plans (e.g. Pig scripts) that can be then executed separately from Calcite... I guess I'm thinking out loud here. Would be great to hear your thoughts about Calcite working with batch compute systems in general.
          Hide
          julianhyde Julian Hyde added a comment -

          I take your point that a batch tool is not doing SELECT. It is more likely doing

          CREATE TEMPORARY TABLE ... AS SELECT ...;
          CREATE TEMPORARY TABLE ... AS SELECT ...;
          CREATE TABLE AS SELECT ...;
          or 
          INSERT INTO ...;
          

          One of the earliest Calcite feature requests, CALCITE-28, comes from the days when Calcite (then called Optiq) was providing a SQL interface to Cascading. As you can see, it is trying to combine the benefits of an interactive REPL and writing the results straight to disk.

          So, a Pig job expressed as SQL is likely to be a few CREATE TABLE AS SELECT statements. Fewer than the corresponding Pig Latin statements, but a similar pattern. So, maybe Calcite should support CREATE TABLE AS SELECT.

          I don't quite see that batch necessarily has to be asynchronous. It still makes sense to execute a Pig script in the foreground. Even if it takes a long time and doesn't print anything to the screen (except maybe row counts) at least you know when it has finished.

          Regarding using Calcite to generate a script that will be run in another environment. Yes, that makes sense. Calcite is basically functioning as a compiler. We should support that. Maybe you can get what you need from EXPLAIN (and by the way, you get a lot of that stuff printed to STDOUT if you run with -Dcalcite.debug).

          Still, we should support Calcite SQL on a Pig Adapter in an interactive (and synchronous) environment, if we can find a way to make Pig synchronous, and can hook up the DUMP command. A REPL is a beautiful thing.

          Show
          julianhyde Julian Hyde added a comment - I take your point that a batch tool is not doing SELECT. It is more likely doing CREATE TEMPORARY TABLE ... AS SELECT ...; CREATE TEMPORARY TABLE ... AS SELECT ...; CREATE TABLE AS SELECT ...; or INSERT INTO ...; One of the earliest Calcite feature requests, CALCITE-28 , comes from the days when Calcite (then called Optiq) was providing a SQL interface to Cascading. As you can see, it is trying to combine the benefits of an interactive REPL and writing the results straight to disk. So, a Pig job expressed as SQL is likely to be a few CREATE TABLE AS SELECT statements. Fewer than the corresponding Pig Latin statements, but a similar pattern. So, maybe Calcite should support CREATE TABLE AS SELECT . I don't quite see that batch necessarily has to be asynchronous. It still makes sense to execute a Pig script in the foreground. Even if it takes a long time and doesn't print anything to the screen (except maybe row counts) at least you know when it has finished. Regarding using Calcite to generate a script that will be run in another environment. Yes, that makes sense. Calcite is basically functioning as a compiler. We should support that. Maybe you can get what you need from EXPLAIN (and by the way, you get a lot of that stuff printed to STDOUT if you run with -Dcalcite.debug). Still, we should support Calcite SQL on a Pig Adapter in an interactive (and synchronous) environment, if we can find a way to make Pig synchronous, and can hook up the DUMP command. A REPL is a beautiful thing.
          Hide
          elilevine Eli Levine added a comment - - edited

          +1 on the value of REPL. It is useful for debugging and testing, and also potentially for production use. I have added a PigSchema, PigToEnumerableConverter and associated classes in order to support Calcite SQL by the Pig adapter. Test cases in PigAdapterTest show that the flow from SQL to Pig-specific relational expression trees to Pig scripts is now supported.

          The big piece that is currently missing is actually executing Pig scripts and returning results via Calcite. I think this can be done, at least using Pig's local execution mode. This is basically what PigTest does, which I use in PigRelBuilderStyleTest. The place to hook that functionality in is probably in PigToEnumerableConverter.implement().

          Other limitations of the existing Pig adapter impl:

          • Only supports VARCHAR columns. Need more data type support.
          • Filters are not pushed to be before joins. To be investigated.
          • Need expansion of supported functionality and test coverage for more permutations of filters, aggregations and joins.

          Julian Hyde, would appreciate you taking another look at the PR and getting your thoughts on the current progress. If you think there is value in this code being part of Calcite, what pieces are still missing before it can be committed? Does it makes sense to commit without the code actually executing Pig, and to work on that incrementally?

          Show
          elilevine Eli Levine added a comment - - edited +1 on the value of REPL. It is useful for debugging and testing, and also potentially for production use. I have added a PigSchema , PigToEnumerableConverter and associated classes in order to support Calcite SQL by the Pig adapter. Test cases in PigAdapterTest show that the flow from SQL to Pig-specific relational expression trees to Pig scripts is now supported. The big piece that is currently missing is actually executing Pig scripts and returning results via Calcite. I think this can be done, at least using Pig's local execution mode. This is basically what PigTest does, which I use in PigRelBuilderStyleTest . The place to hook that functionality in is probably in PigToEnumerableConverter.implement() . Other limitations of the existing Pig adapter impl: Only supports VARCHAR columns. Need more data type support. Filters are not pushed to be before joins. To be investigated. Need expansion of supported functionality and test coverage for more permutations of filters, aggregations and joins. Julian Hyde , would appreciate you taking another look at the PR and getting your thoughts on the current progress. If you think there is value in this code being part of Calcite, what pieces are still missing before it can be committed? Does it makes sense to commit without the code actually executing Pig, and to work on that incrementally?
          Hide
          julianhyde Julian Hyde added a comment -

          I have reviewed, and am almost ready to check in. I made a few fix-ups (mainly for code style to appease javadoc) and added a documentation page, pig.md. Please review (and improve the documentation if you like). See https://github.com/julianhyde/calcite/tree/1598-pig.

          Show
          julianhyde Julian Hyde added a comment - I have reviewed, and am almost ready to check in. I made a few fix-ups (mainly for code style to appease javadoc) and added a documentation page, pig.md. Please review (and improve the documentation if you like). See https://github.com/julianhyde/calcite/tree/1598-pig .
          Hide
          elilevine Eli Levine added a comment -

          Julian Hyde, the documentation page looks great. Good call on adding that. Thanks!

          Show
          elilevine Eli Levine added a comment - Julian Hyde , the documentation page looks great. Good call on adding that. Thanks!
          Hide
          julianhyde Julian Hyde added a comment -

          I get one failure under JDK 1.7. I don't know whether it is intermittent. Can you take a look please.

          2017-02-28 02:15:07,687 [pool-1-thread-7] INFO  - Total input paths to process : 1
          Tests run: 8, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 4.668 sec <<< FAILURE! - in org.apache.calcite.test.PigRelBuilderStyleTest
          testImplWithGroupByCountDistinct(org.apache.calcite.test.PigRelBuilderStyleTest)  Time elapsed: 0.085 sec  <<< ERROR!
          java.lang.RuntimeException: org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1066: Unable to open iterator for alias t
                  at org.apache.calcite.test.PigRelBuilderStyleTest.assertScriptAndResults(PigRelBuilderStyleTest.java:252)
                  at org.apache.calcite.test.PigRelBuilderStyleTest.testImplWithGroupByCountDistinct(PigRelBuilderStyleTest.java:148)
          Caused by: org.apache.pig.impl.logicalLayer.FrontendException: Unable to open iterator for alias t
                  at org.apache.calcite.test.PigRelBuilderStyleTest.assertScriptAndResults(PigRelBuilderStyleTest.java:250)
                  at org.apache.calcite.test.PigRelBuilderStyleTest.testImplWithGroupByCountDistinct(PigRelBuilderStyleTest.java:148)
          Caused by: java.lang.NullPointerException
                  at org.apache.calcite.test.PigRelBuilderStyleTest.assertScriptAndResults(PigRelBuilderStyleTest.java:250)
                  at org.apache.calcite.test.PigRelBuilderStyleTest.testImplWithGroupByCountDistinct(PigRelBuilderStyleTest.java:148)
          
          
          Results :
          
          Tests in error:
            PigRelBuilderStyleTest.testImplWithGroupByCountDistinct:148->assertScriptAndResults:252 Runtime
          
          Tests run: 15, Failures: 0, Errors: 1, Skipped: 0
          
          Show
          julianhyde Julian Hyde added a comment - I get one failure under JDK 1.7. I don't know whether it is intermittent. Can you take a look please. 2017-02-28 02:15:07,687 [pool-1-thread-7] INFO - Total input paths to process : 1 Tests run: 8, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 4.668 sec <<< FAILURE! - in org.apache.calcite.test.PigRelBuilderStyleTest testImplWithGroupByCountDistinct(org.apache.calcite.test.PigRelBuilderStyleTest) Time elapsed: 0.085 sec <<< ERROR! java.lang.RuntimeException: org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1066: Unable to open iterator for alias t at org.apache.calcite.test.PigRelBuilderStyleTest.assertScriptAndResults(PigRelBuilderStyleTest.java:252) at org.apache.calcite.test.PigRelBuilderStyleTest.testImplWithGroupByCountDistinct(PigRelBuilderStyleTest.java:148) Caused by: org.apache.pig.impl.logicalLayer.FrontendException: Unable to open iterator for alias t at org.apache.calcite.test.PigRelBuilderStyleTest.assertScriptAndResults(PigRelBuilderStyleTest.java:250) at org.apache.calcite.test.PigRelBuilderStyleTest.testImplWithGroupByCountDistinct(PigRelBuilderStyleTest.java:148) Caused by: java.lang.NullPointerException at org.apache.calcite.test.PigRelBuilderStyleTest.assertScriptAndResults(PigRelBuilderStyleTest.java:250) at org.apache.calcite.test.PigRelBuilderStyleTest.testImplWithGroupByCountDistinct(PigRelBuilderStyleTest.java:148) Results : Tests in error: PigRelBuilderStyleTest.testImplWithGroupByCountDistinct:148->assertScriptAndResults:252 Runtime Tests run: 15, Failures: 0, Errors: 1, Skipped: 0
          Hide
          elilevine Eli Levine added a comment -

          Julian Hyde, just looked. Seems to be intermittent and related to PigTest. Mind trying with the attached patch?

          Show
          elilevine Eli Levine added a comment - Julian Hyde , just looked. Seems to be intermittent and related to PigTest. Mind trying with the attached patch?
          Hide
          julianhyde Julian Hyde added a comment - - edited

          Fixed in http://git-wip-us.apache.org/repos/asf/calcite/commit/fdbb81cf. Thanks for the PR, Eli Levine, and carry on the good work! Maybe log a follow-up JIRA with what you plan to do next?

          Documentation here: http://calcite.apache.org/docs/pig_adapter.html

          Show
          julianhyde Julian Hyde added a comment - - edited Fixed in http://git-wip-us.apache.org/repos/asf/calcite/commit/fdbb81cf . Thanks for the PR, Eli Levine , and carry on the good work! Maybe log a follow-up JIRA with what you plan to do next? Documentation here: http://calcite.apache.org/docs/pig_adapter.html
          Hide
          elilevine Eli Levine added a comment -

          Awesome! Thanks for committing, Julian Hyde. Happy to see there is value in this work.

          For now, I have created https://issues.apache.org/jira/browse/CALCITE-1669 to make this a true Calcite adapter that is able to go from SQL to results (as you suggested earlier). Later I will be adding JIRAs for incremental additions to supported functionality, such as other agg functions, more complex queries, and more data types.

          Show
          elilevine Eli Levine added a comment - Awesome! Thanks for committing, Julian Hyde . Happy to see there is value in this work. For now, I have created https://issues.apache.org/jira/browse/CALCITE-1669 to make this a true Calcite adapter that is able to go from SQL to results (as you suggested earlier). Later I will be adding JIRAs for incremental additions to supported functionality, such as other agg functions, more complex queries, and more data types.
          Hide
          julianhyde Julian Hyde added a comment -

          Resolved in release 1.12.0 (2017-03-24).

          Show
          julianhyde Julian Hyde added a comment - Resolved in release 1.12.0 (2017-03-24).

            People

            • Assignee:
              elilevine Eli Levine
              Reporter:
              julianhyde Julian Hyde
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development