Derby
  1. Derby
  2. DERBY-6036

If you wrap a SELECT * view around a table, all of the columns are read from the base row even when you SELECT only a subset of the view columns.

    Details

    • Type: Bug Bug
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: 10.10.1.1
    • Fix Version/s: None
    • Component/s: SQL
    • Urgency:
      Normal
    • Issue & fix info:
      Repro attached, Workaround attached
    • Bug behavior facts:
      Performance, Seen in production

      Description

      This also affects SELECTs from views wrapping RestrictedVTIs. Restrictions are pushed into a restricted VTI if you wrap it in a view. However, projections are not. I will attach a script showing this problem.

      1. derby-6036.sql
        1 kB
        Rick Hillegas
      2. derbyAST.xml
        267 kB
        Rick Hillegas
      3. derby-6036-01-aa-testForRestrictionPushing.diff
        3 kB
        Rick Hillegas

        Issue Links

          Activity

          Hide
          Richard Huddleston added a comment -

          unfortunately we cannot use the foreign views / ForeignTableVTI as we are writing our own adapter to greenplum.

          we use derby as a front end to greenplum. we use derby to parse the sql and then we submit the query to the greenplum master and instead of getting a result in the jdbc connection, we do an ETL write to an external servlet which is running in derby. ETLs in greenplum are VERY fast and prevent the bandwidth crunch on the master greenplum node. In the ETL, EACH of our greenplum slaves reaches out individually to the derby instance to post back result sets, and does NOT use any resources on the master. Data gets pushed back to our derby instance at greater than 1 gbit/sec, and our resultset merges the multiple http posts back into a single result. People love our derby frontend to greenplum, and now are thinking about other systems we should expose through derby. but our system is not as fast as it could be.

          Show
          Richard Huddleston added a comment - unfortunately we cannot use the foreign views / ForeignTableVTI as we are writing our own adapter to greenplum. we use derby as a front end to greenplum. we use derby to parse the sql and then we submit the query to the greenplum master and instead of getting a result in the jdbc connection, we do an ETL write to an external servlet which is running in derby. ETLs in greenplum are VERY fast and prevent the bandwidth crunch on the master greenplum node. In the ETL, EACH of our greenplum slaves reaches out individually to the derby instance to post back result sets, and does NOT use any resources on the master. Data gets pushed back to our derby instance at greater than 1 gbit/sec, and our resultset merges the multiple http posts back into a single result. People love our derby frontend to greenplum, and now are thinking about other systems we should expose through derby. but our system is not as fast as it could be.
          Hide
          Rick Hillegas added a comment -

          Hi Richard,

          These are all interesting suggestions for improving the api for table functions. I think this discussion deserves its own, separate JIRA.

          I understand your comments about the tedium of declaring table functions. Much of this tedium may be addressed by the optional foreignViews tool introduced in release 10.10.1: http://db.apache.org/derby/docs/10.10/tools/rtoolsoptforeignviews.html

          If you're interested in contributing to Derby, a good place to start is here: http://db.apache.org/derby/derby_comm.html. Once you subscribe to the derby-dev mailing list, the community can guide you through the process.

          Welcome!
          -Rick

          Show
          Rick Hillegas added a comment - Hi Richard, These are all interesting suggestions for improving the api for table functions. I think this discussion deserves its own, separate JIRA. I understand your comments about the tedium of declaring table functions. Much of this tedium may be addressed by the optional foreignViews tool introduced in release 10.10.1: http://db.apache.org/derby/docs/10.10/tools/rtoolsoptforeignviews.html If you're interested in contributing to Derby, a good place to start is here: http://db.apache.org/derby/derby_comm.html . Once you subscribe to the derby-dev mailing list, the community can guide you through the process. Welcome! -Rick
          Hide
          Richard Huddleston added a comment - - edited

          We are trying to work around this issue by calling the function directly instead of the view, but have noticed that it would be much simpler if Derby VTI exposed the function name to the VTI implementation ResultSet . i don't believe there currently is a way to access the function name in the VTI ResultSet class.

          We are accessing n different remote data tables, each with its own table definition. Thus in Derby, we must register n different functions. Note, we only need 1 implementation class of the function to actually get the data and return our custom ResultSet implementation. We'd prefer to NOT write a separate class and redeploy our application each time someone adds a new remote data table, as we have many actually derby instances running this codebase.

          pseudo code / sql

          we have to create register n function definitions in , for i in 1 .. n, where tableName_i is our ith table name, (tableName_i is like customers)
          register derby function selectFromRemote_tableName_i('tableName_i') java method is selectFromRemote.read ,

          then to call any specific function we have to write (pseudo sql)

          select s.* from table ( selectFromRemote_customers('customers') ) s where custId = 2

          it would be MUCH cleaner if we could just

          register a derby function with different names and no parameters
          so instead of selectFromRemote_customers, we would just register function "customers"
          and derby function customers is java method selectFromRemote.read ,

          and then be able to call
          select c.* from table ( customers() ) c where custId = 2

          then our ResultSet implementation could access some variable about the function name, and we would use that in lieu of a function parameters (it would also save us having to declare function params when we register functions, which is tedious )

          i think in the future you may want to add more information about VTI function context,
          would it be logical to add more stuff to VTIEnvironment and make VTIEnvironment available / applicable to all VTI resultsets, not just those that implement VTICosting ? side question, if we did implement VTICosting , is it guaranteed that its methods will be called ?

          perhaps there should be a new VTIContextAware interface that has something like
          init(VTIContext context)
          that VTIContext for now could have just getFunctionName, but it would be also nice to have things like getFunctionComment so that instead of params to a function, we could also dynamically change a function's behaviour by altering JUST the comments of a regstered function, and NOT having to change all the queries which call the function and pass arguments.

          because that VTIContext is an interface you could continue to add methods to in the future you expose more information to the function., note it would have been nice if in RestrictedVTI, the initScan method took in an interface object like InitScanParams which had currently getColumnNames and getRestriction, that would probably have allowed this bug to be solved rather easily by just adding "getFunctionName" on that InitScanParams interface.

          Perhaps I should stop just being a writing critic and start writing something myself how easy would it be for me to start hacking up and contributing to the derby project myself ?

          Show
          Richard Huddleston added a comment - - edited We are trying to work around this issue by calling the function directly instead of the view, but have noticed that it would be much simpler if Derby VTI exposed the function name to the VTI implementation ResultSet . i don't believe there currently is a way to access the function name in the VTI ResultSet class. We are accessing n different remote data tables, each with its own table definition. Thus in Derby, we must register n different functions. Note, we only need 1 implementation class of the function to actually get the data and return our custom ResultSet implementation. We'd prefer to NOT write a separate class and redeploy our application each time someone adds a new remote data table, as we have many actually derby instances running this codebase. pseudo code / sql we have to create register n function definitions in , for i in 1 .. n, where tableName_i is our ith table name, (tableName_i is like customers) register derby function selectFromRemote_tableName_i('tableName_i') java method is selectFromRemote.read , then to call any specific function we have to write (pseudo sql) select s.* from table ( selectFromRemote_customers('customers') ) s where custId = 2 it would be MUCH cleaner if we could just register a derby function with different names and no parameters so instead of selectFromRemote_customers, we would just register function "customers" and derby function customers is java method selectFromRemote.read , and then be able to call select c.* from table ( customers() ) c where custId = 2 then our ResultSet implementation could access some variable about the function name, and we would use that in lieu of a function parameters (it would also save us having to declare function params when we register functions, which is tedious ) i think in the future you may want to add more information about VTI function context, would it be logical to add more stuff to VTIEnvironment and make VTIEnvironment available / applicable to all VTI resultsets, not just those that implement VTICosting ? side question, if we did implement VTICosting , is it guaranteed that its methods will be called ? perhaps there should be a new VTIContextAware interface that has something like init(VTIContext context) that VTIContext for now could have just getFunctionName, but it would be also nice to have things like getFunctionComment so that instead of params to a function, we could also dynamically change a function's behaviour by altering JUST the comments of a regstered function, and NOT having to change all the queries which call the function and pass arguments. because that VTIContext is an interface you could continue to add methods to in the future you expose more information to the function., note it would have been nice if in RestrictedVTI, the initScan method took in an interface object like InitScanParams which had currently getColumnNames and getRestriction, that would probably have allowed this bug to be solved rather easily by just adding "getFunctionName" on that InitScanParams interface. Perhaps I should stop just being a writing critic and start writing something myself how easy would it be for me to start hacking up and contributing to the derby project myself ?
          Hide
          Rick Hillegas added a comment -

          Hi Richard,

          Thanks for your detailed description of how this problem affects your production app. I have marked the issue as "Seen in production." That will boost its urgency slightly. Voting for the issue will boost its urgency too.

          One workaround would be to define a family of table functions and views for each table, one table function/view for each SELECT list. Then use those views in your queries. Unfortunately, that would involve changing the queries in your app.

          Thanks,
          -Rick

          Show
          Rick Hillegas added a comment - Hi Richard, Thanks for your detailed description of how this problem affects your production app. I have marked the issue as "Seen in production." That will boost its urgency slightly. Voting for the issue will boost its urgency too. One workaround would be to define a family of table functions and views for each table, one table function/view for each SELECT list. Then use those views in your queries. Unfortunately, that would involve changing the queries in your app. Thanks, -Rick
          Hide
          Richard Huddleston added a comment -

          i'm working on a project that uses 1 derby function and n views of that 1 function to access data in a foreign database. for each foreign database table, we have a corresponding view in derby with the EXACT name of the table in the foreign database.

          we were able to port our code over to derby by simply changing the jdbc connection params to derby instead of our foreign database, and all the code magically worked. if we had to change to functions instead of views, we'd lose the flexibility of our code running against the foreign database or derby. i have 30+ developers who have existing sql in tons of different places and development languages that reference the table names / matching derby view name, it's not practical for us to change all the table.

          why performance matters. i created a derby facade to our foreign database out there that supports bulk extractions . that database shards data into 16 or more different slave machines. i use derby to issue a query to the "master" machine instructing all the slaves to send data back to derby. those slave machines in aggregate pump data to our derby facade machine at greater than 1 gigabit/sec, where the data flow lasts more than 30 seconds and we are network bandwidth constrained. so, it's LOTS of data. we have more than 8 fields, and this is a common thing in our facade:
          select some_id from my_derby_function_view where theDate = '12/11/2013'
          select count(some_id) from my_derby_function_view where theDate = '12/11/2013'
          because derby doesn't pass down the field names from the view into our function via initscan, our data flow is more than 8x larger than needed. if this bug were fixed, instead of say 32 seconds, our query would in 4 seconds if we knew which columns derby really needed.

          is there any way to motivate a fix for this bug ?

          Show
          Richard Huddleston added a comment - i'm working on a project that uses 1 derby function and n views of that 1 function to access data in a foreign database. for each foreign database table, we have a corresponding view in derby with the EXACT name of the table in the foreign database. we were able to port our code over to derby by simply changing the jdbc connection params to derby instead of our foreign database, and all the code magically worked. if we had to change to functions instead of views, we'd lose the flexibility of our code running against the foreign database or derby. i have 30+ developers who have existing sql in tons of different places and development languages that reference the table names / matching derby view name, it's not practical for us to change all the table. why performance matters. i created a derby facade to our foreign database out there that supports bulk extractions . that database shards data into 16 or more different slave machines. i use derby to issue a query to the "master" machine instructing all the slaves to send data back to derby. those slave machines in aggregate pump data to our derby facade machine at greater than 1 gigabit/sec, where the data flow lasts more than 30 seconds and we are network bandwidth constrained. so, it's LOTS of data. we have more than 8 fields, and this is a common thing in our facade: select some_id from my_derby_function_view where theDate = '12/11/2013' select count(some_id) from my_derby_function_view where theDate = '12/11/2013' because derby doesn't pass down the field names from the view into our function via initscan, our data flow is more than 8x larger than needed. if this bug were fixed, instead of say 32 seconds, our query would in 4 seconds if we knew which columns derby really needed. is there any way to motivate a fix for this bug ?
          Hide
          Dag H. Wanvik added a comment -

          The "Affects version" field could probably include all older versions here, but I didn't mark then now them since it's a fair amount of work to verify.

          Show
          Dag H. Wanvik added a comment - The "Affects version" field could probably include all older versions here, but I didn't mark then now them since it's a fair amount of work to verify.
          Hide
          Dag H. Wanvik added a comment -

          Triaged for 10.11.

          Show
          Dag H. Wanvik added a comment - Triaged for 10.11.
          Hide
          Rick Hillegas added a comment -

          Attaching derby-6036-01-aa-testForRestrictionPushing.diff. This adds a test case to verify that restrictions are passed through the view but not projections. Committed at subversion revision 1430957.

          When we address this issue, we should adjust RestrictedVTITest.test_12_6036() to verify that projections are passed through also.

          Touches the following file:

          M java/testing/org/apache/derbyTesting/functionTests/tests/lang/RestrictedVTITest.java

          Show
          Rick Hillegas added a comment - Attaching derby-6036-01-aa-testForRestrictionPushing.diff. This adds a test case to verify that restrictions are passed through the view but not projections. Committed at subversion revision 1430957. When we address this issue, we should adjust RestrictedVTITest.test_12_6036() to verify that projections are passed through also. Touches the following file: M java/testing/org/apache/derbyTesting/functionTests/tests/lang/RestrictedVTITest.java
          Hide
          Rick Hillegas added a comment -

          I don't think I'm ready to propose a solution to this issue. My feeling is that this requires more thought and experiment than I have time for right now. Before putting this issue aside, I want to record a couple notes:

          1) There is a workaround for this issue, viz., rewrite the query to eliminate the view.

          2) What queries are really affected by this problem? Affected queries are those in which the cost of materializing the unneeded columns is significant. For instance,

          a) SELECTs of base tables with large rows which span multiple pages.

          b) SELECTs where the needed columns live in a covering index and a base table probe is needed to fetch the unneeded columns.

          3) Here are some problem cases which a solution needs to address:

          a) SELECT * views. This is the original test case discussed by this issue:

          create view vt as select * from t;
          select s_nr from vt where ns_r = 3000;

          b) Views which explicitly SELECT columns not needed by the outer query:

          create view vt1 as select s_r, s_nr from t where ns_r = 3000;
          select s_nr from vt1;

          Show
          Rick Hillegas added a comment - I don't think I'm ready to propose a solution to this issue. My feeling is that this requires more thought and experiment than I have time for right now. Before putting this issue aside, I want to record a couple notes: 1) There is a workaround for this issue, viz., rewrite the query to eliminate the view. 2) What queries are really affected by this problem? Affected queries are those in which the cost of materializing the unneeded columns is significant. For instance, a) SELECTs of base tables with large rows which span multiple pages. b) SELECTs where the needed columns live in a covering index and a base table probe is needed to fetch the unneeded columns. 3) Here are some problem cases which a solution needs to address: a) SELECT * views. This is the original test case discussed by this issue: create view vt as select * from t; select s_nr from vt where ns_r = 3000; b) Views which explicitly SELECT columns not needed by the outer query: create view vt1 as select s_r, s_nr from t where ns_r = 3000; select s_nr from vt1;
          Hide
          Rick Hillegas added a comment -

          I have verified that there is no special jiggery-pokery to reduce the number of columns read from the base table when it is wrapped in a view. For the SELECT from the view (select s_nr from vt where ns_r = 3000), the whole row is read from the Store.

          Fixing this performance problem may be a little tricky. The ProjectRestrict nodes are added by the optimizer AFTER the bind-phase has expanded the * in the view and concluded that ALL of the columns are being referenced.

          Show
          Rick Hillegas added a comment - I have verified that there is no special jiggery-pokery to reduce the number of columns read from the base table when it is wrapped in a view. For the SELECT from the view (select s_nr from vt where ns_r = 3000), the whole row is read from the Store. Fixing this performance problem may be a little tricky. The ProjectRestrict nodes are added by the optimizer AFTER the bind-phase has expanded the * in the view and concluded that ALL of the columns are being referenced.
          Hide
          Rick Hillegas added a comment - - edited

          If you wrap a base table in a view, it appears that the projection is not passed through to the base table. I am attaching derbyAST.xml, which is the printout of parse(), bind(), and optimize() ASTs produced by the following script when you use the XmlASTPrinter attached to DERBY-4415. From the printout, you can see the following:

          1) For the SELECT from a base table (select s_nr from t where ns_r = 3000), the optimizer wraps the base table in a ProjectRestrict node which projects out only two columns.

          2) However, for the SELECT from a view on that base table (select s_nr from vt where ns_r = 3000), the optimizer wraps the base table in a ProjectRestrict node which projects out ALL four columns in the table.

          Here is the script:

          connect 'jdbc:derby:memory:db;create=true';

          create function integerList()
          returns table( s_r int, s_nr int, ns_r int, ns_nr int )
          language java
          parameter style derby_jdbc_result_set
          no sql
          external name 'org.apache.derbyTesting.functionTests.tests.lang.RestrictedVTITest.integerList';

          create view vf as select * from table( integerList() ) s;

          create table t ( s_r int, s_nr int, ns_r int, ns_nr int );
          insert into t values ( 1, 2, 3, 4 ), ( 100, 200, 300, 400 ), ( 1000, 2000, 3000, 4000 ), ( 10000, 20000, 30000, 40000 );
          create view vt as select * from t;

          create procedure setInspector( visitorClassName varchar( 32672 ) )
          language java
          parameter style java
          modifies sql data
          external name 'ASTInspector.setInspector'
          ;

          call setInspector( 'XmlASTPrinter' );

          select s_nr from t where ns_r = 3000;
          select s_nr from table( integerList() ) s where ns_r = 3000;

          select s_nr from vt where ns_r = 3000;
          select s_nr from vf where ns_r = 3000;

          call setInspector( null );

          Show
          Rick Hillegas added a comment - - edited If you wrap a base table in a view, it appears that the projection is not passed through to the base table. I am attaching derbyAST.xml, which is the printout of parse(), bind(), and optimize() ASTs produced by the following script when you use the XmlASTPrinter attached to DERBY-4415 . From the printout, you can see the following: 1) For the SELECT from a base table (select s_nr from t where ns_r = 3000), the optimizer wraps the base table in a ProjectRestrict node which projects out only two columns. 2) However, for the SELECT from a view on that base table (select s_nr from vt where ns_r = 3000), the optimizer wraps the base table in a ProjectRestrict node which projects out ALL four columns in the table. Here is the script: connect 'jdbc:derby:memory:db;create=true'; create function integerList() returns table( s_r int, s_nr int, ns_r int, ns_nr int ) language java parameter style derby_jdbc_result_set no sql external name 'org.apache.derbyTesting.functionTests.tests.lang.RestrictedVTITest.integerList'; create view vf as select * from table( integerList() ) s; create table t ( s_r int, s_nr int, ns_r int, ns_nr int ); insert into t values ( 1, 2, 3, 4 ), ( 100, 200, 300, 400 ), ( 1000, 2000, 3000, 4000 ), ( 10000, 20000, 30000, 40000 ); create view vt as select * from t; create procedure setInspector( visitorClassName varchar( 32672 ) ) language java parameter style java modifies sql data external name 'ASTInspector.setInspector' ; call setInspector( 'XmlASTPrinter' ); select s_nr from t where ns_r = 3000; select s_nr from table( integerList() ) s where ns_r = 3000; select s_nr from vt where ns_r = 3000; select s_nr from vf where ns_r = 3000; call setInspector( null );
          Hide
          Rick Hillegas added a comment -

          Attaching a repro, derby-6036.sql.

          Show
          Rick Hillegas added a comment - Attaching a repro, derby-6036.sql.

            People

            • Assignee:
              Unassigned
              Reporter:
              Rick Hillegas
            • Votes:
              1 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:

                Development