Issue Details (XML | Word | Printable)

Key: DERBY-3946
Type: Improvement Improvement
Status: Open Open
Priority: Major Major
Assignee: Unassigned
Reporter: Rick Hillegas
Votes: 3
Watchers: 1
Operations

If you were logged in you would be able to see more operations.
Derby

Provide support for using the Derby parser to generate Abstract Syntax Trees

Created: 12/Nov/08 07:26 PM   Updated: Yesterday 02:31 PM
Component/s: SQL
Affects Version/s: 10.5.1.1
Fix Version/s: None

Time Tracking:
Not Specified

File Attachments:
  Size
Java Source File Licensed for inclusion in ASF works ASTParser.java 2009-08-25 04:00 PM Rick Hillegas 4 kB
Java Source File Licensed for inclusion in ASF works ASTParser.java 2009-05-28 01:32 PM Rick Hillegas 4 kB
Java Source File Licensed for inclusion in ASF works ASTParser.java 2008-11-12 09:30 PM Rick Hillegas 4 kB
File Licensed for inclusion in ASF works derby-3946-01-aa-standaloneParser.diff 2008-11-12 09:30 PM Rick Hillegas 2 kB
Java Source File Licensed for inclusion in ASF works TreeWalker.java 2009-11-06 02:31 PM Rick Hillegas 6 kB
Java Source File Licensed for inclusion in ASF works TreeWalker.java 2008-11-14 03:23 PM Rick Hillegas 5 kB
Issue Links:
Reference
 


 Description  « Hide
Users would like to be able to use the Derby parser to produce query trees without actually running the queries on Derby.

 All   Comments   Work Log   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
Rick Hillegas added a comment - 12/Nov/08 09:30 PM
Attaching derby-3946-01-aa-standaloneParser.diff and ASTParser.java. To test-drive these changes, you need to do the following:

A) Apply the patch to the Derby trunk and build Derby.

B) Compile ASTParser

C) Run ASTParser. The program takes one argument, which is a query string which you would like to parse. The program then retrieves the parsed query tree and prints it out. For example:

  java ASTParser "select a from t, s where t.a = s.a"

  
Note that this technique for using the parser standalone involves invoking non-public APIs which are subject to change from release to release. Note, however, that these APIs have remained stable since Derby was open-sourced.


The patch touches the following files. I will run tests. I am inclined to check in these changes if the tests pass.


M java/engine/org/apache/derby/impl/sql/conn/GenericLanguageConnectionContext.java
M java/engine/org/apache/derby/iapi/sql/conn/LanguageConnectionContext.java

Adds some accessors to the compiler's state variable so that query trees can be stored and retrieved.


M java/engine/org/apache/derby/impl/sql/GenericStatement.java

Pokes the query tree into the state variable when the user sets the StopAfterParsing tracepoint.

Dag H. Wanvik added a comment - 13/Nov/08 12:13 AM
Tried the program for this example

java ASTParser "select * from table(foo(4))t"

but it seems the binding is done early here so it crashed:
Exception in thread "main" java.sql.SQLSyntaxErrorException: 'APP'.FOO' does not identify a table function.
Does this mean only a subset of the syntax may be used? If so, what parts?

Knut Anders Hatlen added a comment - 13/Nov/08 12:31 PM
Dag, this is because the initialization of NewInvocationNode (invoked by the parser -- see vtiTableConstruct() in sqlgrammar.jj) looks up the class name associated with the table function in the data dictionary. That sounds like the wrong place to do it, it should probably be moved from init() to bindExpression(). The actual verification that the class exists is (correctly) done in bindExpression(). I wouldn't be surprised if there's more misplaced code, since until now it hasn't really made any difference if some of the binding has been performed in the parser.

Rick Hillegas added a comment - 13/Nov/08 01:33 PM
I agree with Knut's analysis. The parse-time constructor for NewInvocationNode contains a bind-time call to the data dictionary. I regard this as a bug. The bug could probably be fixed by faulting in the metadata only when it is needed rather than pro-actively reading the catalogs when the AST is created.

I also agree with Knut that the devil is in the details. Since a standalone parser hasn't been tested widely, there are likely to be several cases where bind-time logic has leaked into the parsing phase. Again, as they come up, I think we should clean them up. It might be worth figuring out how we could record all of the SQL that is issued by our regression tests and then pump that SQL through the ASTParser logic. That could give us some confidence that we have found the central bulge of problems--though it would not be a systematic way to catch all of the outliers.

Rick Hillegas added a comment - 13/Nov/08 01:35 PM
Tests ran cleanly for me on derby-3946-01-aa-standaloneParser.diff. Committed at subversion revision 713721.

Rick Hillegas added a comment - 13/Nov/08 08:07 PM
Additional commentary on this issue can be found in this email thread: http://www.nabble.com/Using-derby-to-parse-an-SQL-statement-td20461127.html#a20461127

Christian Riedel added a comment - 14/Nov/08 12:23 PM
Rick,

your patch has helped us a lot so far ... since I cannot post any messages to the list for some reason, I want to give my feedback here.

getting hold of the QueryTreeNode is a big step in the right direction ... however for actually analyzing the tree it would be necessary to access the tree's elements. I.e. traverse the tree sequentially, getting information how many childnodes exist etc. The problem that I still see is that some of the classes I'd need for this are not visible so that I can use the from the outside.

The QueryTreeNode in your example code seems to be an instance of CursorNode ... how can I go on from here?

The things we'd like to know is: which tables are involved in the statement (also in inner selects) which fields from which tables are accessed, which aliases are given for the tables that are accessed etc.

Could you give me a further hint?

Thanks

Christian

Rick Hillegas added a comment - 14/Nov/08 03:23 PM
Hi Christian,

I have posted a reply to a similar question on your original email thread. Hopefully you will be able to see that soon.

1) It's useful to build the engine javadoc (ant -quiet javadoc) and browse the javadoc for the package org.apache.derby.impl.sql.compile. In particular, you will see that the AST nodes are the classes indented under QueryTreeNode in the tree view.

2) The nodes themselves implement Visitable so you can write you own Visitor to explore the AST graph. Visitable has one method, accept(), and by looking at the implementations of that method, you will understand how the nodes snap together into a graph.

I've attached a simple Visitor (TreeWalker), which shows you some classes in the graph. That may help explain the AST a bit more. When I run

  java TreeWalker "select a from t, s where t.a = s.a"

I get the following graph:

    org.apache.derby.impl.sql.compile.CursorNode
        org.apache.derby.impl.sql.compile.SelectNode
            org.apache.derby.impl.sql.compile.ResultColumnList
                org.apache.derby.impl.sql.compile.ResultColumn
                    org.apache.derby.impl.sql.compile.ColumnReference
            org.apache.derby.impl.sql.compile.FromBaseTable
            org.apache.derby.impl.sql.compile.FromBaseTable
            org.apache.derby.impl.sql.compile.BinaryRelationalOperatorNode
                org.apache.derby.impl.sql.compile.ColumnReference
                org.apache.derby.impl.sql.compile.ColumnReference

Let me try to explain this tree a bit:

The SelectNode has the following children:

i) The columns in the SELECT list. The whole SELECT list is represented by a ResultColumnList and there is only one ResultColumn (representing "a") in that list.

ii) The tables in the FROM list. There are two of these, each represented by its own FromBaseTable.

iii) The WHERE clause. This is a BinaryRelationalOperatorNode (representing the "=" operator). This operator node has a left child and a right child, "t.a" and "s.a" respectively.

So to answer your specific questions:

Q) What tables are in the query?
A) Look for FromBaseTables in the graph.

Q) Which fields in the tables are accessed?
A) The column references can appear in the SELECT list (the ResultColumnList) or in the WHERE clause (under the BinaryRelationalOperatorNode). Note, however, that the columns are not matched up to tables yet. This isn't done by the parser. That kind of name resolution happens during Derby's bind() phase and it requires metadata so that Derby knows the structure of the tables. Unless you create the tables in Derby, there will be no way to move forward to the bind() phase. If you don't provide this information to Derby, then you will have to write your own name-resolution phase.

Q) What are the table aliases, if any?
A) FromBaseTable.getExposedName() will give you the name of the table (or the alias name if you specified an alias)

Hope this helps,
-Rick

Christian Riedel added a comment - 14/Nov/08 04:49 PM
Hi Rick,

thanks a lot for your examples ... that makes things a lot clearer. I think I got the idea now.

Now, I have to try and write my own code around it to make it "easier" to use for us.

This was really great support!

Regards

Christian

Rick Hillegas added a comment - 30/Jan/09 09:00 PM
Linking this issue to DERBY-791. The XmlTreeWalker attached to that issue can be used to print out a query tree in xml.

Myrna van Lunteren added a comment - 18/Mar/09 03:59 AM
Can this be closed? If so, should I put it as a 'new feature' in the Release notes for 10.5?

Rick Hillegas added a comment - 18/Mar/09 06:52 AM
Hi Myrna,

I think there's more work to be done on this issue--I just haven't found a rainy day for that yet. If we say something about this issue in the release notes, then we'll need to warn users that the technique relies on non-public apis which can change between releases. Thanks.

Rick Hillegas added a comment - 28/May/09 01:32 PM
Attaching a version of the ASTParser which has been modified to use a transient in-memory database rather than creating an unnecessary on-disk database.

Knut Anders Hatlen added a comment - 31/May/09 02:40 AM
Good idea to use an in-memory database. The storeless engine (DERBY-2164) is probably an even better fit, as this is exactly the kind of use it was intended for (it skips database creation entirely). The down-side is of course that the storeless classes have not been added to the production jars yet, so you would need to add the classes.storeless directory to the classpath for it to work.

Rick Hillegas added a comment - 25/Aug/09 04:00 PM
Attaching a third rev of the ASTParser. This rev hard-codes the context id for LanguageConnectionContext. This lets you compile ASTParser against the debug jars--the ContextId class itself does not appear in the Derby jar files because the Derby build eliminates classes which merely contain constants.

I have successfully compiled and run this ASTParser against the debug derby.jar that is part of the 10.5.3.0 distribution.

Rick Hillegas added a comment - 06/Nov/09 02:31 PM
Attaching a new version of TreeWalker which implements the visitChildrenFirst() method which was recently added to the Visitor interface.