Uploaded image for project: 'Calcite'
  1. Calcite
  2. CALCITE-1742

Create a read-consistent view of CalciteSchema for each statement compilation

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 1.12.0
    • Fix Version/s: 1.13.0
    • Component/s: core
    • Labels:
      None

      Description

      CalciteSchema is designed to have two sets of objects, the explicit and the implicit. A explicit object means an object we add through explicit CalciteSchema#addTable (or CalciteSchema.addFunction, etc) calls, while an implicit object means an object we get from the underlying Schema object's getXXX methods.

      However, in CalciteCatalogReader#getTableFrom, after a table is resolved through CalciteSchema.getTable method, it will be added to the CalciteSchema again as an explicit object regardless of whether it is originally implicit or explicit. So if it happens to be an implicit table, any change about that table later on will be shadowed by the newly added explicit object and thus cannot be accessed.

        Issue Links

          Activity

          Hide
          jni Jinfeng Ni added a comment -

          Can you elaborate a bit more on this? Which objects will be cached in parent class CalciteSchema?

          Show
          jni Jinfeng Ni added a comment - Can you elaborate a bit more on this? Which objects will be cached in parent class CalciteSchema?
          Hide
          maryannxue Maryann Xue added a comment -

          Sorry, the description was a little inaccurate. In fact, I think the issue exists in CalciteCatalogReader#getTableFrom which after calling schema.getTable, it again calls schema.addTable. I suppose schema.addTable is for adding so-called "explicit" tables but not for caching table objects.

          Show
          maryannxue Maryann Xue added a comment - Sorry, the description was a little inaccurate. In fact, I think the issue exists in CalciteCatalogReader#getTableFrom which after calling schema.getTable , it again calls schema.addTable . I suppose schema.addTable is for adding so-called "explicit" tables but not for caching table objects.
          Hide
          jni Jinfeng Ni added a comment -

          From the code, schema.addTable is only called when the located TableEntry was not successfully unwrapped into a PreparingTable instance. If that happens, it will replace the old one with a new properly created TableEntry in schema. Seems such addTable call has nothing to do with cache?

          1. https://github.com/apache/calcite/blob/master/core/src/main/java/org/apache/calcite/prepare/CalciteCatalogReader.java#L148-L149

          Show
          jni Jinfeng Ni added a comment - From the code, schema.addTable is only called when the located TableEntry was not successfully unwrapped into a PreparingTable instance. If that happens, it will replace the old one with a new properly created TableEntry in schema. Seems such addTable call has nothing to do with cache? 1. https://github.com/apache/calcite/blob/master/core/src/main/java/org/apache/calcite/prepare/CalciteCatalogReader.java#L148-L149
          Hide
          maryannxue Maryann Xue added a comment - - edited

          From the code, schema.addTable is only called when the located TableEntry was not successfully unwrapped into a PreparingTable instance. If that happens, it will replace the old one with a new properly created TableEntry in schema.

          I don't think calling addSchema will actually do anything about the Table instance other than create a new TableEntry object with exactly the same name and Table objects, i.e. it won't wrap the Table instance with a RelOptTableImpl at this point. The side effect of calling addSchema though, is adding this Table into the tableMap which is targeted for explicit tables only (sorry I shouldn't have called it cache). Note that the Table instance here can be an "explicit" or an "implicit" table, so it should NOT be added to the tableMap. Otherwise if it's an implicit Table, the change of this Table from the underlying Schema provider object can be shadowed.

          Show
          maryannxue Maryann Xue added a comment - - edited From the code, schema.addTable is only called when the located TableEntry was not successfully unwrapped into a PreparingTable instance. If that happens, it will replace the old one with a new properly created TableEntry in schema. I don't think calling addSchema will actually do anything about the Table instance other than create a new TableEntry object with exactly the same name and Table objects, i.e. it won't wrap the Table instance with a RelOptTableImpl at this point. The side effect of calling addSchema though, is adding this Table into the tableMap which is targeted for explicit tables only (sorry I shouldn't have called it cache). Note that the Table instance here can be an "explicit" or an "implicit" table, so it should NOT be added to the tableMap . Otherwise if it's an implicit Table, the change of this Table from the underlying Schema provider object can be shadowed.
          Hide
          maryannxue Maryann Xue added a comment -

          Thank you, Jinfeng Ni, for pressing me on the details! I first thought the tableMap was more like another layer of cache and then when I started to fix the issue I realized it was for another purpose. I had decided to update the description together with my PR, but it's definitely better that I do it now.

          Show
          maryannxue Maryann Xue added a comment - Thank you, Jinfeng Ni , for pressing me on the details! I first thought the tableMap was more like another layer of cache and then when I started to fix the issue I realized it was for another purpose. I had decided to update the description together with my PR, but it's definitely better that I do it now.
          Hide
          jni Jinfeng Ni added a comment - - edited

          Do we know why the TableEntry found in schema could not be unwrapped into a PreparingTable instance? Sounds like we will only hit the code of schema.add on L149 when the existing TableEntry could not be unwrapped. Is it because the table is an implicit table?

          I agreed with your argument of side effect of getTableFrom() call; we probably should not call schema.add(), either if we could not find a TableEntry, or the TableEntry could not be unwrapped. The current code means the call getTableFrom() would have side effect. Julian Hyde, what's your thoughts regarding this?

          Show
          jni Jinfeng Ni added a comment - - edited Do we know why the TableEntry found in schema could not be unwrapped into a PreparingTable instance? Sounds like we will only hit the code of schema.add on L149 when the existing TableEntry could not be unwrapped. Is it because the table is an implicit table? I agreed with your argument of side effect of getTableFrom() call; we probably should not call schema.add(), either if we could not find a TableEntry , or the TableEntry could not be unwrapped. The current code means the call getTableFrom() would have side effect. Julian Hyde , what's your thoughts regarding this?
          Hide
          maryannxue Maryann Xue added a comment - - edited

          Do we know why the TableEntry found in schema could not be unwrapped into a PreparingTable instance?

          The instance returned, whether being an implicit table or an explicit one, can be anything that implements the Table interface or its sub-interface, as specified by the schema SPI. And I believe in most cases, the instance does not implement PreparingTable, so the code hits L149 more often than not. Right now MockTable is only thing I can find that will go down the "unwrap" path. And the unwrap logic here is to avoid extra wrapping of "RelOptTableImpl".

          Show
          maryannxue Maryann Xue added a comment - - edited Do we know why the TableEntry found in schema could not be unwrapped into a PreparingTable instance? The instance returned, whether being an implicit table or an explicit one, can be anything that implements the Table interface or its sub-interface, as specified by the schema SPI. And I believe in most cases, the instance does not implement PreparingTable , so the code hits L149 more often than not. Right now MockTable is only thing I can find that will go down the "unwrap" path. And the unwrap logic here is to avoid extra wrapping of "RelOptTableImpl".
          Hide
          maryannxue Maryann Xue added a comment -

          Jinfeng Ni and Julian Hyde, could you please take a look at https://github.com/apache/calcite/pull/421?
          After removing the two occurrences of Schema.addTable that I thought were unnecessary and had side effects, I got two failures from the existing calcite tests. That was because there were two places which required that for each table, the same Table object should be returned within the execution of a Statement:
          1) RelOptTableImpl compares its underlying Table object for hash code generation and equality test.
          2) CsvTranslatableTable will use the fieldTypes initiated during compile time.
          So I guess CalciteCatalogReader and EmptyScope called Schema.addTable deliberately for this exact reason. As it is good enough for a single Statement execution, it does not take into account the fact that table definitions can change between different Statement executions.

          I figured there can be at least two solutions:
          1. Require implementations of Schema SPI always return the same Table object for the same table. This is what I did in the PR. I changed ReflectiveSchema and CsvSchema to let them initiate the "tableMap" just once and return this same tableMap each time.
          2. Keep the Schema.addSchema calls, but clear the "tableMap" between each Statement. This will make sure that Table objects do not change within the same Statement but can change over different Statements.

          Any thoughts?

          Show
          maryannxue Maryann Xue added a comment - Jinfeng Ni and Julian Hyde , could you please take a look at https://github.com/apache/calcite/pull/421? After removing the two occurrences of Schema.addTable that I thought were unnecessary and had side effects, I got two failures from the existing calcite tests. That was because there were two places which required that for each table, the same Table object should be returned within the execution of a Statement: 1) RelOptTableImpl compares its underlying Table object for hash code generation and equality test. 2) CsvTranslatableTable will use the fieldTypes initiated during compile time. So I guess CalciteCatalogReader and EmptyScope called Schema.addTable deliberately for this exact reason. As it is good enough for a single Statement execution, it does not take into account the fact that table definitions can change between different Statement executions. I figured there can be at least two solutions: 1. Require implementations of Schema SPI always return the same Table object for the same table. This is what I did in the PR. I changed ReflectiveSchema and CsvSchema to let them initiate the "tableMap" just once and return this same tableMap each time. 2. Keep the Schema.addSchema calls, but clear the "tableMap" between each Statement. This will make sure that Table objects do not change within the same Statement but can change over different Statements. Any thoughts?
          Hide
          maryannxue Maryann Xue added a comment -

          I just realized that solution 2 was not right. If we had to use a map to keep each Table object we got from calling Schema.getTable, we would need an extra map, different from the current tableMap in CalciteSchema, so that we could reset this extra map only before the execution of a new Statement.

          Show
          maryannxue Maryann Xue added a comment - I just realized that solution 2 was not right. If we had to use a map to keep each Table object we got from calling Schema.getTable , we would need an extra map, different from the current tableMap in CalciteSchema , so that we could reset this extra map only before the execution of a new Statement.
          Hide
          jni Jinfeng Ni added a comment - - edited

          Maryann Xue, it makes sense to remove the side effect of two places you identified, by removing schema.add() function call.

          Make sense to me that Schema should always return same Table object for the same table, regardless of single statement or multiple statement. The same Table instance could be wrapped in different RelOptTableImpl objects.

          I have one question. Will it cause problem if CsvSchema only initializes "tableMap" just once? What if the files in one the directory have been added / deleted between multiple statement execution?

          Show
          jni Jinfeng Ni added a comment - - edited Maryann Xue , it makes sense to remove the side effect of two places you identified, by removing schema.add() function call. Make sense to me that Schema should always return same Table object for the same table, regardless of single statement or multiple statement. The same Table instance could be wrapped in different RelOptTableImpl objects. I have one question. Will it cause problem if CsvSchema only initializes "tableMap" just once? What if the files in one the directory have been added / deleted between multiple statement execution?
          Hide
          maryannxue Maryann Xue added a comment - - edited

          Thank you, Jinfeng Ni, for the feedback!

          I have one question. Will it cause problem if CsvSchema only initializes "tableMap" just once? What if the files in one the directory have been added / deleted between multiple statement execution?

          Yes, if we go down solution 1, it may make Schema implementations more complicated. They have to make sure to return always the same Table object for a specific table name within each Statement, and meanwhile to return a new Table object if its definition has been updated. So in this sense, I am now more inclined towards option 2, in which CalciteSchema should maintain a separate map for tables that have been queried since the start of each statement and this map should be reset before a new Statement starts.

          Show
          maryannxue Maryann Xue added a comment - - edited Thank you, Jinfeng Ni , for the feedback! I have one question. Will it cause problem if CsvSchema only initializes "tableMap" just once? What if the files in one the directory have been added / deleted between multiple statement execution? Yes, if we go down solution 1, it may make Schema implementations more complicated. They have to make sure to return always the same Table object for a specific table name within each Statement, and meanwhile to return a new Table object if its definition has been updated. So in this sense, I am now more inclined towards option 2, in which CalciteSchema should maintain a separate map for tables that have been queried since the start of each statement and this map should be reset before a new Statement starts.
          Hide
          julianhyde Julian Hyde added a comment -

          It's reasonable for the client of the schema SPI (i.e. the preparing statement) to expect read consistency. During preparation, and also if the statement is re-executed.

          So, then the question is: How burdensome is for the implementor of the schema SPI to achieve this? Should we introduce a wrapper to ensure that if the same question is asked twice, we always get the same answer? (CalciteSchema is such a wrapper.)

          There is an existing mechanism for a Schema to indicate whether its contents have changed: Schema.contentsHaveChangedSince(long, long). Is it adequate?

          Show
          julianhyde Julian Hyde added a comment - It's reasonable for the client of the schema SPI (i.e. the preparing statement) to expect read consistency. During preparation, and also if the statement is re-executed. So, then the question is: How burdensome is for the implementor of the schema SPI to achieve this? Should we introduce a wrapper to ensure that if the same question is asked twice, we always get the same answer? ( CalciteSchema is such a wrapper.) There is an existing mechanism for a Schema to indicate whether its contents have changed: Schema.contentsHaveChangedSince(long, long) . Is it adequate?
          Hide
          maryannxue Maryann Xue added a comment -

          So, then the question is: How burdensome is for the implementor of the schema SPI to achieve this? Should we introduce a wrapper to ensure that if the same question is asked twice, we always get the same answer? (CalciteSchema is such a wrapper.)

          The question is: Can the implementor of schema SPI know the boundary between different Statements? If the question is asked twice within the same Statement, the answer should always be the same. And when it comes to the next Statement, it should be able to answer what it is like as of the new Statement time. So in order to allow the implementor to do that, we can either 1) provide a reset callback/mechanism in schema SPI or 2) provide an extra parameter "timestamp" or "query id" each time we call getTable from the schema implementor. Either way it would make the SPI unnecessarily more complicated I think. Doing this in the wrapper, i.e., CalciteSchema is certainly much simpler.
          Right now Phoenix/Calcite integration maintains a map in PhoenixSchema (originally not for this purpose), but has to take advantage of the Hooks to clear the map right before a Statement starts, which is definitely not preferred.

          There is an existing mechanism for a Schema to indicate whether its contents have changed: Schema.contentsHaveChangedSince(long, long). Is it adequate?

          According to javadoc, the Schema.contentsHaveChangedSince(long, long) method is designed for schema caching and the thing we are talking about right now is a different issue and it should work regardless of whether caching is enabled. And even if the application of this method could be extended for this more general purpose, we still need some other mechanism to help it work, and we will be back at the above question again. For example, if "contentsHaveChanged" returns true for our next Statement, we'll have to call a method like "update" on the schema implementor, and this sounds like a combined approach of 1) and 2) mentioned above, coz we have both "reset/update" method and a "timestamp" (from contentsHaveChanged).

          Show
          maryannxue Maryann Xue added a comment - So, then the question is: How burdensome is for the implementor of the schema SPI to achieve this? Should we introduce a wrapper to ensure that if the same question is asked twice, we always get the same answer? (CalciteSchema is such a wrapper.) The question is: Can the implementor of schema SPI know the boundary between different Statements? If the question is asked twice within the same Statement, the answer should always be the same. And when it comes to the next Statement, it should be able to answer what it is like as of the new Statement time. So in order to allow the implementor to do that, we can either 1) provide a reset callback/mechanism in schema SPI or 2) provide an extra parameter "timestamp" or "query id" each time we call getTable from the schema implementor. Either way it would make the SPI unnecessarily more complicated I think. Doing this in the wrapper, i.e., CalciteSchema is certainly much simpler. Right now Phoenix/Calcite integration maintains a map in PhoenixSchema (originally not for this purpose), but has to take advantage of the Hooks to clear the map right before a Statement starts, which is definitely not preferred. There is an existing mechanism for a Schema to indicate whether its contents have changed: Schema.contentsHaveChangedSince(long, long). Is it adequate? According to javadoc, the Schema.contentsHaveChangedSince(long, long) method is designed for schema caching and the thing we are talking about right now is a different issue and it should work regardless of whether caching is enabled. And even if the application of this method could be extended for this more general purpose, we still need some other mechanism to help it work, and we will be back at the above question again. For example, if "contentsHaveChanged" returns true for our next Statement, we'll have to call a method like "update" on the schema implementor, and this sounds like a combined approach of 1) and 2) mentioned above, coz we have both "reset/update" method and a "timestamp" (from contentsHaveChanged ).
          Hide
          julianhyde Julian Hyde added a comment -

          Jinfeng Ni, Can you please take a look at CALCITE-1748 (related to DRILL-5089)? I'd welcome your opinion whether it could be solved in the same way as this case.

          Show
          julianhyde Julian Hyde added a comment - Jinfeng Ni , Can you please take a look at CALCITE-1748 (related to DRILL-5089 )? I'd welcome your opinion whether it could be solved in the same way as this case.
          Hide
          jni Jinfeng Ni added a comment -

          Thanks for the reminder, Julian Hyde. I'll take a look at CALCITE-1748 and DRILL-5089. I overlooked your jira message last week.

          Show
          jni Jinfeng Ni added a comment - Thanks for the reminder, Julian Hyde . I'll take a look at CALCITE-1748 and DRILL-5089 . I overlooked your jira message last week.
          Hide
          maryannxue Maryann Xue added a comment -
          Show
          maryannxue Maryann Xue added a comment - Julian Hyde , Jinfeng Ni , Could you please take a look at https://github.com/apache/calcite/pull/421?
          Hide
          julianhyde Julian Hyde added a comment -

          The weekend is just about to hit, and I don't have time to review this properly. Even though I know you leave on Sunday, Maryann Xue. I'll take a look on Monday.

          Show
          julianhyde Julian Hyde added a comment - The weekend is just about to hit, and I don't have time to review this properly. Even though I know you leave on Sunday, Maryann Xue . I'll take a look on Monday.
          Hide
          julianhyde Julian Hyde added a comment -

          +1 Looks good

          Jinfeng Ni, Can you please review?

          Some ideas for the future:

          • Change long timestamp to a more data type that could accommodate how other systems represent snapshots (e.g. snapshot names)
          • Allow schemas to declare whether they are (a) capable of doing snapshots for themselves, or (b) not capable of doing snapshots and would like CalciteSchema to add a caching wrapper.
          Show
          julianhyde Julian Hyde added a comment - +1 Looks good Jinfeng Ni , Can you please review? Some ideas for the future: Change long timestamp to a more data type that could accommodate how other systems represent snapshots (e.g. snapshot names) Allow schemas to declare whether they are (a) capable of doing snapshots for themselves, or (b) not capable of doing snapshots and would like CalciteSchema to add a caching wrapper.
          Hide
          jni Jinfeng Ni added a comment -

          +1. LGTM.

          Show
          jni Jinfeng Ni added a comment - +1. LGTM.
          Show
          julianhyde Julian Hyde added a comment - Fixed in http://git-wip-us.apache.org/repos/asf/calcite/commit/4519ef6e .
          Hide
          jcamachorodriguez Jesus Camacho Rodriguez added a comment -

          Resolved in release 1.13.0 (2017-06-26).

          Show
          jcamachorodriguez Jesus Camacho Rodriguez added a comment - Resolved in release 1.13.0 (2017-06-26).

            People

            • Assignee:
              maryannxue Maryann Xue
              Reporter:
              maryannxue Maryann Xue
            • Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development