Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-1702

"invalidate metadata" can cause duplicate TableIds

    Details

      Description

      CatalogServiceCatalog.reset() sets the table ID counter to 0, which is protected by the catalog lock. However, TableLoadingMgr.load() calls catalog_.getNextTableId() without taking the lock. This means that if reset() and multiples load()s are running concurrently, two tables created by load() can have the same ID (I'm not sure exactly what happens to these loaded tables since they'll get cleared from the catalog soon, but it appears they still show up in queries).

      This means that if "invalidate metadata" is called concurrently with queries that touch at least two tables (e.g. inserts), there is a chance that operators in the BE will see the wrong table descriptor.

        Issue Links

          Activity

          Hide
          mgrund_impala_bb91 Martin Grund added a comment -

          If I understand the code correctly, nextTableId is an Atomic integer and thus the mentioned case couldn't happen. A similar problem was fixed in

          4b1f1965ee3ba2ae4efe9d55a22c917cdaf91c30
          Fix race when invalidating catalog metadata and loading a new table

          Show
          mgrund_impala_bb91 Martin Grund added a comment - If I understand the code correctly, nextTableId is an Atomic integer and thus the mentioned case couldn't happen. A similar problem was fixed in 4b1f1965ee3ba2ae4efe9d55a22c917cdaf91c30 Fix race when invalidating catalog metadata and loading a new table
          Hide
          alex.behm Alexander Behm added a comment -

          Dimitris, can you confirm that this is indeed fixed?

          Show
          alex.behm Alexander Behm added a comment - Dimitris, can you confirm that this is indeed fixed?
          Hide
          skye Skye Wanderman-Milne added a comment -

          The problem isn't torn reads/writes to nextTableId, it's that we appear to be able to call reset() in between loading the metadata for multiple tables in a single query, yielding two tables with the same ID.

          We haven't been able to reproduce this and I didn't trace through the code enough to guarantee this is possible, but I think we should leave the JIRA open until we're confident it's impossible.

          Show
          skye Skye Wanderman-Milne added a comment - The problem isn't torn reads/writes to nextTableId, it's that we appear to be able to call reset() in between loading the metadata for multiple tables in a single query, yielding two tables with the same ID. We haven't been able to reproduce this and I didn't trace through the code enough to guarantee this is possible, but I think we should leave the JIRA open until we're confident it's impossible.
          Hide
          dhecht Dan Hecht added a comment -

          Perhaps we should stop resetting nextTableId, but instead remember its value when reset() is called. And then add an assert that all tableIds referenced by a query are all less than the reset value, or greater than/equal to the reset value (i.e. a query does not mix both old and new metadata). That might reproduce more readily since we'll check that condition on all queries rather than only noticing when the tableId happens to be duplicated exactly within a single query.

          Show
          dhecht Dan Hecht added a comment - Perhaps we should stop resetting nextTableId, but instead remember its value when reset() is called. And then add an assert that all tableIds referenced by a query are all less than the reset value, or greater than/equal to the reset value (i.e. a query does not mix both old and new metadata). That might reproduce more readily since we'll check that condition on all queries rather than only noticing when the tableId happens to be duplicated exactly within a single query.
          Hide
          skye Skye Wanderman-Milne added a comment -

          Two things I tried (and gave up on):

          1) Similar to Dan's suggestion, adding a catalog_reset_count catalog variable in addition the catalog version, which is incremented each time CatalogServiceCatalog.reset() is called, i.e. each time an "invalidate metadata" is issued. The idea is to propagate the catalog_reset_count along with the catalog version, and then at the end of query analysis check that all referenced tables have the same catalog_reset_count. Like Dan's idea, we then don't rely on duplicate table IDs to catch this condition.

          I had trouble propagating the catalog_reset_count though. I basically tried to add it everywhere we pass around the catalog version, but it turns out this is a lot of places.

          2) Never resetting nextTableId_. nextTableId_ is an AtomicInt (vs AtomicLong), meaning it can overflow. I don't think we can deal with negative table IDs; in particular, Id.INVALID_ID = -1. I tried changing it to an AtomicLong, but this affected so many code paths, some in non-trivial ways, that I gave up. Another approach may be to keep the AtomicInt but add logic to safely set it to 0 when it overflows.

          Show
          skye Skye Wanderman-Milne added a comment - Two things I tried (and gave up on): 1) Similar to Dan's suggestion, adding a catalog_reset_count catalog variable in addition the catalog version, which is incremented each time CatalogServiceCatalog.reset() is called, i.e. each time an "invalidate metadata" is issued. The idea is to propagate the catalog_reset_count along with the catalog version, and then at the end of query analysis check that all referenced tables have the same catalog_reset_count. Like Dan's idea, we then don't rely on duplicate table IDs to catch this condition. I had trouble propagating the catalog_reset_count though. I basically tried to add it everywhere we pass around the catalog version, but it turns out this is a lot of places. 2) Never resetting nextTableId_. nextTableId_ is an AtomicInt (vs AtomicLong), meaning it can overflow. I don't think we can deal with negative table IDs; in particular, Id.INVALID_ID = -1. I tried changing it to an AtomicLong, but this affected so many code paths, some in non-trivial ways, that I gave up. Another approach may be to keep the AtomicInt but add logic to safely set it to 0 when it overflows.
          Hide
          HuaisiXu Huaisi Xu added a comment -

          I received some comments on this:

          """
          all the hosts in this small cluster are VMware virtual machines, and the crashes seem to coincide with when VMware Consolidated Backup (VCB) snapshots are in progress.

          According to a "Understanding VMware Consolidated Backup" whitepaper I found on VMware's site:
          Before creating the snapshot, Consolidated Backup flushes the transient writes in the guest operating system and suspends any further writes for a few seconds in order to create a crash-consistent virtual machine image.
          """
          not sure now how this relates to this problem.

          Show
          HuaisiXu Huaisi Xu added a comment - I received some comments on this: """ all the hosts in this small cluster are VMware virtual machines, and the crashes seem to coincide with when VMware Consolidated Backup (VCB) snapshots are in progress. According to a "Understanding VMware Consolidated Backup" whitepaper I found on VMware's site: Before creating the snapshot, Consolidated Backup flushes the transient writes in the guest operating system and suspends any further writes for a few seconds in order to create a crash-consistent virtual machine image. """ not sure now how this relates to this problem.
          Show
          HuaisiXu Huaisi Xu added a comment - https://github.com/apache/incubator-impala/commit/01e7b1101523a668b5959462490918ae1a312452
          Hide
          tarmstrong Tim Armstrong added a comment -

          I believe this issue can sometimes cause another crash aside from IMPALA-4019:

          Thread 338 (crashed)
           0  impalad!impala::ExprContext::GetValue(impala::TupleRow const*) [expr-context.cc : 224 + 0x0]
              rax = 0x000000000d1e40c0   rdx = 0x0000000000000000
              rcx = 0x0000000000000000   rbx = 0x0000000000000002
              rsi = 0x0000000000000000   rdi = 0x0000000000000000
              rbp = 0x0000000000000010   rsp = 0x00007f6516560f88
               r8 = 0x00000000023523c6    r9 = 0x0000000000000000
              r10 = 0x0000000000000004   r11 = 0x00007f6653b1e4a0
              r12 = 0x000000000f61d2e0   r13 = 0x000000000d2ce750
              r14 = 0x000000000e500b40   r15 = 0x0000000015798840
              rip = 0x000000000081cc83
              Found by: given as instruction pointer in context
           1  impalad!impala::HdfsTableSink::Open(impala::RuntimeState*) [hdfs-table-sink.cc : 206 + 0xf]
              rbx = 0x0000000000000002   rbp = 0x0000000000000010
              rsp = 0x00007f6516560f90   r12 = 0x000000000f61d2e0
              r13 = 0x000000000d2ce750   r14 = 0x000000000e500b40
              r15 = 0x0000000015798840   rip = 0x0000000000d7a4ef
              Found by: call frame info
          
          Show
          tarmstrong Tim Armstrong added a comment - I believe this issue can sometimes cause another crash aside from IMPALA-4019 : Thread 338 (crashed) 0 impalad!impala::ExprContext::GetValue(impala::TupleRow const *) [expr-context.cc : 224 + 0x0] rax = 0x000000000d1e40c0 rdx = 0x0000000000000000 rcx = 0x0000000000000000 rbx = 0x0000000000000002 rsi = 0x0000000000000000 rdi = 0x0000000000000000 rbp = 0x0000000000000010 rsp = 0x00007f6516560f88 r8 = 0x00000000023523c6 r9 = 0x0000000000000000 r10 = 0x0000000000000004 r11 = 0x00007f6653b1e4a0 r12 = 0x000000000f61d2e0 r13 = 0x000000000d2ce750 r14 = 0x000000000e500b40 r15 = 0x0000000015798840 rip = 0x000000000081cc83 Found by: given as instruction pointer in context 1 impalad!impala::HdfsTableSink::Open(impala::RuntimeState*) [hdfs-table-sink.cc : 206 + 0xf] rbx = 0x0000000000000002 rbp = 0x0000000000000010 rsp = 0x00007f6516560f90 r12 = 0x000000000f61d2e0 r13 = 0x000000000d2ce750 r14 = 0x000000000e500b40 r15 = 0x0000000015798840 rip = 0x0000000000d7a4ef Found by: call frame info
          Hide
          tarmstrong Tim Armstrong added a comment -

          This can also manifest as the error "No default partition found for HdfsTextTableSink"

          Show
          tarmstrong Tim Armstrong added a comment - This can also manifest as the error "No default partition found for HdfsTextTableSink"

            People

            • Assignee:
              HuaisiXu Huaisi Xu
              Reporter:
              skye Skye Wanderman-Milne
            • Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development