Traffic Server
  1. Traffic Server
  2. TS-1114

Crash report: HttpTransactCache::SelectFromAlternates

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 3.1.4, 3.0.5
    • Component/s: None
    • Labels:
      None
    • Backport to Version:

      Description

      it may or may not be the upstream issue, let us open it for tracking.

      #0  0x000000000053075e in HttpTransactCache::SelectFromAlternates (cache_vector=0x2aaab80ff500, client_request=0x2aaab80ff4c0, 
          http_config_params=0x2aaab547b800) at ../../proxy/hdrs/HTTP.h:1375
      1375	  ((int32_t *) & val)[0] = m_alt->m_object_key[0];
      

        Issue Links

          Activity

          Hide
          Zhao Yongming added a comment -
          (gdb) f	1
          #1  0x0000000000644387 in CacheVC::openReadChooseWriter	(this=0x2aaab80ff400, event=8, e=<value optimized out>)	at CacheRead.cc:341
          (gdb) p	vector
          $19 = {magic = 0x0, data = {data = 0x2aaabcc8bc78, fast_data = {{alternate = {m_alt = 0x0}}, {alternate = {m_alt = 0x0}}, {alternate = {
                    m_alt	= 0x0}}, {alternate = {m_alt = 0x0}}}, default_val = 0xe85a58, size = 8, pos = 7}, xcount = 8, vector_buf = {m_ptr = 0x0}}
          (gdb)  
          
          Show
          Zhao Yongming added a comment - (gdb) f 1 #1 0x0000000000644387 in CacheVC::openReadChooseWriter ( this =0x2aaab80ff400, event=8, e=<value optimized out>) at CacheRead.cc:341 (gdb) p vector $19 = {magic = 0x0, data = {data = 0x2aaabcc8bc78, fast_data = {{alternate = {m_alt = 0x0}}, {alternate = {m_alt = 0x0}}, {alternate = { m_alt = 0x0}}, {alternate = {m_alt = 0x0}}}, default_val = 0xe85a58, size = 8, pos = 7}, xcount = 8, vector_buf = {m_ptr = 0x0}} (gdb)
          Hide
          Leif Hedstrom added a comment -

          Moved to 3.1.4, please move bugs back to 3.1.3, which you will work on in the next 2 weeks.

          Show
          Leif Hedstrom added a comment - Moved to 3.1.4, please move bugs back to 3.1.3, which you will work on in the next 2 weeks.
          Hide
          weijin added a comment -

          write_vector should be protected by vol mutex.

          Show
          weijin added a comment - write_vector should be protected by vol mutex.
          Hide
          Zhao Yongming added a comment -

          this patch runs perfect in our production for weeks

          Show
          Zhao Yongming added a comment - this patch runs perfect in our production for weeks
          Hide
          John Plevyak added a comment -

          Gads, yes, the write_vector needs to be protected by the vol mutex. This is a serious oversight. Thanx for finding it.

          This patch has to get in. Do you want to commit it or do you want me to do a closer read then commit it?

          Show
          John Plevyak added a comment - Gads, yes, the write_vector needs to be protected by the vol mutex. This is a serious oversight. Thanx for finding it. This patch has to get in. Do you want to commit it or do you want me to do a closer read then commit it?
          Hide
          Zhao Yongming added a comment -

          yeah, we are confidential that we have fixed the crash, and we need your review, that is what we are waiting for

          Show
          Zhao Yongming added a comment - yeah, we are confidential that we have fixed the crash, and we need your review, that is what we are waiting for
          Hide
          Zhao Yongming added a comment -

          when we tracking down this issue, we have two directions:
          Weijin is tracking on why the event is "8", where there should not be any event that is "8" in the event system, and in other core dumps we are sure that the event is not what it should be as a really event, it is shown as a random data, that turns out to be something really interest: 1, it should be that the old data(may or may not be the same event) is freed, and the event is not canceled. 2, someone overwrite the data in this event. Weijin track down this way and it turns out that the action cancel codes may rise some problem under certain situation. He made a patch into our tree, and we applied it on half of our servers, it runs without any crash for weeks.

          At the same time, Koutai is working on make the vector write & read more safe, even in some very strange situation. And patched half of our servers, runs without any crash too.

          after carefully discuss, we conclude that Weijing's patch is what we need to keep, and here comes the patch.

          back to TS-857, when I look it back, there is some strange event in the back trace, we have only , is that the same issue hare? where is the action canceled without mutex protected? if we can consider TS-1114 a good fix, then we should think about TS-857 a crash same as it.

          so far, I am not sure how many crashes after patched with TS-1114, I just don't get too much new back trace for this issue, TS-1114 may covered many strange crashes as it will make system really strange.

          Show
          Zhao Yongming added a comment - when we tracking down this issue, we have two directions: Weijin is tracking on why the event is "8", where there should not be any event that is "8" in the event system, and in other core dumps we are sure that the event is not what it should be as a really event, it is shown as a random data, that turns out to be something really interest: 1, it should be that the old data(may or may not be the same event) is freed, and the event is not canceled. 2, someone overwrite the data in this event. Weijin track down this way and it turns out that the action cancel codes may rise some problem under certain situation. He made a patch into our tree, and we applied it on half of our servers, it runs without any crash for weeks. At the same time, Koutai is working on make the vector write & read more safe, even in some very strange situation. And patched half of our servers, runs without any crash too. after carefully discuss, we conclude that Weijing's patch is what we need to keep, and here comes the patch. back to TS-857 , when I look it back, there is some strange event in the back trace, we have only , is that the same issue hare? where is the action canceled without mutex protected? if we can consider TS-1114 a good fix, then we should think about TS-857 a crash same as it. so far, I am not sure how many crashes after patched with TS-1114 , I just don't get too much new back trace for this issue, TS-1114 may covered many strange crashes as it will make system really strange.
          Hide
          John Plevyak added a comment -

          This patch has been committed to 3.0.x and master branches. Please verify and mark this "fixed" if your testing confirms that the problem is gone.

          Show
          John Plevyak added a comment - This patch has been committed to 3.0.x and master branches. Please verify and mark this "fixed" if your testing confirms that the problem is gone.
          Hide
          John Plevyak added a comment -

          Not in 3.0.x yet. Need to get agreement on a backport.

          Show
          John Plevyak added a comment - Not in 3.0.x yet. Need to get agreement on a backport.
          Hide
          Conan Wang added a comment -

          get Hunk #1 FAILED at 1491 when try to backport, because TS-1084 also has a simple modify to the code.

          Show
          Conan Wang added a comment - get Hunk #1 FAILED at 1491 when try to backport, because TS-1084 also has a simple modify to the code.
          Hide
          Brian Geffon added a comment -

          Committed to 3.0.x : 9dde9dc230d8e6ba4ed9af37a2692a09e9a73260

          Show
          Brian Geffon added a comment - Committed to 3.0.x : 9dde9dc230d8e6ba4ed9af37a2692a09e9a73260

            People

            • Assignee:
              Brian Geffon
              Reporter:
              Zhao Yongming
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development