Uploaded image for project: 'Hadoop Common'
  1. Hadoop Common
  2. HADOOP-12941

abort in Unsafe_GetLong when running IA64 HPUX 64bit mode

VotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Won't Fix
    • None
    • None
    • None
    • None
    • hpux IA64 running 64bit mode

    Description

      Now that we have a core to look at we can sorta see what is going on#14 0x9fffffffaf000dd0 in Java native_call_stub frame#15 0x9fffffffaf014470 in JNI frame: sun.misc.Unsafe::getLong (java.lang.Object, long) ->long#16 0x9fffffffaf0067a0 in interpreted frame: org.apache.hadoop.hbase.util.Bytes$LexicographicalComparerHolder$UnsafeComparer::compareTo (byte[], int, int, byte[], int, int) ->int bci: 74#17 0x9fffffffaf0066e0 in interpreted frame: org.apache.hadoop.hbase.util.Bytes$LexicographicalComparerHolder$UnsafeComparer::compareTo (java.lang.Object, int, int, java.lang.Object, int, int) ->int bci: 16#18 0x9fffffffaf006720 in interpreted frame: org.apache.hadoop.hbase.util.Bytes::compareTo (byte[], int, int, byte[], int, int) ->int bci: 11#19 0x9fffffffaf0066e0 in interpreted frame: org.apache.hadoop.hbase.KeyValue$KVComparator::compareRowKey (org.apache.hadoop.hbase.Cell, org.apache.hadoop.hbase.Cell) ->int bci: 36#20 0x9fffffffaf0066e0 in interpreted frame: org.apache.hadoop.hbase.KeyValue$KVComparator::compare (org.apache.hadoop.hbase.Cell, org.apache.hadoop.hbase.Cell) ->int bci: 3#21 0x9fffffffaf0066e0 in interpreted frame: org.apache.hadoop.hbase.KeyValue$KVComparator::compare (java.lang.Object, java.lang.Object) ->int bci: 9;; Line: 4000xc00000003ad84d30:0 <Unsafe_GetLong+0x130>: (p1) ld8 r45=[r34]0xc00000003ad84d30:1 <Unsafe_GetLong+0x131>: adds r34=16,r320xc00000003ad84d30:2 <Unsafe_GetLong+0x132>: adds ret0=8,r32;;0xc00000003ad84d40:0 <Unsafe_GetLong+0x140>: add ret1=r35,r45 <==== r35 is off0xc00000003ad84d40:1 <Unsafe_GetLong+0x141>: ld8 r35=[r34],240xc00000003ad84d40:2 <Unsafe_GetLong+0x142>: nop.i 0x00xc00000003ad84d50:0 <Unsafe_GetLong+0x150>: ld8 r41=[ret0];;0xc00000003ad84d50:1 <Unsafe_GetLong+0x151>: ld8.s r49=[r34],-240xc00000003ad84d50:2 <Unsafe_GetLong+0x152>: nop.i 0x00xc00000003ad84d60:0 <Unsafe_GetLong+0x160>: ld8 r39=[ret1];; <=== abort0xc00000003ad84d60:1 <Unsafe_GetLong+0x161>: ld8 ret0=[r35]0xc00000003ad84d60:2 <Unsafe_GetLong+0x162>: nop.i 0x0;;0xc00000003ad84d70:0 <Unsafe_GetLong+0x170>: cmp.ne.unc p1=r0,ret0;;M,MI0xc00000003ad84d70:1 <Unsafe_GetLong+0x171>: (p1) mov r48=r410xc00000003ad84d70:2 <Unsafe_GetLong+0x172>: (p1) chk.s.i r49,Unsafe_GetLong+0x290(gdb) x /10i $pc-48*20x9fffffffaf000d70: flushrs MMI0x9fffffffaf000d71: mov r44=r320x9fffffffaf000d72: mov r45=r330x9fffffffaf000d80: mov r46=r34 MMI0x9fffffffaf000d81: mov r47=r350x9fffffffaf000d82: mov r48=r360x9fffffffaf000d90: mov r49=r37 MMI0x9fffffffaf000d91: mov r50=r380x9fffffffaf000d92: mov r51=r39
      0x9fffffffaf000da0: adds r14=0x270,r4 MMI(gdb) p /x $r35$9 = 0x22(gdb) x /x $ret10x9ffffffe1d0d2bda: 0x677a68676c78743a(gdb) x /x $r45+0x220x9ffffffe1d0d2bda: 0x677a68676c78743aSo here is the problem, this is a 64bit JVM 0 : /opt/java8/bin/IA64W/java1 : -Djava.util.logging.config.file=/test28/gzh/tomcat/conf/logging.properties2 : -Djava.util.logging.manager=org.apache.juli.ClassLoaderLogManager3 : -Dorg.apache.catalina.security.SecurityListener.UMASK=0224 : -server5 : -XX:PermSize=128m6 : -XX:MaxPermSize=256m7 : -Djava.endorsed.dirs=/test28/gzh/tomcat/endorsed8 : -classpath9 : /test28/gzh/tomcat/bin/bootstrap.jar:/test28/gzh/tomcat/bin/tomcat-juli.jar10 : -Dcatalina.base=/test28/gzh/tomcat11 : -Dcatalina.home=/test28/gzh/tomcat12 : -Djava.io.tmpdir=/test28/gzh/tomcat/temp13 : org.apache.catalina.startup.Bootstrap14 : startSince they are not passing and -Xmx values we are taking defaults which look at the system resources. So what is happening here is a 32 bit word aligned address is being used to index into a byte array (gdb) jo 0x9ffffffe1d0d2bb8_mark = 0x0000000000000001, _klass = 0x9fffffffa8c00768, instance of type [Blength of the array: 1180 0 0 102 0 0 0 8 0 70 103 122 104 103 108 120 116 58 70 83 78 95 50 48 49 53 49 48 50 50 44 65 44 49 52 52 53 52 55 57 57 51 51 57 53 56 46 52 56 54 55 50 48 51 49 99 57 97 101 52 57 101 97 101 49 100 56 49 51 53 51 99 99 97 97 54 98 56 100 46 4 105 110 102 111 115 101 113 110 117 109 68 117 114 105 110 103 79 112 101 110 0 0 1 80 -6 96 -95 -48 4 0 0 0 0 0 0 0 4This is the whole string gdb) x /2s 0x9ffffffe1d0d2bd80x9ffffffe1d0d2bd8: ""0x9ffffffe1d0d2bd9: "Fgzhglxt:FSN_20151022,A,1445479933958.48672031c9ae49eae1d81353ccaa6b8d.\004infoseqnumDuringOpen"To me this is a bug in the callee potentially in org.apache.hadoop.hbase.util.Bytes$LexicographicalComparerHolder$UnsafeComparer::compareToWhy are they calling Unsafe_GetLong on a byte array, there is no checking of alignment and I really think this is a bug on their part. As far as I know, GetLong expects 64 bit alignment I did find some other 64 bit users who saw this with the same stack trace as this customer
      https://issues.apache.org/jira/browse/PHOENIX-1438http://permalink.gmane.org/gmane.comp.java.hadoop.hbase.devel/39017

      the fix would go here by adding a test for ia64

      looking at the code from a bug they are checking for if the box is sparc. static Comparer<byte[]> getBestComparer() {
      + if (System.getProperty("os.arch").equals("sparc")) { <====
      + if (LOG.isTraceEnabled())

      { + LOG.trace("Lexicographical comparer selected for " + + "byte aligned system architecture"); + }

      + return lexicographicalComparerJavaImpl();
      + }
      try {
      Class<?> theClass = Class.forName(UNSAFE_COMPARER_NAME);so this is 'fixable' from a java class perspective.Hari said he will talk with his open source contact
      This Hadoop bug report points to the same problem in the same code:

      https://issues.apache.org/jira/browse/HADOOP-11466

      In that case the symptom of the unaligned accesses was bad performance instead of a crash. This shows diffs for that fix:

      http://mail-archives.apache.org/mod_mbox/hadoop-common-commits/201501.mbox/%3Cb19d5f83ca7148b782e5b432817b6448@git.apache.org%3E

      Those diffs show that fix only avoids the bad code when running on "sparc". They really should have instead avoided that bad code for every architecture other than x86. They should not be assuming that that FastByteComparisons enhancement will work on other processors and actually improves performance. On processors that do allow unaligned accesses at much cost they are just creating bad performance that will be hard for anyone to ever find.

      For all IA64 customers this will be an issue when running 64 bit. The IA processor enforces alignment on instruction types

      Attachments

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            Unassigned Unassigned
            genebradley gene bradley
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment