Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-3499

Backend cannot write catalog-update topic more than 2GB to jByteArray

    Details

      Description

      I've encountered a issue after impalad restarts. Logging shows:

      E0509 11:44:08.168478 72547 impala-server.cc:1325] There was an error processing the impalad catalog update. Requesting a full topic update to recover: NegativeArraySizeException: null

      This may be related to size of catalog-update topic in statestored exceeding 2GB.
      In updateCatalogCache, serialized data passed through JNI are represented as byte[], which cannot exceed that limit.

      When I restart catalogd (that is, reset catalog-update topic), everything goes fine.

      I'm assuming that metadata larger than 2GB can be a common case in massive scale usage.
      Is there any possible approach to fix this?
      Thanks.

        Issue Links

          Activity

          Hide
          aivanov_impala_e71b Antoni added a comment -

          Hi,

          We are hitting similar error thoug hte message is slightly different (due to OOM :

          E1214 10:11:43.073958 9580 impala-server.cc:1325] There was an error processing the impalad catalog update. Requesting a full topic update to recover: OutOfMemoryError: GC overhead limit exceeded

          Our version is v2.5.0-cdh5.7.0
          Do you think it is related

          Show
          aivanov_impala_e71b Antoni added a comment - Hi, We are hitting similar error thoug hte message is slightly different (due to OOM : E1214 10:11:43.073958 9580 impala-server.cc:1325] There was an error processing the impalad catalog update. Requesting a full topic update to recover: OutOfMemoryError: GC overhead limit exceeded Our version is v2.5.0-cdh5.7.0 Do you think it is related
          Hide
          HuaisiXu Huaisi Xu added a comment -

          He Tianyi, we ended up splitting update in a 2 dimensional array and passing that to frontend. and the null pointer exception is from writing >2gb java byte array in backend, not when deserializing that in frontend. Thank you for your help!

          IMPALA-3499: Split catalog update
          JNI does not support writing java byte array larger than 2GB.
          Instead of passing a single serialized update to frontend,
          this patch splits the update into a vector of updates less
          than 500MB each. Then they are serialized, sent to frontend,
          deserialized and merged before calling
          Frontend::updateCatalogCache().

          Change-Id: I176db25124a32944f2396ce8aafbed49cac95928
          Reviewed-on: http://gerrit.cloudera.org:8080/3067
          Reviewed-by: Huaisi Xu <hxu@cloudera.com>
          Tested-by: Huaisi Xu <hxu@cloudera.com>

          Show
          HuaisiXu Huaisi Xu added a comment - He Tianyi , we ended up splitting update in a 2 dimensional array and passing that to frontend. and the null pointer exception is from writing >2gb java byte array in backend, not when deserializing that in frontend. Thank you for your help! IMPALA-3499 : Split catalog update JNI does not support writing java byte array larger than 2GB. Instead of passing a single serialized update to frontend, this patch splits the update into a vector of updates less than 500MB each. Then they are serialized, sent to frontend, deserialized and merged before calling Frontend::updateCatalogCache(). Change-Id: I176db25124a32944f2396ce8aafbed49cac95928 Reviewed-on: http://gerrit.cloudera.org:8080/3067 Reviewed-by: Huaisi Xu <hxu@cloudera.com> Tested-by: Huaisi Xu <hxu@cloudera.com>
          Hide
          He Tianyi He Tianyi added a comment -

          I have this right now in catalogd jvm (with jstat -gc <pid>):

               S0C      S1C  S0U        S1U        EC        EU         OC         OU
          752640.0 233472.0  0.0   232963.1 8981504.0 8779528.7 20970496.0 18268452.1
          

          IMHO this is larger by an order of magnitude than serialized one, right?

          Show
          He Tianyi He Tianyi added a comment - I have this right now in catalogd jvm (with jstat -gc <pid> ): S0C S1C S0U S1U EC EU OC OU 752640.0 233472.0 0.0 232963.1 8981504.0 8779528.7 20970496.0 18268452.1 IMHO this is larger by an order of magnitude than serialized one, right?
          Hide
          HuaisiXu Huaisi Xu added a comment -

          Thanks. Could you also shared your catalogd's heap size? used vs. total as well. Impala stores serialized metadata in statestore, but unserialized data in catalogd, and I suspect that sometimes serialization is not that efficient. What do you think?

          Show
          HuaisiXu Huaisi Xu added a comment - Thanks. Could you also shared your catalogd's heap size? used vs. total as well. Impala stores serialized metadata in statestore, but unserialized data in catalogd, and I suspect that sometimes serialization is not that efficient. What do you think?
          Hide
          He Tianyi He Tianyi added a comment -

          Hi, I am not in US.

          Currently in the cluster:
          Topic catalog-update Size (keys / values / total): 266.60 KB / 2.35 GB / 2.35 GB
          Just exceeded '2GB' recently, hence the issue.

          Show
          He Tianyi He Tianyi added a comment - Hi, I am not in US. Currently in the cluster: Topic catalog-update Size (keys / values / total): 266.60 KB / 2.35 GB / 2.35 GB Just exceeded '2GB' recently, hence the issue.
          Hide
          HuaisiXu Huaisi Xu added a comment -

          Thanks for letting me know. Are you in US btw? how large is the metadata you have(catalog heap? statestore heap?)?

          Show
          HuaisiXu Huaisi Xu added a comment - Thanks for letting me know. Are you in US btw? how large is the metadata you have(catalog heap? statestore heap?)?
          Hide
          He Tianyi He Tianyi added a comment -

          Looks good. I basically did the same, and it has been working well for a while (about 2 weeks) on a 50 nodes cluster.

          Show
          He Tianyi He Tianyi added a comment - Looks good. I basically did the same, and it has been working well for a while (about 2 weeks) on a 50 nodes cluster.
          Hide
          HuaisiXu Huaisi Xu added a comment -

          He Tianyi, thanks. I added some check in frontend. since it is guaranteed catalog object being at the end so I think I can do this.

          Thanks for pointing that out.
          here is the diff.

          diff --git a/fe/src/main/java/com/cloudera/impala/catalog/ImpaladCatalog.java b/fe/src/main/java/com/cloudera/impala/catalog/ImpaladCat
          index b0713a3..2f74e4a 100644
          --- a/fe/src/main/java/com/cloudera/impala/catalog/ImpaladCatalog.java
          +++ b/fe/src/main/java/com/cloudera/impala/catalog/ImpaladCatalog.java
          @@ -111,6 +111,7 @@ public class ImpaladCatalog extends Catalog {
              */
             public synchronized TUpdateCatalogCacheResponse updateCatalog(
               TUpdateCatalogCacheRequest req) throws CatalogException {
          +    boolean last_batch_update = false;
               // Check for changes in the catalog service ID.
               if (!catalogServiceId_.equals(req.getCatalog_service_id())) {
                 boolean firstRun = catalogServiceId_.equals(INITIAL_CATALOG_SERVICE_ID);
          @@ -127,6 +128,7 @@ public class ImpaladCatalog extends Catalog {
               for (TCatalogObject catalogObject: req.getUpdated_objects()) {
                 if (catalogObject.getType() == TCatalogObjectType.CATALOG) {
                   newCatalogVersion = catalogObject.getCatalog_version();
          +        last_batch_update = true;
                 } else {
                   try {
                     addCatalogObject(catalogObject);
          @@ -145,7 +147,7 @@ public class ImpaladCatalog extends Catalog {
               lastSyncedCatalogVersion_ = newCatalogVersion;
               // Cleanup old entries in the log.
               catalogDeltaLog_.garbageCollect(lastSyncedCatalogVersion_);
          -    isReady_.set(true);
          +    if (last_batch_update) isReady_.set(true);
          
               // Notify all the threads waiting on a catalog update.
               synchronized (catalogUpdateEventNotifier_) {
          
          Show
          HuaisiXu Huaisi Xu added a comment - He Tianyi , thanks. I added some check in frontend. since it is guaranteed catalog object being at the end so I think I can do this. Thanks for pointing that out. here is the diff. diff --git a/fe/src/main/java/com/cloudera/impala/catalog/ImpaladCatalog.java b/fe/src/main/java/com/cloudera/impala/catalog/ImpaladCat index b0713a3..2f74e4a 100644 --- a/fe/src/main/java/com/cloudera/impala/catalog/ImpaladCatalog.java +++ b/fe/src/main/java/com/cloudera/impala/catalog/ImpaladCatalog.java @@ -111,6 +111,7 @@ public class ImpaladCatalog extends Catalog { */ public synchronized TUpdateCatalogCacheResponse updateCatalog( TUpdateCatalogCacheRequest req) throws CatalogException { + boolean last_batch_update = false ; // Check for changes in the catalog service ID. if (!catalogServiceId_.equals(req.getCatalog_service_id())) { boolean firstRun = catalogServiceId_.equals(INITIAL_CATALOG_SERVICE_ID); @@ -127,6 +128,7 @@ public class ImpaladCatalog extends Catalog { for (TCatalogObject catalogObject: req.getUpdated_objects()) { if (catalogObject.getType() == TCatalogObjectType.CATALOG) { newCatalogVersion = catalogObject.getCatalog_version(); + last_batch_update = true ; } else { try { addCatalogObject(catalogObject); @@ -145,7 +147,7 @@ public class ImpaladCatalog extends Catalog { lastSyncedCatalogVersion_ = newCatalogVersion; // Cleanup old entries in the log. catalogDeltaLog_.garbageCollect(lastSyncedCatalogVersion_); - isReady_.set( true ); + if (last_batch_update) isReady_.set( true ); // Notify all the threads waiting on a catalog update. synchronized (catalogUpdateEventNotifier_) {
          Hide
          HuaisiXu Huaisi Xu added a comment -

          oh you mean the first update. catalog will send all "incomplete table" over when start, which basically just tell impala table name exist.

          Yes that can happen when impala did not receive the incomplete table in the first batch update. that way impala will say table does not exist.

          This wont be a problem when the whole cluster just started(batch size is large enough) but may be a problem when impala has already been running for a while.

          Thanks for pointing out.

          Show
          HuaisiXu Huaisi Xu added a comment - oh you mean the first update. catalog will send all "incomplete table" over when start, which basically just tell impala table name exist. Yes that can happen when impala did not receive the incomplete table in the first batch update. that way impala will say table does not exist. This wont be a problem when the whole cluster just started(batch size is large enough) but may be a problem when impala has already been running for a while. Thanks for pointing out.
          Hide
          He Tianyi He Tianyi added a comment -

          LGTM. Thanks.

          Show
          He Tianyi He Tianyi added a comment - LGTM. Thanks.
          Hide
          HuaisiXu Huaisi Xu added a comment -

          Actually that can happen all the time. In this case, impala will block and wait for that becomes available.

          This may contain some information about that. https://issues.cloudera.org/browse/IMPALA-3568.. I ran into many problem during test.

          Here is the new code review http://gerrit.cloudera.org:8080/#/c/3132/

          Show
          HuaisiXu Huaisi Xu added a comment - Actually that can happen all the time. In this case, impala will block and wait for that becomes available. This may contain some information about that. https://issues.cloudera.org/browse/IMPALA-3568 .. I ran into many problem during test. Here is the new code review http://gerrit.cloudera.org:8080/#/c/3132/
          Hide
          He Tianyi He Tianyi added a comment -

          Is that possible that a user connect to the 'partially initialized' impalad, executing a query and get an empty result? Maybe a incomplete partition meta information can be seen?

          Show
          He Tianyi He Tianyi added a comment - Is that possible that a user connect to the 'partially initialized' impalad, executing a query and get an empty result? Maybe a incomplete partition meta information can be seen?
          Hide
          HuaisiXu Huaisi Xu added a comment -

          Thanks He Tianyi! sorry just saw this...

          Yes I think so. This changes the behavior a little bit. but after I looked at https://issues.cloudera.org/browse/IMPALA-563, which added that flag. looks like it only worries about if catalog object is initialized. With batch update, this object will be initialized. Impala can tolerate incomplete catalog cache since it holds no lock accessing that and yet catalog cache is shared among many threads.

          Show
          HuaisiXu Huaisi Xu added a comment - Thanks He Tianyi ! sorry just saw this... Yes I think so. This changes the behavior a little bit. but after I looked at https://issues.cloudera.org/browse/IMPALA-563 , which added that flag. looks like it only worries about if catalog object is initialized. With batch update, this object will be initialized. Impala can tolerate incomplete catalog cache since it holds no lock accessing that and yet catalog cache is shared among many threads.
          Hide
          He Tianyi He Tianyi added a comment -

          Hi @Huaisi, thanks for update.
          Yes, It won't cause impalad to restart. It just won't serve correctly after restart.

          I have one comment for CR:
          Frontend assumes the first update to be complete (to set ready flag, trigger listeners, etc.). But now we are sending partial updates during start up.
          This may make frontend 'thinks' it is ready to serve, but actually it is not.
          In my codebase, i've added another field 'is_final' besides 'is_delta'. Then state flags and triggers will only be changed or triggered when receiving 'final' update.
          Since I didn't read through all the frontend code, having no idea whether it would actually cause any problem or not, I think this may be safer.

          Show
          He Tianyi He Tianyi added a comment - Hi @Huaisi, thanks for update. Yes, It won't cause impalad to restart. It just won't serve correctly after restart. I have one comment for CR: Frontend assumes the first update to be complete (to set ready flag, trigger listeners, etc.). But now we are sending partial updates during start up. This may make frontend 'thinks' it is ready to serve, but actually it is not. In my codebase, i've added another field 'is_final' besides 'is_delta'. Then state flags and triggers will only be changed or triggered when receiving 'final' update. Since I didn't read through all the frontend code, having no idea whether it would actually cause any problem or not, I think this may be safer.
          Hide
          HuaisiXu Huaisi Xu added a comment -

          He Tianyi. I just pushed for review. you have any comment?http://gerrit.cloudera.org:8080/#/c/3067/

          Show
          HuaisiXu Huaisi Xu added a comment - He Tianyi . I just pushed for review. you have any comment? http://gerrit.cloudera.org:8080/#/c/3067/
          Hide
          HuaisiXu Huaisi Xu added a comment -

          I think this exception
          E0509 11:44:08.168478 72547 impala-server.cc:1325] There was an error processing the impalad catalog update. Requesting a full topic update to recover: NegativeArraySizeException: nul
          wont cause impalad to restart. unless you also hit IMPALA-3494.

          Show
          HuaisiXu Huaisi Xu added a comment - I think this exception E0509 11:44:08.168478 72547 impala-server.cc:1325] There was an error processing the impalad catalog update. Requesting a full topic update to recover: NegativeArraySizeException: nul wont cause impalad to restart. unless you also hit IMPALA-3494 .
          Hide
          HuaisiXu Huaisi Xu added a comment -

          I think this is a better solution for https://issues.cloudera.org/browse/IMPALA-3494. Do you think it is better to base update on byte size instead of number of objects?

          Show
          HuaisiXu Huaisi Xu added a comment - I think this is a better solution for https://issues.cloudera.org/browse/IMPALA-3494 . Do you think it is better to base update on byte size instead of number of objects?
          Hide
          He Tianyi He Tianyi added a comment -

          By the way, I've made a modification on impalad-side to make catalog update incremental (just backend -> frontend). Like this:

          I0510 09:42:18.441896 292909 impalad-main.cc:90] Impala has started.
          I0510 09:42:41.614565 293422 impala-server.cc:1295] Incrementally updating 256 catalog objects (is_delta: 0)
          I0510 09:42:41.870551 293422 impala-server.cc:1295] Incrementally updating 256 catalog objects (is_delta: 1)
          I0510 09:42:42.668447 293422 impala-server.cc:1295] Incrementally updating 256 catalog objects (is_delta: 1)
          I0510 09:42:42.689504 293422 impala-server.cc:1295] Incrementally updating 256 catalog objects (is_delta: 1)
          I0510 09:42:42.946655 293422 impala-server.cc:1295] Incrementally updating 256 catalog objects (is_delta: 1)
          I0510 09:42:43.989063 293422 impala-server.cc:1295] Incrementally updating 256 catalog objects (is_delta: 1)
          I0510 09:42:48.049051 293422 impala-server.cc:1295] Incrementally updating 256 catalog objects (is_delta: 1)
          I0510 09:42:48.051121 293422 impala-server.cc:1295] Incrementally updating 256 catalog objects (is_delta: 1)
          I0510 09:42:48.051954 293422 impala-server.cc:1295] Incrementally updating 256 catalog objects (is_delta: 1)
          I0510 09:42:48.060015 293422 impala-server.cc:1295] Incrementally updating 256 catalog objects (is_delta: 1)
          I0510 09:42:48.066064 293422 impala-server.cc:1295] Incrementally updating 256 catalog objects (is_delta: 1)
          I0510 09:42:48.067200 293422 impala-server.cc:1295] Incrementally updating 256 catalog objects (is_delta: 1)
          I0510 09:42:48.068032 293422 impala-server.cc:1295] Incrementally updating 256 catalog objects (is_delta: 1)
          I0510 09:42:48.068969 293422 impala-server.cc:1295] Incrementally updating 256 catalog objects (is_delta: 1)
          I0510 09:42:51.312445 293422 impala-server.cc:1348] Sending last updating with 217 catalog objects (is_delta: 1)
          I0510 09:43:49.522151 293422 ImpaladCatalog.java:151] Received final catalog update, ready to serve

          That is, splitting catalog updates into multiple thrift messages, each < 2GB.
          And it worked well. (as long as topic still <4GB, i think)

          Show
          He Tianyi He Tianyi added a comment - By the way, I've made a modification on impalad-side to make catalog update incremental (just backend -> frontend). Like this: I0510 09:42:18.441896 292909 impalad-main.cc:90] Impala has started. I0510 09:42:41.614565 293422 impala-server.cc:1295] Incrementally updating 256 catalog objects (is_delta: 0) I0510 09:42:41.870551 293422 impala-server.cc:1295] Incrementally updating 256 catalog objects (is_delta: 1) I0510 09:42:42.668447 293422 impala-server.cc:1295] Incrementally updating 256 catalog objects (is_delta: 1) I0510 09:42:42.689504 293422 impala-server.cc:1295] Incrementally updating 256 catalog objects (is_delta: 1) I0510 09:42:42.946655 293422 impala-server.cc:1295] Incrementally updating 256 catalog objects (is_delta: 1) I0510 09:42:43.989063 293422 impala-server.cc:1295] Incrementally updating 256 catalog objects (is_delta: 1) I0510 09:42:48.049051 293422 impala-server.cc:1295] Incrementally updating 256 catalog objects (is_delta: 1) I0510 09:42:48.051121 293422 impala-server.cc:1295] Incrementally updating 256 catalog objects (is_delta: 1) I0510 09:42:48.051954 293422 impala-server.cc:1295] Incrementally updating 256 catalog objects (is_delta: 1) I0510 09:42:48.060015 293422 impala-server.cc:1295] Incrementally updating 256 catalog objects (is_delta: 1) I0510 09:42:48.066064 293422 impala-server.cc:1295] Incrementally updating 256 catalog objects (is_delta: 1) I0510 09:42:48.067200 293422 impala-server.cc:1295] Incrementally updating 256 catalog objects (is_delta: 1) I0510 09:42:48.068032 293422 impala-server.cc:1295] Incrementally updating 256 catalog objects (is_delta: 1) I0510 09:42:48.068969 293422 impala-server.cc:1295] Incrementally updating 256 catalog objects (is_delta: 1) I0510 09:42:51.312445 293422 impala-server.cc:1348] Sending last updating with 217 catalog objects (is_delta: 1) I0510 09:43:49.522151 293422 ImpaladCatalog.java:151] Received final catalog update, ready to serve That is, splitting catalog updates into multiple thrift messages, each < 2GB. And it worked well. (as long as topic still <4GB, i think)

            People

            • Assignee:
              HuaisiXu Huaisi Xu
              Reporter:
              He Tianyi He Tianyi
            • Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development