Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-5739

Size exceeds Integer.MAX_VALUE in File Map

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Minor
    • Resolution: Duplicate
    • Affects Version/s: 1.1.1
    • Fix Version/s: None
    • Component/s: MLlib
    • Labels:
      None
    • Environment:

      Spark1.1.1 on a cluster with 12 node. Every node with 128GB RAM, 24 Core. the data is just 40GB, and there is 48 parallel task on a node.

      Description

      I just run the kmeans algorithm using a random generate data,but occurred this problem after some iteration. I try several time, and this problem is reproduced.

      Because the data is random generate, so I guess is there a bug ? Or if random data can lead to such a scenario that the size is bigger than Integer.MAX_VALUE, can we check the size before using the file map?

      015-02-11 00:39:36,057 [sparkDriver-akka.actor.default-dispatcher-15] WARN org.apache.spark.util.SizeEstimator - Failed to check whether UseCompressedOops is set; assuming yes
      [error] (run-main-0) java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE
      java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE
      at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:850)
      at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:105)
      at org.apache.spark.storage.DiskStore.putIterator(DiskStore.scala:86)
      at org.apache.spark.storage.MemoryStore.putIterator(MemoryStore.scala:140)
      at org.apache.spark.storage.MemoryStore.putIterator(MemoryStore.scala:105)
      at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:747)
      at org.apache.spark.storage.BlockManager.putIterator(BlockManager.scala:598)
      at org.apache.spark.storage.BlockManager.putSingle(BlockManager.scala:869)
      at org.apache.spark.broadcast.TorrentBroadcast.writeBlocks(TorrentBroadcast.scala:79)
      at org.apache.spark.broadcast.TorrentBroadcast.<init>(TorrentBroadcast.scala:68)
      at org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:36)
      at org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:29)
      at org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:62)
      at org.apache.spark.SparkContext.broadcast(SparkContext.scala:809)
      at org.apache.spark.mllib.clustering.KMeans.initKMeansParallel(KMeans.scala:270)
      at org.apache.spark.mllib.clustering.KMeans.runBreeze(KMeans.scala:143)
      at org.apache.spark.mllib.clustering.KMeans.run(KMeans.scala:126)
      at org.apache.spark.mllib.clustering.KMeans$.train(KMeans.scala:338)
      at org.apache.spark.mllib.clustering.KMeans$.train(KMeans.scala:348)
      at KMeansDataGenerator$.main(kmeans.scala:105)
      at KMeansDataGenerator.main(kmeans.scala)
      at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
      at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:94)
      at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:55)
      at java.lang.reflect.Method.invoke(Method.java:619)

        Issue Links

          Activity

          Hide
          DjvuLee DjvuLee added a comment -

          the data is generated by the example KMeansDataGenerator in the Spark, I just change the parameter for

          val parts = 480
          val numPoints = 480 //1500
          val k = 10 //args(3).toInt
          val d = 10000000//args(4).toInt
          val r = 1.0 //args(5).toDouble
          val iter = 8

          Show
          DjvuLee DjvuLee added a comment - the data is generated by the example KMeansDataGenerator in the Spark, I just change the parameter for val parts = 480 val numPoints = 480 //1500 val k = 10 //args(3).toInt val d = 10000000//args(4).toInt val r = 1.0 //args(5).toDouble val iter = 8
          Hide
          srowen Sean Owen added a comment -

          You are generating 10,000,000-dimensional data, so each vector is about 80MB. The init process will sample about 2k = 20 centers, and those centers are broadcast, so you're at 1.6GB or more, which is already flirting with the max byte array size in the JVM, 2GB, which is what is used under the hood to broadcast the serialized object. if you set runs > 1, it multiplies all of this even further.

          That's a size that's not likely to turn up in practice, and I think the implementation assumes it can broadcast these points for speed. So I'd suggest that if you're merely benchmarking, pick a smaller d. Even 1M should be fine.

          Show
          srowen Sean Owen added a comment - You are generating 10,000,000-dimensional data, so each vector is about 80MB. The init process will sample about 2k = 20 centers, and those centers are broadcast, so you're at 1.6GB or more, which is already flirting with the max byte array size in the JVM, 2GB, which is what is used under the hood to broadcast the serialized object. if you set runs > 1, it multiplies all of this even further. That's a size that's not likely to turn up in practice, and I think the implementation assumes it can broadcast these points for speed. So I'd suggest that if you're merely benchmarking, pick a smaller d. Even 1M should be fine.
          Hide
          DjvuLee DjvuLee added a comment -

          Yes, 1M maybe enough for the Kmeans algorithm.

          But if we consider other machine learning algorithm, such as logistic regression, then 10^7 dimension is not such big. LR in the ad click model in the real maybe common(I ever heard by my friends), so how Spark can deal well with this?

          Maybe the weight parameter in LR is only one, but when the dimension is up to billion, the data can up to GB.

          Show
          DjvuLee DjvuLee added a comment - Yes, 1M maybe enough for the Kmeans algorithm. But if we consider other machine learning algorithm, such as logistic regression, then 10^7 dimension is not such big. LR in the ad click model in the real maybe common(I ever heard by my friends), so how Spark can deal well with this? Maybe the weight parameter in LR is only one, but when the dimension is up to billion, the data can up to GB.
          Hide
          srowen Sean Owen added a comment -

          Yes, but you're talking about extremely sparse vectors in problems like that. Here you've set up a fully dense 10M-dimensional input. That's not typical.

          Show
          srowen Sean Owen added a comment - Yes, but you're talking about extremely sparse vectors in problems like that. Here you've set up a fully dense 10M-dimensional input. That's not typical.
          Hide
          DjvuLee DjvuLee added a comment -

          Got it, thanks very much ! But can I understand that Spark do not support sparse vector very well now ?

          Show
          DjvuLee DjvuLee added a comment - Got it, thanks very much ! But can I understand that Spark do not support sparse vector very well now ?
          Hide
          srowen Sean Owen added a comment -

          No, it should be able to operate on sparse vectors, but what you generated and loaded was fully dense.

          Show
          srowen Sean Owen added a comment - No, it should be able to operate on sparse vectors, but what you generated and loaded was fully dense.
          Hide
          DjvuLee DjvuLee added a comment -

          Ok, Got it, I will look the code for more detail.

          I will close this issue, but add a check for the size is still a good advise?

          Show
          DjvuLee DjvuLee added a comment - Ok, Got it, I will look the code for more detail. I will close this issue, but add a check for the size is still a good advise?
          Hide
          srowen Sean Owen added a comment -

          It's hard to say because it depends on d, k, runs, and some of the details of the initialization. I am not sure where a warning should kick in. Small vectors + a huge k and # runs would also produce an error.

          Show
          srowen Sean Owen added a comment - It's hard to say because it depends on d, k, runs, and some of the details of the initialization. I am not sure where a warning should kick in. Small vectors + a huge k and # runs would also produce an error.
          Hide
          DjvuLee DjvuLee added a comment -

          Yes, I do not explain cleanly. What I mean is that we can add a check in the getBytes method in the DiskStore.scala file.
          just before call the following function.

          Some(channel.map(MapMode.READ_ONLY, segment.offset, segment.length))

          Show
          DjvuLee DjvuLee added a comment - Yes, I do not explain cleanly. What I mean is that we can add a check in the getBytes method in the DiskStore.scala file. just before call the following function. Some(channel.map(MapMode.READ_ONLY, segment.offset, segment.length))
          Hide
          srowen Sean Owen added a comment -

          What would that really do though except change one error into another? I think the current exception is quite clear.

          Show
          srowen Sean Owen added a comment - What would that really do though except change one error into another? I think the current exception is quite clear.
          Hide
          DjvuLee DjvuLee added a comment -

          En, it have little difference. What I consider is that check size can reduce call the channel.map function, rather than step into it and throw a Exception, close this issue is also ok, because the exception hint is clear to find the problem now.

          Show
          DjvuLee DjvuLee added a comment - En, it have little difference. What I consider is that check size can reduce call the channel.map function, rather than step into it and throw a Exception, close this issue is also ok, because the exception hint is clear to find the problem now.
          Hide
          srowen Sean Owen added a comment -

          I think at best this reduces to just hitting the issue that blocks can't be >2GB

          Show
          srowen Sean Owen added a comment - I think at best this reduces to just hitting the issue that blocks can't be >2GB
          Hide
          kgierach Karl D. Gierach added a comment - - edited

          Is there anyway to increase this block limit? I'm hitting the same issue during a UnionRDD operation.

          Also, above this issue's state is "resolved" but I'm not sure what the resolution is? Maybe a state of "closed" with a reference to the duplicate ticket would make it more clear.

          Show
          kgierach Karl D. Gierach added a comment - - edited Is there anyway to increase this block limit? I'm hitting the same issue during a UnionRDD operation. Also, above this issue's state is "resolved" but I'm not sure what the resolution is? Maybe a state of "closed" with a reference to the duplicate ticket would make it more clear.

            People

            • Assignee:
              Unassigned
              Reporter:
              DjvuLee DjvuLee
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development