Uploaded image for project: 'Ignite'
  1. Ignite
  2. IGNITE-11783

Open file limit for deb distribution

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 2.7
    • Fix Version/s: None
    • Component/s: persistence
    • Environment:

      ubuntu-16.04

    • Ignite Flags:
      Docs Required

      Description

      Step to reproduce:
      1) Install ignite from deb package on ubuntu 16.04
      2) Start with persistence
      3) Create 5 caches (or one with 4000+ partitions)
      Error text:

      [18:29:44,369][INFO][exchange-worker-#43][GridCacheDatabaseSharedManager] Restoring partition state for local groups [cntPartStateWal=0, lastCheckpointId=bd24ff23-da6f-46e5-bafd-b643db3870d4]
      [18:29:51,864][SEVERE][exchange-worker-#43][] Critical system error detected. Will be handled accordingly to configured handler [hnd=StopNodeOrHaltFailureHandler [tryStop=false, timeout=0, super=AbstractFailureH
      andler [ignoredFailureTypes=[SYSTEM_WORKER_BLOCKED]]], failureCtx=FailureContext [type=CRITICAL_ERROR, err=class o.a.i.i.processors.cache.persistence.StorageException: Failed to initialize partition file: /usr/s
      hare/apache-ignite/work/db/node00-f49af718-48da-4186-b664-62aca736bdc9/cache-SQL_PUBLIC_VERTEX_TBL/part-913.bin]]
      class org.apache.ignite.internal.processors.cache.persistence.StorageException: Failed to initialize partition file: /usr/share/apache-ignite/work/db/node00-f49af718-48da-4186-b664-62aca736bdc9/cache-SQL_PUBLIC_
      VERTEX_TBL/part-913.bin
              at org.apache.ignite.internal.processors.cache.persistence.file.FilePageStore.init(FilePageStore.java:444)
              at org.apache.ignite.internal.processors.cache.persistence.file.FilePageStore.ensure(FilePageStore.java:650)
              at org.apache.ignite.internal.processors.cache.persistence.file.FilePageStoreManager.ensure(FilePageStoreManager.java:712)
              at org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager.restorePartitionStates(GridCacheDatabaseSharedManager.java:2472)
              at org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager.applyLastUpdates(GridCacheDatabaseSharedManager.java:2419)
              at org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager.restoreState(GridCacheDatabaseSharedManager.java:1628)
              at org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager.beforeExchange(GridCacheDatabaseSharedManager.java:1302)
              at org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.distributedExchange(GridDhtPartitionsExchangeFuture.java:1453)
              at org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.init(GridDhtPartitionsExchangeFuture.java:806)
              at org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$ExchangeWorker.body0(GridCachePartitionExchangeManager.java:2667)
              at org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$ExchangeWorker.body(GridCachePartitionExchangeManager.java:2539)
              at org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:120)
              at java.lang.Thread.run(Thread.java:748)
      Caused by: java.nio.file.FileSystemException: /usr/share/apache-ignite/work/db/node00-f49af718-48da-4186-b664-62aca736bdc9/cache-SQL_PUBLIC_VERTEX_TBL/part-913.bin: Too many open files
              at sun.nio.fs.UnixException.translateToIOException(UnixException.java:91)
              at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
              at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
              at sun.nio.fs.UnixFileSystemProvider.newAsynchronousFileChannel(UnixFileSystemProvider.java:196)
              at java.nio.channels.AsynchronousFileChannel.open(AsynchronousFileChannel.java:248)
              at java.nio.channels.AsynchronousFileChannel.open(AsynchronousFileChannel.java:301)
              at org.apache.ignite.internal.processors.cache.persistence.file.AsyncFileIO.<init>(AsyncFileIO.java:57)
              at org.apache.ignite.internal.processors.cache.persistence.file.AsyncFileIOFactory.create(AsyncFileIOFactory.java:53)
              at org.apache.ignite.internal.processors.cache.persistence.file.FilePageStore.init(FilePageStore.java:416)
              ... 12 more
      

      It happen because systemd service description (/etc/systemd/system/apache-ignite@.service) didn't contain

      LimitNOFILE=500000
      (possible with) LimitNPROC=500000
      

      see: https://fredrikaverpil.github.io/2016/04/27/systemd-and-resource-limits/
      Possible, installation script should also add:

      • "fs.file-max = 2097152" to "/etc/sysctl.conf"
      • into /etc/security/limits.conf:
        *         hard    nofile      500000
        *         soft    nofile      500000
        root      hard    nofile      500000
        root      soft    nofile      500000
        

        see: https://easyengine.io/tutorials/linux/increase-open-files-limit
        And it will be amazing if ignite start process check file limits and print link to documentation page if:
        1) persistence enabled
        2) limits below some value (<=4096)
        3) limits below total number of partition in current node
        And one more thing - if ignite get "Too many open files" exception in the middle of rebalancing - it will be terrible situation, whole cluster just stop working. It can happen if each node have almost full limit and:

      • someone create additional cache
      • topology change (remove node) and each remaining nodes get more local partition.
        Can we remember limit on startup and check limit each time when are we going to create local partition?

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              sbberkov Alexander Belyak
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:

                Time Tracking

                Estimated:
                Original Estimate - 4h
                4h
                Remaining:
                Remaining Estimate - 4h
                4h
                Logged:
                Time Spent - Not Specified
                Not Specified