Uploaded image for project: 'Cassandra'
  1. Cassandra
  2. CASSANDRA-8308

Windows: Commitlog access violations on unit tests

    Details

      Description

      We have four unit tests failing on trunk on Windows, all with FileSystemException's related to the SchemaLoader:

      [junit] Test org.apache.cassandra.db.compaction.DateTieredCompactionStrategyTest FAILED
      [junit] Test org.apache.cassandra.cql3.ThriftCompatibilityTest FAILED
      [junit] Test org.apache.cassandra.io.sstable.SSTableRewriterTest FAILED
      [junit] Test org.apache.cassandra.repair.LocalSyncTaskTest FAILED
      

      Example error:

          [junit] Caused by: java.nio.file.FileSystemException: build\test\cassandra\commitlog;0\CommitLog-5-1415908745965.log: The process cannot access the file because it is being used by another process.
          [junit]
          [junit]     at sun.nio.fs.WindowsException.translateToIOException(WindowsException.java:86)
          [junit]     at sun.nio.fs.WindowsException.rethrowAsIOException(WindowsException.java:97)
          [junit]     at sun.nio.fs.WindowsException.rethrowAsIOException(WindowsException.java:102)
          [junit]     at sun.nio.fs.WindowsFileSystemProvider.implDelete(WindowsFileSystemProvider.java:269)
          [junit]     at sun.nio.fs.AbstractFileSystemProvider.delete(AbstractFileSystemProvider.java:103)
          [junit]     at java.nio.file.Files.delete(Files.java:1079)
          [junit]     at org.apache.cassandra.io.util.FileUtils.deleteWithConfirm(FileUtils.java:125)
      
      1. 8308_v1.txt
        7 kB
        Joshua McKenzie
      2. 8308_v2.txt
        8 kB
        Joshua McKenzie
      3. 8308_v3.txt
        7 kB
        Joshua McKenzie
      4. 8308-post-fix.txt
        2 kB
        Robert Stupp
      5. 8308-post-fix-3.txt
        23 kB
        Robert Stupp

        Activity

        Hide
        JoshuaMcKenzie Joshua McKenzie added a comment -

        This looks like a test-only artifact from schemaloader attempting to delete files that are opened as RAF in Windows which causes the error. As done in CASSANDRA-4050, converting the CommitLogSegment to nio would resolve this deletion sharing issue, however we rely on having an integer FD for trySkipCache calls; I'm not seeing an immediate method to get a FileDescriptor for a FileChannel. If I can track something down in that vein I'll pursue converting CommitLogSegment to nio so we can avoid potential future access violations. If not, refactoring the AbstractCommitLogService to allow stopping and restarting (rather than relying on the ctor only for functionality) should allow us to better test that component.

        Show
        JoshuaMcKenzie Joshua McKenzie added a comment - This looks like a test-only artifact from schemaloader attempting to delete files that are opened as RAF in Windows which causes the error. As done in CASSANDRA-4050 , converting the CommitLogSegment to nio would resolve this deletion sharing issue, however we rely on having an integer FD for trySkipCache calls; I'm not seeing an immediate method to get a FileDescriptor for a FileChannel. If I can track something down in that vein I'll pursue converting CommitLogSegment to nio so we can avoid potential future access violations. If not, refactoring the AbstractCommitLogService to allow stopping and restarting (rather than relying on the ctor only for functionality) should allow us to better test that component.
        Hide
        JoshuaMcKenzie Joshua McKenzie added a comment -
        1. Swapped to FileChannel on CommitLogSegment
        2. Use reflection to grab FD out of it to use in trySkipCache (new method in CLibrary)
        3. Added clibrary method for strerror
        4. Added logger warning on non-0 result from trySkipCache w/error message
        5. Tweaked logic on SchemaLoader to recursively delete contents of commitlog folder rather than the folder itself

        This fixes 3 of the 4 unit tests above; SSTableRewriterTest assumes that when a sstable is successfully deleted it won't be present in the filesystem, however on Windows when you delete a file with FILE_SHARE_DELETE and something else still has a handle to the file, it stays in its original location on the drive rather than getting removed and a link preserved in the /proc filesystem. I'll open another ticket to make the SSTableRewriter tests more multi-platform friendly.

        The warning on trySkipCache has uncovered that we have invalid fd references on our trySkipCache calls in SSTR.cloneWithNewStart - I haven't dug into it yet but it might be an ordering issue where a file's deleted before we posix_fadvise it.

        Show
        JoshuaMcKenzie Joshua McKenzie added a comment - Swapped to FileChannel on CommitLogSegment Use reflection to grab FD out of it to use in trySkipCache (new method in CLibrary) Added clibrary method for strerror Added logger warning on non-0 result from trySkipCache w/error message Tweaked logic on SchemaLoader to recursively delete contents of commitlog folder rather than the folder itself This fixes 3 of the 4 unit tests above; SSTableRewriterTest assumes that when a sstable is successfully deleted it won't be present in the filesystem, however on Windows when you delete a file with FILE_SHARE_DELETE and something else still has a handle to the file, it stays in its original location on the drive rather than getting removed and a link preserved in the /proc filesystem. I'll open another ticket to make the SSTableRewriter tests more multi-platform friendly. The warning on trySkipCache has uncovered that we have invalid fd references on our trySkipCache calls in SSTR.cloneWithNewStart - I haven't dug into it yet but it might be an ordering issue where a file's deleted before we posix_fadvise it.
        Hide
        JoshuaMcKenzie Joshua McKenzie added a comment -

        Benedict: any chance you'd take this review this since it's similar to CASSANDRA-4050 (though far less complicated)?

        Show
        JoshuaMcKenzie Joshua McKenzie added a comment - Benedict : any chance you'd take this review this since it's similar to CASSANDRA-4050 (though far less complicated)?
        Hide
        benedict Benedict added a comment -

        Sure. Will review tomorrow.

        Show
        benedict Benedict added a comment - Sure. Will review tomorrow.
        Hide
        benedict Benedict added a comment -
        • channel.truncate() is not equivalent to raf.setLength(), and we want the length to be set upfront to somewhat ensure contiguity
        • would be nice to extract the "is linux" decision to an enum, embed it in FBUtilities where we already have an isUnix() method (and an OPERATING_SYSTEM property, that could be converted to the enum)
        Show
        benedict Benedict added a comment - channel.truncate() is not equivalent to raf.setLength(), and we want the length to be set upfront to somewhat ensure contiguity would be nice to extract the "is linux" decision to an enum, embed it in FBUtilities where we already have an isUnix() method (and an OPERATING_SYSTEM property, that could be converted to the enum)
        Hide
        JoshuaMcKenzie Joshua McKenzie added a comment -

        v2 attached. Good catch on truncate on nio - I misread the javadoc on that and also assumed they were going for functional parity with RAF.setLength.

        I couldn't find an analogue to RAF.setLength in nio; rather than creating a single byte ByteBuffer, seeking to DD.getCommitLogSegmentSize(), writing that byte, and seeking back - I went ahead and just used RAF.setLength to get our size and then used the FileChannel API to map it later as it seems less prone to error and opening CLS isn't critical path. If there's a cleaner or more idiomatic way to do that in nio I'm all for it but I couldn't track it down.

        I also added another call to CommitLog.instance.resetUnsafe in SchemaLoader before we attempt to delete directories as it was failing to delete the memory-mapped files. Not sure why it worked in v1 but it definitely needs it now.

        Lastly - while I 100% agree the os determination needs to be tightened up (see CASSANDRA-8452), I'm not sure how that's related to this patch as none of the changes reference that.

        Show
        JoshuaMcKenzie Joshua McKenzie added a comment - v2 attached. Good catch on truncate on nio - I misread the javadoc on that and also assumed they were going for functional parity with RAF.setLength. I couldn't find an analogue to RAF.setLength in nio; rather than creating a single byte ByteBuffer, seeking to DD.getCommitLogSegmentSize(), writing that byte, and seeking back - I went ahead and just used RAF.setLength to get our size and then used the FileChannel API to map it later as it seems less prone to error and opening CLS isn't critical path. If there's a cleaner or more idiomatic way to do that in nio I'm all for it but I couldn't track it down. I also added another call to CommitLog.instance.resetUnsafe in SchemaLoader before we attempt to delete directories as it was failing to delete the memory-mapped files. Not sure why it worked in v1 but it definitely needs it now. Lastly - while I 100% agree the os determination needs to be tightened up (see CASSANDRA-8452 ), I'm not sure how that's related to this patch as none of the changes reference that.
        Hide
        benedict Benedict added a comment -

        I'm not sure how that's related to this patch

        My bad, misread the patch boundary,

        Since we're opening/closing an extra file, it might be worth only performing the action if the channel isn't the correct size, since it typically will be (so, open channel, if incorrect size close it, open raf, set length, reopen channel).

        I haven't tested the change to introduce strerror - are you confident of it, and have you tested it? Might be sensible to split into its own ticket.

        Otherwise LGTM

        Show
        benedict Benedict added a comment - I'm not sure how that's related to this patch My bad, misread the patch boundary, Since we're opening/closing an extra file, it might be worth only performing the action if the channel isn't the correct size, since it typically will be (so, open channel, if incorrect size close it, open raf, set length, reopen channel). I haven't tested the change to introduce strerror - are you confident of it, and have you tested it? Might be sensible to split into its own ticket. Otherwise LGTM
        Hide
        JoshuaMcKenzie Joshua McKenzie added a comment -

        v3 attached. Checked length of logFile and then RAF.setLength on it if it's incorrect rather than open / close / open on FileChannel (so we can keep the channel final). Also removed the strerror CLibrary code as that should be added separately.

        Show
        JoshuaMcKenzie Joshua McKenzie added a comment - v3 attached. Checked length of logFile and then RAF.setLength on it if it's incorrect rather than open / close / open on FileChannel (so we can keep the channel final). Also removed the strerror CLibrary code as that should be added separately.
        Hide
        benedict Benedict added a comment -

        +1

        Show
        benedict Benedict added a comment - +1
        Hide
        JoshuaMcKenzie Joshua McKenzie added a comment -

        committed

        Show
        JoshuaMcKenzie Joshua McKenzie added a comment - committed
        Hide
        snazy Robert Stupp added a comment - - edited

        The commit breaks a lot of unit tests - at least on my machine (14.0.0 Darwin Kernel Version 14.0.0) - with Java 7.

        ant test run with Java 7 is affected (tried some 7u65, 7u71)
        ant test run with Java 8 works.

        ant test fails with a lot of timeout exceptions. The first one is constantly this one:

            [junit] Testcase: org.apache.cassandra.config.DatabaseDescriptorTest:testTransKsMigration:	Caused an ERROR
            [junit] Timeout occurred. Please note the time in the report does not reflect the time until the timeout.
            [junit] junit.framework.AssertionFailedError: Timeout occurred. Please note the time in the report does not reflect the time until the timeout.
            [junit] 	at java.lang.Thread.run(Thread.java:745)
        

        system.log contains messages like this - which seems to be the reason for the unit test timeout:

        DEBUG [main] 2015-01-18 16:30:39,349 Syncing log with a batch window of 1.0
        INFO  [main] 2015-01-18 16:30:39,349 DiskAccessMode is mmap, indexAccessMode is mmap
        INFO  [main] 2015-01-18 16:30:39,360 Global memtable on-heap threshold is enabled at 227MB
        INFO  [main] 2015-01-18 16:30:39,360 Global memtable off-heap threshold is enabled at 227MB
        INFO  [main] 2015-01-18 16:30:39,372 Started the RoundRobin Request Scheduler
        INFO  [main] 2015-01-18 16:30:39,373 Loading settings from file:/Users/snazy/devel/cassandra/trunk/test/conf/cassandra.yaml
        INFO  [main] 2015-01-18 16:30:39,378 Node configuration:[cluster_name=Test Cluster; column_index_size_in_kb=4; commitlog_directory=build/test/cassandra/commitlog; commitlog_segment_size_in_mb=5; commitlog_sync=batch; commitlog_sync_batch_window_in_ms=1.0; compaction_throughput_mb_per_sec=0; concurrent_compactors=4; data_file_directories=[build/test/cassandra/data]; disk_access_mode=mmap; dynamic_snitch=true; endpoint_snitch=org.apache.cassandra.locator.SimpleSnitch; incremental_backups=true; listen_address=127.0.0.1; memtable_allocation_type=offheap_objects; native_transport_port=9042; partitioner=org.apache.cassandra.dht.ByteOrderedPartitioner; request_scheduler=org.apache.cassandra.scheduler.RoundRobinScheduler; request_scheduler_id=keyspace; rpc_port=9170; saved_caches_directory=build/test/cassandra/saved_caches; seed_provider=[{class_name=org.apache.cassandra.locator.SimpleSeedProvider, parameters=[{seeds=127.0.0.1}]}]; server_encryption_options=<REDACTED>; start_native_transport=true; storage_port=7010]
        DEBUG [COMMIT-LOG-ALLOCATOR] 2015-01-18 16:30:39,388 No segments in reserve; creating a fresh one
        DEBUG [main] 2015-01-18 16:30:39,400 Closing and clearing existing commit log segments...
        DEBUG [main] 2015-01-18 16:30:39,403 Deleting CommitLog-5-1421595039390.log
        DEBUG [main] 2015-01-18 16:30:39,417 Deleting backups
        DEBUG [main] 2015-01-18 16:30:39,418 Deleting schema_triggers-0359bc7171233ee19a4ab9dfb11fc125
        DEBUG [main] 2015-01-18 16:30:39,418 Deleting system
        DEBUG [main] 2015-01-18 16:30:39,418 Closing and clearing existing commit log segments...
        INFO  [COMMIT-LOG-ALLOCATOR] 2015-01-18 16:30:39,426 Overriding RING_DELAY to 1000ms
        ERROR [COMMIT-LOG-ALLOCATOR] 2015-01-18 16:30:39,517 Failed managing commit log segments. Commit disk failure policy is stop; terminating thread
        org.apache.cassandra.io.FSWriteError: java.nio.file.NoSuchFileException: build/test/cassandra/commitlog:0/CommitLog-5-1421595039390.log
                at org.apache.cassandra.db.commitlog.CommitLogSegment.<init>(CommitLogSegment.java:189) ~[main/:na]
                at org.apache.cassandra.db.commitlog.CommitLogSegment.freshSegment(CommitLogSegment.java:120) ~[main/:na]
                at org.apache.cassandra.db.commitlog.CommitLogSegmentManager$1.runMayThrow(CommitLogSegmentManager.java:119) ~[main/:na]
                at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) [main/:na]
                at java.lang.Thread.run(Thread.java:745) [na:1.7.0_71]
        Caused by: java.nio.file.NoSuchFileException: build/test/cassandra/commitlog:0/CommitLog-5-1421595039390.log
                at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86) ~[na:1.7.0_71]
                at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102) ~[na:1.7.0_71]
                at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107) ~[na:1.7.0_71]
                at sun.nio.fs.UnixFileSystemProvider.newFileChannel(UnixFileSystemProvider.java:177) ~[na:1.7.0_71]
                at java.nio.channels.FileChannel.open(FileChannel.java:287) ~[na:1.7.0_71]
                at java.nio.channels.FileChannel.open(FileChannel.java:334) ~[na:1.7.0_71]
                at org.apache.cassandra.db.commitlog.CommitLogSegment.<init>(CommitLogSegment.java:174) ~[main/:na]
                ... 4 common frames omitted
        

        EDIT: unit tests also fail with 8u25 (same exception in system.log) - just other tests

        Show
        snazy Robert Stupp added a comment - - edited The commit breaks a lot of unit tests - at least on my machine ( 14.0.0 Darwin Kernel Version 14.0.0 ) - with Java 7. ant test run with Java 7 is affected (tried some 7u65, 7u71) ant test run with Java 8 works . ant test fails with a lot of timeout exceptions. The first one is constantly this one: [junit] Testcase: org.apache.cassandra.config.DatabaseDescriptorTest:testTransKsMigration: Caused an ERROR [junit] Timeout occurred. Please note the time in the report does not reflect the time until the timeout. [junit] junit.framework.AssertionFailedError: Timeout occurred. Please note the time in the report does not reflect the time until the timeout. [junit] at java.lang.Thread.run(Thread.java:745) system.log contains messages like this - which seems to be the reason for the unit test timeout: DEBUG [main] 2015-01-18 16:30:39,349 Syncing log with a batch window of 1.0 INFO [main] 2015-01-18 16:30:39,349 DiskAccessMode is mmap, indexAccessMode is mmap INFO [main] 2015-01-18 16:30:39,360 Global memtable on-heap threshold is enabled at 227MB INFO [main] 2015-01-18 16:30:39,360 Global memtable off-heap threshold is enabled at 227MB INFO [main] 2015-01-18 16:30:39,372 Started the RoundRobin Request Scheduler INFO [main] 2015-01-18 16:30:39,373 Loading settings from file:/Users/snazy/devel/cassandra/trunk/test/conf/cassandra.yaml INFO [main] 2015-01-18 16:30:39,378 Node configuration:[cluster_name=Test Cluster; column_index_size_in_kb=4; commitlog_directory=build/test/cassandra/commitlog; commitlog_segment_size_in_mb=5; commitlog_sync=batch; commitlog_sync_batch_window_in_ms=1.0; compaction_throughput_mb_per_sec=0; concurrent_compactors=4; data_file_directories=[build/test/cassandra/data]; disk_access_mode=mmap; dynamic_snitch=true; endpoint_snitch=org.apache.cassandra.locator.SimpleSnitch; incremental_backups=true; listen_address=127.0.0.1; memtable_allocation_type=offheap_objects; native_transport_port=9042; partitioner=org.apache.cassandra.dht.ByteOrderedPartitioner; request_scheduler=org.apache.cassandra.scheduler.RoundRobinScheduler; request_scheduler_id=keyspace; rpc_port=9170; saved_caches_directory=build/test/cassandra/saved_caches; seed_provider=[{class_name=org.apache.cassandra.locator.SimpleSeedProvider, parameters=[{seeds=127.0.0.1}]}]; server_encryption_options=<REDACTED>; start_native_transport=true; storage_port=7010] DEBUG [COMMIT-LOG-ALLOCATOR] 2015-01-18 16:30:39,388 No segments in reserve; creating a fresh one DEBUG [main] 2015-01-18 16:30:39,400 Closing and clearing existing commit log segments... DEBUG [main] 2015-01-18 16:30:39,403 Deleting CommitLog-5-1421595039390.log DEBUG [main] 2015-01-18 16:30:39,417 Deleting backups DEBUG [main] 2015-01-18 16:30:39,418 Deleting schema_triggers-0359bc7171233ee19a4ab9dfb11fc125 DEBUG [main] 2015-01-18 16:30:39,418 Deleting system DEBUG [main] 2015-01-18 16:30:39,418 Closing and clearing existing commit log segments... INFO [COMMIT-LOG-ALLOCATOR] 2015-01-18 16:30:39,426 Overriding RING_DELAY to 1000ms ERROR [COMMIT-LOG-ALLOCATOR] 2015-01-18 16:30:39,517 Failed managing commit log segments. Commit disk failure policy is stop; terminating thread org.apache.cassandra.io.FSWriteError: java.nio.file.NoSuchFileException: build/test/cassandra/commitlog:0/CommitLog-5-1421595039390.log at org.apache.cassandra.db.commitlog.CommitLogSegment.<init>(CommitLogSegment.java:189) ~[main/:na] at org.apache.cassandra.db.commitlog.CommitLogSegment.freshSegment(CommitLogSegment.java:120) ~[main/:na] at org.apache.cassandra.db.commitlog.CommitLogSegmentManager$1.runMayThrow(CommitLogSegmentManager.java:119) ~[main/:na] at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) [main/:na] at java.lang.Thread.run(Thread.java:745) [na:1.7.0_71] Caused by: java.nio.file.NoSuchFileException: build/test/cassandra/commitlog:0/CommitLog-5-1421595039390.log at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86) ~[na:1.7.0_71] at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102) ~[na:1.7.0_71] at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107) ~[na:1.7.0_71] at sun.nio.fs.UnixFileSystemProvider.newFileChannel(UnixFileSystemProvider.java:177) ~[na:1.7.0_71] at java.nio.channels.FileChannel.open(FileChannel.java:287) ~[na:1.7.0_71] at java.nio.channels.FileChannel.open(FileChannel.java:334) ~[na:1.7.0_71] at org.apache.cassandra.db.commitlog.CommitLogSegment.<init>(CommitLogSegment.java:174) ~[main/:na] ... 4 common frames omitted EDIT: unit tests also fail with 8u25 (same exception in system.log ) - just other tests
        Hide
        snazy Robert Stupp added a comment -

        Some more information on this (using DatabaseDescriptorTest as an example):

        1. test methods testCFMetaDataSerialization + testKSMetaDataSerialization run fine
        2. test method testTransKsMigration calls SchemaLoader.cleanupAndLeaveDirs, which calls CommitLog.instance.resetUnsafe()
        3. CommitLog.instance.resetUnsafe() calls CommitLogSegmentManager.resetUnsafe, which waits until segmentManagementTasks is empty
        4. there is one task (guess its the one from CommitLogSegmentManager.wakeManager via advanceAllocatingFrom) in segmentManagementTasks
        5. but segmentManagementTasks cannot become empty, because COMMIT-LOG-ALLOCATOR is not running
        6. result is an infinite loop ; the test hangs and times out

        Removing the first CommitLog.instance.resetUnsafe(); in SchemaLoader.cleanupAndLeaveDirs fixes ant test - but I don't know what this means for Windows.

        Show
        snazy Robert Stupp added a comment - Some more information on this (using DatabaseDescriptorTest as an example): test methods testCFMetaDataSerialization + testKSMetaDataSerialization run fine test method testTransKsMigration calls SchemaLoader.cleanupAndLeaveDirs , which calls CommitLog.instance.resetUnsafe() CommitLog.instance.resetUnsafe() calls CommitLogSegmentManager.resetUnsafe , which waits until segmentManagementTasks is empty there is one task (guess its the one from CommitLogSegmentManager.wakeManager via advanceAllocatingFrom ) in segmentManagementTasks but segmentManagementTasks cannot become empty, because COMMIT-LOG-ALLOCATOR is not running result is an infinite loop ; the test hangs and times out Removing the first CommitLog.instance.resetUnsafe(); in SchemaLoader.cleanupAndLeaveDirs fixes ant test - but I don't know what this means for Windows.
        Hide
        JoshuaMcKenzie Joshua McKenzie added a comment -

        Unit tests are clean on both linux and windows on trunk on jdk7 on our CI environment, and have 2 failure on jdk8 on linux. Also, I'm running 7u71 locally on Windows and don't see these problems either.

        I'd expect we'd have heard more from other developers regarding this if it was a widespread issue so I'm inclined to suspect your environment before digging in and changing code. Have you done a new clone of the git repo and tried to test on a completely clean checkout?

        Even if you remove the 1st call to CommitLog.instance.resetUnsafe() in SchemaLoader.cleanupAndLeaveDirs, there's still a second call at the end of that method that should also hang if your theory as to why it's failing is correct.

        Is COMMIT-LOG-ALLOCATOR not running? Since CommitLog is a singleton, I would expect the allocator thread would be running since the managerThread is started in the constructor for CommitLogSegmentManager which is called in the singleton's constructor - unless I'm missing something.

        Show
        JoshuaMcKenzie Joshua McKenzie added a comment - Unit tests are clean on both linux and windows on trunk on jdk7 on our CI environment, and have 2 failure on jdk8 on linux. Also, I'm running 7u71 locally on Windows and don't see these problems either. I'd expect we'd have heard more from other developers regarding this if it was a widespread issue so I'm inclined to suspect your environment before digging in and changing code. Have you done a new clone of the git repo and tried to test on a completely clean checkout? Even if you remove the 1st call to CommitLog.instance.resetUnsafe() in SchemaLoader.cleanupAndLeaveDirs , there's still a second call at the end of that method that should also hang if your theory as to why it's failing is correct. Is COMMIT-LOG-ALLOCATOR not running? Since CommitLog is a singleton, I would expect the allocator thread would be running since the managerThread is started in the constructor for CommitLogSegmentManager which is called in the singleton's constructor - unless I'm missing something.
        Hide
        snazy Robert Stupp added a comment -

        I'll dig a bit deeper into this and try it on a Linux box and figure out why this happens (only to me and my Mac?).

        Show
        snazy Robert Stupp added a comment - I'll dig a bit deeper into this and try it on a Linux box and figure out why this happens (only to me and my Mac?).
        Hide
        snazy Robert Stupp added a comment -

        I tried on my Linux box and on another (5 yrs old) MBP: tests pass.
        Aleksey Yeschenko's Mac does also have no timeouts.

        Seems to be a pure local (mine) problem. Sorry for alerting for nothing.

        Show
        snazy Robert Stupp added a comment - I tried on my Linux box and on another (5 yrs old) MBP: tests pass. Aleksey Yeschenko 's Mac does also have no timeouts. Seems to be a pure local (mine) problem. Sorry for alerting for nothing.
        Hide
        snazy Robert Stupp added a comment -

        Joshua McKenzie cassci now also complains about TimeoutExceptions (http://cassci.datastax.com/job/trunk_utest/1353/testReport/ and various previous builds).

        I digged a bit more into it and have a patch that solves the TimeoutException-problem: https://gist.github.com/snazy/6f2f8203ca4ab7c19422 and a branch https://github.com/snazy/cassandra/tree/8308-post-fix
        The good: the problem does only affect unit tests - not the production code.

        Background:
        Sometimes CommitLogSegmentManager.resetUnsafe() produces a deadlock - when depends on 'random' - i.e. when COMMIT-LOG-ALLOCATOR thread works against data 'corrupted' by resetUnsafe. The patch of resetUnsafe just shuts down COMMIT-LOG-ALLOCATOR thread, does its work and restarts COMMIT-LOG-ALLOCATOR.

        Show
        snazy Robert Stupp added a comment - Joshua McKenzie cassci now also complains about TimeoutExceptions ( http://cassci.datastax.com/job/trunk_utest/1353/testReport/ and various previous builds). I digged a bit more into it and have a patch that solves the TimeoutException-problem: https://gist.github.com/snazy/6f2f8203ca4ab7c19422 and a branch https://github.com/snazy/cassandra/tree/8308-post-fix The good: the problem does only affect unit tests - not the production code. Background: Sometimes CommitLogSegmentManager.resetUnsafe() produces a deadlock - when depends on 'random' - i.e. when COMMIT-LOG-ALLOCATOR thread works against data 'corrupted' by resetUnsafe . The patch of resetUnsafe just shuts down COMMIT-LOG-ALLOCATOR thread, does its work and restarts COMMIT-LOG-ALLOCATOR .
        Hide
        JoshuaMcKenzie Joshua McKenzie added a comment -

        Changes on 8308-post-fix LGTM.

        Show
        JoshuaMcKenzie Joshua McKenzie added a comment - Changes on 8308-post-fix LGTM.
        Hide
        snazy Robert Stupp added a comment -

        Committed 8308-post-fix.txt as 3dd1bf10441a86a620976b479a66fe428acc2d66

        Show
        snazy Robert Stupp added a comment - Committed 8308-post-fix.txt as 3dd1bf10441a86a620976b479a66fe428acc2d66
        Hide
        snazy Robert Stupp added a comment -

        I'm really sorry to reopen this one again. My last patch accidentally made trunk-utest-win failing.

        Follow-up patch is here: https://gist.github.com/snazy/2494731cb7e2ed1e2479

        Show
        snazy Robert Stupp added a comment - I'm really sorry to reopen this one again. My last patch accidentally made trunk-utest-win failing. Follow-up patch is here: https://gist.github.com/snazy/2494731cb7e2ed1e2479
        Hide
        JoshuaMcKenzie Joshua McKenzie added a comment -

        Hm. Something is clearly different between our CI environment and my local win7 install, as both current trunk and your branch complete with no test errors.

        Having said that - your changes on post-fix-2 look like they should address the latest errors on trunk-utest-win and look benign. +1.

        Show
        JoshuaMcKenzie Joshua McKenzie added a comment - Hm. Something is clearly different between our CI environment and my local win7 install, as both current trunk and your branch complete with no test errors. Having said that - your changes on post-fix-2 look like they should address the latest errors on trunk-utest-win and look benign. +1.
        Hide
        snazy Robert Stupp added a comment -

        Okay - committed post-fix-2 as 0fa19b7ce791701231f888902ad7d981c17cb9f4

        Show
        snazy Robert Stupp added a comment - Okay - committed post-fix-2 as 0fa19b7ce791701231f888902ad7d981c17cb9f4
        Hide
        snazy Robert Stupp added a comment -

        clearly different between our CI environment and my local win7 install

        same for me - trunk_utest_win32 still failing (though less utest errors) and working fine in my Win7 VM

        Show
        snazy Robert Stupp added a comment - clearly different between our CI environment and my local win7 install same for me - trunk_utest_win32 still failing (though less utest errors) and working fine in my Win7 VM
        Hide
        snazy Robert Stupp added a comment -

        Hm - at the moment I've got no other idea than to try to delete each segment file in a for-try-wait loop.
        Made another one https://github.com/snazy/cassandra/commit/9d1cb6fc3a5a89480817434e55d515f4a37036dc

        Maybe there's something async going on here (which the new pray-n-error-patch could fix) or something with mmap cleaner is going weird (which would be a bigger problem). We could check the latter by setting disk_access_mode = standard.

        Show
        snazy Robert Stupp added a comment - Hm - at the moment I've got no other idea than to try to delete each segment file in a for-try-wait loop. Made another one https://github.com/snazy/cassandra/commit/9d1cb6fc3a5a89480817434e55d515f4a37036dc Maybe there's something async going on here (which the new pray-n-error-patch could fix) or something with mmap cleaner is going weird (which would be a bigger problem). We could check the latter by setting disk_access_mode = standard .
        Hide
        JoshuaMcKenzie Joshua McKenzie added a comment -

        I was able to reproduce the CI issues on my local machine once, so this isn't CI-specific.

        As per some offline discussion, I went ahead and threw together a version that integrates stopping and starting the AbstractCommitLogExecutorService into the mix in the off chance part of our failure is that the ACLE is futzing with files at the time we go to delete them. I did a full 10 runs of our unit tests on trunk on my local win7 install with 0 unit test failures.

        Robert Stupp: branch is here if you wouldn't mind taking a look (was your idea after all). Ryan McGuire: would it be possible to have our CI windows server do a run against that branch to see if it resolves the issues there? Our CI environment appears to be racing / failing on every run (as opposed to 1 out of 8 attempts to retry on my local this morning).

        Show
        JoshuaMcKenzie Joshua McKenzie added a comment - I was able to reproduce the CI issues on my local machine once, so this isn't CI-specific. As per some offline discussion, I went ahead and threw together a version that integrates stopping and starting the AbstractCommitLogExecutorService into the mix in the off chance part of our failure is that the ACLE is futzing with files at the time we go to delete them. I did a full 10 runs of our unit tests on trunk on my local win7 install with 0 unit test failures. Robert Stupp : branch is here if you wouldn't mind taking a look (was your idea after all ). Ryan McGuire : would it be possible to have our CI windows server do a run against that branch to see if it resolves the issues there? Our CI environment appears to be racing / failing on every run (as opposed to 1 out of 8 attempts to retry on my local this morning).
        Hide
        snazy Robert Stupp added a comment -

        Joshua McKenzie branch LGTM - except that there's a change in test/data/corrupt-sstables/Keyspace1-Standard3-jb-1-Summary.db
        I like the idea to test it from the branch on cassci

        Show
        snazy Robert Stupp added a comment - Joshua McKenzie branch LGTM - except that there's a change in test/data/corrupt-sstables/Keyspace1-Standard3-jb-1-Summary.db I like the idea to test it from the branch on cassci
        Hide
        JoshuaMcKenzie Joshua McKenzie added a comment -

        Reverted the test/data that snuck in - thanks for the heads up.

        Show
        JoshuaMcKenzie Joshua McKenzie added a comment - Reverted the test/data that snuck in - thanks for the heads up.
        Hide
        snazy Robert Stupp added a comment -

        I've setup a Jenkins instance against apache/trunk and Josh's branch. Both utests flap with TimeoutExceptions. Time's come for good-old LOGGER debugging...

        blame:off may I change the title to "pray-and-error-issue"?

        Show
        snazy Robert Stupp added a comment - I've setup a Jenkins instance against apache/trunk and Josh's branch. Both utests flap with TimeoutExceptions. Time's come for good-old LOGGER debugging... blame:off may I change the title to "pray-and-error-issue"?
        Hide
        snazy Robert Stupp added a comment - - edited

        I've taken Josh's diff from its branch and ensured that all unit tests pass.
        Additionally I've setup a Jenkins instance on Linux that runs utests against my branch every 30 minutes.
        Until now it made 27 builds without and failure/error. ("Only") two build failures during these runs: one-time flappy tests:

        • one TimeoutException in AggregationTest.testJavaAggregate
        • one NPE in LeveledCompactionStrategyTest.testValidationMultipleSSTablePerLevel

        But the TimeoutExceptions in these BeforeTest... methods are gone.
        I'll leave my Jenkins instance busy so we'll have a good expectation about the patch.
        Additionally I'd like to see it pass on a real Win box.

        Show
        snazy Robert Stupp added a comment - - edited I've taken Josh's diff from its branch and ensured that all unit tests pass. Additionally I've setup a Jenkins instance on Linux that runs utests against my branch every 30 minutes. Until now it made 27 builds without and failure/error. ("Only") two build failures during these runs: one-time flappy tests: one TimeoutException in AggregationTest.testJavaAggregate one NPE in LeveledCompactionStrategyTest.testValidationMultipleSSTablePerLevel But the TimeoutExceptions in these BeforeTest... methods are gone. I'll leave my Jenkins instance busy so we'll have a good expectation about the patch. Additionally I'd like to see it pass on a real Win box.
        Hide
        snazy Robert Stupp added a comment -

        Forgot one: also added a new MmapFileTest that basically ensures that mmap works the way it is used in CL including querying the appropriate JMX bean.

        Show
        snazy Robert Stupp added a comment - Forgot one: also added a new MmapFileTest that basically ensures that mmap works the way it is used in CL including querying the appropriate JMX bean.
        Hide
        JoshuaMcKenzie Joshua McKenzie added a comment -

        Feature branch with a couple of formatting nits applied here. Have job on cassci queued here.

        Show
        JoshuaMcKenzie Joshua McKenzie added a comment - Feature branch with a couple of formatting nits applied here . Have job on cassci queued here .
        Hide
        JoshuaMcKenzie Joshua McKenzie added a comment -

        Got Philip to run it against a Windows CI (thanks) - results are here. Looks like it's resolved the issue however before we commit I have a couple of questions and one concern:

        Why 10 retries and why 500ms sleep between retries on closeAndDeleteSegmentUnsafe (vs. 5 and 10ms, for instance, or some other arbitrary number)? The "keep trying to delete it until it works or eventually we time out" smells a lot like a micro-version of SSTableDeletingTask for CLS where we're just spinning locally instead of scheduling off GC. More limited in scope (also only applies to tests) but equally hackish.

        This may very well be the best we can do on Windows if we're in an environment where filesystem drivers can't be trusted to consistently stay out of our way and mangle up our deleteRecursive calls (which is looking more and more like the case) - but I think it's best we have a clear view of the pig before we put lipstick on it and call it a day.

        Show
        JoshuaMcKenzie Joshua McKenzie added a comment - Got Philip to run it against a Windows CI (thanks) - results are here . Looks like it's resolved the issue however before we commit I have a couple of questions and one concern: Why 10 retries and why 500ms sleep between retries on closeAndDeleteSegmentUnsafe (vs. 5 and 10ms, for instance, or some other arbitrary number)? The "keep trying to delete it until it works or eventually we time out" smells a lot like a micro-version of SSTableDeletingTask for CLS where we're just spinning locally instead of scheduling off GC. More limited in scope (also only applies to tests) but equally hackish. This may very well be the best we can do on Windows if we're in an environment where filesystem drivers can't be trusted to consistently stay out of our way and mangle up our deleteRecursive calls (which is looking more and more like the case) - but I think it's best we have a clear view of the pig before we put lipstick on it and call it a day.
        Hide
        snazy Robert Stupp added a comment -

        You're right, the retry-sleep-loop (including the randomly chosen 500ms sleep) would hide the problem. So, yup - we can remove it. I didn't see it spin there on OSX or Linux.

        +1 on getting a view of the pig and fix it where it occurs (and aims to occur in production?).

        Show
        snazy Robert Stupp added a comment - You're right, the retry-sleep-loop (including the randomly chosen 500ms sleep) would hide the problem. So, yup - we can remove it. I didn't see it spin there on OSX or Linux. +1 on getting a view of the pig and fix it where it occurs (and aims to occur in production?).
        Hide
        snazy Robert Stupp added a comment -

        Update from "my Jenkins":

        • nearly 100 builds
        • one TimeoutException in AggregationTest.testJavaAggregate
        • 3 failed with NPE in LeveledCompactionStrategyTest.testValidationMultipleSSTablePerLevel
        • 2 failed with AssertionError in LeveledCompactionStrategyTest.testValidationMultipleSSTablePerLevel

        My interpretation: LeveledCompactionStrategyTest.testValidationMultipleSSTablePerLevel needs some investigation but the flappy BeforeTest-TimeoutException is finally fixed.

        Note: scheduling GC should not be necessary if the Cleaner for DirectByteBuffer is working - the unit test MmapFileTest asserts that a mmap'd file can be deleted without GC.

        Show
        snazy Robert Stupp added a comment - Update from "my Jenkins": nearly 100 builds one TimeoutException in AggregationTest.testJavaAggregate 3 failed with NPE in LeveledCompactionStrategyTest.testValidationMultipleSSTablePerLevel 2 failed with AssertionError in LeveledCompactionStrategyTest.testValidationMultipleSSTablePerLevel My interpretation: LeveledCompactionStrategyTest.testValidationMultipleSSTablePerLevel needs some investigation but the flappy BeforeTest-TimeoutException is finally fixed. Note: scheduling GC should not be necessary if the Cleaner for DirectByteBuffer is working - the unit test MmapFileTest asserts that a mmap'd file can be deleted without GC.
        Hide
        JoshuaMcKenzie Joshua McKenzie added a comment -

        So to clarify from our earlier conversation regarding the retry-sleep-loop - are you seeing it spin there on Windows? I.e. is the success of the unit tests running within CI contingent upon retrying deletion of those files?

        Show
        JoshuaMcKenzie Joshua McKenzie added a comment - So to clarify from our earlier conversation regarding the retry-sleep-loop - are you seeing it spin there on Windows? I.e. is the success of the unit tests running within CI contingent upon retrying deletion of those files?
        Hide
        snazy Robert Stupp added a comment -

        I've no real Windows box - just a (slow) VM.
        I suppose the success on is not related to the try-spin-loop - but would like to it without the loop in CI environment.

        Show
        snazy Robert Stupp added a comment - I've no real Windows box - just a (slow) VM. I suppose the success on is not related to the try-spin-loop - but would like to it without the loop in CI environment.
        Hide
        JoshuaMcKenzie Joshua McKenzie added a comment -

        Philip Thompson: Could you do another run on cassci against this branch? Has the try/spin loop on segment deletion removed.

        Show
        JoshuaMcKenzie Joshua McKenzie added a comment - Philip Thompson : Could you do another run on cassci against this branch ? Has the try/spin loop on segment deletion removed.
        Hide
        philipthompson Philip Thompson added a comment -

        http://cassci.datastax.com/job/8308_trunk_utest_win32/2/testReport/
        http://cassci.datastax.com/job/8308_trunk_utest_win32/3/testReport/

        I ran it twice. Inconsistent results, but removing the try/spin loop didn't meaningfully affect results.

        Show
        philipthompson Philip Thompson added a comment - http://cassci.datastax.com/job/8308_trunk_utest_win32/2/testReport/ http://cassci.datastax.com/job/8308_trunk_utest_win32/3/testReport/ I ran it twice. Inconsistent results, but removing the try/spin loop didn't meaningfully affect results.
        Hide
        JoshuaMcKenzie Joshua McKenzie added a comment -

        Committed v3. Now to keep an eye on Windows CI...

        Show
        JoshuaMcKenzie Joshua McKenzie added a comment - Committed v3. Now to keep an eye on Windows CI...
        Hide
        snazy Robert Stupp added a comment -

        I'll keep my fingers crossed

        Show
        snazy Robert Stupp added a comment - I'll keep my fingers crossed

          People

          • Assignee:
            JoshuaMcKenzie Joshua McKenzie
            Reporter:
            JoshuaMcKenzie Joshua McKenzie
            Reviewer:
            Benedict
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development