Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 5.1, 6.0
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      The issue is, today this means listAll() is always slow, sometimes MUCH slower, because it must do the fstat()-equivalent of each file to check if its a directory to filter it out.

      When i benchmarked this on a fast filesystem, doing all these filesystem metadata calls only makes listAll() 2.6x slower, but on a non-ssd, slower i/o, it can be more than 60x slower.

      Lucene doesn't make subdirectories, so hiding these for abuse cases just makes real use cases slower.

      To add insult to injury, most code (e.g. all of lucene except for where RAMDir copies from an FSDir) does not actually care if extraneous files are directories or not.

      Finally it sucks the name is listAll() when it is doing anything but that.

      I really hate to add a method here to deal with this abusive stuff, but I'd rather add isDirectory(String) for the rare code that wants to filter out, than just let stuff always be slow.

      1. LUCENE-6241.patch
        9 kB
        Robert Muir
      2. LUCENE-6241.patch
        9 kB
        Robert Muir
      3. LUCENE-6241-alternative.patch
        16 kB
        Robert Muir

        Issue Links

          Activity

          Hide
          Ryan Ernst added a comment -

          +1

          Show
          Ryan Ernst added a comment - +1
          Hide
          Robert Muir added a comment -

          add missing ensureOpen.

          Show
          Robert Muir added a comment - add missing ensureOpen.
          Hide
          Michael McCandless added a comment -

          +1

          Must we add isDirectory? Can't abusers do this themselves? E.g. there is no way to make a sub directory in a RAMDirectory...

          Show
          Michael McCandless added a comment - +1 Must we add isDirectory? Can't abusers do this themselves? E.g. there is no way to make a sub directory in a RAMDirectory...
          Hide
          Robert Muir added a comment -

          I'd love to, but how will ramdirs copy-ctor work?

          Maybe it doesnt need to take Directory 'other' and can take FSDirectory 'other' ?

          Show
          Robert Muir added a comment - I'd love to, but how will ramdirs copy-ctor work? Maybe it doesnt need to take Directory 'other' and can take FSDirectory 'other' ?
          Hide
          Michael McCandless added a comment -

          but how will ramdirs copy-ctor work?

          I think RAMDir's copy ctor shouldn't handle this abuse case? If you are an abuser, you do your own copying to RAMDir ...

          Show
          Michael McCandless added a comment - but how will ramdirs copy-ctor work? I think RAMDir's copy ctor shouldn't handle this abuse case? If you are an abuser, you do your own copying to RAMDir ...
          Hide
          Robert Muir added a comment -

          Well, or its copy filters by CODEC_PATTERN, so it only copies in lucene index files and not extraneous junk. Thats an alternative here too.

          But yes, you get the problem, its only non-lucene usage and lucene in general does not care here, so i really hate adding any method to the Directory api.

          Show
          Robert Muir added a comment - Well, or its copy filters by CODEC_PATTERN, so it only copies in lucene index files and not extraneous junk. Thats an alternative here too. But yes, you get the problem, its only non-lucene usage and lucene in general does not care here, so i really hate adding any method to the Directory api.
          Hide
          Robert Muir added a comment -

          Another option would be to change this signature from RAMDirectory(Directory other) to RAMDirectory(FSDirectory other). FSDirectory already has 'Path getDirectory()' so RAMDirectory could use this directly, to ignore subdirectories.

          Show
          Robert Muir added a comment - Another option would be to change this signature from RAMDirectory(Directory other) to RAMDirectory(FSDirectory other). FSDirectory already has 'Path getDirectory()' so RAMDirectory could use this directly, to ignore subdirectories.
          Hide
          Michael McCandless added a comment -

          Another option would be to change this signature from RAMDirectory(Directory other) to RAMDirectory(FSDirectory other). FSDirectory already has 'Path getDirectory()' so RAMDirectory could use this directly, to ignore subdirectories.

          +1 to this, or to simply disallow the abuse case to RAMDir's copy ctor.

          This way we don't add APIs solely for abuse cases.

          Show
          Michael McCandless added a comment - Another option would be to change this signature from RAMDirectory(Directory other) to RAMDirectory(FSDirectory other). FSDirectory already has 'Path getDirectory()' so RAMDirectory could use this directly, to ignore subdirectories. +1 to this, or to simply disallow the abuse case to RAMDir's copy ctor. This way we don't add APIs solely for abuse cases.
          Hide
          Robert Muir added a comment -

          +1 to this, or to simply disallow the abuse case to RAMDir's copy ctor.

          What do you mean by disallow the abuse case? The user will get IOException if we don't do something, and the IOException will make it look like a lucene bug to them.

          Show
          Robert Muir added a comment - +1 to this, or to simply disallow the abuse case to RAMDir's copy ctor. What do you mean by disallow the abuse case? The user will get IOException if we don't do something, and the IOException will make it look like a lucene bug to them.
          Hide
          Michael McCandless added a comment -

          The user will get IOException if we don't do something, and the IOException will make it look like a lucene bug to them.

          Well, I think that's OK? The IOExc will specific abusing directory they had added to Lucene's index directory? Plus I think we are talking about very few actual occurrences of this: it's abusers who also use RAMDir's copy ctor.

          Maybe we should simply remove RAMDir's copy ctor? That method seems abusive/trappy to me too!

          Show
          Michael McCandless added a comment - The user will get IOException if we don't do something, and the IOException will make it look like a lucene bug to them. Well, I think that's OK? The IOExc will specific abusing directory they had added to Lucene's index directory? Plus I think we are talking about very few actual occurrences of this: it's abusers who also use RAMDir's copy ctor. Maybe we should simply remove RAMDir's copy ctor? That method seems abusive/trappy to me too!
          Hide
          Robert Muir added a comment -

          Just illustration: remove the isDirectory() call in my patch and run TestRAMDirectory, the exception would look like this:

          java.io.IOException: Is a directory: NIOFSIndexInput(path="/tmp/lucene.store.TestRAMDirectory F412F4510E9B7B4-001/testsubdir-001/subdir")
          	at __randomizedtesting.SeedInfo.seed([F412F4510E9B7B4:3E7CBDE940612694]:0)
          	at org.apache.lucene.store.NIOFSDirectory$NIOFSIndexInput.readInternal(NIOFSDirectory.java:190)
          	at org.apache.lucene.store.BufferedIndexInput.readBytes(BufferedIndexInput.java:160)
          	at org.apache.lucene.store.BufferedIndexInput.readBytes(BufferedIndexInput.java:116)
          	at org.apache.lucene.store.MockIndexInputWrapper.readBytes(MockIndexInputWrapper.java:140)
          	at org.apache.lucene.store.DataOutput.copyBytes(DataOutput.java:275)
          	at org.apache.lucene.store.Directory.copyFrom(Directory.java:156)
          	at org.apache.lucene.store.RAMDirectory.<init>(RAMDirectory.java:102)
          	at org.apache.lucene.store.RAMDirectory.<init>(RAMDirectory.java:96)
          	at org.apache.lucene.store.TestRAMDirectory.testCopySubdir(TestRAMDirectory.java:78)
          

          Its not something we can easily "upgrade" to a better exception.

          Show
          Robert Muir added a comment - Just illustration: remove the isDirectory() call in my patch and run TestRAMDirectory, the exception would look like this: java.io.IOException: Is a directory: NIOFSIndexInput(path="/tmp/lucene.store.TestRAMDirectory F412F4510E9B7B4-001/testsubdir-001/subdir") at __randomizedtesting.SeedInfo.seed([F412F4510E9B7B4:3E7CBDE940612694]:0) at org.apache.lucene.store.NIOFSDirectory$NIOFSIndexInput.readInternal(NIOFSDirectory.java:190) at org.apache.lucene.store.BufferedIndexInput.readBytes(BufferedIndexInput.java:160) at org.apache.lucene.store.BufferedIndexInput.readBytes(BufferedIndexInput.java:116) at org.apache.lucene.store.MockIndexInputWrapper.readBytes(MockIndexInputWrapper.java:140) at org.apache.lucene.store.DataOutput.copyBytes(DataOutput.java:275) at org.apache.lucene.store.Directory.copyFrom(Directory.java:156) at org.apache.lucene.store.RAMDirectory.<init>(RAMDirectory.java:102) at org.apache.lucene.store.RAMDirectory.<init>(RAMDirectory.java:96) at org.apache.lucene.store.TestRAMDirectory.testCopySubdir(TestRAMDirectory.java:78) Its not something we can easily "upgrade" to a better exception.
          Hide
          Robert Muir added a comment - - edited

          Maybe we should simply remove RAMDir's copy ctor? That method seems abusive/trappy to me too!

          Well, I think there is a real use case for this? Despite how bad it might perform, some people want to load their entire index into memory.

          I would greatly prefer if we had an option to mmap that called MappedByteBuffer.load() or something for these users, this one will even do the correct madvise() call and read one byte for each page and do this essentially as efficiently as the OS can. I think its a better solution for that use case.

          In general, for this issue, I wanted to avoid controversies or larger changes like this, my problem is one with operations on Directory API not being transparent, I want to make some progress on this and I think they should map directly to what is happening on the filesystem. Thats easiest to understand and the least trappy.

          • listAll() should just list files, not readdir() + fstat()
          • rename() should just rename() and not rename() + fsync(dir)
          • createOutput should just createOutput, not delete() + create()

          edit: i meant load() not force()

          Show
          Robert Muir added a comment - - edited Maybe we should simply remove RAMDir's copy ctor? That method seems abusive/trappy to me too! Well, I think there is a real use case for this? Despite how bad it might perform, some people want to load their entire index into memory. I would greatly prefer if we had an option to mmap that called MappedByteBuffer.load() or something for these users, this one will even do the correct madvise() call and read one byte for each page and do this essentially as efficiently as the OS can. I think its a better solution for that use case. In general, for this issue, I wanted to avoid controversies or larger changes like this, my problem is one with operations on Directory API not being transparent, I want to make some progress on this and I think they should map directly to what is happening on the filesystem. Thats easiest to understand and the least trappy. listAll() should just list files, not readdir() + fstat() rename() should just rename() and not rename() + fsync(dir) createOutput should just createOutput, not delete() + create() edit: i meant load() not force()
          Hide
          Michael McCandless added a comment -

          OK then +1 to keep RAMDir's copy ctor, have it take FSDirectory, have it check itself for directories and skip them.

          Show
          Michael McCandless added a comment - OK then +1 to keep RAMDir's copy ctor, have it take FSDirectory, have it check itself for directories and skip them.
          Hide
          Robert Muir added a comment -

          And for the record, the exception may change in future JDK versions. Its actually related to the fsync-directory issue... today you are allowed to open a directory as a file and you dont get exception until you try to read.

          Show
          Robert Muir added a comment - And for the record, the exception may change in future JDK versions. Its actually related to the fsync-directory issue... today you are allowed to open a directory as a file and you dont get exception until you try to read.
          Hide
          Robert Muir added a comment -

          OK then +1 to keep RAMDir's copy ctor, have it take FSDirectory, have it check itself for directories and skip them.

          OK, well thats the current patch. Do you see any problems with it specifically? We can followup with some of the other ideas here and maybe clean this up in the future too, thats definitely worth the effort to me.

          Show
          Robert Muir added a comment - OK then +1 to keep RAMDir's copy ctor, have it take FSDirectory, have it check itself for directories and skip them. OK, well thats the current patch. Do you see any problems with it specifically? We can followup with some of the other ideas here and maybe clean this up in the future too, thats definitely worth the effort to me.
          Hide
          Robert Muir added a comment -

          Here is the alternative FSDir patch. For lucene, this is fine.

          But I am hesitant, because I think the ability to explicitly check for a subdirectory might be needed, even though its not needed by lucene. My main concern is honestly .DS_Store and things like that. In this case its not really an abuse case, but the user is a victim.

          On the other hand, things needing to filter out trash can do this stuff cleanly already. For example lucene's replication module only replicates index files matching IndexFileNames.CODEC_PATTERN. This will also take care of windows thumbs.db or whatever too. So maybe this patch is fine, and this should be the recommended approach?

          Show
          Robert Muir added a comment - Here is the alternative FSDir patch. For lucene, this is fine. But I am hesitant, because I think the ability to explicitly check for a subdirectory might be needed, even though its not needed by lucene. My main concern is honestly .DS_Store and things like that. In this case its not really an abuse case, but the user is a victim. On the other hand, things needing to filter out trash can do this stuff cleanly already. For example lucene's replication module only replicates index files matching IndexFileNames.CODEC_PATTERN. This will also take care of windows thumbs.db or whatever too. So maybe this patch is fine, and this should be the recommended approach?
          Hide
          Michael McCandless added a comment -

          +1 for the latest patch.

          Show
          Michael McCandless added a comment - +1 for the latest patch.
          Hide
          ASF subversion and git services added a comment -

          Commit 1659621 from Robert Muir in branch 'dev/trunk'
          [ https://svn.apache.org/r1659621 ]

          LUCENE-6241: FSDirectory.listAll doesnt filter out subdirectories anymore

          Show
          ASF subversion and git services added a comment - Commit 1659621 from Robert Muir in branch 'dev/trunk' [ https://svn.apache.org/r1659621 ] LUCENE-6241 : FSDirectory.listAll doesnt filter out subdirectories anymore
          Hide
          Robert Muir added a comment -

          thanks Mike

          Show
          Robert Muir added a comment - thanks Mike
          Hide
          ASF subversion and git services added a comment -

          Commit 1659628 from Robert Muir in branch 'dev/branches/branch_5x'
          [ https://svn.apache.org/r1659628 ]

          LUCENE-6241: FSDirectory.listAll doesnt filter out subdirectories anymore

          Show
          ASF subversion and git services added a comment - Commit 1659628 from Robert Muir in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1659628 ] LUCENE-6241 : FSDirectory.listAll doesnt filter out subdirectories anymore
          Hide
          Timothy Potter added a comment -

          Bulk close after 5.1 release

          Show
          Timothy Potter added a comment - Bulk close after 5.1 release

            People

            • Assignee:
              Unassigned
              Reporter:
              Robert Muir
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development