Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.0-BETA, 6.0
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Carl Austin raised a good issue in a comment on my Lucene 4.0.0 alpha blog post: http://blog.mikemccandless.com/2012/07/lucene-400-alpha-at-long-last.html

      IndexWriter will now (as of 4.0) delete all foreign files from the index directory. We made this change because Codecs are free to write to any files now, so the space of filenames is hard to "bound".

      But if the user accidentally uses the wrong directory (eg c:/) then we will in fact delete important stuff.

      I think we can at least use some simple criteria (must start with _, maybe must fit certain pattern eg _<base36>(_X).Y), so we are much less likely to delete a non-Lucene file....

      1. LUCENE-4190.patch
        12 kB
        Robert Muir
      2. LUCENE-4190.patch
        5 kB
        Robert Muir
      3. LUCENE-4190.patch
        5 kB
        Robert Muir
      4. LUCENE-4190.patch
        5 kB
        Robert Muir

        Activity

        Hide
        Michael McCandless added a comment -

        History repeats itself: LUCENE-385.

        Show
        Michael McCandless added a comment - History repeats itself: LUCENE-385 .
        Hide
        Robert Muir added a comment -

        I dont think we should do anything beyond 'must start with _'

        Otherwise this file handling gets complicated again, I don't want to see that!

        In all cases, if we start enforcing what the file names for lucene-files must be,
        then when we call SegmentInfo.setFiles, we need an assert that they
        all in fact actually match this pattern.

        Show
        Robert Muir added a comment - I dont think we should do anything beyond 'must start with _' Otherwise this file handling gets complicated again, I don't want to see that! In all cases, if we start enforcing what the file names for lucene-files must be, then when we call SegmentInfo.setFiles, we need an assert that they all in fact actually match this pattern.
        Hide
        Shai Erera added a comment -

        I also raised an eyebrow when I read this comment. Many of the lucene+facet deployments that I know of store the taxonomy index as a sub-directory of the search index. Also, we've been storing other files in the index directory too ... this new feature will affect such existing deployments.

        I think it makes sense to change IW behavior to only delete files that start with _. It's a reasonable requirement IMO.

        While I don't know the nature of this change, I can assume it's related to IW not knowing which files to delete when a segment is no longer needed, because Codecs can pick their own file names. If we had an instance which kept track of all files that were created, e.g. every Codec would register the files there (if it wants to protect from their deletion), would make the decision of which files to delete easier?

        Show
        Shai Erera added a comment - I also raised an eyebrow when I read this comment. Many of the lucene+facet deployments that I know of store the taxonomy index as a sub-directory of the search index. Also, we've been storing other files in the index directory too ... this new feature will affect such existing deployments. I think it makes sense to change IW behavior to only delete files that start with _. It's a reasonable requirement IMO. While I don't know the nature of this change, I can assume it's related to IW not knowing which files to delete when a segment is no longer needed, because Codecs can pick their own file names. If we had an instance which kept track of all files that were created, e.g. every Codec would register the files there (if it wants to protect from their deletion), would make the decision of which files to delete easier?
        Hide
        Robert Muir added a comment -

        We track the files, but what if your computer crashes during flush, how would we know to erase the broken files.

        Lets keep it simple with _

        Show
        Robert Muir added a comment - We track the files, but what if your computer crashes during flush, how would we know to erase the broken files. Lets keep it simple with _
        Hide
        Robert Muir added a comment -

        patch is against branch_4x

        Show
        Robert Muir added a comment - patch is against branch_4x
        Hide
        Michael McCandless added a comment -

        Many of the lucene+facet deployments that I know of store the taxonomy index as a sub-directory of the search index

        We won't delete directories, just files.

        Also, we've been storing other files in the index directory too ... this new feature will affect such existing deployments.

        Yeah ... better to move them elsewhere or to a sub dir?

        I can assume it's related to IW not knowing which files to delete when a segment is no longer needed, because Codecs can pick their own file names

        Right: it's easy to track the positive set (files referenced by current segments), what's harder is the negative set (files created in the past but no longer referenced).

        If we had an instance which kept track of all files that were created, e.g. every Codec would register the files there (if it wants to protect from their deletion), would make the decision of which files to delete easier?

        In theory it would ... but this would add a fair amount of complexity (we'd have to save this list of files into segments_N). In fact long ago Lucene did this (it had a deletable file which stored the list of files previously created and now to-be-deleted).

        Show
        Michael McCandless added a comment - Many of the lucene+facet deployments that I know of store the taxonomy index as a sub-directory of the search index We won't delete directories, just files. Also, we've been storing other files in the index directory too ... this new feature will affect such existing deployments. Yeah ... better to move them elsewhere or to a sub dir? I can assume it's related to IW not knowing which files to delete when a segment is no longer needed, because Codecs can pick their own file names Right: it's easy to track the positive set (files referenced by current segments), what's harder is the negative set (files created in the past but no longer referenced). If we had an instance which kept track of all files that were created, e.g. every Codec would register the files there (if it wants to protect from their deletion), would make the decision of which files to delete easier? In theory it would ... but this would add a fair amount of complexity (we'd have to save this list of files into segments_N). In fact long ago Lucene did this (it had a deletable file which stored the list of files previously created and now to-be-deleted).
        Hide
        Andi Vajda added a comment -

        I think that the way to "bound" the namespace of files is to put everything in a subdirectory of the index directory chosen by the user and control the name of that subdirectory, making it clear that this is semi-private to Lucene and that all files in that subdirectory are fair game.

        Show
        Andi Vajda added a comment - I think that the way to "bound" the namespace of files is to put everything in a subdirectory of the index directory chosen by the user and control the name of that subdirectory, making it clear that this is semi-private to Lucene and that all files in that subdirectory are fair game.
        Hide
        Shai Erera added a comment -

        Let's keep it simple, I agree.

        Show
        Shai Erera added a comment - Let's keep it simple, I agree.
        Hide
        Robert Muir added a comment -

        We can maybe improve in the future besides the _ check, but I just think this is an easy improvement that will prevent most of the problems.

        I just think in this specific case, its critical to balance the complexity of our code against any improvement here, because there can always be a conflict no matter what we do. _ is a nice 80/20 that is simple to understand

        Show
        Robert Muir added a comment - We can maybe improve in the future besides the _ check, but I just think this is an easy improvement that will prevent most of the problems. I just think in this specific case, its critical to balance the complexity of our code against any improvement here, because there can always be a conflict no matter what we do. _ is a nice 80/20 that is simple to understand
        Hide
        Uwe Schindler added a comment -

        Robert: assertSaneFiles is called directly, so the loop is always executed.

        Should look like:

        assert assertSaneFiles();
        Show
        Uwe Schindler added a comment - Robert: assertSaneFiles is called directly, so the loop is always executed. Should look like: assert assertSaneFiles();
        Hide
        Robert Muir added a comment -

        nice catch, thanks for reviewing Uwe!

        Show
        Robert Muir added a comment - nice catch, thanks for reviewing Uwe!
        Hide
        Michael McCandless added a comment -

        Patch looks good!

        Show
        Michael McCandless added a comment - Patch looks good!
        Hide
        Hoss Man added a comment -

        I think that the way to "bound" the namespace of files is to put everything in a subdirectory of the index directory chosen by the user and control the name of that subdirectory, making it clear that this is semi-private to Lucene and that all files in that subdirectory are fair game.

        isn't that in theory already the point of the index directory anyway? how far down the rabit hole are we going to go?

        We won't delete directories, just files.

        One sanity check: this may be an orthoginal issue, but is there anything stoping a codec from using subdirectories? what if i have a codec that creates "_mycodec/foo" and "_mycodec/bar" ... will those not get cleaned up?

        Show
        Hoss Man added a comment - I think that the way to "bound" the namespace of files is to put everything in a subdirectory of the index directory chosen by the user and control the name of that subdirectory, making it clear that this is semi-private to Lucene and that all files in that subdirectory are fair game. isn't that in theory already the point of the index directory anyway? how far down the rabit hole are we going to go? We won't delete directories, just files. One sanity check: this may be an orthoginal issue, but is there anything stoping a codec from using subdirectories? what if i have a codec that creates "_mycodec/foo" and "_mycodec/bar" ... will those not get cleaned up?
        Hide
        Robert Muir added a comment -

        One sanity check: this may be an orthoginal issue, but is there anything stoping a codec from using subdirectories

        Yes: the fact that Directory abstraction doesnt have a way of dealing with subdirectories. you can only openInput(String) and createOutput(String)

        Show
        Robert Muir added a comment - One sanity check: this may be an orthoginal issue, but is there anything stoping a codec from using subdirectories Yes: the fact that Directory abstraction doesnt have a way of dealing with subdirectories. you can only openInput(String) and createOutput(String)
        Hide
        Yonik Seeley added a comment -

        isn't that in theory already the point of the index directory anyway?

        Solr has always supported certain files being in the index directory (elevation and external file field IIRC).

        Show
        Yonik Seeley added a comment - isn't that in theory already the point of the index directory anyway? Solr has always supported certain files being in the index directory (elevation and external file field IIRC).
        Hide
        Robert Muir added a comment -

        Then its probably a good time to put those somewhere else.

        the index directory is for the index.

        with flexible indexing, filenames are not set in stone.

        despite this issue, those files could get deleted by IndexFileDeleter (especially if they start with the name "segments" or an underscore)

        Show
        Robert Muir added a comment - Then its probably a good time to put those somewhere else. the index directory is for the index. with flexible indexing, filenames are not set in stone. despite this issue, those files could get deleted by IndexFileDeleter (especially if they start with the name "segments" or an underscore)
        Hide
        Yonik Seeley added a comment -

        How about we keep it simple and practical over pedantic and just go with the underscore.

        Show
        Yonik Seeley added a comment - How about we keep it simple and practical over pedantic and just go with the underscore.
        Hide
        Robert Muir added a comment -

        I'm not trying to be pedantic: I don't want the crazy files() handling we had before: its too much.

        Thats why i took this issue (and already committed, see the commits list) to only delete files that start with underscores.

        I'm just mentioning that in general its still not really safe to put files in the index directory: this is just a simple solution that doesnt bring back the complicated files() handling we had that nobody really understood.

        Show
        Robert Muir added a comment - I'm not trying to be pedantic: I don't want the crazy files() handling we had before: its too much. Thats why i took this issue (and already committed, see the commits list) to only delete files that start with underscores. I'm just mentioning that in general its still not really safe to put files in the index directory: this is just a simple solution that doesnt bring back the complicated files() handling we had that nobody really understood.
        Hide
        Andi Vajda added a comment -

        If Joe user gives c:\ as their index directory, which is silly, sure, it's even worse to just delete all files in there.
        Even if you just delete files there that are prefixed with _, we should know better than that. By putting the files we want to control into their own directory, a subdirectory of the Lucene index directory, there is very little room for mistakes.
        _ is just not a namespace for files reserved to Lucene, but a sub-directory chosen by Lucene instead is.
        If you persist in picking just , why not picking _90439043 to make it at least more unique ?

        Show
        Andi Vajda added a comment - If Joe user gives c:\ as their index directory, which is silly, sure, it's even worse to just delete all files in there. Even if you just delete files there that are prefixed with _, we should know better than that. By putting the files we want to control into their own directory, a subdirectory of the Lucene index directory, there is very little room for mistakes. _ is just not a namespace for files reserved to Lucene, but a sub-directory chosen by Lucene instead is. If you persist in picking just , why not picking _90439043 to make it at least more unique ?
        Hide
        Robert Muir added a comment -

        now I'm ready to be pedantic: I refuse to let file handling get complicated for this stuff. its not important and we are trying to make a search engine not a file manager.

        tomorrow, ill revert my commit and go back to deleting all files as I was totally happy with the previous situation: just don't put stuff in the lucene index directory.

        Show
        Robert Muir added a comment - now I'm ready to be pedantic: I refuse to let file handling get complicated for this stuff. its not important and we are trying to make a search engine not a file manager. tomorrow, ill revert my commit and go back to deleting all files as I was totally happy with the previous situation: just don't put stuff in the lucene index directory.
        Hide
        Yonik Seeley added a comment -

        Thats why i took this issue (and already committed, see the commits list)

        Hmmm, I don't see any commits under this issue. But searching the mailing list, I do see that you correctly used the issue name in the commit log - so I guess we can chalk it up to the recent ASF infra flakiness.

        tomorrow, ill revert my commit and go back to deleting all files

        -1 to reverting, we've made progress!

        Show
        Yonik Seeley added a comment - Thats why i took this issue (and already committed, see the commits list) Hmmm, I don't see any commits under this issue. But searching the mailing list, I do see that you correctly used the issue name in the commit log - so I guess we can chalk it up to the recent ASF infra flakiness. tomorrow, ill revert my commit and go back to deleting all files -1 to reverting, we've made progress!
        Hide
        Robert Muir added a comment -

        I still plan to revert tomorrow. I think the comments here are a bad sign... I sent us down an unfortunate slippery slope

        we are a search engine library!

        don't put shit in the index directory! or we will delete your shit. dead simple.

        Show
        Robert Muir added a comment - I still plan to revert tomorrow. I think the comments here are a bad sign... I sent us down an unfortunate slippery slope we are a search engine library! don't put shit in the index directory! or we will delete your shit. dead simple.
        Hide
        Yonik Seeley added a comment -

        I still plan to revert tomorrow.

        -1

        The now current behavior is more consistent with the legacy behavior, and everyone else seems to agree it's an improvement (although some feel we should go further). Reverting this bug fix without a better fix makes no sense.

        Show
        Yonik Seeley added a comment - I still plan to revert tomorrow. -1 The now current behavior is more consistent with the legacy behavior, and everyone else seems to agree it's an improvement (although some feel we should go further). Reverting this bug fix without a better fix makes no sense.
        Hide
        Robert Muir added a comment -

        pretty sure I have the right to revert my own commit.

        I can declare the licensing of asl2 as a mistake and instead full gpl if we want to press the point?

        Show
        Robert Muir added a comment - pretty sure I have the right to revert my own commit. I can declare the licensing of asl2 as a mistake and instead full gpl if we want to press the point?
        Hide
        Shai Erera added a comment -

        What if we had an object called IndexFileNames with a method accept(String name), that returns true if the file is recognized, false otherwise - that could give applications a way to create a recognized-set of index files:

        • Lucene would provide a DefaultIndexFileNames which recognizes all non-codec files
        • Either the app would provide an extension to the default (or a wrapper) which recognizes its codec files as well
          • Or, we make the Codec responsible for recognizing files too, and then the code would just query the Codec for non-default index files.

        Either way, it seems like we can very easily recognize what are index files and what aren't.

        When files need to be deleted, it seems simple as well:

        • Lucene lists all files in the directory
        • Any file that is referenced by the index (I assume we still know which files are needed right?) is kept
        • Any other file is queried against IndexFileNames.accept and if it is accepted, it's deleted, otherwise it's left alone.

        Since this looks too simple to me, I'm assuming that I'm missing something. If so, can someone please clarify the problem to me?

        Show
        Shai Erera added a comment - What if we had an object called IndexFileNames with a method accept(String name), that returns true if the file is recognized, false otherwise - that could give applications a way to create a recognized-set of index files: Lucene would provide a DefaultIndexFileNames which recognizes all non-codec files Either the app would provide an extension to the default (or a wrapper) which recognizes its codec files as well Or, we make the Codec responsible for recognizing files too, and then the code would just query the Codec for non-default index files. Either way, it seems like we can very easily recognize what are index files and what aren't. When files need to be deleted, it seems simple as well: Lucene lists all files in the directory Any file that is referenced by the index (I assume we still know which files are needed right?) is kept Any other file is queried against IndexFileNames.accept and if it is accepted, it's deleted, otherwise it's left alone. Since this looks too simple to me, I'm assuming that I'm missing something. If so, can someone please clarify the problem to me?
        Hide
        Gilad Barkai added a comment -

        Perhaps out of context, but here goes..
        Users sometimes do stupid things, me included, such as putting the index in a non-dedicated-directory. But should they pay the penalty just because the code should not get overly complicated?

        Codecs create their own files, and no one seems able to control what files they create (other than in assert?); Than, is it possible for the codec to handle the removal of the files it created?

        That would make codecs work the same way the Index handles the 'core' index files - each codec will be able to erase its own.
        Another closely related option - let IW consult with the codecs about 'non-core-files' and see which one should/could be removed.

        I only suggest this because I fear for users' files which might get erased.

        Disclosure:
        It'll be ages before I understand Lucene 4 half as much as I do Lucene 3.6 (not that that's much), so forgive me if I stepped on anyone's toes, or just described how to implement a time machine

        Show
        Gilad Barkai added a comment - Perhaps out of context, but here goes.. Users sometimes do stupid things, me included, such as putting the index in a non-dedicated-directory. But should they pay the penalty just because the code should not get overly complicated? Codecs create their own files, and no one seems able to control what files they create (other than in assert?); Than, is it possible for the codec to handle the removal of the files it created? That would make codecs work the same way the Index handles the 'core' index files - each codec will be able to erase its own. Another closely related option - let IW consult with the codecs about 'non-core-files' and see which one should/could be removed. I only suggest this because I fear for users' files which might get erased. Disclosure: It'll be ages before I understand Lucene 4 half as much as I do Lucene 3.6 (not that that's much), so forgive me if I stepped on anyone's toes, or just described how to implement a time machine
        Hide
        selckin added a comment - - edited

        <edit: remove unhelpful sarcasm, sorry>

        Show
        selckin added a comment - - edited <edit: remove unhelpful sarcasm, sorry>
        Hide
        Robert Muir added a comment -

        Again: I am totally against complicated file handling here for this reason. People can handle this some other way in their apps.
        We HAVE to keep this kind of code simple and maintainable in lucene.

        It was a mistake to start a slippery slope by being friendly at all to this. (I reverted)

        Show
        Robert Muir added a comment - Again: I am totally against complicated file handling here for this reason. People can handle this some other way in their apps. We HAVE to keep this kind of code simple and maintainable in lucene. It was a mistake to start a slippery slope by being friendly at all to this. (I reverted)
        Hide
        Carl Austin added a comment -

        I was the original commenter on the blog about this issue and have previously experienced the deletion of all files on a drive because of the exact same restriction - the fallout from this is massive.

        The issue here is that many people who use lucene will not realise that this can happen, and this situation will occur sooner or later. You can't expect that every developer who uses lucene will understand every in and out, read every bit of javadoc fully or every release change note. Look at the number of posts to the mailing list that are just people who haven't fully read or understood something. I firmly believe that this has to be handled by the library such that a simple mistake or misunderstanding by a developer does not lead to the loss of important files.

        Show
        Carl Austin added a comment - I was the original commenter on the blog about this issue and have previously experienced the deletion of all files on a drive because of the exact same restriction - the fallout from this is massive. The issue here is that many people who use lucene will not realise that this can happen, and this situation will occur sooner or later. You can't expect that every developer who uses lucene will understand every in and out, read every bit of javadoc fully or every release change note. Look at the number of posts to the mailing list that are just people who haven't fully read or understood something. I firmly believe that this has to be handled by the library such that a simple mistake or misunderstanding by a developer does not lead to the loss of important files.
        Hide
        Shai Erera added a comment -

        We HAVE to keep this kind of code simple and maintainable in lucene.

        Why? We write lots of other code that prevents users from shooting themselves in the legs, so why make an exception here? Just because a code might get complicated doesn't mean we don't need to write it.

        While I agree with you that Lucene is not a File manager, I think it'd be good if we can cleanup after ourselves rather than delete everything that we don't recognize.

        Since you're more familiar than me with the 4.0 internals, can you please relate to the simple proposal I outlined above? Can it even work?

        Show
        Shai Erera added a comment - We HAVE to keep this kind of code simple and maintainable in lucene. Why? We write lots of other code that prevents users from shooting themselves in the legs, so why make an exception here? Just because a code might get complicated doesn't mean we don't need to write it. While I agree with you that Lucene is not a File manager, I think it'd be good if we can cleanup after ourselves rather than delete everything that we don't recognize. Since you're more familiar than me with the 4.0 internals, can you please relate to the simple proposal I outlined above? Can it even work?
        Hide
        Michael McCandless added a comment -

        I agree there is a real danger here if users accidentally point
        IndexWriter at the wrong directory. This was found/fixed way in the
        past already: LUCENE-385.

        But I also don't want to go back to the hairy files(), extensions() we
        used to require of all codec components.

        Yet I think there's a good middle ground: only allow a codec to write
        to <seg>.* or _<seg>. files (ie the ones created by
        IndexFileNames). All of our codecs are (should be!) using IndexFileName.*
        to compute a file name to write to.

        In reality a codec already isn't free to just write to any file,
        because then it may conflict with another codec doing the same thing.
        So de-facto codecs already have a "private" namespace, prefixed by
        _<seg> and further refined by _N (ie when there are multiple postings
        formats in a single codec).

        Since a general codec must already obey its private namespace (to not
        step on other codecs) I think it's fine to enforce it?

        Show
        Michael McCandless added a comment - I agree there is a real danger here if users accidentally point IndexWriter at the wrong directory. This was found/fixed way in the past already: LUCENE-385 . But I also don't want to go back to the hairy files(), extensions() we used to require of all codec components. Yet I think there's a good middle ground: only allow a codec to write to <seg>.* or _<seg> . files (ie the ones created by IndexFileNames). All of our codecs are (should be!) using IndexFileName.* to compute a file name to write to. In reality a codec already isn't free to just write to any file, because then it may conflict with another codec doing the same thing. So de-facto codecs already have a "private" namespace, prefixed by _<seg> and further refined by _N (ie when there are multiple postings formats in a single codec). Since a general codec must already obey its private namespace (to not step on other codecs) I think it's fine to enforce it?
        Hide
        Robert Muir added a comment -

        The problem with a global filter is that what files a codec uses are an implementation detail of the codec. Currently today,
        a codec can name files pretty much whatever it wants (it must avoid _seg.cfs and segments_seg and segments.gen of course).

        In general other than exceptional cases, we know which files a codec owns because
        a codec writes the list of files that it uses for a segment into the SegmentInfo (http://lucene.apache.org/core/4_0_0-ALPHA/core/org/apache/lucene/codecs/lucene40/Lucene40SegmentInfoFormat.html).

        The problem is these exceptional cases: how can IndexFileDeleter distinguish between leftover partially written index files for a segment and some files of the user, since it may not have the SegmentInfo (.si) for that segment?

        Previous attempts at this still didnt work well:

        • listing the extensions() in the codec is not great, e.g. Sep codec uses .doc extension for documents!
        • having the codec list the files it uses for a segment isnt easy and causes a mess: previously files() had to be symmetric at read and write time and we often had bugs in this, because the files used by the codec often depends upon various things like options the user chooses (e.g. did they enable term vectors, payloads, etc etc). I will do anything to prevent this from coming back!

        So in my opinion, the only real, third option is to restrict what file names a codec can use, in a way thats not a huge imposition to the codec. My patch on this issue (which people weren't happy with) did just this: it restricted file names to begin with an underscore.

        Show
        Robert Muir added a comment - The problem with a global filter is that what files a codec uses are an implementation detail of the codec. Currently today, a codec can name files pretty much whatever it wants (it must avoid _seg.cfs and segments_seg and segments.gen of course). In general other than exceptional cases, we know which files a codec owns because a codec writes the list of files that it uses for a segment into the SegmentInfo ( http://lucene.apache.org/core/4_0_0-ALPHA/core/org/apache/lucene/codecs/lucene40/Lucene40SegmentInfoFormat.html ). The problem is these exceptional cases: how can IndexFileDeleter distinguish between leftover partially written index files for a segment and some files of the user, since it may not have the SegmentInfo (.si) for that segment? Previous attempts at this still didnt work well: listing the extensions() in the codec is not great, e.g. Sep codec uses .doc extension for documents! having the codec list the files it uses for a segment isnt easy and causes a mess: previously files() had to be symmetric at read and write time and we often had bugs in this, because the files used by the codec often depends upon various things like options the user chooses (e.g. did they enable term vectors, payloads, etc etc). I will do anything to prevent this from coming back! So in my opinion, the only real, third option is to restrict what file names a codec can use, in a way thats not a huge imposition to the codec. My patch on this issue (which people weren't happy with) did just this: it restricted file names to begin with an underscore.
        Hide
        Robert Muir added a comment -

        Since a general codec must already obey its private namespace (to not
        step on other codecs) I think it's fine to enforce it?

        The problem it seems is people want a perfect solution. An imperfect solution (_.*) seems to imply that
        its a bug if lucene deletes _myImportantDocument.doc.

        So if we insist on a perfect solution: then fine, the perfect solution I accept is for lucene to totally own
        the directory, don't put files in there! Then the behavior is clear, no bugs, we delete everything.

        Show
        Robert Muir added a comment - Since a general codec must already obey its private namespace (to not step on other codecs) I think it's fine to enforce it? The problem it seems is people want a perfect solution. An imperfect solution (_.*) seems to imply that its a bug if lucene deletes _myImportantDocument.doc. So if we insist on a perfect solution: then fine, the perfect solution I accept is for lucene to totally own the directory, don't put files in there! Then the behavior is clear, no bugs, we delete everything.
        Hide
        Gilad Barkai added a comment -

        So if we insist on a perfect solution: then fine, the perfect solution I accept is for lucene to totally own
        the directory, don't put files in there! Then the behavior is clear, no bugs, we delete everything.

        But than we're left with the original problem - should a poor user (say, me) accidentally put an index in an already filled directory (say /tmp) - the price to pay for is great.
        Too great IMHO.

        Show
        Gilad Barkai added a comment - So if we insist on a perfect solution: then fine, the perfect solution I accept is for lucene to totally own the directory, don't put files in there! Then the behavior is clear, no bugs, we delete everything. But than we're left with the original problem - should a poor user (say, me) accidentally put an index in an already filled directory (say /tmp) - the price to pay for is great. Too great IMHO.
        Hide
        Yonik Seeley added a comment -

        Robert, you completely ignored my explicit VETO.
        We had consensus, and code was committed. It's no longer "your commit" to do anything you want with over the objections of others.
        Undoubtedly, you would now revert any commit I would make to rectify the situation and fix this bug. So let's now take it to the PMC and codify if it's OK to ignore VETOs of other PMC members that you don't agree with. Perhaps we need to update the rules we operate under.

        Show
        Yonik Seeley added a comment - Robert, you completely ignored my explicit VETO. We had consensus, and code was committed. It's no longer "your commit" to do anything you want with over the objections of others. Undoubtedly, you would now revert any commit I would make to rectify the situation and fix this bug. So let's now take it to the PMC and codify if it's OK to ignore VETOs of other PMC members that you don't agree with. Perhaps we need to update the rules we operate under.
        Hide
        Robert Muir added a comment -

        you can't veto me backing out my own commit.

        Show
        Robert Muir added a comment - you can't veto me backing out my own commit.
        Hide
        Mark Miller added a comment -

        pretty sure I have the right to revert my own commit.

        Once something is in the code base, it doesn't matter who committed it - all the same rules apply. That it's your commit doesn't change anything. Unless you went insane and started threatening license revoking type...oh wait...

        I can declare the licensing of asl2 as a mistake and instead full gpl if we want to press the point?

        You can't be on the PMC and play games like this if you ask me. Being on the PMC means you have an obligation to act above this. Are we going back to the revert wars now?

        As far as I can tell you are trying to act like a dictator on this issue. You contribute a lot to Lucene, but you are not the dictator. Why do you need to demand that certain things happen as you prescribe? Why do you need to make threats about revoking licenses?

        This issue should be about consensus, not bullying.

        Are you kidding me dude?

        I hope you start working with the community and stop trying to step on it. Your stance is far too often, it's Roberts way or the highway if you ask me.

        Show
        Mark Miller added a comment - pretty sure I have the right to revert my own commit. Once something is in the code base, it doesn't matter who committed it - all the same rules apply. That it's your commit doesn't change anything. Unless you went insane and started threatening license revoking type...oh wait... I can declare the licensing of asl2 as a mistake and instead full gpl if we want to press the point? You can't be on the PMC and play games like this if you ask me. Being on the PMC means you have an obligation to act above this. Are we going back to the revert wars now? As far as I can tell you are trying to act like a dictator on this issue. You contribute a lot to Lucene, but you are not the dictator. Why do you need to demand that certain things happen as you prescribe? Why do you need to make threats about revoking licenses? This issue should be about consensus, not bullying. Are you kidding me dude? I hope you start working with the community and stop trying to step on it. Your stance is far too often, it's Roberts way or the highway if you ask me.
        Hide
        Robert Muir added a comment -

        Thats ok, its clearly another typical Solr-versus-Robert battle here, where Mark+Yonik both gang up on me.

        Another way to look at it: I committed the patch after Mike reviewed it, because it looked like consensus.
        There was then a ton of questions and commentary, arguably there wasnt really consensus and i prematurely committed.

        So i backed it out.

        Show
        Robert Muir added a comment - Thats ok, its clearly another typical Solr-versus-Robert battle here, where Mark+Yonik both gang up on me. Another way to look at it: I committed the patch after Mike reviewed it, because it looked like consensus. There was then a ton of questions and commentary, arguably there wasnt really consensus and i prematurely committed. So i backed it out.
        Hide
        Mark Harwood added a comment -

        -1 for merrily wiping contents of whatever directory a user happens to pick for an index location
        +0 on requiring all codecs to declare filenames because I take on board Rob's points re complexity
        +1 for the "_*" name-spacing proposal as a sensible compromise

        Show
        Mark Harwood added a comment - -1 for merrily wiping contents of whatever directory a user happens to pick for an index location +0 on requiring all codecs to declare filenames because I take on board Rob's points re complexity +1 for the "_*" name-spacing proposal as a sensible compromise
        Hide
        Mark Miller added a comment -

        Thats ok, its clearly another typical Solr-versus-Robert battle here, where Mark+Yonik both gang up on me.

        I don't have an opinion on this issue. A lot of smart people have already given input, and I was interested to read about it. I have not formulated my own opinion yet.

        I also don't mind if you and Yonik have disagreement or debate. As long as you act reasonably.

        Anyone that ignores consensus and threatens license revoking has me not on their side. That's part of the role of a PMC member IMO. To keep an eye out for unhealthy community behavior and point it out. All the disagreement in the world is fine, but you have to play in the sandbox.

        Show
        Mark Miller added a comment - Thats ok, its clearly another typical Solr-versus-Robert battle here, where Mark+Yonik both gang up on me. I don't have an opinion on this issue. A lot of smart people have already given input, and I was interested to read about it. I have not formulated my own opinion yet. I also don't mind if you and Yonik have disagreement or debate. As long as you act reasonably. Anyone that ignores consensus and threatens license revoking has me not on their side. That's part of the role of a PMC member IMO. To keep an eye out for unhealthy community behavior and point it out. All the disagreement in the world is fine, but you have to play in the sandbox.
        Hide
        Robert Muir added a comment -

        I think if you read through the issue, how is consensus being ignored?

        Again: I committed the patch after Mike reviewed it, because it looked like consensus.
        But I think this was premature, because a lot of questions and comments came afterwards.

        Backing it out is the right thing to do. It might be that we get consensus for this patch
        or something else and it might even go right back in the way it was.

        You just don't like the words I used.

        Show
        Robert Muir added a comment - I think if you read through the issue, how is consensus being ignored? Again: I committed the patch after Mike reviewed it, because it looked like consensus. But I think this was premature, because a lot of questions and comments came afterwards. Backing it out is the right thing to do. It might be that we get consensus for this patch or something else and it might even go right back in the way it was. You just don't like the words I used.
        Hide
        Mark Miller added a comment -

        Another way to look at it: I committed the patch after Mike reviewed it, because it looked like consensus.
        There was then a ton of questions and commentary, arguably there wasnt really consensus and i prematurely committed. So i backed it out.

        That would have been a good argument and much better than the alternative argument you took IMO.

        Show
        Mark Miller added a comment - Another way to look at it: I committed the patch after Mike reviewed it, because it looked like consensus. There was then a ton of questions and commentary, arguably there wasnt really consensus and i prematurely committed. So i backed it out. That would have been a good argument and much better than the alternative argument you took IMO.
        Hide
        Mark Miller added a comment -

        You just don't like the words I used.

        The words, the threat, the quick action - correct - that's my problem.

        Your objective is certainly not my problem.

        Had you made the argument you formulated after, I would not have a problem. That argument is within the 'sandbox'.

        Show
        Mark Miller added a comment - You just don't like the words I used. The words, the threat, the quick action - correct - that's my problem. Your objective is certainly not my problem. Had you made the argument you formulated after, I would not have a problem. That argument is within the 'sandbox'.
        Hide
        Robert Muir added a comment -

        The words, the threat, the quick action - correct - that's my problem.

        Right, i take offense to the idea that if i committed something too soon, i cant back it out.

        Sure, it didnt help that I was already frustrated with the technical situation (I thought and still do think,
        that the patch is a great compromise, easy solution, low risk, simple, etc).

        But i think if this situation happens, e.g. someone commits prematurely, then there are a bunch of comments
        on the issue that make it clear there really isnt consensus, then they have the right to back it out, in fact
        I think its the right thing to do.

        And nobody can veto that.

        Show
        Robert Muir added a comment - The words, the threat, the quick action - correct - that's my problem. Right, i take offense to the idea that if i committed something too soon, i cant back it out. Sure, it didnt help that I was already frustrated with the technical situation (I thought and still do think, that the patch is a great compromise, easy solution, low risk, simple, etc). But i think if this situation happens, e.g. someone commits prematurely, then there are a bunch of comments on the issue that make it clear there really isnt consensus, then they have the right to back it out, in fact I think its the right thing to do. And nobody can veto that.
        Hide
        Yonik Seeley added a comment -

        But I think this was premature, because a lot of questions and comments came afterwards.

        Except that if you read back through the issue, that's not what happened.

        Anyway, if you're interested in consensus now, I don't see anyone opposed to the underscore solution in the short term, even if some thought it didn't go far enough. I didn't see anyone saying that deleting all files was preferable in the short term.
        So if there are no objections, I'll re-commit the underscore fix (which there was consensus for), and then discussion can continue about better methods.

        Robert, I'll repeat your own words back to you:

        We can maybe improve in the future besides the _ check, but I just think this is an easy improvement that will prevent most of the problems.

        Show
        Yonik Seeley added a comment - But I think this was premature, because a lot of questions and comments came afterwards. Except that if you read back through the issue, that's not what happened. Anyway, if you're interested in consensus now, I don't see anyone opposed to the underscore solution in the short term, even if some thought it didn't go far enough. I didn't see anyone saying that deleting all files was preferable in the short term. So if there are no objections, I'll re-commit the underscore fix (which there was consensus for), and then discussion can continue about better methods. Robert, I'll repeat your own words back to you: We can maybe improve in the future besides the _ check, but I just think this is an easy improvement that will prevent most of the problems.
        Hide
        Robert Muir added a comment -

        I didn't see anyone saying that deleting all files was preferable in the short term.

        I'm not sure this is totally true.

        Again I think if its required that we have a perfect solution, then deleting all files is preferable to the alternative of codec having hairy code to detect if it owns or doesnt own a file.

        But this patch is a nice imperfect solution that probably prevents accidental deletion of MyImportantDocument.doc or whatever.

        Show
        Robert Muir added a comment - I didn't see anyone saying that deleting all files was preferable in the short term. I'm not sure this is totally true. Again I think if its required that we have a perfect solution, then deleting all files is preferable to the alternative of codec having hairy code to detect if it owns or doesnt own a file. But this patch is a nice imperfect solution that probably prevents accidental deletion of MyImportantDocument.doc or whatever.
        Hide
        Robert Muir added a comment -

        So if there are no objections, I'll re-commit the underscore fix (which there was consensus for), and then discussion can continue about better methods.

        I don't object to the patch being committed (though i think it would be good to wait a little bit), but
        I am still very very concerned that it starts a slippery slope back to Codec.files().

        Show
        Robert Muir added a comment - So if there are no objections, I'll re-commit the underscore fix (which there was consensus for), and then discussion can continue about better methods. I don't object to the patch being committed (though i think it would be good to wait a little bit), but I am still very very concerned that it starts a slippery slope back to Codec.files().
        Hide
        Uwe Schindler added a comment -

        Hi,
        I was thinking about the whole thing for longer time. My idea would limit us a bit more, but I really like Mike's proposal of fixed names. I would change the Directory class, so every method that handles or deletes files gets 2 parameters, segment name and one arbitrary codec-private file name. the directory is then responsible to create the file name, prefix with _ and so on. A custom directoy (like hbase), could use the segment name as table name and the private file name as identifier, so all segment files go into same hbase table. the diurectory would then also be responible to do a "cleanup"/"list of files", where it would only return files matching the pattern.

        For the index wide metdata like segments file we would then unfortunately need a special method to get indexoutput

        If we keep with current one-filename, i would make the format fixed, so it throws IOException if filename is invalid. Assert makes no sense here as it does not prevent people from doing the wrong thing. Then really nothing can create invalid files and deleting by "_[0-9a-z_]+" works and all would be happy.

        Alternatively, we could switch to the following:

        • If we create an new index, we enforce that listFiles returns empty list (., .. excluded, buts thats done already), otherwise we throw IOException("directory not empty").
        • If there is a segment file already there, we can delete everything not allowed in an index.

        Uwe

        Show
        Uwe Schindler added a comment - Hi, I was thinking about the whole thing for longer time. My idea would limit us a bit more, but I really like Mike's proposal of fixed names. I would change the Directory class, so every method that handles or deletes files gets 2 parameters, segment name and one arbitrary codec-private file name. the directory is then responsible to create the file name, prefix with _ and so on. A custom directoy (like hbase), could use the segment name as table name and the private file name as identifier, so all segment files go into same hbase table. the diurectory would then also be responible to do a "cleanup"/"list of files", where it would only return files matching the pattern. For the index wide metdata like segments file we would then unfortunately need a special method to get indexoutput If we keep with current one-filename, i would make the format fixed, so it throws IOException if filename is invalid. Assert makes no sense here as it does not prevent people from doing the wrong thing. Then really nothing can create invalid files and deleting by "_ [0-9a-z_] +" works and all would be happy. Alternatively, we could switch to the following: If we create an new index, we enforce that listFiles returns empty list (., .. excluded, buts thats done already), otherwise we throw IOException("directory not empty"). If there is a segment file already there, we can delete everything not allowed in an index. Uwe
        Hide
        Robert Muir added a comment -

        I was thinking about the whole thing for longer time. My idea would limit us a bit more, but I really like Mike's proposal of fixed names. I would change the Directory class, so every method that handles or deletes files gets 2 parameters, segment name and one arbitrary codec-private file name. the directory is then responsible to create the file name, prefix with _ and so on. A custom directoy (like hbase), could use the segment name as table name and the private file name as identifier, so all segment files go into same hbase table. the diurectory would then also be responible to do a "cleanup"/"list of files", where it would only return files matching the pattern.

        I'm not sure matching _[0-9a-z_]+ is really that big of an improvement over just the underscore. But i dont think we need
        to refactor Directory.java to do this. we could just change the underscore check to a regular expression.

        Assert makes no sense here as it does not prevent people from doing the wrong thing.

        I don't agree: i at first thought to do a hard check, but this is only really necessary for codec developers. So an assert
        is enough, because you catch it when developing your codec (its either gonna work, or completely not work here).

        If we create an new index, we enforce that listFiles returns empty list (., .. excluded, buts thats done already), otherwise we throw IOException("directory not empty").

        I thought about this but i have concerns about things like .DS_Store and .nfsXXXXX or other files that some system could
        be doing behind the scenes, etc.

        Show
        Robert Muir added a comment - I was thinking about the whole thing for longer time. My idea would limit us a bit more, but I really like Mike's proposal of fixed names. I would change the Directory class, so every method that handles or deletes files gets 2 parameters, segment name and one arbitrary codec-private file name. the directory is then responsible to create the file name, prefix with _ and so on. A custom directoy (like hbase), could use the segment name as table name and the private file name as identifier, so all segment files go into same hbase table. the diurectory would then also be responible to do a "cleanup"/"list of files", where it would only return files matching the pattern. I'm not sure matching _ [0-9a-z_] + is really that big of an improvement over just the underscore. But i dont think we need to refactor Directory.java to do this. we could just change the underscore check to a regular expression. Assert makes no sense here as it does not prevent people from doing the wrong thing. I don't agree: i at first thought to do a hard check, but this is only really necessary for codec developers. So an assert is enough, because you catch it when developing your codec (its either gonna work, or completely not work here). If we create an new index, we enforce that listFiles returns empty list (., .. excluded, buts thats done already), otherwise we throw IOException("directory not empty"). I thought about this but i have concerns about things like .DS_Store and .nfsXXXXX or other files that some system could be doing behind the scenes, etc.
        Hide
        Robert Muir added a comment -

        just to mention: the reason I don't like the Directory refactoring would be some of the crazy things we do (look at CompoundFileDirectory and also IndexWriter copySegmentAsIs, etc).

        This is basically what i think we should avoid: adding a lot of risky complexity for little gain.

        Show
        Robert Muir added a comment - just to mention: the reason I don't like the Directory refactoring would be some of the crazy things we do (look at CompoundFileDirectory and also IndexWriter copySegmentAsIs, etc). This is basically what i think we should avoid: adding a lot of risky complexity for little gain.
        Hide
        Uwe Schindler added a comment -

        Assert makes no sense here as it does not prevent people from doing the wrong thing.

        I don't agree: i at first thought to do a hard check, but this is only really necessary for codec developers. So an assert
        is enough, because you catch it when developing your codec (its either gonna work, or completely not work here).

        Why not make it a hard check, otherwise one could write a file without _ and schwupps, it's wech (German). Why only an assert? If we require all files start with _ lets enorce it, otherwise delete all files like we do currently. Using an assert would get my -1 to commit this again.

        Show
        Uwe Schindler added a comment - Assert makes no sense here as it does not prevent people from doing the wrong thing. I don't agree: i at first thought to do a hard check, but this is only really necessary for codec developers. So an assert is enough, because you catch it when developing your codec (its either gonna work, or completely not work here). Why not make it a hard check, otherwise one could write a file without _ and schwupps, it's wech (German). Why only an assert? If we require all files start with _ lets enorce it, otherwise delete all files like we do currently. Using an assert would get my -1 to commit this again.
        Hide
        Robert Muir added a comment - - edited

        I'm not 1000% determined for it to only be an assert, but then we should change how the code works to make
        sure that the check is not too expensive. The current assert makes SegmentInfo.addFiles/addFile very expensive (if its turned directly into a hard check)

        Show
        Robert Muir added a comment - - edited I'm not 1000% determined for it to only be an assert, but then we should change how the code works to make sure that the check is not too expensive. The current assert makes SegmentInfo.addFiles/addFile very expensive (if its turned directly into a hard check)
        Hide
        Uwe Schindler added a comment -

        String.startsWith("_") static string is cheap (a few cpu cycles, as it only needs compare length and one char... Please dont tell me that SI.addFilkes is called in inner loops like Scorers! Not doing this check is stupid.

        BTW: In CFSDirectory the assert about double entries on reading the dir should also throw CorruptIndexEx, because a CFS with duplicate file names is broken. This check is even cheaper. I am planning to open a new issue to fix all those I/O related checks to be hard, asserts are not appropriate here.

        Show
        Uwe Schindler added a comment - String.startsWith("_") static string is cheap (a few cpu cycles, as it only needs compare length and one char... Please dont tell me that SI.addFilkes is called in inner loops like Scorers! Not doing this check is stupid. BTW: In CFSDirectory the assert about double entries on reading the dir should also throw CorruptIndexEx, because a CFS with duplicate file names is broken. This check is even cheaper. I am planning to open a new issue to fix all those I/O related checks to be hard, asserts are not appropriate here.
        Hide
        Robert Muir added a comment -

        Uwe, no i mean that we check the entire list each time.

        So if someone were to call addFile(), addFile(), addFile() that would be very bad runtime. Ill update the patch.

        Show
        Robert Muir added a comment - Uwe, no i mean that we check the entire list each time. So if someone were to call addFile(), addFile(), addFile() that would be very bad runtime. Ill update the patch.
        Hide
        Robert Muir added a comment -

        updated patch with _ check turned into a hard check.

        Show
        Robert Muir added a comment - updated patch with _ check turned into a hard check.
        Hide
        Uwe Schindler added a comment -

        New patch looks good, I was not aware that the previous one was iterating over all files each time. As the SegmentInfo internal list should not be available outside, we have no problem anybody else changing this uncontrolled.

        See also issue LUCENE-4196.

        Show
        Uwe Schindler added a comment - New patch looks good, I was not aware that the previous one was iterating over all files each time. As the SegmentInfo internal list should not be available outside, we have no problem anybody else changing this uncontrolled. See also issue LUCENE-4196 .
        Hide
        Robert Muir added a comment -

        I think that the way to "bound" the namespace of files is to put everything in a subdirectory of the index directory chosen by the user and control the name of that subdirectory, making it clear that this is semi-private to Lucene and that all files in that subdirectory are fair game.

        Well there are a couple challenges with that I think:
        1. subdirectories currently are a foreign concept to Directory, we would have to make some serious changes there to support subdirectories.
        2. Lucene 3.x and Lucene4-alpha indexes still need to be supported, and we dont want to leave behind baggage when we merge, so the transition would be tricky.
        3. the user could also do this on their own right? e.g. we still have the same situation we have currently, where anything in that directory can get deleted by lucene, its just underneath another layer.

        Show
        Robert Muir added a comment - I think that the way to "bound" the namespace of files is to put everything in a subdirectory of the index directory chosen by the user and control the name of that subdirectory, making it clear that this is semi-private to Lucene and that all files in that subdirectory are fair game. Well there are a couple challenges with that I think: 1. subdirectories currently are a foreign concept to Directory, we would have to make some serious changes there to support subdirectories. 2. Lucene 3.x and Lucene4-alpha indexes still need to be supported, and we dont want to leave behind baggage when we merge, so the transition would be tricky. 3. the user could also do this on their own right? e.g. we still have the same situation we have currently, where anything in that directory can get deleted by lucene, its just underneath another layer.
        Hide
        Michael McCandless added a comment -

        I think

        _<seg>.*

        or

        _<seg>_*.

        is better than simply _*?

        Ie, it further reduces the risk of deleting a non-Lucene file, while
        not restricting the codec at all (this is already its private
        namespace)? No codec should be writing files outside of this space
        (and if somehow one needs to in the future, we can cross that bridge
        at that point?).

        It's of course still not perfect but I think it's as close as we should try
        to get.

        Show
        Michael McCandless added a comment - I think _<seg>.* or _<seg>_*. is better than simply _*? Ie, it further reduces the risk of deleting a non-Lucene file, while not restricting the codec at all (this is already its private namespace)? No codec should be writing files outside of this space (and if somehow one needs to in the future, we can cross that bridge at that point?). It's of course still not perfect but I think it's as close as we should try to get.
        Hide
        Robert Muir added a comment -

        I think the regex matching the current behavior (as restrictive as i can get) is:

        _<seg>(_.*)?\..*
        
        Show
        Robert Muir added a comment - I think the regex matching the current behavior (as restrictive as i can get) is: _<seg>(_.*)?\..*
        Hide
        Robert Muir added a comment -

        Updated patch with the regex instead of the underscore.

        additionally, dont fire off IndexFileDeleter's nuking of files if there is no current segments file.

        This means if you screw up and pick c:\, we do nothing. if you go and commit documents to c:\, then the next time you fire up indexwriter it will do the nuking as its then actually an index directory.

        But this gives you an additional chance to catch your mistake besides just the regex: plus its just a null check/optimization if you want to look at it.

        Show
        Robert Muir added a comment - Updated patch with the regex instead of the underscore. additionally, dont fire off IndexFileDeleter's nuking of files if there is no current segments file. This means if you screw up and pick c:\, we do nothing. if you go and commit documents to c:\, then the next time you fire up indexwriter it will do the nuking as its then actually an index directory. But this gives you an additional chance to catch your mistake besides just the regex: plus its just a null check/optimization if you want to look at it.
        Hide
        Andi Vajda added a comment -

        1. subdirectories currently are a foreign concept to Directory, we would have
        to make some serious changes there to support subdirectories.
        2. Lucene 3.x and Lucene4-alpha indexes still need to be supported, and we
        dont want to leave behind baggage when we merge, so the transition would be
        tricky.

        The way I imagined this working (short of looking at the code and proposing a
        patch) was to just append something like "lucene.index" to the directory path
        given by the user and pretend that's what the user gave it (and create that
        subdirectory). Code deleting the index files becomes simpler, it's just a
        recursive delete of that subdirectory Lucene created.

        That name, "lucene.index", could be in fact an additional parameter to the
        Directory class so that one could pick different names and store multiple
        indexes in the directory part.

        Having that extra name parameter would make it harder to a just use c:\ or c:\tmp which, while apparently silly, is also easy to hit accidentally. Imagine a shell script that puts together an index directory path with, say, some environment variables or command line parameters, and because of a bug there or some human error some inputs making up that path are now missing. Fancy generated path /home/fred/$(indexes) is now shorter all of a sudden, /home/fred gets used instead and later (partially) wiped. Ouch. No, I didn't just make this up

        3. the user could also do this on their own right? e.g. we still have the same situation we have currently, where anything in that directory can get deleted by lucene, its just underneath another layer.

        Of course they could, but the difference is that if Lucene is the one creating
        a directory, I expect it to more or less control the files in it, whereas if I
        create a directory myself, that is not the case. I don't expect Lucene to take
        it over by deleting files in it it didn't create.

        With the extra index name parameter, one would make it much less likely to stomp over something not belonging to Lucene. As for backwards compatibility, one could tolerate for the next release cycle, that this extra name be empty (which would be equivalent to the situation we have now, bug 4190).

        Show
        Andi Vajda added a comment - 1. subdirectories currently are a foreign concept to Directory, we would have to make some serious changes there to support subdirectories. 2. Lucene 3.x and Lucene4-alpha indexes still need to be supported, and we dont want to leave behind baggage when we merge, so the transition would be tricky. The way I imagined this working (short of looking at the code and proposing a patch) was to just append something like "lucene.index" to the directory path given by the user and pretend that's what the user gave it (and create that subdirectory). Code deleting the index files becomes simpler, it's just a recursive delete of that subdirectory Lucene created. That name, "lucene.index", could be in fact an additional parameter to the Directory class so that one could pick different names and store multiple indexes in the directory part. Having that extra name parameter would make it harder to a just use c:\ or c:\tmp which, while apparently silly, is also easy to hit accidentally. Imagine a shell script that puts together an index directory path with, say, some environment variables or command line parameters, and because of a bug there or some human error some inputs making up that path are now missing. Fancy generated path /home/fred/$(indexes) is now shorter all of a sudden, /home/fred gets used instead and later (partially) wiped. Ouch. No, I didn't just make this up 3. the user could also do this on their own right? e.g. we still have the same situation we have currently, where anything in that directory can get deleted by lucene, its just underneath another layer. Of course they could, but the difference is that if Lucene is the one creating a directory, I expect it to more or less control the files in it, whereas if I create a directory myself, that is not the case. I don't expect Lucene to take it over by deleting files in it it didn't create. With the extra index name parameter, one would make it much less likely to stomp over something not belonging to Lucene. As for backwards compatibility, one could tolerate for the next release cycle, that this extra name be empty (which would be equivalent to the situation we have now, bug 4190).
        Hide
        Robert Muir added a comment -

        As for backwards compatibility, one could tolerate for the next release cycle, that this extra name be empty (which would be equivalent to the situation we have now, bug 4190).

        2 release cycles: as we need to be able to support the 4.0-ALPHA indexes through the 5.x series, we wouldn't even be able to implement this change until Lucene 6.0.

        Show
        Robert Muir added a comment - As for backwards compatibility, one could tolerate for the next release cycle, that this extra name be empty (which would be equivalent to the situation we have now, bug 4190). 2 release cycles: as we need to be able to support the 4.0-ALPHA indexes through the 5.x series, we wouldn't even be able to implement this change until Lucene 6.0.
        Hide
        Michael McCandless added a comment -

        I think that the way to "bound" the namespace of files is to put everything in a subdirectory of the index directory chosen by the user and control the name of that subdirectory, making it clear that this is semi-private to Lucene and that all files in that subdirectory are fair game.

        I think this idea is compelling. It would clearly succeed in creating
        a private namespace (unless someone had happened to separately create
        a directory named 'lucene.index' there, which seems very unlikely).

        For back compat... if we open an existing index (not in the subdir),
        we'd have to just continue writing there? But when creating a new
        index, we'd create the subdir and write files into it (and maybe we
        fix index upgrader to somehow move the files to the subdir?).

        We could make this change for 4.0 Beta, but it wouldn't be until 6.0
        that we could remove the back compat (which is fine).

        Hmm but then an old index would never actually migrate "forward". Not
        quite sure how to do the transition ...

        It seems like a rather big change, which makes me nervous ... and I
        imagine apps will be confused eg when they try to replicate or backup
        (but, they'd just have to fix themselves: apps can't rely on the index
        files).

        It seems less risky to start with the current patch; in fact I think
        the two changes can be done separately?

        Show
        Michael McCandless added a comment - I think that the way to "bound" the namespace of files is to put everything in a subdirectory of the index directory chosen by the user and control the name of that subdirectory, making it clear that this is semi-private to Lucene and that all files in that subdirectory are fair game. I think this idea is compelling. It would clearly succeed in creating a private namespace (unless someone had happened to separately create a directory named 'lucene.index' there, which seems very unlikely). For back compat... if we open an existing index (not in the subdir), we'd have to just continue writing there? But when creating a new index, we'd create the subdir and write files into it (and maybe we fix index upgrader to somehow move the files to the subdir?). We could make this change for 4.0 Beta, but it wouldn't be until 6.0 that we could remove the back compat (which is fine). Hmm but then an old index would never actually migrate "forward". Not quite sure how to do the transition ... It seems like a rather big change, which makes me nervous ... and I imagine apps will be confused eg when they try to replicate or backup (but, they'd just have to fix themselves: apps can't rely on the index files). It seems less risky to start with the current patch; in fact I think the two changes can be done separately?
        Hide
        Michael McCandless added a comment -

        I think for now we should just commit the last "best effort regexp" patch ... progress not perfection.

        We can separately / later explore creating our own sub-directory...

        Show
        Michael McCandless added a comment - I think for now we should just commit the last "best effort regexp" patch ... progress not perfection. We can separately / later explore creating our own sub-directory...

          People

          • Assignee:
            Michael McCandless
            Reporter:
            Michael McCandless
          • Votes:
            0 Vote for this issue
            Watchers:
            10 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development