Lucene - Core
  1. Lucene - Core
  2. LUCENE-2282

Expose IndexFileNames as public, and make use of its methods in the code

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.0-ALPHA
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New, Patch Available

      Description

      IndexFileNames is useful for applications that extend Lucene, an in particular those who extend Directory or IndexWriter. It provides useful constants and methods to query whether a certain file is a core Lucene file or not. In addition, IndexFileNames should be used by Lucene's code to generate segment file names, or query whether a certain file matches a certain extension.

      I'll post the patch shortly.

      1. LUCENE-2282.patch
        42 kB
        Shai Erera
      2. LUCENE-2282.patch
        41 kB
        Shai Erera
      3. LUCENE-2282.patch
        41 kB
        Shai Erera

        Activity

        Hide
        Marvin Humphrey added a comment -

        It seems to me that identifying only core index files conflicts with the idea
        of pluggable index formats. Presumably plugins would use their own file
        extensions. Would these belong to the index, according to a detector based
        off of IndexFileNames? Presumably not, which would either limit the
        usefulness of such a utility, or outright encourage anti-patterns such as a
        sweeper that zaps files created by plugins because they aren't "core Lucene"
        enough.

        Also, are temporary files "core Lucene"? Lockfiles? Only sometimes?

        What are the applications that we are trying to support by exposing this API?

        Show
        Marvin Humphrey added a comment - It seems to me that identifying only core index files conflicts with the idea of pluggable index formats. Presumably plugins would use their own file extensions. Would these belong to the index, according to a detector based off of IndexFileNames? Presumably not, which would either limit the usefulness of such a utility, or outright encourage anti-patterns such as a sweeper that zaps files created by plugins because they aren't "core Lucene" enough. Also, are temporary files "core Lucene"? Lockfiles? Only sometimes? What are the applications that we are trying to support by exposing this API?
        Hide
        Shai Erera added a comment -

        Patch provides:

        • IFN constants and methods as public
        • segmentFileName concatenates the name and extension w/ '.', using a StringBuilder. Using SB is a minor efficiency, and since the + operator on strings allocate it anyway, I figure we'd better allocate it directly, especially when two + operators are involved.
        • matchesExtension checks whether a filename matches an extension
        • fileNameFromGeneration adds the '.' when neccessary.
        • Core code does not concatenate file, '.' and extension on its own, but uses IFN.

        All core tests pass. I have a local problem w/ contrib tests (the Analyzers data dir is locked for some reason and I cannot unlock it), but I see no reasons for them to fail.

        If this can go out in 3.0.2 instead of waiting for 3.1, then all the better.

        Show
        Shai Erera added a comment - Patch provides: IFN constants and methods as public segmentFileName concatenates the name and extension w/ '.', using a StringBuilder. Using SB is a minor efficiency, and since the + operator on strings allocate it anyway, I figure we'd better allocate it directly, especially when two + operators are involved. matchesExtension checks whether a filename matches an extension fileNameFromGeneration adds the '.' when neccessary. Core code does not concatenate file, '.' and extension on its own, but uses IFN. All core tests pass. I have a local problem w/ contrib tests (the Analyzers data dir is locked for some reason and I cannot unlock it), but I see no reasons for them to fail. If this can go out in 3.0.2 instead of waiting for 3.1, then all the better.
        Hide
        Shai Erera added a comment -

        What are the applications that we are trying to support by exposing this API?

        Any application that extends IW, or provide its own Directory implementation, and wants to reference Lucene's file extensions properly (i.e. not by putting its code under o.a.l.index or hardcoding ".del" as its deletions file) will benefit from making it public.
        The class already exists, and declares the right constants and methods. All I want is to expose it as public. This information is not hidden, it's out there. We're just making it easier for apps to reference it.

        Like Mike mentioned on the java-dev thread http://old.nabble.com/IndexFileNames-td27699103.html, he exposes it (or a variant) on the flex branch as public, because classes outside the 'index' package need to reference it. Indeed once Flex is out, that file will need to change or extend to accommodate pluggable index formats / files, but that's not the matter of this issue. This issue is just about exposing an already existing class as public instead of package-private for convenience ...

        Show
        Shai Erera added a comment - What are the applications that we are trying to support by exposing this API? Any application that extends IW, or provide its own Directory implementation, and wants to reference Lucene's file extensions properly (i.e. not by putting its code under o.a.l.index or hardcoding ".del" as its deletions file) will benefit from making it public. The class already exists, and declares the right constants and methods. All I want is to expose it as public. This information is not hidden , it's out there. We're just making it easier for apps to reference it. Like Mike mentioned on the java-dev thread http://old.nabble.com/IndexFileNames-td27699103.html , he exposes it (or a variant) on the flex branch as public, because classes outside the 'index' package need to reference it. Indeed once Flex is out, that file will need to change or extend to accommodate pluggable index formats / files, but that's not the matter of this issue. This issue is just about exposing an already existing class as public instead of package-private for convenience ...
        Hide
        Shai Erera added a comment -

        Forgot to tag IFN as @lucene.internal

        Show
        Shai Erera added a comment - Forgot to tag IFN as @lucene.internal
        Hide
        Marvin Humphrey added a comment -

        > Any application that extends IW, or provide its own Directory
        > implementation, and wants to reference Lucene's file extensions properly
        > (i.e. not by putting its code under o.a.l.index or hardcoding ".del" as its
        > deletions file) will benefit from making it public.

        > Forgot to tag IFN as @lucene.internal

        ?

        If the class is tagged as "internal", then external applications like the one
        you describe above aren't supposed to use it, right?

        But I don't really care about whether this goes into Lucene. Go ahead, make
        it fully public and omit the "internal" tag. Not my problem.

        The thing is, I really don't understand what kind of thing you want to do.
        Are you writing your own deletions files?

        I'm trying to understand because the only use cases I can think of for this
        aren't compatible with index pluggability, which is a high priority for Lucy.

        • Sweep "non-core-Lucene files" to "clean up" an index.
        • Gather up "core-Lucene files" for export.
        • Audit "core-Lucene files" to determine whether the index is in a valid
          state.
        • Differentiate between "core-Lucene and "non-core-Lucene" files when
          writing a compound file.

        Maybe there's something I haven't thought of, though. Why do you want to
        "reference Lucene's file extensions properly"? Once you've identified
        identified which files are "core Lucene" and which files aren't, what are you
        going to do with them?

        Show
        Marvin Humphrey added a comment - > Any application that extends IW, or provide its own Directory > implementation, and wants to reference Lucene's file extensions properly > (i.e. not by putting its code under o.a.l.index or hardcoding ".del" as its > deletions file) will benefit from making it public. > Forgot to tag IFN as @lucene.internal ? If the class is tagged as "internal", then external applications like the one you describe above aren't supposed to use it, right? But I don't really care about whether this goes into Lucene. Go ahead, make it fully public and omit the "internal" tag. Not my problem. The thing is, I really don't understand what kind of thing you want to do. Are you writing your own deletions files? I'm trying to understand because the only use cases I can think of for this aren't compatible with index pluggability, which is a high priority for Lucy. Sweep "non-core-Lucene files" to "clean up" an index. Gather up "core-Lucene files" for export. Audit "core-Lucene files" to determine whether the index is in a valid state. Differentiate between "core-Lucene and "non-core-Lucene" files when writing a compound file. Maybe there's something I haven't thought of, though. Why do you want to "reference Lucene's file extensions properly"? Once you've identified identified which files are "core Lucene" and which files aren't, what are you going to do with them?
        Hide
        Shai Erera added a comment - - edited

        The thing is, I really don't understand what kind of thing you want to do.

        Consider a variation of FileSwitchDirectory which someone wants to write. Extensions xyz, abc go do directory 1 and def got do directory 2. That someone will want to reference Lucene extensions in order to achieve that. FSD only takes care of extensions, but someone might need to work on prefixes of files as well.
        You can say "that class itself should not care about Lucene's extensions, it should get the extensions/prefixes in a ctor" ...

        Well ... let's take FileSwitchDirectory (which is a core Lucene class) and say I want to instantiate it. I need to pass file extensions. How do I pass the .del file as an extension? Do I hard code it to ".del" (or just "del"?) or, or do I put that code in o.a.l.store just for that? Or ... we make IndexFileNames publc and I can happily and safely reference it.
        BTW, I'm writing this from a computer w/o the Lucene code – I wonder how FSD is tested, is the test in o.a.l.index because it references IFN or in o.a.l.store because it tests FSD? If it's the former, then I (usually) think it points at a problem in the design.

        The @lucene.internal tag gives us exactly that freedom. Instead of me putting code in o.a.l.* (and feel like I'm dong something wrong, and make my code look wrong), I can put it in my package but know I'm risking the chance a certain constant or method impl will change in the future. I've already took that risk when I chose to put my code in o.a.l.* so I prefer to take that risk and not make my code look patchy.

        I, personally, don't mind what Lucene's back-compat policy is. Whenever I upgrade my code to a new Lucene version, I re-compile anything. I am happy to get rid of deprecated API, I am happy to take advantage of new, more efficient API, and I wouldn't care if a @lucene.internal class changed, as long as it's documented in the "Changes in back-compat" section in CHANGES, as a FYI - not because it's a change in the policy, so I can read about it quickly (that's the first section I usually check).

        That class is just a reference to Lucene's core files. If you want to write that 'Sweep' thing, you might benefit from knowing what is a core Lucene file and delegate the sweep task to a core Lucene instance, while sweeping the rest of the files by another instance (which created them?). If Lucene core will create files w/ other names/extensions, then I believe this class (IFN) should be changed entirely, but I don't think making it public blocks any of these changes, as long as it's tagged by @lucene.internal.

        Show
        Shai Erera added a comment - - edited The thing is, I really don't understand what kind of thing you want to do. Consider a variation of FileSwitchDirectory which someone wants to write. Extensions xyz, abc go do directory 1 and def got do directory 2. That someone will want to reference Lucene extensions in order to achieve that. FSD only takes care of extensions, but someone might need to work on prefixes of files as well. You can say "that class itself should not care about Lucene's extensions, it should get the extensions/prefixes in a ctor" ... Well ... let's take FileSwitchDirectory (which is a core Lucene class) and say I want to instantiate it. I need to pass file extensions. How do I pass the .del file as an extension? Do I hard code it to ".del" (or just "del"?) or, or do I put that code in o.a.l.store just for that? Or ... we make IndexFileNames publc and I can happily and safely reference it. BTW, I'm writing this from a computer w/o the Lucene code – I wonder how FSD is tested, is the test in o.a.l.index because it references IFN or in o.a.l.store because it tests FSD? If it's the former, then I (usually) think it points at a problem in the design. The @lucene.internal tag gives us exactly that freedom. Instead of me putting code in o.a.l.* (and feel like I'm dong something wrong, and make my code look wrong), I can put it in my package but know I'm risking the chance a certain constant or method impl will change in the future. I've already took that risk when I chose to put my code in o.a.l.* so I prefer to take that risk and not make my code look patchy. I, personally, don't mind what Lucene's back-compat policy is. Whenever I upgrade my code to a new Lucene version, I re-compile anything. I am happy to get rid of deprecated API, I am happy to take advantage of new, more efficient API, and I wouldn't care if a @lucene.internal class changed, as long as it's documented in the "Changes in back-compat" section in CHANGES, as a FYI - not because it's a change in the policy, so I can read about it quickly (that's the first section I usually check). That class is just a reference to Lucene's core files. If you want to write that 'Sweep' thing, you might benefit from knowing what is a core Lucene file and delegate the sweep task to a core Lucene instance, while sweeping the rest of the files by another instance (which created them?). If Lucene core will create files w/ other names/extensions, then I believe this class (IFN) should be changed entirely, but I don't think making it public blocks any of these changes, as long as it's tagged by @lucene.internal.
        Hide
        Shai Erera added a comment -

        Updated TestFileSwitchDirectory to use the constants instead of hard coding "fdt" and "fdx". It couldn't have done that because IFN was package-private, which shows the problem with it.

        Show
        Shai Erera added a comment - Updated TestFileSwitchDirectory to use the constants instead of hard coding "fdt" and "fdx". It couldn't have done that because IFN was package-private, which shows the problem with it.
        Hide
        Michael McCandless added a comment -

        Patch looks good Shai!

        But I don't think we should back port to 3.0.2 – it's non-trivial
        enough that there is some risk?

        As the API is now marked @lucene.internal, and it'll only be very
        expert usage, I'm not as concerned as Marvin is about the risks of
        even exposing this. Also, even with flex, a good number of Lucene's
        index files are not under codec control (codec only touches postings
        files – .tis, .tii, .frq, .prx for the standard codec). But I do
        agree it's not ideal that the knowledge of file extensions is split
        across this class and the codec. The IndexFileNameFilter in flex now
        takes a Codec as input, to make up for that... but IndexFileNames just
        has a NOTE at the top stating the limitation.

        Show
        Michael McCandless added a comment - Patch looks good Shai! But I don't think we should back port to 3.0.2 – it's non-trivial enough that there is some risk? As the API is now marked @lucene.internal, and it'll only be very expert usage, I'm not as concerned as Marvin is about the risks of even exposing this. Also, even with flex, a good number of Lucene's index files are not under codec control (codec only touches postings files – .tis, .tii, .frq, .prx for the standard codec). But I do agree it's not ideal that the knowledge of file extensions is split across this class and the codec. The IndexFileNameFilter in flex now takes a Codec as input, to make up for that... but IndexFileNames just has a NOTE at the top stating the limitation.
        Hide
        Shai Erera added a comment -

        But I don't think we should back port to 3.0.2

        Ok, I can live w/ 3.1, as long as it's not released at the end of 2010. I can for now put that part of my code in o.a.l.index, until 3.1 is out.

        As I wrote in the TestFileSwitchDirectory comment, this IMO has to go in, because otherwise it would make the code of users of FSD fragile (potentially).

        Thanks for looking at this !

        Show
        Shai Erera added a comment - But I don't think we should back port to 3.0.2 Ok, I can live w/ 3.1, as long as it's not released at the end of 2010. I can for now put that part of my code in o.a.l.index, until 3.1 is out. As I wrote in the TestFileSwitchDirectory comment, this IMO has to go in, because otherwise it would make the code of users of FSD fragile (potentially). Thanks for looking at this !
        Hide
        Uwe Schindler added a comment -

        But I don't think we should back port to 3.0.2 - it's non-trivial enough that there is some risk?

        Please no backport to 3.0.2, its an API change. And we are not sure if there will be ever a 3.0.2. BTW: Version 3.0.1 comes out latest on Friday, will appear on the mirrors soon!

        Show
        Uwe Schindler added a comment - But I don't think we should back port to 3.0.2 - it's non-trivial enough that there is some risk? Please no backport to 3.0.2, its an API change. And we are not sure if there will be ever a 3.0.2. BTW: Version 3.0.1 comes out latest on Friday, will appear on the mirrors soon!
        Hide
        Marvin Humphrey added a comment -

        > As the API is now marked @lucene.internal, and it'll only be very
        > expert usage, I'm not as concerned as Marvin is about the risks of
        > even exposing this.

        Um, the only possible concerns I could have had were regarding public exposure
        of this API. If it's marked as internal, it's an implementation detail.
        Whether or not the dot is included in internal-use-only constant strings isn't
        something I'm going to waste a lot of time thinking about.

        So now, not only do I really, really not care whether this goes in, I have no
        qualms about it either.

        Having users like Shai who are willing to recompile and regenerate to take
        advantage of experimental features is a big boon, as it allows us to test
        drive features before declaring them stable. Designing optimal APIs without
        usability testing is difficult to impossible.

        Show
        Marvin Humphrey added a comment - > As the API is now marked @lucene.internal, and it'll only be very > expert usage, I'm not as concerned as Marvin is about the risks of > even exposing this. Um, the only possible concerns I could have had were regarding public exposure of this API. If it's marked as internal, it's an implementation detail. Whether or not the dot is included in internal-use-only constant strings isn't something I'm going to waste a lot of time thinking about. So now, not only do I really, really not care whether this goes in, I have no qualms about it either. Having users like Shai who are willing to recompile and regenerate to take advantage of experimental features is a big boon, as it allows us to test drive features before declaring them stable. Designing optimal APIs without usability testing is difficult to impossible.
        Hide
        Michael McCandless added a comment -

        Thanks Shai!

        Show
        Michael McCandless added a comment - Thanks Shai!

          People

          • Assignee:
            Michael McCandless
            Reporter:
            Shai Erera
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development