Lucene - Core
  1. Lucene - Core
  2. LUCENE-3586

Choose a specific Directory implementation running the CheckIndex main

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 3.6, 4.0-ALPHA
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      It should be possible to choose a specific Directory implementation to use during the CheckIndex process when we run it from its main.
      What about an additional main parameter?
      In fact, I'm experiencing some problems with MMapDirectory working with a big segment, and after some failed attempts playing with maxChunkSize, I decided to switch to another FSDirectory implementation but I needed to do that on my own main.
      Should we also consider to use a FileSwitchDirectory?
      I'm willing to contribute, could you please let me know your thoughts about it?

      1. LUCENE-3586.patch
        18 kB
        Luca Cavanna
      2. LUCENE-3586.patch
        18 kB
        Luca Cavanna
      3. LUCENE-3586.patch
        12 kB
        Luca Cavanna
      4. LUCENE-3586.patch
        10 kB
        Luca Cavanna

        Activity

        Hide
        Simon Willnauer added a comment -

        I think this makes sense though. sometimes like in your usecase MMap just fails with OOM and since we create the directory instance based on constants like 64bit platform, OS etc. we could make this optional. another option would be to simply use NIOFSDirectory no matter where we are. I don't think it is necessary to use FileSwitchDirectory for CheckIndex. Maybe you can add a commandline option and get a specific Dir impl if specified? NIO, MMAP & SimpleFS should be enough and use FSDirectory.open by default.

        Show
        Simon Willnauer added a comment - I think this makes sense though. sometimes like in your usecase MMap just fails with OOM and since we create the directory instance based on constants like 64bit platform, OS etc. we could make this optional. another option would be to simply use NIOFSDirectory no matter where we are. I don't think it is necessary to use FileSwitchDirectory for CheckIndex. Maybe you can add a commandline option and get a specific Dir impl if specified? NIO, MMAP & SimpleFS should be enough and use FSDirectory.open by default.
        Hide
        Simon Willnauer added a comment -

        Luca, I made you a contributor for Lucene & Solr (JIRA Role) so you can assign issues to you.

        Show
        Simon Willnauer added a comment - Luca, I made you a contributor for Lucene & Solr (JIRA Role) so you can assign issues to you.
        Hide
        Luca Cavanna added a comment -

        Cool! Thanks, tomorrow I'll add a patch for this.

        Show
        Luca Cavanna added a comment - Cool! Thanks, tomorrow I'll add a patch for this.
        Hide
        Uwe Schindler added a comment -

        Hi Simon, hi Luca,

        Maybe we get some introduction and background from Luca, because in general issues are assigned to committers/contributors that have already done work for Lucene/Solr?

        Uwe

        Show
        Uwe Schindler added a comment - Hi Simon, hi Luca, Maybe we get some introduction and background from Luca, because in general issues are assigned to committers/contributors that have already done work for Lucene/Solr? Uwe
        Hide
        Uwe Schindler added a comment - - edited

        About the issue at all:
        we should maybe have this parameter to all command line tools (e.g. IndexUpgrader,...). It would be good to have some general command line parser for those comand line tools that work on indexes, so we don't have the same/similar code several times.

        For me much more interesting is the problem you have with MMAP? Whats you Java VM / Lucene version and what's your complete command line to CheckIndex? How big is the index? What architecture are you using and which OS? Changing the chunk size is not needed in most cases, if it needs to we need complete configuration details to optimize auto detection.

        Show
        Uwe Schindler added a comment - - edited About the issue at all: we should maybe have this parameter to all command line tools (e.g. IndexUpgrader,...). It would be good to have some general command line parser for those comand line tools that work on indexes, so we don't have the same/similar code several times. For me much more interesting is the problem you have with MMAP? Whats you Java VM / Lucene version and what's your complete command line to CheckIndex? How big is the index? What architecture are you using and which OS? Changing the chunk size is not needed in most cases, if it needs to we need complete configuration details to optimize auto detection.
        Hide
        Luca Cavanna added a comment -

        Hi Uwe, hi all,
        you're right, I know (virtually) almost every people working at Solr/Lucene, but you of course don't know me!
        I'm Luca, I'm a software developer actually working at Dutchworks in Amsterdam within the enterprise search team. You certainly know my colleague Martijn, Solr/Lucene committer. I really like the open source world and the solr/lucene project. These are my contributions so far: SOLR-1499 and SOLR-2591. Just a few, but I would like to contribute as much as I can. Please,don't hesitate to ask if you have any questions.
        I assigned the issue to myself because I thought I was gonna work on it, but let me know if I made some mistakes.
        Tomorrow I'll add more details about the MMAP and the OOM.
        Thanks for your help and for your point of view about the issue, it would be great to do something more generic than I initially thought, I'll have a look at the code.

        Show
        Luca Cavanna added a comment - Hi Uwe, hi all, you're right, I know (virtually) almost every people working at Solr/Lucene, but you of course don't know me! I'm Luca, I'm a software developer actually working at Dutchworks in Amsterdam within the enterprise search team. You certainly know my colleague Martijn, Solr/Lucene committer. I really like the open source world and the solr/lucene project. These are my contributions so far: SOLR-1499 and SOLR-2591 . Just a few, but I would like to contribute as much as I can. Please,don't hesitate to ask if you have any questions. I assigned the issue to myself because I thought I was gonna work on it, but let me know if I made some mistakes. Tomorrow I'll add more details about the MMAP and the OOM. Thanks for your help and for your point of view about the issue, it would be great to do something more generic than I initially thought, I'll have a look at the code.
        Hide
        Luca Cavanna added a comment - - edited

        Hi Uwe,
        I'm working with lucene 3.4 and jvm 1.6.0_23 64 bit. I have ulimit unlimited and /proc/sys/vm/max_map_count 65530. The index size is 76GB, I have problems with just one segment, the first and definitely the biggest one, on which the fdt is 22GB, tvf is 13GB, prx is 4.2GB, frq is 2.8GB, and so on (others are under 1GB). I tried to run the checkindex with -Xmx512m and -Xms512m. Here is my output:

        Segments file=segments_kj numSegments=31 version=3.4 format=FORMAT_3_1 [Lucene 3.1+]
        1 of 31: name=_1iz docCount=3186709
        compound=false
        hasProx=true
        numFiles=12
        size (MB)=41,902.362
        diagnostics =

        {optimize=true, mergeFactor=23, os.version=2.6.27.59-1POi-x86_64, os=Linux, lucene.version=3.4.0 1167142 - mike - 2011-09-09 09:02:09, source=merge, os.arch=amd64, java.version=1.6.0_23, java.vendor=Sun Microsystems Inc.}

        has deletions [delFileName=_1iz_r.del]
        test: open reader.........FAILED

        Thanks, and please let me know if you have any questions.

        Show
        Luca Cavanna added a comment - - edited Hi Uwe, I'm working with lucene 3.4 and jvm 1.6.0_23 64 bit. I have ulimit unlimited and /proc/sys/vm/max_map_count 65530. The index size is 76GB, I have problems with just one segment, the first and definitely the biggest one, on which the fdt is 22GB, tvf is 13GB, prx is 4.2GB, frq is 2.8GB, and so on (others are under 1GB). I tried to run the checkindex with -Xmx512m and -Xms512m. Here is my output: Segments file=segments_kj numSegments=31 version=3.4 format=FORMAT_3_1 [Lucene 3.1+] 1 of 31: name=_1iz docCount=3186709 compound=false hasProx=true numFiles=12 size (MB)=41,902.362 diagnostics = {optimize=true, mergeFactor=23, os.version=2.6.27.59-1POi-x86_64, os=Linux, lucene.version=3.4.0 1167142 - mike - 2011-09-09 09:02:09, source=merge, os.arch=amd64, java.version=1.6.0_23, java.vendor=Sun Microsystems Inc.} has deletions [delFileName=_1iz_r.del] test: open reader.........FAILED Thanks, and please let me know if you have any questions.
        Hide
        Luca Cavanna added a comment -

        Here is the first version of my patch, please let me know your comments.

        I tried to do something more generic for command line tools and less "static" but I found myself reinventing the wheel, writing stuff similar to commons cli, so I gave up.
        I basically added a new option called -dir-impl to CheckIndex, IndexReader and IndexUpgrader (please let me know if I forgot some other tools). The new option can have the following values (case insensitive): mmap, nio or simple. Any other value is ignored and considered as default (FSDirectory.open).

        I noticed a finally block used to close the directory inside IndexReader, but it's missing on the other two classes, wouldn't it be better to add it to CheckIndex and IndexUpgrader?

        Show
        Luca Cavanna added a comment - Here is the first version of my patch, please let me know your comments. I tried to do something more generic for command line tools and less "static" but I found myself reinventing the wheel, writing stuff similar to commons cli, so I gave up. I basically added a new option called -dir-impl to CheckIndex, IndexReader and IndexUpgrader (please let me know if I forgot some other tools). The new option can have the following values (case insensitive): mmap, nio or simple. Any other value is ignored and considered as default (FSDirectory.open). I noticed a finally block used to close the directory inside IndexReader, but it's missing on the other two classes, wouldn't it be better to add it to CheckIndex and IndexUpgrader?
        Hide
        Luca Cavanna added a comment -

        Forgot to mention that the patch applies to the 3x branch. After your comments, and maybe some changes, I'll attach a patch for the trunk as well.

        Show
        Luca Cavanna added a comment - Forgot to mention that the patch applies to the 3x branch. After your comments, and maybe some changes, I'll attach a patch for the trunk as well.
        Hide
        Simon Willnauer added a comment -

        hey luca, you patch looks good. I wonder if we should maybe add an enum to FSDirectory like

        enum Type {
          NIOFS,
          MMAP,
          SIMPLE;
        
        }
        
        

        and add FSDirectory#open(File, Type). Just an idea, this would make the CommandlineUtils obsolete and I think it could be useful... not sure what others think... uwe?

        Forgot to mention that the patch applies to the 3x branch

        usually we (I) only provide a patch against trunk since we do a merge when we backport so a patch doesn't help very often. So it'd be better if you'd use trunk to create patches.

        simon

        Show
        Simon Willnauer added a comment - hey luca, you patch looks good. I wonder if we should maybe add an enum to FSDirectory like enum Type { NIOFS, MMAP, SIMPLE; } and add FSDirectory#open(File, Type). Just an idea, this would make the CommandlineUtils obsolete and I think it could be useful... not sure what others think... uwe? Forgot to mention that the patch applies to the 3x branch usually we (I) only provide a patch against trunk since we do a merge when we backport so a patch doesn't help very often. So it'd be better if you'd use trunk to create patches. simon
        Hide
        Luca Cavanna added a comment -

        Hey Simon,
        looks better! I thought I wasn't allowed to edit FSDirectory just for a new command line option
        I suppose we don't need also FSDirectory#open(File, LockFactory, Type) because we can't specify the LockFactory from command line, right?

        Show
        Luca Cavanna added a comment - Hey Simon, looks better! I thought I wasn't allowed to edit FSDirectory just for a new command line option I suppose we don't need also FSDirectory#open(File, LockFactory, Type) because we can't specify the LockFactory from command line, right?
        Hide
        Michael McCandless added a comment -

        Hmm, I don't think we should add an enum to FSDir here? Can we simply accept the class name and then just load that class (maybe prefixing oal.store so user doesn't have to type that all the time)?

        Also, can we make it a hard error if the specified name isn't recognized? (Instead of silently falling back to FSDir.open).

        Show
        Michael McCandless added a comment - Hmm, I don't think we should add an enum to FSDir here? Can we simply accept the class name and then just load that class (maybe prefixing oal.store so user doesn't have to type that all the time)? Also, can we make it a hard error if the specified name isn't recognized? (Instead of silently falling back to FSDir.open).
        Hide
        Luca Cavanna added a comment -

        Hmm, I don't think we should add an enum to FSDir here? Can we simply accept the class name and then just load that class (maybe prefixing oal.store so user doesn't have to type that all the time)?

        Also, can we make it a hard error if the specified name isn't recognized? (Instead of silently falling back to FSDir.open).

        That's fine as well. Just a little bit longer than writing NIOFS, MMAP or SIMPLE, but I guess it doesn't matter. Mike, do you mean to load the class using reflection or compare the input string to those three class names?

        Any other opinion?

        Show
        Luca Cavanna added a comment - Hmm, I don't think we should add an enum to FSDir here? Can we simply accept the class name and then just load that class (maybe prefixing oal.store so user doesn't have to type that all the time)? Also, can we make it a hard error if the specified name isn't recognized? (Instead of silently falling back to FSDir.open). That's fine as well. Just a little bit longer than writing NIOFS, MMAP or SIMPLE, but I guess it doesn't matter. Mike, do you mean to load the class using reflection or compare the input string to those three class names? Any other opinion?
        Hide
        Michael McCandless added a comment -

        I think just load the classes by name via reflection? This way if I have my own external Dir impl somewhere I can also have CheckIndex use that...

        Show
        Michael McCandless added a comment - I think just load the classes by name via reflection? This way if I have my own external Dir impl somewhere I can also have CheckIndex use that...
        Hide
        Luca Cavanna added a comment -

        New patch against trunk according to Michael's hints.
        It's now possible to use external FSDirectory implementations. The package oal.store is used if no package is specified. This isn't good if someone has the FSDirectory implementation in the default package, but I'm not sure if this case is worth a fall back. Please, let me know what you think.

        Show
        Luca Cavanna added a comment - New patch against trunk according to Michael's hints. It's now possible to use external FSDirectory implementations. The package oal.store is used if no package is specified. This isn't good if someone has the FSDirectory implementation in the default package, but I'm not sure if this case is worth a fall back. Please, let me know what you think.
        Hide
        Michael McCandless added a comment -

        Thanks Luca; I think this patch is close!

        One thing I noticed is that LuceneTestCase.java (under
        src/test-framework/java/org/apache/lucene/util), has a method
        newDirectoryImpl doing almost the same thing as the new static method
        here. Maybe merge these two (have LTC call the new one, but leave that
        random logic in LTC)?

        Separately, can we put the static method in oal.util? Maybe in a new
        file, CommandLineUtil.java or something?

        Show
        Michael McCandless added a comment - Thanks Luca; I think this patch is close! One thing I noticed is that LuceneTestCase.java (under src/test-framework/java/org/apache/lucene/util), has a method newDirectoryImpl doing almost the same thing as the new static method here. Maybe merge these two (have LTC call the new one, but leave that random logic in LTC)? Separately, can we put the static method in oal.util? Maybe in a new file, CommandLineUtil.java or something?
        Hide
        Luca Cavanna added a comment -

        Good eye! Thanks Michael for your feedback.
        I attached a new patch version. The merge hasn't been so easy since LTC needs sometimes a Directory, while other times a FSDirectory. Furthermore commmand line utilities don't need a fallback while tests do..before the tests we even create the directory on file system etc.
        Let me know what you think!

        Show
        Luca Cavanna added a comment - Good eye! Thanks Michael for your feedback. I attached a new patch version. The merge hasn't been so easy since LTC needs sometimes a Directory, while other times a FSDirectory. Furthermore commmand line utilities don't need a fallback while tests do..before the tests we even create the directory on file system etc. Let me know what you think!
        Hide
        Michael McCandless added a comment -

        Hi Luca,

        Actually I don't like all the generics required in CommandLineUtil solely to support LTC's case; if not for LTC the util would be much simpler (just one method taking a String class name/path and returning a Directory instance).

        So.... I think we should go back, and leave LTC with its current (private) code for finding the FS/Directory class/instance, and keep CommandLineUtil simple? Sorry!

        Otherwise the patch looks great! Can you add the lost " // for javadocs" comment on one of CheckIndex's imports? Thanks.

        Show
        Michael McCandless added a comment - Hi Luca, Actually I don't like all the generics required in CommandLineUtil solely to support LTC's case; if not for LTC the util would be much simpler (just one method taking a String class name/path and returning a Directory instance). So.... I think we should go back, and leave LTC with its current (private) code for finding the FS/Directory class/instance, and keep CommandLineUtil simple? Sorry! Otherwise the patch looks great! Can you add the lost " // for javadocs" comment on one of CheckIndex's imports? Thanks.
        Hide
        Luca Cavanna added a comment -

        Hi Michael,

        Actually I don't like all the generics required in CommandLineUtil solely to support LTC's case; if not for LTC the util would be much simpler (just one method taking a String class name/path and returning a Directory instance).

        I see. But I think we were on the right track. Have a look at the last version, I kept the merged code removing that method with generics. If you don't like even this version I'll go back, no problem at all!

        Can you add the lost " // for javadocs" comment on one of CheckIndex's imports?

        Ops, I restored it.

        Show
        Luca Cavanna added a comment - Hi Michael, Actually I don't like all the generics required in CommandLineUtil solely to support LTC's case; if not for LTC the util would be much simpler (just one method taking a String class name/path and returning a Directory instance). I see. But I think we were on the right track. Have a look at the last version, I kept the merged code removing that method with generics. If you don't like even this version I'll go back, no problem at all! Can you add the lost " // for javadocs" comment on one of CheckIndex's imports? Ops, I restored it.
        Hide
        Uwe Schindler added a comment -

        What is the problem with the generics code? I like it

        Show
        Uwe Schindler added a comment - What is the problem with the generics code? I like it
        Hide
        Michael McCandless added a comment -

        New patch looks great Luca! I'll commit & backport shortly...

        Show
        Michael McCandless added a comment - New patch looks great Luca! I'll commit & backport shortly...
        Hide
        Michael McCandless added a comment -

        Thanks Luca!

        Show
        Michael McCandless added a comment - Thanks Luca!
        Hide
        Luca Cavanna added a comment -

        Cool, thanks to you Michael!

        Show
        Luca Cavanna added a comment - Cool, thanks to you Michael!

          People

          • Assignee:
            Luca Cavanna
            Reporter:
            Luca Cavanna
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development