Lucene - Core
  1. Lucene - Core
  2. LUCENE-2950

Modules under top-level modules/ directory should be included in lucene's build targets, e.g. 'package-tgz', 'package-tgz-src', and 'javadocs'

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Blocker Blocker
    • Resolution: Duplicate
    • Affects Version/s: 4.0-ALPHA
    • Fix Version/s: 4.0-ALPHA
    • Component/s: general/build
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Lucene's top level modules/ directory is not included in the binary or source release distribution Ant targets package-tgz and package-tgz-src, or in javadocs, in lucene/build.xml. (However, these targets do include Lucene contribs.)

      This issue is visible via the nightly Jenkins (formerly Hudson) job named "Lucene-trunk", which publishes binary and source artifacts, using package-tgz and package-tgz-src, as well as javadocs using the javadocs target, all run from the top-level lucene/ directory.

        Activity

        Steve Rowe created issue -
        Steve Rowe made changes -
        Field Original Value New Value
        Description Lucene's top level {{modules/}} directory is not included in the binary or source release distribution Ant targets {{package-tgz}} and {{package-tgz-src}} in {{lucene/build.xml}}. (However, these targets do include Lucene contribs.)

        This issue is visible via the nightly Jenkins (formerly Hudson) job named "Lucene-trunk", which publishes binary and sourse artifacts using {{package-tgz}} and {{package-tgz-src}} run from the top-level {{lucene/}} directory.
        Lucene's top level {{modules/}} directory is not included in the binary or source release distribution Ant targets {{package-tgz}} and {{package-tgz-src}} in {{lucene/build.xml}}. (However, these targets do include Lucene contribs.)

        This issue is visible via the nightly Jenkins (formerly Hudson) job named "Lucene-trunk", which publishes binary and source artifacts using {{package-tgz}} and {{package-tgz-src}} run from the top-level {{lucene/}} directory.
        Hide
        Robert Muir added a comment -

        this is a big problem, the source.tgz artifact that is produced will not even compile because unfortunately some contribs depend on /modules.

        ideally we could remove these dependencies though.

        Show
        Robert Muir added a comment - this is a big problem, the source.tgz artifact that is produced will not even compile because unfortunately some contribs depend on /modules. ideally we could remove these dependencies though.
        Steve Rowe made changes -
        Summary Lucene's build targets 'package-tgz' and 'package-tgz-src' don't include anything under top-level modules/ directory Modules under top-level modules/ directory should be included in lucene's build targets, e.g. 'package-tgz', 'package-tgz-src', and 'javadocs'
        Description Lucene's top level {{modules/}} directory is not included in the binary or source release distribution Ant targets {{package-tgz}} and {{package-tgz-src}} in {{lucene/build.xml}}. (However, these targets do include Lucene contribs.)

        This issue is visible via the nightly Jenkins (formerly Hudson) job named "Lucene-trunk", which publishes binary and source artifacts using {{package-tgz}} and {{package-tgz-src}} run from the top-level {{lucene/}} directory.
        Lucene's top level {{modules/}} directory is not included in the binary or source release distribution Ant targets {{package-tgz}} and {{package-tgz-src}}, or in {{javadocs}}, in {{lucene/build.xml}}. (However, these targets do include Lucene contribs.)

        This issue is visible via the nightly Jenkins (formerly Hudson) job named "Lucene-trunk", which publishes binary and source artifacts, using {{package-tgz}} and {{package-tgz-src}}, as well as javadocs using the {{javadocs}} target, all run from the top-level {{lucene/}} directory.
        Hide
        Steve Rowe added a comment -

        ideally we could remove these dependencies though.

        How would this work? E.g. many contribs depend on the common-analyzers module. Removing this dependency would almost certainly make the contribs non-functional.

        Maybe you mean, we should move contribs with modules/ dependencies into modules/?

        Show
        Steve Rowe added a comment - ideally we could remove these dependencies though. How would this work? E.g. many contribs depend on the common-analyzers module. Removing this dependency would almost certainly make the contribs non-functional. Maybe you mean, we should move contribs with modules/ dependencies into modules/ ?
        Hide
        Robert Muir added a comment -

        How would this work? E.g. many contribs depend on the common-analyzers module. Removing this dependency would almost certainly make the contribs non-functional.

        The dependency is mostly bogus. Here is the contribs in question:

        • ant
        • demo
        • lucli
        • misc
        • spellchecker
        • swing
        • wordnet

        For example the ant IndexTask only depends on this so it can make this hashmap:

            static {
              analyzerLookup.put("simple", SimpleAnalyzer.class.getName());
              analyzerLookup.put("standard", StandardAnalyzer.class.getName());
              analyzerLookup.put("stop", StopAnalyzer.class.getName());
              analyzerLookup.put("whitespace", WhitespaceAnalyzer.class.getName());
            }
        

        I think we could remove this, e.g. it already has reflection code to build the analyzer, if you supply "Xyz" why not just look for XyzAnalyzer as a fallback?

        The lucli code has 'StandardAnalyzer' as a default: I think its best to not have a default analyzer at all. I would have fixed this already: but this contrib module has no tests! This makes it hard to want to get in there and clean up.

        The misc code mostly supplies an Analyzer inside embedded tools that don't actually analyze anything. We could add a pkg-private NullAnalyzer that throws UOE on its tokenStream() <-- especially as they shouldnt be analyzing anything, so its reasonable to do?

        The spellchecker code has a hardcoded WhitespaceAnalyzer... why is this? Seems like the whole spellchecking n-gramming is wrong anyway. Spellchecker uses a special form of n-gramming that depends upon the word length. Currently it does this in java code and indexes with WhitespaceAnalyzer (creating a lot of garbage in the process, e.g. lots of Field objects), but it seems this could all be cleaned up so that the spellchecker uses its own SpellCheckNgramAnalyzer, for better performance to boot.

        The swing code defaults to a whitespaceanalyzer... in my opinion again its best to not have a default analyzer and make the user somehow specify one.

        The wordnet code uses StandardAnalyzer for indexing the wordnet database. It also includes a very limited SynonymTokenFilter. In my opinion, now that we merged the SynonymTokenizer from solr that supports multi-word synonyms etc (which this wordnet module DOES NOT!), we should nuke this whole thing.

        Instead, we should make the synonym-loading process more flexible, so that one can produce the SynonymMap from various formats (such as the existing Solr format, a relational database, wordnet's format, or openoffice thesaurus format, among others). We could have parsers for these various formats. This would allow us to have a much more powerful synonym capability, that works nicely regardless of format. We could then look at other improvements, such as allowing SynonymFilter to use a more ram-conscious datastructure for its Synonym mappings (e.g. FST), and everyone would see the benefits.
        So hopefully this entire contrib could be deprecated.

        Show
        Robert Muir added a comment - How would this work? E.g. many contribs depend on the common-analyzers module. Removing this dependency would almost certainly make the contribs non-functional. The dependency is mostly bogus. Here is the contribs in question: ant demo lucli misc spellchecker swing wordnet For example the ant IndexTask only depends on this so it can make this hashmap: static { analyzerLookup.put("simple", SimpleAnalyzer.class.getName()); analyzerLookup.put("standard", StandardAnalyzer.class.getName()); analyzerLookup.put("stop", StopAnalyzer.class.getName()); analyzerLookup.put("whitespace", WhitespaceAnalyzer.class.getName()); } I think we could remove this, e.g. it already has reflection code to build the analyzer, if you supply "Xyz" why not just look for XyzAnalyzer as a fallback? The lucli code has 'StandardAnalyzer' as a default: I think its best to not have a default analyzer at all. I would have fixed this already: but this contrib module has no tests! This makes it hard to want to get in there and clean up. The misc code mostly supplies an Analyzer inside embedded tools that don't actually analyze anything. We could add a pkg-private NullAnalyzer that throws UOE on its tokenStream() <-- especially as they shouldnt be analyzing anything, so its reasonable to do? The spellchecker code has a hardcoded WhitespaceAnalyzer... why is this? Seems like the whole spellchecking n-gramming is wrong anyway. Spellchecker uses a special form of n-gramming that depends upon the word length. Currently it does this in java code and indexes with WhitespaceAnalyzer (creating a lot of garbage in the process, e.g. lots of Field objects), but it seems this could all be cleaned up so that the spellchecker uses its own SpellCheckNgramAnalyzer, for better performance to boot. The swing code defaults to a whitespaceanalyzer... in my opinion again its best to not have a default analyzer and make the user somehow specify one. The wordnet code uses StandardAnalyzer for indexing the wordnet database. It also includes a very limited SynonymTokenFilter. In my opinion, now that we merged the SynonymTokenizer from solr that supports multi-word synonyms etc (which this wordnet module DOES NOT!), we should nuke this whole thing. Instead, we should make the synonym-loading process more flexible, so that one can produce the SynonymMap from various formats (such as the existing Solr format, a relational database, wordnet's format, or openoffice thesaurus format, among others). We could have parsers for these various formats. This would allow us to have a much more powerful synonym capability, that works nicely regardless of format. We could then look at other improvements, such as allowing SynonymFilter to use a more ram-conscious datastructure for its Synonym mappings (e.g. FST), and everyone would see the benefits. So hopefully this entire contrib could be deprecated.
        Hide
        Robert Muir added a comment -

        just following up: the only thing in lucene reaching back into modules right now is contrib/demo...

        Show
        Robert Muir added a comment - just following up: the only thing in lucene reaching back into modules right now is contrib/demo...
        Hide
        Chris Male added a comment -

        The xml-query-parser demo also reaches back to StandardAnalyzer. Does this get included in the packaging?

        Show
        Chris Male added a comment - The xml-query-parser demo also reaches back to StandardAnalyzer. Does this get included in the packaging?
        Hide
        Robert Muir added a comment -

        This is related to LUCENE-2999 (or a subset of it)

        Show
        Robert Muir added a comment - This is related to LUCENE-2999 (or a subset of it)
        Hide
        Simon Willnauer added a comment -

        this is a duplicate of LUCENE-2999 right?

        Show
        Simon Willnauer added a comment - this is a duplicate of LUCENE-2999 right?
        Hide
        Steve Rowe added a comment -

        this is a duplicate of LUCENE-2999 right?

        vice-versa, but yes.

        Show
        Steve Rowe added a comment - this is a duplicate of LUCENE-2999 right? vice-versa, but yes.
        Hide
        Robert Muir added a comment -

        Out of that huge list, the only contrib still reaching back into /modules is contrib/demo.

        Show
        Robert Muir added a comment - Out of that huge list, the only contrib still reaching back into /modules is contrib/demo.
        Hide
        Steve Rowe added a comment -

        Out of that huge list, the only contrib still reaching back into /modules is contrib/demo.

        Maybe lucene/contrib/demo/ should be moved to modules/demo/?

        Also, the javadocs-all target in lucene/build.xml depends on modules modules/queryparser/, modules/analysis/common/, and modules/queries/.

        Show
        Steve Rowe added a comment - Out of that huge list, the only contrib still reaching back into /modules is contrib/demo. Maybe lucene/contrib/demo/ should be moved to modules/demo/ ? Also, the javadocs-all target in lucene/build.xml depends on modules modules/queryparser/ , modules/analysis/common/ , and modules/queries/ .
        Hide
        Robert Muir added a comment -

        Maybe lucene/contrib/demo/ should be moved to modules/demo/?

        I don't remember which issue the discussions were on, but somewhere we discussed improving the way we do this,
        such that we have something like an 'examples' module, which could have various examples showing how to use lucene/solr,
        including modules (e.g. faceting).

        I think its nice to have little examples in javadocs, but at the same time it would be really nice to have
        more fleshed-out examples which can span across different use-cases, for example faceting and grouping combined or whatever.

        As a real module of course we enforce we don't break these (unlike javadocs examples: which really i think should not
        grow too large but just be easy demonstrations of the basics), and we can even have tests that they are also
        example-ing correctly (the demo stuff does have some tests).

        So my first idea would be a top-level examples/... in fact I don't even think it should be rooted underneath modules/.
        Examples are for new users and they shouldn't have to dig around to find them.

        Show
        Robert Muir added a comment - Maybe lucene/contrib/demo/ should be moved to modules/demo/? I don't remember which issue the discussions were on, but somewhere we discussed improving the way we do this, such that we have something like an 'examples' module, which could have various examples showing how to use lucene/solr, including modules (e.g. faceting). I think its nice to have little examples in javadocs, but at the same time it would be really nice to have more fleshed-out examples which can span across different use-cases, for example faceting and grouping combined or whatever. As a real module of course we enforce we don't break these (unlike javadocs examples: which really i think should not grow too large but just be easy demonstrations of the basics), and we can even have tests that they are also example-ing correctly (the demo stuff does have some tests). So my first idea would be a top-level examples/... in fact I don't even think it should be rooted underneath modules/. Examples are for new users and they shouldn't have to dig around to find them.
        Hide
        Steve Rowe added a comment -

        Interesting - I assume, though, that solr/example/ would stay where it is?

        I like the idea of continuously testing examples.

        I don't remember which issue the discussions were on, but somewhere we discussed improving the way we do this, such that we have something like an 'examples' module, which could have various examples showing how to use lucene/solr, including modules (e.g. faceting).

        I think you're thinking of LUCENE-3550.

        Show
        Steve Rowe added a comment - Interesting - I assume, though, that solr/example/ would stay where it is? I like the idea of continuously testing examples. I don't remember which issue the discussions were on, but somewhere we discussed improving the way we do this, such that we have something like an 'examples' module, which could have various examples showing how to use lucene/solr, including modules (e.g. faceting). I think you're thinking of LUCENE-3550 .
        Hide
        Robert Muir added a comment -

        I think you're thinking of LUCENE-3550.

        That's the one, though it does seem to hint at examples being within each module.

        I am proposing instead that we have examples that can show how the different parts and pieces work together,
        "big picture", which I think is important as 4.0 is already a more modular architecture than 3.x

        Interesting - I assume, though, that solr/example/ would stay where it is?

        I have no opinion for now. I'd say leave it for the time being, there is already lots of documentation
        and stuff about how to use this in the place where it is now. I also think this is less of an 'example'
        and more of a 'defaults'... whenever I say this people argue with me, but lets face reality

        But from the lucene side, as mentioned on LUCENE-3550, lots of APIs have changed, things have been modularized
        and refactored and moved around, and when lucene 4.x comes out (unless people are working really hard in secret
        on this and I don't know about it), we won't have any "lucene in action" book for 4.x (to my knowledge) to shove
        the responsibility onto... I sorta feel thats how users have dealt with this lack of documentation in the past.

        Examples are like the least-maintained part of lucene, so I dont have any solutions for 'hey how do we get people
        excited about writing some nice clean examples', but it would be good to at least set things up so people can
        do this easily without refactoring the build system.

        Show
        Robert Muir added a comment - I think you're thinking of LUCENE-3550 . That's the one, though it does seem to hint at examples being within each module. I am proposing instead that we have examples that can show how the different parts and pieces work together, "big picture", which I think is important as 4.0 is already a more modular architecture than 3.x Interesting - I assume, though, that solr/example/ would stay where it is? I have no opinion for now. I'd say leave it for the time being, there is already lots of documentation and stuff about how to use this in the place where it is now. I also think this is less of an 'example' and more of a 'defaults'... whenever I say this people argue with me, but lets face reality But from the lucene side, as mentioned on LUCENE-3550 , lots of APIs have changed, things have been modularized and refactored and moved around, and when lucene 4.x comes out (unless people are working really hard in secret on this and I don't know about it), we won't have any "lucene in action" book for 4.x (to my knowledge) to shove the responsibility onto... I sorta feel thats how users have dealt with this lack of documentation in the past. Examples are like the least-maintained part of lucene, so I dont have any solutions for 'hey how do we get people excited about writing some nice clean examples', but it would be good to at least set things up so people can do this easily without refactoring the build system.
        Hide
        Robert Muir added a comment -

        Fixed in LUCENE-3965

        Show
        Robert Muir added a comment - Fixed in LUCENE-3965
        Robert Muir made changes -
        Status Open [ 1 ] Resolved [ 5 ]
        Resolution Duplicate [ 3 ]
        Uwe Schindler made changes -
        Status Resolved [ 5 ] Closed [ 6 ]
        Transition Time In Source Status Execution Times Last Executer Last Execution Date
        Open Open Resolved Resolved
        411d 15h 43m 1 Robert Muir 19/Apr/12 07:12
        Resolved Resolved Closed Closed
        386d 4h 31m 1 Uwe Schindler 10/May/13 11:43

          People

          • Assignee:
            Unassigned
            Reporter:
            Steve Rowe
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development