Lucene - Core
  1. Lucene - Core
  2. LUCENE-1377

Add HTMLStripReader and WordDelimiterFilter from SOLR

    Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 2.3.2
    • Fix Version/s: None
    • Component/s: modules/analysis
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      SOLR has two classes HTMLStripReader and WordDelimiterFilter which are very useful for a wide variety of use cases. It would be good to place them into core Lucene.

        Issue Links

          Activity

          Hide
          Robert Muir added a comment -

          the WordDelimiterFilter in SOLR trunk as it stands today would be a significant benefit to lucene.

          Also, I think there's a very valuable use for an analyzer something like the following:
          WhitespaceTokenizer
          WordDelimiterFilter (default settings)
          LowerCaseFilter

          This simple configuration would provide some much needed functionality to lucene, specifically the ability to index things like Hindi text. Its not perfect and will add some additional nonsense terms for some languages, but in the short term, is much more friendly to a variety of languages where no viable option in lucene exists at all right now.

          Show
          Robert Muir added a comment - the WordDelimiterFilter in SOLR trunk as it stands today would be a significant benefit to lucene. Also, I think there's a very valuable use for an analyzer something like the following: WhitespaceTokenizer WordDelimiterFilter (default settings) LowerCaseFilter This simple configuration would provide some much needed functionality to lucene, specifically the ability to index things like Hindi text. Its not perfect and will add some additional nonsense terms for some languages, but in the short term, is much more friendly to a variety of languages where no viable option in lucene exists at all right now.
          Hide
          Otis Gospodnetic added a comment -

          Could we make this even more generic and say that all basic tokenizers and filters that currently live in Solr should really move to Lucene?

          Show
          Otis Gospodnetic added a comment - Could we make this even more generic and say that all basic tokenizers and filters that currently live in Solr should really move to Lucene?
          Hide
          Robert Muir added a comment -

          i don't really know enough to say that, but this one is especially important (imho)

          without getting into details, the changes Yonik made to WordDelimiterFilter, in combination with WhitespaceTokenizer treat the various unicode categories correctly, unlike any of the other analyzers in lucene.

          No, it doesn't actually use the Word_Break properties which is really the key, but if you toss some Hindi, Bangla, Tibetan, Arabic, Tamil, ... [list goes on very long] text at it, in general it's gonna work pretty well. This is a big improvement over any of the other default options!

          Show
          Robert Muir added a comment - i don't really know enough to say that, but this one is especially important (imho) without getting into details, the changes Yonik made to WordDelimiterFilter, in combination with WhitespaceTokenizer treat the various unicode categories correctly, unlike any of the other analyzers in lucene. No, it doesn't actually use the Word_Break properties which is really the key, but if you toss some Hindi, Bangla, Tibetan, Arabic, Tamil, ... [list goes on very long] text at it, in general it's gonna work pretty well. This is a big improvement over any of the other default options!
          Hide
          Michael McCandless added a comment -

          Robert, would ICUTokenizer (LUCNEE-1488) subsume WordDelimiterFilter? Meaning, are there things WordDelimiterFilter would do that ICUTokenizer would not?

          Show
          Michael McCandless added a comment - Robert, would ICUTokenizer (LUCNEE-1488) subsume WordDelimiterFilter? Meaning, are there things WordDelimiterFilter would do that ICUTokenizer would not?
          Hide
          Robert Muir added a comment -

          they are a bit different.

          for example:

          worddelimiterfilter has functionality to split on case-changes, etc.
          the default rule set for ICUTokenizer does not do this, because its not in the unicode defaults.
          but you could write up a quick .txt file grammar and supply a tailored RuleBasedBreakIterator to ICUTokenizer and do that if you wanted.

          Show
          Robert Muir added a comment - they are a bit different. for example: worddelimiterfilter has functionality to split on case-changes, etc. the default rule set for ICUTokenizer does not do this, because its not in the unicode defaults. but you could write up a quick .txt file grammar and supply a tailored RuleBasedBreakIterator to ICUTokenizer and do that if you wanted.
          Hide
          Grant Ingersoll added a comment -

          Could we make this even more generic and say that all basic tokenizers and filters that currently live in Solr should really move to Lucene?

          I'm for this, but unless I see a linked issue in JIRA with patches for both Lucene and Solr, I am going to be -1. Longer term, it is always going to be the case that there is analysis stuff in Solr that isn't in Lucene unless we make it such that Solr committers can have write access to the Lucene analysis piece.

          Show
          Grant Ingersoll added a comment - Could we make this even more generic and say that all basic tokenizers and filters that currently live in Solr should really move to Lucene? I'm for this, but unless I see a linked issue in JIRA with patches for both Lucene and Solr, I am going to be -1. Longer term, it is always going to be the case that there is analysis stuff in Solr that isn't in Lucene unless we make it such that Solr committers can have write access to the Lucene analysis piece.
          Hide
          Robert Muir added a comment -

          ok, well i was stirring up trouble recommending this filter from Solr to users so they can index hindi, etc in the short term.

          its a pretty good workaround for the missing unicode support in lucene, but hopefully this won't take much longer to fix.

          Show
          Robert Muir added a comment - ok, well i was stirring up trouble recommending this filter from Solr to users so they can index hindi, etc in the short term. its a pretty good workaround for the missing unicode support in lucene, but hopefully this won't take much longer to fix.
          Hide
          Michael McCandless added a comment -

          I agree we'd need a more comprehensive "strategy" to consolidate all
          analyzers in one place. I also think it's important. But it's a
          biggie.

          its a pretty good workaround for the missing unicode support in lucene, but hopefully this won't take much longer to fix.

          I'm having trouble keeping track of the various issues to fix the
          "missing unicode support in lucene". Are there issues opened for all?
          Should we open a consolidated issue for properly handling surrogate
          pairs?

          Show
          Michael McCandless added a comment - I agree we'd need a more comprehensive "strategy" to consolidate all analyzers in one place. I also think it's important. But it's a biggie. its a pretty good workaround for the missing unicode support in lucene, but hopefully this won't take much longer to fix. I'm having trouble keeping track of the various issues to fix the "missing unicode support in lucene". Are there issues opened for all? Should we open a consolidated issue for properly handling surrogate pairs?
          Hide
          Robert Muir added a comment -

          Michael, I only opened one issue: LUCENE-1488...
          When I talk about missing unicode support, I'm not really referring to the surrogate pair issue.
          I'm talking about support for the unicode standard (yes the stuff like breaking text into words and erasing case differences and normalization and what not).

          separately, maybe your confusion is related to the fact that there are a lot of existing jira issues whose root cause is the lack of this functionality. I didn't cause this!

          I consider the surrogate pair issue a separate java 5 migration issue, and I think it is consolidated: LUCENE-1689.

          Show
          Robert Muir added a comment - Michael, I only opened one issue: LUCENE-1488 ... When I talk about missing unicode support, I'm not really referring to the surrogate pair issue. I'm talking about support for the unicode standard (yes the stuff like breaking text into words and erasing case differences and normalization and what not). separately, maybe your confusion is related to the fact that there are a lot of existing jira issues whose root cause is the lack of this functionality. I didn't cause this! I consider the surrogate pair issue a separate java 5 migration issue, and I think it is consolidated: LUCENE-1689 .
          Hide
          Robert Muir added a comment -

          hello, I would like to raise attention to this issue.

          I am currently trying to figure out how to nicely integrate wordnet support into solr (unrelated) when I thought about this again.

          Can we do something to fix this situation? Perhaps move all the Solr tokenstreams into lucene-contrib and make all solr committers lucene contrib committers?

          Personally, I feel if these tokenstreams were in lucene-contrib, it would be easier for me to help improve them, and also easier to make sure we do not make changes to lucene that cause problems for Solr. I am sure I am not the only one, I think there are other lucene committers that would help improve and maintain them.

          Show
          Robert Muir added a comment - hello, I would like to raise attention to this issue. I am currently trying to figure out how to nicely integrate wordnet support into solr (unrelated) when I thought about this again. Can we do something to fix this situation? Perhaps move all the Solr tokenstreams into lucene-contrib and make all solr committers lucene contrib committers? Personally, I feel if these tokenstreams were in lucene-contrib, it would be easier for me to help improve them, and also easier to make sure we do not make changes to lucene that cause problems for Solr. I am sure I am not the only one, I think there are other lucene committers that would help improve and maintain them.
          Hide
          Uwe Schindler added a comment -

          +1, I'll help. And I would also help to convert the last few TokenStreams in Solr that not yet moved to attributes!

          Show
          Uwe Schindler added a comment - +1, I'll help. And I would also help to convert the last few TokenStreams in Solr that not yet moved to attributes!
          Hide
          Yonik Seeley added a comment -

          There are some downsides for Solr: if these are in Lucene, we (solr) lose the ability to easily add new features or fix bugs... mainly because Solr doesn't use Lucene trunk anymore.

          Show
          Yonik Seeley added a comment - There are some downsides for Solr: if these are in Lucene, we (solr) lose the ability to easily add new features or fix bugs... mainly because Solr doesn't use Lucene trunk anymore.
          Hide
          Earwin Burrfoot added a comment -

          Hehehe. There is an upside for Lucene: if new and exciting stuff lingers in trunk for too long, Solr developers will come and complain, which will hopefully make Lucene releases not such rare a sight.

          Show
          Earwin Burrfoot added a comment - Hehehe. There is an upside for Lucene: if new and exciting stuff lingers in trunk for too long, Solr developers will come and complain, which will hopefully make Lucene releases not such rare a sight.
          Hide
          Robert Muir added a comment -

          Yonik, I suppose what I am suggesting is a way to make this easier. Isn't this one of the things hindering adoption of Lucene 3.x in solr?

          I think it is silly that lucene has Pattern-based tokenization, but solr has a separate impl which is better.
          I think it is silly that lucene has synonym support, but solr has a separate impl which is better.
          I think it is silly that lucene has wordnet support, but the right pieces are not exposed so they can be used in solr (for its better synonym support).
          I think it is terrible that people post to the lucene user list asking how to tokenize hindi (or other complex scripts), when whitespace + worddelimiter works very well for the time being.

          I think I could go on and on, but we should remove this duplicated effort and try to keep things simpler.

          For one, I do not want to break things in solr with a lucene update. this is easier if the analysis components are consolidated.

          Show
          Robert Muir added a comment - Yonik, I suppose what I am suggesting is a way to make this easier. Isn't this one of the things hindering adoption of Lucene 3.x in solr? I think it is silly that lucene has Pattern-based tokenization, but solr has a separate impl which is better. I think it is silly that lucene has synonym support, but solr has a separate impl which is better. I think it is silly that lucene has wordnet support, but the right pieces are not exposed so they can be used in solr (for its better synonym support). I think it is terrible that people post to the lucene user list asking how to tokenize hindi (or other complex scripts), when whitespace + worddelimiter works very well for the time being. I think I could go on and on, but we should remove this duplicated effort and try to keep things simpler. For one, I do not want to break things in solr with a lucene update. this is easier if the analysis components are consolidated.
          Hide
          Jake Mannix added a comment -

          There are some downsides for Solr: if these are in Lucene, we (solr) lose the ability to easily add new features or fix bugs... mainly because Solr doesn't use Lucene trunk anymore.

          Who is this "we" of which you speak? If someone using Solr needs something changed in Lucene, a patch submitted to Lucene JIRA should be no more hard to get committed than a patch submitted to the Solr JIRA, no? The release cycle may be slower, but yes, as Earwin says, maybe this just gives another reason for more rapid minor version iteration in Lucene.

          Show
          Jake Mannix added a comment - There are some downsides for Solr: if these are in Lucene, we (solr) lose the ability to easily add new features or fix bugs... mainly because Solr doesn't use Lucene trunk anymore. Who is this "we" of which you speak? If someone using Solr needs something changed in Lucene, a patch submitted to Lucene JIRA should be no more hard to get committed than a patch submitted to the Solr JIRA, no? The release cycle may be slower, but yes, as Earwin says, maybe this just gives another reason for more rapid minor version iteration in Lucene.
          Hide
          Yonik Seeley added a comment -

          Who is this "we" of which you speak?

          I did put solr in parens to clarify that I meant the solr development community.

          If someone using Solr needs something changed in Lucene, a patch submitted to Lucene JIRA should be no more hard to get committed than a patch submitted to the Solr JIRA, no?

          Right, but getting it into Lucene alone doesn't help much if you're using Solr. One would first have to get it into Lucene, and then get that version of Lucene into Solr.

          Show
          Yonik Seeley added a comment - Who is this "we" of which you speak? I did put solr in parens to clarify that I meant the solr development community. If someone using Solr needs something changed in Lucene, a patch submitted to Lucene JIRA should be no more hard to get committed than a patch submitted to the Solr JIRA, no? Right, but getting it into Lucene alone doesn't help much if you're using Solr. One would first have to get it into Lucene, and then get that version of Lucene into Solr.
          Hide
          Yonik Seeley added a comment -

          maybe this just gives another reason for more rapid minor version iteration in Lucene.

          We've always wanted that for both Lucene and Solr. But we've been unable to make that happen in the past, so betting on it going forward seems like a mistake.

          Show
          Yonik Seeley added a comment - maybe this just gives another reason for more rapid minor version iteration in Lucene. We've always wanted that for both Lucene and Solr. But we've been unable to make that happen in the past, so betting on it going forward seems like a mistake.
          Hide
          Jake Mannix added a comment -

          But we've been unable to make that happen in the past, so betting on it going forward seems like a mistake.

          Sure, but keeping code which could be used by any Lucene project sequestered in Solr because you want to be able to modify it on a different time-schedule seems like bad open-source policy.

          I guess what I'd change my statement to is that yes, it would be a downside to the Solr community (anytime you depend on another project, you also depend on its release schedule). But I'd still hold that this fact alone should not keep code out of Lucene-core. Anything which isn't Solr-specific should belong deeper in, to allow use by more people, otherwise only Solr users get the benefit.

          Show
          Jake Mannix added a comment - But we've been unable to make that happen in the past, so betting on it going forward seems like a mistake. Sure, but keeping code which could be used by any Lucene project sequestered in Solr because you want to be able to modify it on a different time-schedule seems like bad open-source policy. I guess what I'd change my statement to is that yes, it would be a downside to the Solr community (anytime you depend on another project, you also depend on its release schedule). But I'd still hold that this fact alone should not keep code out of Lucene-core. Anything which isn't Solr-specific should belong deeper in, to allow use by more people, otherwise only Solr users get the benefit.
          Hide
          Michael McCandless added a comment -

          Yonik, could Solr pull only Lucene's contrib analyzers JAR, frequently?

          Would it help if we also consolidated all of Lucene's core analyzers into contrib/analyzers, so there'd be a single place for all analyzers?

          Show
          Michael McCandless added a comment - Yonik, could Solr pull only Lucene's contrib analyzers JAR, frequently? Would it help if we also consolidated all of Lucene's core analyzers into contrib/analyzers, so there'd be a single place for all analyzers?
          Hide
          Robert Muir added a comment -

          Hi, maybe we could simplify some things here if we were to refine this issue as its described: maybe not all tokenstreams but just some.

          Maybe just tokenstreams considered to be more 'core/stable' by solr should belong in lucene-contrib?

          I would think word delimiter filter at least, maybe html strip and synonyms too would belong in this category. maybe not the pattern tokenizer even though I think its duplicated effort... at least this would be a start?

          Show
          Robert Muir added a comment - Hi, maybe we could simplify some things here if we were to refine this issue as its described: maybe not all tokenstreams but just some. Maybe just tokenstreams considered to be more 'core/stable' by solr should belong in lucene-contrib? I would think word delimiter filter at least, maybe html strip and synonyms too would belong in this category. maybe not the pattern tokenizer even though I think its duplicated effort... at least this would be a start?
          Hide
          Robert Muir added a comment -

          Would it help if we also consolidated all of Lucene's core analyzers into contrib/analyzers, so there'd be a single place for all analyzers?

          +1 for consolidating lucene analyzers, including snowball, all into one pkg so we can remove further duplication. Simon has mentioned this already before.

          Show
          Robert Muir added a comment - Would it help if we also consolidated all of Lucene's core analyzers into contrib/analyzers, so there'd be a single place for all analyzers? +1 for consolidating lucene analyzers, including snowball, all into one pkg so we can remove further duplication. Simon has mentioned this already before.
          Hide
          Yonik Seeley added a comment -

          Anything which isn't Solr-specific should belong deeper in, to allow use by more people, otherwise only Solr users get the benefit.

          That logic only works if the lucene and solr development communities were essentially one. They currently aren't.
          The downsides of the lucene/solr divide works both ways:

          A solr developer adds something to solr w/o putting it in lucene - only solr users benefit.
          A lucene developer adds something to lucene w/o adding support for that in solr - only lucene users benefit.

          Show
          Yonik Seeley added a comment - Anything which isn't Solr-specific should belong deeper in, to allow use by more people, otherwise only Solr users get the benefit. That logic only works if the lucene and solr development communities were essentially one. They currently aren't. The downsides of the lucene/solr divide works both ways: A solr developer adds something to solr w/o putting it in lucene - only solr users benefit. A lucene developer adds something to lucene w/o adding support for that in solr - only lucene users benefit.
          Hide
          Jake Mannix added a comment -

          A solr developer adds something to solr w/o putting it in lucene - only solr users benefit.

          A lucene developer adds something to lucene w/o adding support for that in solr - only lucene users benefit.

          I agree with the first, but not the second: solr users benefit on the next release cycle of solr which includes the lucene changes. Your statements here only apply if there wasn't a dependency of solr on lucene.

          Show
          Jake Mannix added a comment - A solr developer adds something to solr w/o putting it in lucene - only solr users benefit. A lucene developer adds something to lucene w/o adding support for that in solr - only lucene users benefit. I agree with the first, but not the second: solr users benefit on the next release cycle of solr which includes the lucene changes. Your statements here only apply if there wasn't a dependency of solr on lucene.
          Hide
          Yonik Seeley added a comment -

          solr users benefit on the next release cycle of solr which includes the lucene changes.

          Not really... Many new lucene features require quite a bit of work in solr-land to make them fully integrated and usable.
          It's also not on the next release cycle of Solr either - you would be surprised the number of people that use solr-trunk (even in production), and that serves to help along solr development.

          Show
          Yonik Seeley added a comment - solr users benefit on the next release cycle of solr which includes the lucene changes. Not really... Many new lucene features require quite a bit of work in solr-land to make them fully integrated and usable. It's also not on the next release cycle of Solr either - you would be surprised the number of people that use solr-trunk (even in production), and that serves to help along solr development.
          Hide
          Michael McCandless added a comment -

          Here's a semi-concrete proposal: how about we plan to move all
          analyzers (Solr's, Lucene's core analyzers, snowball,
          contrib/analyzers) into one place.

          We could probably just use Lucene's existing contrib/analyzers as that
          place.

          We then change Solr's checkout/build/ship process to pull directly
          from contrib/analyzers. So when I checkout Solr, I get Lucene's
          contrib analyzers and test against that. We also fix Lucene's build
          scripts – we'd have to build analyzers first, and make it available
          for core tests.

          Any changes to contrib/analyzers must pass both Lucene's and Solr's
          unit tests before being committed, which is great because it also
          means more/better test coverage for all analyzers changes.

          We may have some issues with someone being a committer on one project
          but not another, but we can take those up on a case by case basis.
          The wost case is we post a patch to a Solr or Lucene issue and a
          committer picks it up, which would be fine.

          This will require some one-time effort – fixing the ant build scripts
          for both Solr and Lucene, doing the initial move, etc. I'm happy to
          help out, but will probably need help with ant

          We could even promote contrib/analyzers to its own sub-project, but
          I think that's probably overkill for now.

          Could something like this work?

          Show
          Michael McCandless added a comment - Here's a semi-concrete proposal: how about we plan to move all analyzers (Solr's, Lucene's core analyzers, snowball, contrib/analyzers) into one place. We could probably just use Lucene's existing contrib/analyzers as that place. We then change Solr's checkout/build/ship process to pull directly from contrib/analyzers. So when I checkout Solr, I get Lucene's contrib analyzers and test against that. We also fix Lucene's build scripts – we'd have to build analyzers first, and make it available for core tests. Any changes to contrib/analyzers must pass both Lucene's and Solr's unit tests before being committed, which is great because it also means more/better test coverage for all analyzers changes. We may have some issues with someone being a committer on one project but not another, but we can take those up on a case by case basis. The wost case is we post a patch to a Solr or Lucene issue and a committer picks it up, which would be fine. This will require some one-time effort – fixing the ant build scripts for both Solr and Lucene, doing the initial move, etc. I'm happy to help out, but will probably need help with ant We could even promote contrib/analyzers to its own sub-project, but I think that's probably overkill for now. Could something like this work?
          Hide
          Grant Ingersoll added a comment -

          (kind of rambling, but...)
          We probably should also include Nutch. I think they have their own analyzers too.

          While I think it is reasonable to consolidate Analyzers, it is a slippery slope (next up Function queries - Solr users love function queries and they get a lot of love both from users and devs while Lucene's function queries languish and we're never brought over to Solr b/c no one did the work and Solr's moved on as well-, then faceting, then the schema, ZooKeeper integration etc.). One could easily make the case that all of Solr should be a part of Lucene or that all the guts of Solr should be pulled into Lucene other than the "scaffolding" part at which point Solr simply becomes the Lucene search server. At the same time, you could make the case that Solr could be it's own TLP: Solr often has different goals from Lucene, not to mention different release cycles, etc. (for instance, we don't ask Compass, DBSight, et. al. to donate back their analyzers, right?) Besides the fact that there are many things that are just easier to do in Solr b/c it provides the framework around Lucene that ALL OF US HAVE BUILT in one form or another over the years. That being said, Solr and Lucene have a special relationship in that they both fly under the Lucene flag and many Solr committers are also Lucene committers, so we should try to coordinate a bit more, while maintaining independence. As I've said before, though, it often seems like a one way street for those Lucene committers who are not Solr users. Namely, they grab stuff from Solr and put it into Lucene, but then they don't make Solr "whole" again (function queries are "exhibit A").

          At the same time, when I see talk of Lucene adding schema like features or other stuff that is already in Solr and that I don't think belongs in Lucene, I think "why not just use Solr", yet the Lucene community, at times seems intent on duplication, too, so the knife cuts both ways, as they say.

          At any rate, the reason I'm pro Analyzer consolidation (I could even see it being a standalone sub-project) is that Analyzers are useful in their own right outside of search all together. For instance, Mahout currently has a dep. on both core and contrib/analyzers for the sole purpose of using the Analyzers (well, we also have a utility dependency on Lucene core to create feature vectors from a Lucene index, but the core has a dep on Analyzers). So, perhaps, the PMC and broader community should take this up on general@ and we should create an Analyzers project and all Lucene ecosystem committers have commit writes on it WITH THE VERY RESTRICTIVE approach that changes need to be tested across projects. Of course, this will, as Yonik points out, become a bottleneck for each project and it is far from perfect too. However, we could still allow projects to create their own and then they get promoted when deemed useful by others.

          On the other hand, I kind of am not thrilled about a whole other subproject with it's own lists, JIRA, etc. just for Analyzers. Maybe we could just have separate SVN (so that we can change commit access, builds, etc.) and java-dev is still the place for discussion. Then each project could have it's dependency on that code as needed and the Analyzers code could be released separately, etc. So, pretty much a full project, but not the overhead.

          Show
          Grant Ingersoll added a comment - (kind of rambling, but...) We probably should also include Nutch. I think they have their own analyzers too. While I think it is reasonable to consolidate Analyzers, it is a slippery slope (next up Function queries - Solr users love function queries and they get a lot of love both from users and devs while Lucene's function queries languish and we're never brought over to Solr b/c no one did the work and Solr's moved on as well-, then faceting, then the schema, ZooKeeper integration etc.). One could easily make the case that all of Solr should be a part of Lucene or that all the guts of Solr should be pulled into Lucene other than the "scaffolding" part at which point Solr simply becomes the Lucene search server. At the same time, you could make the case that Solr could be it's own TLP: Solr often has different goals from Lucene, not to mention different release cycles, etc. (for instance, we don't ask Compass, DBSight, et. al. to donate back their analyzers, right?) Besides the fact that there are many things that are just easier to do in Solr b/c it provides the framework around Lucene that ALL OF US HAVE BUILT in one form or another over the years. That being said, Solr and Lucene have a special relationship in that they both fly under the Lucene flag and many Solr committers are also Lucene committers, so we should try to coordinate a bit more, while maintaining independence. As I've said before, though, it often seems like a one way street for those Lucene committers who are not Solr users. Namely, they grab stuff from Solr and put it into Lucene, but then they don't make Solr "whole" again (function queries are "exhibit A"). At the same time, when I see talk of Lucene adding schema like features or other stuff that is already in Solr and that I don't think belongs in Lucene, I think "why not just use Solr", yet the Lucene community, at times seems intent on duplication, too, so the knife cuts both ways, as they say. At any rate, the reason I'm pro Analyzer consolidation (I could even see it being a standalone sub-project) is that Analyzers are useful in their own right outside of search all together. For instance, Mahout currently has a dep. on both core and contrib/analyzers for the sole purpose of using the Analyzers (well, we also have a utility dependency on Lucene core to create feature vectors from a Lucene index, but the core has a dep on Analyzers). So, perhaps, the PMC and broader community should take this up on general@ and we should create an Analyzers project and all Lucene ecosystem committers have commit writes on it WITH THE VERY RESTRICTIVE approach that changes need to be tested across projects. Of course, this will, as Yonik points out, become a bottleneck for each project and it is far from perfect too. However, we could still allow projects to create their own and then they get promoted when deemed useful by others. On the other hand, I kind of am not thrilled about a whole other subproject with it's own lists, JIRA, etc. just for Analyzers. Maybe we could just have separate SVN (so that we can change commit access, builds, etc.) and java-dev is still the place for discussion. Then each project could have it's dependency on that code as needed and the Analyzers code could be released separately, etc. So, pretty much a full project, but not the overhead.
          Hide
          Robert Muir added a comment -

          As I've said before, though, it often seems like a one way street for those Lucene committers who are not Solr users. Namely, they grab stuff from Solr and put it into Lucene, but then they don't make Solr "whole" again (function queries are "exhibit A").

          Really? we don't add any analysis capabilities to lucene that solr uses too?

          Show
          Robert Muir added a comment - As I've said before, though, it often seems like a one way street for those Lucene committers who are not Solr users. Namely, they grab stuff from Solr and put it into Lucene, but then they don't make Solr "whole" again (function queries are "exhibit A"). Really? we don't add any analysis capabilities to lucene that solr uses too?
          Hide
          Grant Ingersoll added a comment -

          Really? we don't add any analysis capabilities to lucene that solr uses too?

          Yes, but Solr has a dependency on Lucene, not the other way around. Solr is, by definition, at a higher level than Lucene. In order for Lucene to use something of Solr's, it has to, essentially, fork the code. It's happened several times where stuff was pulled out of Solr and put in Lucene, but then Solr wasn't updated to use it or it was updated due to Solr undertaking a fair amount of work to then use the exact same feature it had in it's own code base that Lucene then added. Since Solr has the dep. on Lucene, it's natural Solr takes advantage of what Lucene has to offer, just like any other project that uses Lucene.

          Like I said, though, it may make sense for analysis to be separate. I was just pointing out it is a slippery slope.

          Show
          Grant Ingersoll added a comment - Really? we don't add any analysis capabilities to lucene that solr uses too? Yes, but Solr has a dependency on Lucene, not the other way around. Solr is, by definition, at a higher level than Lucene. In order for Lucene to use something of Solr's, it has to, essentially, fork the code. It's happened several times where stuff was pulled out of Solr and put in Lucene, but then Solr wasn't updated to use it or it was updated due to Solr undertaking a fair amount of work to then use the exact same feature it had in it's own code base that Lucene then added. Since Solr has the dep. on Lucene, it's natural Solr takes advantage of what Lucene has to offer, just like any other project that uses Lucene. Like I said, though, it may make sense for analysis to be separate. I was just pointing out it is a slippery slope.
          Hide
          Robert Muir added a comment -

          ok i am sorry, i didnt realize what you are saying.

          taking stuff from solr and putting in lucene without fixing solr to use it from lucene is terrible, we should not do this.

          i think we want to remove, not create duplication.

          Show
          Robert Muir added a comment - ok i am sorry, i didnt realize what you are saying. taking stuff from solr and putting in lucene without fixing solr to use it from lucene is terrible, we should not do this. i think we want to remove, not create duplication.
          Hide
          Grant Ingersoll added a comment -

          i think we want to remove, not create duplication.

          I think we all agree on that. Alas, though, the devil is in the details and that's where it always seems to get hung up. Not saying it can't work (I've often been an advocate for it), just saying we've gone around on this a number of times and I think it gets hung up on the fact that the two communities are fairly independent with the exception of a few core committers.

          Show
          Grant Ingersoll added a comment - i think we want to remove, not create duplication. I think we all agree on that. Alas, though, the devil is in the details and that's where it always seems to get hung up. Not saying it can't work (I've often been an advocate for it), just saying we've gone around on this a number of times and I think it gets hung up on the fact that the two communities are fairly independent with the exception of a few core committers.
          Hide
          Robert Muir added a comment -

          one thing we could do, is to just have a general rule that we should not copy stuff like this and instead it should be moved, with tests and back compat and all that working on both projects.

          Show
          Robert Muir added a comment - one thing we could do, is to just have a general rule that we should not copy stuff like this and instead it should be moved, with tests and back compat and all that working on both projects.
          Hide
          Grant Ingersoll added a comment -

          one thing we could do, is to just have a general rule that we should not copy stuff like this and instead it should be moved, with tests and back compat and all that working on both projects.

          Yeah, and we can probably take this on a more case by case basis, but it still is creating extra work for Solr committers w/ very little benefit to the project. Not a big deal for the analyzers stuff, since Solr has that process mostly automated anyway, but may be a bigger issue for other stuff.

          So, if we go with Mike's proposal and make Lucene core have a dep on a new Analyzers module, then this could work, but even that I'm not sure about, as Solr is not on Lucene 3.x yet (and doesn't have immediate plans for it either). At any rate, let's get concrete w/ a patch.

          Show
          Grant Ingersoll added a comment - one thing we could do, is to just have a general rule that we should not copy stuff like this and instead it should be moved, with tests and back compat and all that working on both projects. Yeah, and we can probably take this on a more case by case basis, but it still is creating extra work for Solr committers w/ very little benefit to the project. Not a big deal for the analyzers stuff, since Solr has that process mostly automated anyway, but may be a bigger issue for other stuff. So, if we go with Mike's proposal and make Lucene core have a dep on a new Analyzers module, then this could work, but even that I'm not sure about, as Solr is not on Lucene 3.x yet (and doesn't have immediate plans for it either). At any rate, let's get concrete w/ a patch.
          Hide
          Mark Miller added a comment -

          with the exception of a few core committers.

          I think the exception is the other way around, especially considering Lucene contrib. Lets look at the Solr list (and consider some are not very active in Solr currently)

          name status
          Bill Au  
          Doug Cutting Lucene Core Committer
          Otis Gospodnetić Lucene Core Committer
          Erik Hatcher Lucene Core Committer
          Chris Hostetter Lucene Core Committer
          Grant Ingersoll Lucene Core Committer
          Mike Klaas  
          Shalin Shekhar Mangar  
          Ryan McKinley Lucene Contrib Committer
          Mark Miller Lucene Core Committer
          Noble Paul  
          Yonik Seeley Lucene Core Committer
          Koji Sekiguchi Lucene Contrib Committer
          Show
          Mark Miller added a comment - with the exception of a few core committers. I think the exception is the other way around, especially considering Lucene contrib. Lets look at the Solr list (and consider some are not very active in Solr currently) name status Bill Au   Doug Cutting Lucene Core Committer Otis Gospodnetić Lucene Core Committer Erik Hatcher Lucene Core Committer Chris Hostetter Lucene Core Committer Grant Ingersoll Lucene Core Committer Mike Klaas   Shalin Shekhar Mangar   Ryan McKinley Lucene Contrib Committer Mark Miller Lucene Core Committer Noble Paul   Yonik Seeley Lucene Core Committer Koji Sekiguchi Lucene Contrib Committer
          Hide
          Robert Muir added a comment - - edited

          as Solr is not on Lucene 3.x yet (and doesn't have immediate plans for it either). At any rate, let's get concrete w/ a patch.

          then in this case, in my opinion this would be one strategy (for the word delimiter case as an example):

          • we should fix the word delimiter filter in solr itself to use the new lucene tokenstream API. I think we can start working on this now?
          • we must wait until solr has moved to lucene 3.x, otherwise this will be a code copy and a bad idea
          • at this point we create an issue in each project linked together, with the code changes on each side. Once everyone agrees on the patch on each side, the movement, commit, jar files update in solr should be as atomic as possible.
          Show
          Robert Muir added a comment - - edited as Solr is not on Lucene 3.x yet (and doesn't have immediate plans for it either). At any rate, let's get concrete w/ a patch. then in this case, in my opinion this would be one strategy (for the word delimiter case as an example): we should fix the word delimiter filter in solr itself to use the new lucene tokenstream API. I think we can start working on this now? we must wait until solr has moved to lucene 3.x, otherwise this will be a code copy and a bad idea at this point we create an issue in each project linked together, with the code changes on each side. Once everyone agrees on the patch on each side, the movement, commit, jar files update in solr should be as atomic as possible.
          Hide
          Robert Muir added a comment -

          fixed as of rev 940781.

          Show
          Robert Muir added a comment - fixed as of rev 940781.

            People

            • Assignee:
              Unassigned
              Reporter:
              Jason Rutherglen
            • Votes:
              3 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Time Tracking

                Estimated:
                Original Estimate - 24h
                24h
                Remaining:
                Remaining Estimate - 24h
                24h
                Logged:
                Time Spent - Not Specified
                Not Specified

                  Development