Details

    • Type: Task Task
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 3.0
    • Component/s: modules/analysis
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      I looked into clover analysis: It seems to be no longer used since I removed the tests yesterday - I am happy!

      1. LUCENE-1946.patch
        41 kB
        Uwe Schindler
      2. LUCENE-1946.patch
        89 kB
        Uwe Schindler
      3. LUCENE-1946.patch
        145 kB
        Uwe Schindler
      4. LUCENE-1946.patch
        137 kB
        Uwe Schindler

        Issue Links

          Activity

          Hide
          Uwe Schindler added a comment -

          Short and simple patch for Lucene core. The tests were modified yesterday to no longer call deprecated TS methods and no longer use setOnlyUseNewAPI()

          Show
          Uwe Schindler added a comment - Short and simple patch for Lucene core. The tests were modified yesterday to no longer call deprecated TS methods and no longer use setOnlyUseNewAPI()
          Hide
          Uwe Schindler added a comment -

          One general question:

          My last patch also removes the ISOLatin1AccentFilter, because it is deprecated. The problem with removing this filter is, that current indexes cannot be used anymore, because you have no possibility to add new documents using this filter in 3.0 or do queries with accents against it. Maybe we should leave this class deprecated alive or move it to contrib? Or we force people to reindex.

          The same applies to DateField from the o.a.l.document package. It is deprecated, so no one should use it, but we cannot remove it, because existing indexes need this support. This is different for e.g. Field.Store.COMPRESS, because you can still read the indexes (decompression is automatically), and can add new content without compression. Or reindex and use a own compression using byte[] stored field (as suggested).

          Show
          Uwe Schindler added a comment - One general question: My last patch also removes the ISOLatin1AccentFilter, because it is deprecated. The problem with removing this filter is, that current indexes cannot be used anymore, because you have no possibility to add new documents using this filter in 3.0 or do queries with accents against it. Maybe we should leave this class deprecated alive or move it to contrib? Or we force people to reindex. The same applies to DateField from the o.a.l.document package. It is deprecated, so no one should use it, but we cannot remove it, because existing indexes need this support. This is different for e.g. Field.Store.COMPRESS, because you can still read the indexes (decompression is automatically), and can add new content without compression. Or reindex and use a own compression using byte[] stored field (as suggested).
          Hide
          Uwe Schindler added a comment -

          Some more changes and additional tests (removed by the backwards-testcase removal). It now also fixes contrib/analyzers.

          There are still some problems in contrib like tests or old-style consumers.

          Show
          Uwe Schindler added a comment - Some more changes and additional tests (removed by the backwards-testcase removal). It now also fixes contrib/analyzers. There are still some problems in contrib like tests or old-style consumers.
          Hide
          Michael Busch added a comment -

          In case the change in our backwards-compatibility policy happens (see LUCENE-1698) we could think about removing the old TokenStream API in 3.1, considering how central this API is.

          Show
          Michael Busch added a comment - In case the change in our backwards-compatibility policy happens (see LUCENE-1698 ) we could think about removing the old TokenStream API in 3.1, considering how central this API is.
          Hide
          Uwe Schindler added a comment -

          I already promoted (and also Grant in his webinar) that Lucene 3.0 will not contain the old TokenStream API anymore. Nobody had problems with it (even the UIMA people were very happy about the new API, because it fits better to the UIMA architecture [they also have such things like Attribute but called different] - there was a conference about UIMA in Potsdam).

          I keep this patch pending (only the generics changes are submitted), but all tests for the BW compatibility were removed in my deprecated cleanup, so there are no tests for the old api anymore in trunk.

          Show
          Uwe Schindler added a comment - I already promoted (and also Grant in his webinar) that Lucene 3.0 will not contain the old TokenStream API anymore. Nobody had problems with it (even the UIMA people were very happy about the new API, because it fits better to the UIMA architecture [they also have such things like Attribute but called different] - there was a conference about UIMA in Potsdam). I keep this patch pending (only the generics changes are submitted), but all tests for the BW compatibility were removed in my deprecated cleanup, so there are no tests for the old api anymore in trunk.
          Hide
          Michael Busch added a comment -

          even the UIMA people were very happy about the new API, because it fits better to the UIMA architecture

          Yeah, I had the feeling they like it too (had lunch with them a couple weeks ago when I was working from Boeblingen).

          I'm personally ok with removing the old API; just thought Grant and others mentioned concerns about removing it too soon, because lots of users have their own TokenStreams.

          Show
          Michael Busch added a comment - even the UIMA people were very happy about the new API, because it fits better to the UIMA architecture Yeah, I had the feeling they like it too (had lunch with them a couple weeks ago when I was working from Boeblingen). I'm personally ok with removing the old API; just thought Grant and others mentioned concerns about removing it too soon, because lots of users have their own TokenStreams.
          Hide
          Michael Busch added a comment -

          This patch was sitting here for a week now and it seems like nobody objects removing the old API in 3.0.

          So my +1 for committing this!

          Show
          Michael Busch added a comment - This patch was sitting here for a week now and it seems like nobody objects removing the old API in 3.0. So my +1 for committing this!
          Hide
          Uwe Schindler added a comment -

          +1 from me too. The patch needs some more work, as I stopped somewhere in contrib to remove, as it was not clear to remove it for 3.0I will do the rest now. They only problem is InstantiatedIndexWriter, which still uses the old API to consume. I will try to modify it. But maybe this contrib will get removed (see other issue LUCENE-1948).

          So I will find out what to do.

          Show
          Uwe Schindler added a comment - +1 from me too. The patch needs some more work, as I stopped somewhere in contrib to remove, as it was not clear to remove it for 3.0I will do the rest now. They only problem is InstantiatedIndexWriter, which still uses the old API to consume. I will try to modify it. But maybe this contrib will get removed (see other issue LUCENE-1948 ). So I will find out what to do.
          Hide
          Uwe Schindler added a comment -

          Here the patch with all contrib's fixed:

          • PrecedenceQueryParser was missing new TokenStream API, I fixed it somehow with States and restoreState. I also added a javacc-target, which was missing
          • InstantiatedIndexWriter was also changed to use the new TokenStream API. The fix is very hackish, but works for the beginning. The class uses lots of Lists/Sets with cloned Token instances inside, so I simple used an AttributeImpl iterator and used copyTo(token). This works most cases (other cases are ignored by a empty Exception catch block). But this should really be fixed or the whole class removed (as suggested)

          There is one question: I removed IsoLatin1Filter. Thismay be a backwards break, so that old indexes using this filter in the analyzer need to be reindexed. But for most cases the AccentFilter would also work, but some hits may be missing when you query such an index. What should we do. Leave the deprecated analyzer in and remove it with 4.0 when all old indexes cannot be read anymore?

          Show
          Uwe Schindler added a comment - Here the patch with all contrib's fixed: PrecedenceQueryParser was missing new TokenStream API, I fixed it somehow with States and restoreState. I also added a javacc-target, which was missing InstantiatedIndexWriter was also changed to use the new TokenStream API. The fix is very hackish, but works for the beginning. The class uses lots of Lists/Sets with cloned Token instances inside, so I simple used an AttributeImpl iterator and used copyTo(token). This works most cases (other cases are ignored by a empty Exception catch block). But this should really be fixed or the whole class removed (as suggested) There is one question: I removed IsoLatin1Filter. Thismay be a backwards break, so that old indexes using this filter in the analyzer need to be reindexed. But for most cases the AccentFilter would also work, but some hits may be missing when you query such an index. What should we do. Leave the deprecated analyzer in and remove it with 4.0 when all old indexes cannot be read anymore?
          Hide
          Robert Muir added a comment -

          There is one question: I removed IsoLatin1Filter. Thismay be a backwards break, so that old indexes using this filter in the analyzer need to be reindexed. But for most cases the AccentFilter would also work, but some hits may be missing when you query such an index. What should we do. Leave the deprecated analyzer in and remove it with 4.0 when all old indexes cannot be read anymore?

          I do not think this is a backwards break, because IsoLatin1Filter is deprecated?

           * @deprecated in favor of {@link ASCIIFoldingFilter} which covers a superset 
           * of Latin 1. This class will be removed in Lucene 3.0.
          
          Show
          Robert Muir added a comment - There is one question: I removed IsoLatin1Filter. Thismay be a backwards break, so that old indexes using this filter in the analyzer need to be reindexed. But for most cases the AccentFilter would also work, but some hits may be missing when you query such an index. What should we do. Leave the deprecated analyzer in and remove it with 4.0 when all old indexes cannot be read anymore? I do not think this is a backwards break, because IsoLatin1Filter is deprecated? * @deprecated in favor of {@link ASCIIFoldingFilter} which covers a superset * of Latin 1. This class will be removed in Lucene 3.0.
          Hide
          Uwe Schindler added a comment -

          This is correct. But e.g. NumberTools and DateTools will also stay deprecated in 3.0, because you need them to use Indexes from previous versions.

          So there is a difference between deprecated APIs and deprecated functionality that is maybe needed, as long as old indexes are available. With 4.0 we will change the index format and then the problem is gone.

          But I agree with you, the differences between both classes are so minimal and only western european languages would have used the ISO filter. The superset does not hurt, only when extended chars (>255) are involved.

          Show
          Uwe Schindler added a comment - This is correct. But e.g. NumberTools and DateTools will also stay deprecated in 3.0, because you need them to use Indexes from previous versions. So there is a difference between deprecated APIs and deprecated functionality that is maybe needed, as long as old indexes are available. With 4.0 we will change the index format and then the problem is gone. But I agree with you, the differences between both classes are so minimal and only western european languages would have used the ISO filter. The superset does not hurt, only when extended chars (>255) are involved.
          Hide
          Robert Muir added a comment - - edited

          Uwe, well this class already uses new tokenstream API, so it is not hurting anything right?
          maybe follow what Mark said and keep it, change the javadoc to say 'this class will be removed in Lucene 4.0' ?

          edit. also maybe another option, you could move it to contrib?

          Show
          Robert Muir added a comment - - edited Uwe, well this class already uses new tokenstream API, so it is not hurting anything right? maybe follow what Mark said and keep it, change the javadoc to say 'this class will be removed in Lucene 4.0' ? edit. also maybe another option, you could move it to contrib?
          Hide
          Uwe Schindler added a comment -

          Cool idea. I move it to contrib/analyzers/common. But the class name will stay, the same so it will be in the top-level dir (without language suffix) as the only one.

          With NumberTools/DateField/DateTools we can do the same? This would be good in misc or somewhere else. But we then have to remove support for them from core's query parser? — mhhm bad idea for the beginning. I leave it there

          Show
          Uwe Schindler added a comment - Cool idea. I move it to contrib/analyzers/common. But the class name will stay, the same so it will be in the top-level dir (without language suffix) as the only one. With NumberTools/DateField/DateTools we can do the same? This would be good in misc or somewhere else. But we then have to remove support for them from core's query parser? — mhhm bad idea for the beginning. I leave it there
          Hide
          Robert Muir added a comment -

          Cool idea. I move it to contrib/analyzers/common. But the class name will stay, the same so it will be in the top-level dir (without language suffix) as the only one.

          or maybe put it in misc also? I don't understand the Number/Date issues like you do, so maybe contrib is a bad idea... just throwing it out there.

          Show
          Robert Muir added a comment - Cool idea. I move it to contrib/analyzers/common. But the class name will stay, the same so it will be in the top-level dir (without language suffix) as the only one. or maybe put it in misc also? I don't understand the Number/Date issues like you do, so maybe contrib is a bad idea... just throwing it out there.
          Hide
          Uwe Schindler added a comment -

          Patch with ISOLatin1Filter deprecated with a hint to 4.0 and additional text. I also made the missing TokenStreams (intended not to be extended) final (see LUCENE-1753).

          All test pass. I will commit soon.

          Show
          Uwe Schindler added a comment - Patch with ISOLatin1Filter deprecated with a hint to 4.0 and additional text. I also made the missing TokenStreams (intended not to be extended) final (see LUCENE-1753 ). All test pass. I will commit soon.
          Hide
          Uwe Schindler added a comment -

          Committed revision: 824116

          Show
          Uwe Schindler added a comment - Committed revision: 824116

            People

            • Assignee:
              Uwe Schindler
              Reporter:
              Uwe Schindler
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development