Issue Details (XML | Word | Printable)

Key: LUCENE-1166
Type: New Feature New Feature
Status: Resolved Resolved
Resolution: Fixed
Priority: Minor Minor
Assignee: Grant Ingersoll
Reporter: Thomas Peuss
Votes: 0
Watchers: 1
Operations

If you were logged in you would be able to see more operations.
Lucene - Java

A tokenfilter to decompose compound words

Created: 06/Feb/08 11:08 AM   Updated: 18/May/08 01:21 PM
Return to search
Component/s: Analysis
Affects Version/s: None
Fix Version/s: None

Time Tracking:
Not Specified

File Attachments:
  Size
Text File Licensed for inclusion in ASF works CompoundTokenFilter.patch 2008-05-16 11:32 AM Thomas Peuss 106 kB
Text File Licensed for inclusion in ASF works CompoundTokenFilter.patch 2008-04-30 02:26 PM Thomas Peuss 106 kB
Text File Licensed for inclusion in ASF works CompoundTokenFilter.patch 2008-04-30 09:17 AM Thomas Peuss 105 kB
Text File Licensed for inclusion in ASF works CompoundTokenFilter.patch 2008-04-24 10:11 AM Thomas Peuss 99 kB
Text File Licensed for inclusion in ASF works CompoundTokenFilter.patch 2008-03-29 11:04 AM Thomas Peuss 90 kB
Text File Licensed for inclusion in ASF works CompoundTokenFilter.patch 2008-03-25 12:56 PM Thomas Peuss 90 kB
Text File Licensed for inclusion in ASF works CompoundTokenFilter.patch 2008-03-25 12:49 PM Thomas Peuss 91 kB
Text File Licensed for inclusion in ASF works CompoundTokenFilter.patch 2008-03-03 04:35 PM Thomas Peuss 90 kB
Text File Licensed for inclusion in ASF works CompoundTokenFilter.patch 2008-02-14 04:22 PM Thomas Peuss 85 kB
Text File Licensed for inclusion in ASF works CompoundTokenFilter.patch 2008-02-12 11:09 AM Thomas Peuss 76 kB
Text File Licensed for inclusion in ASF works CompoundTokenFilter.patch 2008-02-06 11:08 AM Thomas Peuss 71 kB
XML File de.xml 2008-02-06 11:10 AM Thomas Peuss 48 kB
File hyphenation.dtd 2008-02-06 11:11 AM Thomas Peuss 3 kB
Issue Links:
Reference
 

Lucene Fields: Patch Available
Resolution Date: 16/May/08 12:28 PM


 Description  « Hide
A tokenfilter to decompose compound words you find in many germanic languages (like German, Swedish, ...) into single tokens.

An example: Donaudampfschiff would be decomposed to Donau, dampf, schiff so that you can find the word even when you only enter "Schiff".

I use the hyphenation code from the Apache XML project FOP (http://xmlgraphics.apache.org/fop/) to do the first step of decomposition. Currently I use the FOP jars directly. I only use a handful of classes from the FOP project.

My question now:
Would it be OK to copy this classes over to the Lucene project (renaming the packages of course) or should I stick with the dependency to the FOP jars? The FOP code uses the ASF V2 license as well.

What do you think?



 All   Comments   Work Log   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
Thomas Peuss added a comment - 06/Feb/08 11:09 AM
A preliminary version of the token filter.

Thomas Peuss added a comment - 06/Feb/08 11:10 AM

Thomas Peuss added a comment - 06/Feb/08 11:11 AM
The DTD describing the hyphenation grammar XML files.

Steven Rowe added a comment - 06/Feb/08 04:39 PM
Hi Thomas,

Looking at http://offo.sourceforge.net/hyphenation/licenses.html, which seems to be the same information as in the off-hyphenation.zip file you attached to this issue, the license issue may be a problem - the hyphenation data is covered by different licenses on a per-language basis. For example, there are two German data files, and both are licensed under a LaTeX license, as is the Danish file, and these two languages are the most likely targets for your TokenFilter. IANAL, but unless Apache licenses can be secured for this data, I don't think the files can be incorporated directly into an Apache project.

Also, I don't see Swedish among the hyphenation data licenses - is it covered in some other way?


Thomas Peuss added a comment - 06/Feb/08 05:33 PM

Looking at http://offo.sourceforge.net/hyphenation/licenses.html, which seems to be the same information as in the off-hyphenation.zip file you attached to this issue, the license issue may be a problem - the hyphenation data is covered by different licenses on a per-language basis. For example, there are two German data files, and both are licensed under a LaTeX license, as is the Danish file, and these two languages are the most likely targets for your TokenFilter. IANAL, but unless Apache licenses can be secured for this data, I don't think the files can be incorporated directly into an Apache project.

This is true. And that's why I uploaded the two files without the ASF license grant. The FOP project does not have the files in the code base as well because of the licensing problem.

Also, I don't see Swedish among the hyphenation data licenses - is it covered in some other way?

OFFO has no Swedish grammar file. We can generate a Swedish grammar file out of the LaTeX grammar files. I have a look into this tonight.

All other hyphenation implementations I have found so far use them either directly or in an converted variant like the FOP code. What we can do of course is to ask the authors of the LaTeX files if they want to license their work under the ASF license as well. It is worth a try. But I suppose that many email addresses in the LaTeX files are not used anymore. I try to contact the authors of the German grammar files tomorrow.

BTW: an example for those that don't want to try the patch:
Input token stream:
Rindfleischüberwachungsgesetz Drahtschere abba

Output token stream:
(Rindfleischüberwachungsgesetz,0,29)
(Rind,0,4,posIncr=0)
(fleisch,4,11,posIncr=0)
(überwachung,11,22,posIncr=0)
(gesetz,23,29,posIncr=0)
(Drahtschere,30,41)
(Draht,30,35,posIncr=0)
(schere,35,41,posIncr=0)
(abba,42,46)


Thomas Peuss added a comment - 07/Feb/08 12:22 PM
A Swedish hyphenation grammar is available at http://www.peuss.de/node/64

Thomas Peuss added a comment - 12/Feb/08 11:09 AM
Changes:
  • added unittest
  • minor tweaks for getting the encoding of the XML files right

Thomas Peuss added a comment - 14/Feb/08 04:22 PM
Updated version:
  • new dumb decomposition filter
    • uses a brute-force approach by generating substrings and checking them against the dictionary
    • seems to work better for languages that have no patterns file with a lot of special cases
    • Is roughly 3 times slower than the decomposition filter using hyphenation patterns
    • No licensing problems because of the hyphenation pattern files
  • Refactoring to have all methods used by both decomposition filters in one place
  • Minor performance improvements

Otis Gospodnetic added a comment - 17/Feb/08 05:48 AM
I haven't looked at the patch.
But I'm wondering if a similar approach could be used for, say, word segmentation in Chinese?
That is, iterate through a string of Chinese characters, buffering them and looking up the buffered string in a Chinese dictionary. Once there is a dictionary match, and the addition of the following character results in a string that has no entry in the dictionary, that previous buffered string can be considered a word/token.

I'm not sure if your patch does something like this, but if it does, I am wondering if it is general enough that what you did can be used (as the basis of) word segmentation for Chinese, and thus for a Chinese Analyzer that's not just a dump n-gram Analyzer (which is what we have today).


Thomas Peuss added a comment - 17/Feb/08 03:24 PM

But I'm wondering if a similar approach could be used for, say, word segmentation in Chinese? That is, iterate through a string of Chinese characters, buffering them and looking up the buffered string in a Chinese dictionary. Once there is a dictionary match, and the addition of the following character results in a string that has no entry in the dictionary, that previous buffered string can be considered a word/token. I'm not sure if your patch does something like this, but if it does, I am wondering if it is general enough that what you did can be used (as the basis of) word segmentation for Chinese, and thus for a Chinese Analyzer that's not just a dump n-gram Analyzer (which is what we have today).

Currently the code adds a token to the stream when an n-gram from the current token in the token stream matches a word in the dictionary (I am only speaking about the DumbCompoundWordTokenFilter because I doubt that there exist hyphenation patterns for Chinese languages). I don't know much about the structure of Chinese characters to answer this questions in detail. You can have a look at the test-case in the patch to see how the filters work.


Otis Gospodnetic added a comment - 28/Feb/08 11:11 PM
Thomas, I think that might work for Chinese - going through the "string" of Chinese characters, one at a time, and looking up a dictionary after each additional character. One you find a dictionary match, you look at one more character. If that matches a dictionary entry, keep doing that until you keep matching dictionary entries (in order to grab the longest dictionary-matching string of characters). If the next character does not match, then the previous/last character was the end of the dictionary entry.
That would work, no?

As for the license info, I think you could take the approach where the required libraries are not included in the contribution in the ASF repo, but are downloaded on the fly, at build time, much like some other contributions. Could you do that?


Thomas Peuss added a comment - 03/Mar/08 10:33 AM

Thomas, I think that might work for Chinese - going through the "string" of Chinese characters, one at a time, and looking up a dictionary after each additional character. One you find a dictionary match, you look at one more character. If that matches a dictionary entry, keep doing that until you keep matching dictionary entries (in order to grab the longest dictionary-matching string of characters). If the next character does not match, then the previous/last character was the end of the dictionary entry. That would work, no?

I have started to look into this. I will add the constructor parameter "onlyLongestMatch" (default is false).

As for the license info, I think you could take the approach where the required libraries are not included in the contribution in the ASF repo, but are downloaded on the fly, at build time, much like some other contributions. Could you do that?

I pull the grammar files for the tests already. But I don't know if it makes sense to pull them on build time because the end-user can easily download them. I need the XML versions now - so the jar-file from Sourceforge does not help anymore (I have included the needed classes from the FOP project - they use the ASF license as well).


Thomas Peuss added a comment - 03/Mar/08 04:35 PM
Updated patch according to Otis suggestions for longest match.

Next step: move to contrib


Thomas Peuss added a comment - 25/Mar/08 12:43 PM
Moved the compound word tokenfilter stuff to contrib.

Thomas Peuss added a comment - 25/Mar/08 12:49 PM
Moved compound word token filter to contrib.

Thomas Peuss added a comment - 25/Mar/08 12:56 PM
Dropped Java5 dependencies.

Thomas Peuss added a comment - 29/Mar/08 11:04 AM
Fixed a compilation bug in the testcase.

Grant Ingersoll added a comment - 24/Apr/08 01:14 AM

I pull the grammar files for the tests already. But I don't know if it makes sense to pull them on build time because the end-user can easily download them. I need the XML versions now - so the jar-file from Sourceforge does not help anymore (I have included the needed classes from the FOP project - they use the ASF license as well).

I think they have to download automatically, otherwise the automated tests, etc. will not run. I applied the patch and ran "ant test" and it fails b/c I didn't download the files.

Also, much of the code has author tags that are not you, I am assuming you got it from FOP per your comments above, but can you explicitly mark all the files as to there origin?


Thomas Peuss added a comment - 24/Apr/08 06:13 AM
All files in the package org.apache.lucene.analysis.compound.hyphenation are from the FOP project (as well ASF licensed). Should I add a comment to them to state from where they are? All other files are from me. I have to check why it fails when you run "ant test" by downloading a fresh copy of Lucene-trunk.

Thomas Peuss added a comment - 24/Apr/08 06:51 AM
The error is
[junit] Testsuite: org.apache.lucene.analysis.compound.TestCompoundWordTokenFilter
    [junit] Tests run: 4, Failures: 0, Errors: 2, Time elapsed: 2,139 sec
    [junit]
    [junit] Testcase: testHyphenationCompoundWordsDE(org.apache.lucene.analysis.compound.TestCompoundWordTokenFilter):  Caused an ERROR
    [junit] File not found: /home/thomas/projects/lucene-trunk-compound/hyphenation.dtd (No such file or directory)
    [junit] org.apache.lucene.analysis.compound.hyphenation.HyphenationException: File not found: /home/thomas/projects/lucene-trunk-compound/hyphenation.dtd (No such file or directory)
    [junit]     at org.apache.lucene.analysis.compound.hyphenation.PatternParser.parse(PatternParser.java:123)
    [junit]     at org.apache.lucene.analysis.compound.hyphenation.HyphenationTree.loadPatterns(HyphenationTree.java:138)
    [junit]     at org.apache.lucene.analysis.compound.HyphenationCompoundWordTokenFilter.getHyphenationTree(HyphenationCompoundWordTokenFilter.java:142)
    [junit]     at org.apache.lucene.analysis.compound.TestCompoundWordTokenFilter.testHyphenationCompoundWordsDE(TestCompoundWordTokenFilter.java:70)
    [junit]
    [junit]
    [junit] Testcase: testHyphenationCompoundWordsDELongestMatch(org.apache.lucene.analysis.compound.TestCompoundWordTokenFilter):      Caused an ERROR
    [junit] File not found: /home/thomas/projects/lucene-trunk-compound/hyphenation.dtd (No such file or directory)
    [junit] org.apache.lucene.analysis.compound.hyphenation.HyphenationException: File not found: /home/thomas/projects/lucene-trunk-compound/hyphenation.dtd (No such file or directory)
    [junit]     at org.apache.lucene.analysis.compound.hyphenation.PatternParser.parse(PatternParser.java:123)
    [junit]     at org.apache.lucene.analysis.compound.hyphenation.HyphenationTree.loadPatterns(HyphenationTree.java:138)
    [junit]     at org.apache.lucene.analysis.compound.HyphenationCompoundWordTokenFilter.getHyphenationTree(HyphenationCompoundWordTokenFilter.java:142)
    [junit]     at org.apache.lucene.analysis.compound.TestCompoundWordTokenFilter.testHyphenationCompoundWordsDELongestMatch(TestCompoundWordTokenFilter.java:96)
    [junit]
    [junit]
    [junit] Test org.apache.lucene.analysis.compound.TestCompoundWordTokenFilter FAILED

So it does not find the hyphenation.dtd. I have to investigate how I can make that DTD know to the parser without copying the hyphenation.dtd to Lucene's base directory.


Thomas Peuss added a comment - 24/Apr/08 10:11 AM
  • Fixed the problem with the hyphenation.dtd file that was not found
  • Removed all @author tags
  • Added a note to all files I copied from the FOP project
  • Added package.html files (not very much in there - but credits for the FOP project)

Grant Ingersoll added a comment - 30/Apr/08 01:12 AM
So, why would I ever want to use a "Dumb" compound filter? Any suggestions for a better name? No need for a patch, I can just make the change.

Grant Ingersoll added a comment - 30/Apr/08 01:20 AM
This looks pretty good Thomas. I think the last bit that would be good is to add to the package docs an example of start to finish using it, kind of like in the test case. You might want to explain a little bit about where to get the hyphenation files, etc. (if I am understanding them correctly).

I think if we can finish that up, we can look to commit.

The other interesting thing here, as an aside, is the Ternary Tree might be worth pulling up to a "util" package (no need to do so now, just thinking out loud), as it could be used for other interesting things. For instance, see http://www.javaworld.com/javaworld/jw-02-2001/jw-0216-ternary.html The version we have needs a little work, but I have been thinking about how it might be used to improve spelling, etc.


Otis Gospodnetic added a comment - 30/Apr/08 03:16 AM
Ah, TST was in there, lovely! +1 to what Grant said about getting it into util later.
Noticed a misspelling in javadoc while glancing at TST: hibrid -> hybrid

Grant Ingersoll added a comment - 30/Apr/08 03:28 AM

cool. It needs some work, IMO to add more features, per that article
I sent, but no biggie.

I saw some of those, but purposely left in the FOP typos... There were
more than just that one.

--------------------------
Grant Ingersoll

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ


Thomas Peuss added a comment - 30/Apr/08 06:49 AM

So, why would I ever want to use a "Dumb" compound filter? Any suggestions for a better name? No need for a patch, I can just make the change.

A better name would be DictionaryCompoundWordTokenFilter. I called it "Dumb" because it uses a brute-force approach. But DictionaryCompoundWordTokenFilter characterizes it better.


Thomas Peuss added a comment - 30/Apr/08 09:17 AM
  • Renamed DumbCompoundWordTokenFilter to DictionaryCompoundWordTokenFilter
  • Added more text to the package description file (package.html)
  • Removed some code that was necessary because of LUCENE-1163 (in HyphenationCompoundWordTokenFilter and DictionaryCompoundWordTokenFilter
    )

Thomas Peuss added a comment - 30/Apr/08 02:26 PM
  • Minor bugfix in DictionaryCompoundWordFilter: it was not using the maxSubwordSize parameter
  • Major performance improvement for the DictionaryCompoundWordTokenFilter: we now convert all dictionary strings to lower case before adding them to the CharArraySet and set the ignoreCase parameter of CharArraySet to false. The filter makes a lower case copy of the token before it starts working on it. This avoids many toLowerCase() calls in CharArraySet.
  • Minor performance improvement for the HyphenationCompoundWordTokenFilter: see above

François Terrier added a comment - 07/May/08 09:48 AM
Is there any plan of integrating this patch in the official lucene libraries in the short term ?

Grant Ingersoll added a comment - 07/May/08 10:42 AM
Yes.

--------------------------
Grant Ingersoll

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ


Grant Ingersoll added a comment - 14/May/08 10:43 AM - edited
I'm now getting:
..../lucene/java/lucene-clean/contrib/analyzers/src/test/org/apache/lucene/analysis/compound/TestCompoundWordTokenFilter.java:60: warning: unmappable character for encoding utf-8
[javac] "Aufgabe", "Überwachung" };

Can you convert the classes in question to UTF-8 for the source?


Thomas Peuss added a comment - 16/May/08 11:32 AM
UTF-8 problem fixed...

Grant Ingersoll added a comment - 16/May/08 12:28 PM
Committed revision 657027.