|
[
Permlink
| « Hide
]
Thomas Peuss added a comment - 06/Feb/08 11:09 AM
A preliminary version of the token filter.
A hyphenation grammar. You can download them from: http://downloads.sourceforge.net/offo/offo-hyphenation.zip?modtime=1168687306&big_mirror=0
The DTD describing the hyphenation grammar XML files.
Hi Thomas,
Looking at http://offo.sourceforge.net/hyphenation/licenses.html Also, I don't see Swedish among the hyphenation data licenses - is it covered in some other way?
This is true. And that's why I uploaded the two files without the ASF license grant. The FOP project does not have the files in the code base as well because of the licensing problem.
OFFO has no Swedish grammar file. We can generate a Swedish grammar file out of the LaTeX grammar files. I have a look into this tonight. All other hyphenation implementations I have found so far use them either directly or in an converted variant like the FOP code. What we can do of course is to ask the authors of the LaTeX files if they want to license their work under the ASF license as well. It is worth a try. But I suppose that many email addresses in the LaTeX files are not used anymore. I try to contact the authors of the German grammar files tomorrow. BTW: an example for those that don't want to try the patch: Output token stream: A Swedish hyphenation grammar is available at http://www.peuss.de/node/64
Changes:
Updated version:
I haven't looked at the patch.
But I'm wondering if a similar approach could be used for, say, word segmentation in Chinese? That is, iterate through a string of Chinese characters, buffering them and looking up the buffered string in a Chinese dictionary. Once there is a dictionary match, and the addition of the following character results in a string that has no entry in the dictionary, that previous buffered string can be considered a word/token. I'm not sure if your patch does something like this, but if it does, I am wondering if it is general enough that what you did can be used (as the basis of) word segmentation for Chinese, and thus for a Chinese Analyzer that's not just a dump n-gram Analyzer (which is what we have today).
Currently the code adds a token to the stream when an n-gram from the current token in the token stream matches a word in the dictionary (I am only speaking about the DumbCompoundWordTokenFilter because I doubt that there exist hyphenation patterns for Chinese languages). I don't know much about the structure of Chinese characters to answer this questions in detail. You can have a look at the test-case in the patch to see how the filters work. Thomas, I think that might work for Chinese - going through the "string" of Chinese characters, one at a time, and looking up a dictionary after each additional character. One you find a dictionary match, you look at one more character. If that matches a dictionary entry, keep doing that until you keep matching dictionary entries (in order to grab the longest dictionary-matching string of characters). If the next character does not match, then the previous/last character was the end of the dictionary entry.
That would work, no? As for the license info, I think you could take the approach where the required libraries are not included in the contribution in the ASF repo, but are downloaded on the fly, at build time, much like some other contributions. Could you do that?
I have started to look into this. I will add the constructor parameter "onlyLongestMatch" (default is false).
I pull the grammar files for the tests already. But I don't know if it makes sense to pull them on build time because the end-user can easily download them. I need the XML versions now - so the jar-file from Sourceforge does not help anymore (I have included the needed classes from the FOP project - they use the ASF license as well). Updated patch according to Otis suggestions for longest match.
Next step: move to contrib Moved the compound word tokenfilter stuff to contrib.
Moved compound word token filter to contrib.
Fixed a compilation bug in the testcase.
I think they have to download automatically, otherwise the automated tests, etc. will not run. I applied the patch and ran "ant test" and it fails b/c I didn't download the files. Also, much of the code has author tags that are not you, I am assuming you got it from FOP per your comments above, but can you explicitly mark all the files as to there origin? All files in the package org.apache.lucene.analysis.compound.hyphenation are from the FOP project (as well ASF licensed). Should I add a comment to them to state from where they are? All other files are from me. I have to check why it fails when you run "ant test" by downloading a fresh copy of Lucene-trunk.
The error is
[junit] Testsuite: org.apache.lucene.analysis.compound.TestCompoundWordTokenFilter
[junit] Tests run: 4, Failures: 0, Errors: 2, Time elapsed: 2,139 sec
[junit]
[junit] Testcase: testHyphenationCompoundWordsDE(org.apache.lucene.analysis.compound.TestCompoundWordTokenFilter): Caused an ERROR
[junit] File not found: /home/thomas/projects/lucene-trunk-compound/hyphenation.dtd (No such file or directory)
[junit] org.apache.lucene.analysis.compound.hyphenation.HyphenationException: File not found: /home/thomas/projects/lucene-trunk-compound/hyphenation.dtd (No such file or directory)
[junit] at org.apache.lucene.analysis.compound.hyphenation.PatternParser.parse(PatternParser.java:123)
[junit] at org.apache.lucene.analysis.compound.hyphenation.HyphenationTree.loadPatterns(HyphenationTree.java:138)
[junit] at org.apache.lucene.analysis.compound.HyphenationCompoundWordTokenFilter.getHyphenationTree(HyphenationCompoundWordTokenFilter.java:142)
[junit] at org.apache.lucene.analysis.compound.TestCompoundWordTokenFilter.testHyphenationCompoundWordsDE(TestCompoundWordTokenFilter.java:70)
[junit]
[junit]
[junit] Testcase: testHyphenationCompoundWordsDELongestMatch(org.apache.lucene.analysis.compound.TestCompoundWordTokenFilter): Caused an ERROR
[junit] File not found: /home/thomas/projects/lucene-trunk-compound/hyphenation.dtd (No such file or directory)
[junit] org.apache.lucene.analysis.compound.hyphenation.HyphenationException: File not found: /home/thomas/projects/lucene-trunk-compound/hyphenation.dtd (No such file or directory)
[junit] at org.apache.lucene.analysis.compound.hyphenation.PatternParser.parse(PatternParser.java:123)
[junit] at org.apache.lucene.analysis.compound.hyphenation.HyphenationTree.loadPatterns(HyphenationTree.java:138)
[junit] at org.apache.lucene.analysis.compound.HyphenationCompoundWordTokenFilter.getHyphenationTree(HyphenationCompoundWordTokenFilter.java:142)
[junit] at org.apache.lucene.analysis.compound.TestCompoundWordTokenFilter.testHyphenationCompoundWordsDELongestMatch(TestCompoundWordTokenFilter.java:96)
[junit]
[junit]
[junit] Test org.apache.lucene.analysis.compound.TestCompoundWordTokenFilter FAILED
So it does not find the hyphenation.dtd. I have to investigate how I can make that DTD know to the parser without copying the hyphenation.dtd to Lucene's base directory.
So, why would I ever want to use a "Dumb" compound filter? Any suggestions for a better name? No need for a patch, I can just make the change.
This looks pretty good Thomas. I think the last bit that would be good is to add to the package docs an example of start to finish using it, kind of like in the test case. You might want to explain a little bit about where to get the hyphenation files, etc. (if I am understanding them correctly).
I think if we can finish that up, we can look to commit. The other interesting thing here, as an aside, is the Ternary Tree might be worth pulling up to a "util" package (no need to do so now, just thinking out loud), as it could be used for other interesting things. For instance, see http://www.javaworld.com/javaworld/jw-02-2001/jw-0216-ternary.html Ah, TST was in there, lovely! +1 to what Grant said about getting it into util later.
Noticed a misspelling in javadoc while glancing at TST: hibrid -> hybrid cool. It needs some work, IMO to add more features, per that article I saw some of those, but purposely left in the FOP typos... There were -------------------------- Lucene Helpful Hints:
A better name would be DictionaryCompoundWordTokenFilter. I called it "Dumb" because it uses a brute-force approach. But DictionaryCompoundWordTokenFilter characterizes it better.
Is there any plan of integrating this patch in the official lucene libraries in the short term ?
Yes.
-------------------------- Lucene Helpful Hints: I'm now getting:
..../lucene/java/lucene-clean/contrib/analyzers/src/test/org/apache/lucene/analysis/compound/TestCompoundWordTokenFilter.java:60: warning: unmappable character for encoding utf-8 [javac] "Aufgabe", "Überwachung" }; Can you convert the classes in question to UTF-8 for the source? Committed revision 657027.
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||