Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-1166

A tokenfilter to decompose compound words

    Details

    • Type: New Feature
    • Status: Resolved
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: modules/analysis
    • Labels:
      None
    • Lucene Fields:
      Patch Available

      Description

      A tokenfilter to decompose compound words you find in many germanic languages (like German, Swedish, ...) into single tokens.

      An example: Donaudampfschiff would be decomposed to Donau, dampf, schiff so that you can find the word even when you only enter "Schiff".

      I use the hyphenation code from the Apache XML project FOP (http://xmlgraphics.apache.org/fop/) to do the first step of decomposition. Currently I use the FOP jars directly. I only use a handful of classes from the FOP project.

      My question now:
      Would it be OK to copy this classes over to the Lucene project (renaming the packages of course) or should I stick with the dependency to the FOP jars? The FOP code uses the ASF V2 license as well.

      What do you think?

        Attachments

        1. CompoundTokenFilter.patch
          106 kB
          Thomas Peuss
        2. CompoundTokenFilter.patch
          106 kB
          Thomas Peuss
        3. CompoundTokenFilter.patch
          105 kB
          Thomas Peuss
        4. CompoundTokenFilter.patch
          99 kB
          Thomas Peuss
        5. CompoundTokenFilter.patch
          90 kB
          Thomas Peuss
        6. CompoundTokenFilter.patch
          90 kB
          Thomas Peuss
        7. CompoundTokenFilter.patch
          91 kB
          Thomas Peuss
        8. CompoundTokenFilter.patch
          90 kB
          Thomas Peuss
        9. CompoundTokenFilter.patch
          85 kB
          Thomas Peuss
        10. CompoundTokenFilter.patch
          76 kB
          Thomas Peuss
        11. CompoundTokenFilter.patch
          71 kB
          Thomas Peuss
        12. de.xml
          48 kB
          Thomas Peuss
        13. hyphenation.dtd
          3 kB
          Thomas Peuss

          Issue Links

            Activity

              People

              • Assignee:
                gsingers Grant Ingersoll
                Reporter:
                tpeuss Thomas Peuss
              • Votes:
                0 Vote for this issue
                Watchers:
                1 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: