Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-1166

A tokenfilter to decompose compound words

Agile BoardAttach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • New Feature
    • Status: Resolved
    • Minor
    • Resolution: Fixed
    • None
    • None
    • modules/analysis
    • None
    • Patch Available

    Description

      A tokenfilter to decompose compound words you find in many germanic languages (like German, Swedish, ...) into single tokens.

      An example: Donaudampfschiff would be decomposed to Donau, dampf, schiff so that you can find the word even when you only enter "Schiff".

      I use the hyphenation code from the Apache XML project FOP (http://xmlgraphics.apache.org/fop/) to do the first step of decomposition. Currently I use the FOP jars directly. I only use a handful of classes from the FOP project.

      My question now:
      Would it be OK to copy this classes over to the Lucene project (renaming the packages of course) or should I stick with the dependency to the FOP jars? The FOP code uses the ASF V2 license as well.

      What do you think?

      Attachments

        1. CompoundTokenFilter.patch
          106 kB
          Thomas Peuss
        2. CompoundTokenFilter.patch
          106 kB
          Thomas Peuss
        3. CompoundTokenFilter.patch
          105 kB
          Thomas Peuss
        4. CompoundTokenFilter.patch
          99 kB
          Thomas Peuss
        5. CompoundTokenFilter.patch
          90 kB
          Thomas Peuss
        6. CompoundTokenFilter.patch
          90 kB
          Thomas Peuss
        7. CompoundTokenFilter.patch
          91 kB
          Thomas Peuss
        8. CompoundTokenFilter.patch
          90 kB
          Thomas Peuss
        9. CompoundTokenFilter.patch
          85 kB
          Thomas Peuss
        10. CompoundTokenFilter.patch
          76 kB
          Thomas Peuss
        11. CompoundTokenFilter.patch
          71 kB
          Thomas Peuss
        12. de.xml
          48 kB
          Thomas Peuss
        13. hyphenation.dtd
          3 kB
          Thomas Peuss

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            gsingers Grant Ingersoll
            tpeuss Thomas Peuss
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment