[JENA-1326] Generic Lucene Analyzers - ASF JIRA

Details

Type: New Feature
Status: Closed
Priority: Major
Resolution: Done
Affects Version/s: Jena 3.2.0
Fix Version/s: Jena 3.4.0
Component/s: Jena, Text
Labels:

Description

There are analyzers in the Lucene distribution bundled with jena-text that cannot currently be referenced in the assembler configurations. Also, many analyzers provide constructors that accept parameters such as stop word sets and sets of stem exclusions that are not currently supported. Finally, there are analyzers that do not appear in the Lucene distribution that may be needed to be used and there is not currently any way to refer to such analyzers without modifying the jena-text source.

This issue proposes the addition of a jena-text assembler configuration feature to permit the specification of generic Lucene Analyzers given a fully qualified Class name and a list of parameters for a constructor of the Class and to allow the naming of such specifications for use in the Multilingual feature and use in other text:analyzer specifications.

A text:GenericAnalyzer specification is similar to other text:analyzer specifications:

           text:analyzer [
               a text:GenericAnalyzer ;
               text:class "org.apache.lucene.analysis.en.EnglishAnalyzer" ;
               text:params (
                    [ text:paramName "stopwords" ;
                      text:paramType text:TypeSet ;
                      text:paramValue ("the" "a" "an") ]
                    [ text:paramName "stemExclusionSet" ;
                      text:paramType text:TypeSet ;
                      text:paramValue ("ing" "ed") ]
                    )
           ] .

The text:class is the fully qualified class name of the desired analyzer.

The parameters may be of the following types:

Type	Description
`text:TypeAnalyzer`	a subclass of `org.apache.lucene.analysis.Analyzer`
`text:TypeBoolean`	a java `boolean`
`text:TypeFile`	the `String` path to a file materialized as a `java.io.FileReader`
`text:TypeInt`	a java `int`
`text:TypeString`	a java `String`
`text:TypeSet`	an `org.apache.lucene.analysis.CharArraySet`

Although the list of types is not exhaustive it is a simple matter to create a wrapper Analyzer that reads a file with information that can be used to initialize any sort of parameters that may be needed for a given Analyzer. The provided types cover the most common cases.

For example, org.apache.lucene.analysis.ja.JapaneseAnalyzer has a constructor with 4 parameters: a UserDict, a CharArraySet, a JapaneseTokenizer.Mode, and a Set<String>. So a simple wrapper can extract the values needed for the various parameters with types not available in this extension, construct the required instances, and instantiate the JapaneseAnalyzer.

Adding custom Analyzers such as the above wrapper analyzer is a simple matter of adding the Analyzer class and any associated filters and tokenizer and so on to the classpath for Jena. Also, all of the Analyzers that are included in the Lucene distribution bundled with Jena are available as well.

Each parameter object is specified with:

an optional text:paramName that may be used to document which parameter is represented
a text:paramType which is one of: text:TypeAnalyzer, text:TypeBoolean, text:TypeFile, text:TypeInt, text:TypeSet, text:TypeString.
a text:paramValue which is an xsd:string, xsd:boolean, xsd:int, or Resource.

A text:TypeSet parameter may have zero or more text:paramValue.

All other parameter types must have a single text:paramValue of the appropriate type.

An example configuration using the ShingleAnalyzerWrapper is:

    text:map (
         [ text:field "text" ; 
           text:predicate rdfs:label;
           text:analyzer [
               a text:GenericAnalyzer ;
               text:class "org.apache.lucene.analysis.shingle.ShingleAnalyzerWrapper" ;
               text:params (
                    [ text:paramName "defaultAnalyzer" ;
                      text:paramType text:TypeAnalyzer ;
                      text:paramValue [ a text:SimpleAnalyzer ]  ]
                    [ text:paramName "maxShingleSize" ;
                      text:paramType text:TypeInt ;
                      text:paramValue 3 ]
                    )
           ] .

The text:defineAnalyzers feature allows to extend the Multilingual support. Further, this feature can also be used to name analyzers defined via text:GenericAnalyzer so that a single (perhaps complex) analyzer configuration can be used is several places.

The text:defineAnalyzers is used with text:TextIndexLucene to provide a list of analyzer
definitions:

    <#indexLucene> a text:TextIndexLucene ;
        text:directory <file:Lucene> ;
        text:entityMap <#entMap> ;
        text:defineAnalyzers (
            [ text:addLang "sa-x-iast" ;
              text:analyzer [ . . . ] ]
            [ text:defineAnalyzer <#foo> ;
              text:analyzer [ . . . ] ]
        )
        .

References to a defined analyzer may be made in the entity map like:

    text:analyzer [
        a text:DefinedAnalyzer
        text:useAnalyzer <#foo> ]

Multilingual support currently allows for a fixed set of ISO 2-letter codes to be used to select from among built-in analyzers using the nullary constructor associated with each analyzer. So if one is wanting to use:

a language not included, e.g., Brazilian; or
use additional constructors defining stop words, stem exclusions and so on; or
refer to custom analyzers that might be associated with generalized BCP-47 language tags,
such as, sa-x-iast for Sanskrit in the IAST transliteration,

then text:defineAnalyzers with text:addLang will add the desired analyzers to the multilingual
support so that fields with the appropriate language tags will use the appropriate custom analyzer.

When text:defineAnalyzers is used with text:addLang then text:multilingualSupport is implicitly added if not already specified and a warning is put in the log:

        text:defineAnalyzers (
            [ text:addLang "sa-x-iast" ;
              text:analyzer [ . . . ] ]

this adds an analyzer to be used when the text:langField has the value sa-x-iast during indexing
and search.

Repeating a text:GenericAnalyzer specification for use with multiple fields in an entity map
may be cumbersome. The text:defineAnalyzer is used in an element of a text:defineAnalyzers list to associate a resource with an analyzer so that it may be referred to later in a text:analyzer
object. Assuming that an analyzer definition such as the following has appeared among the
text:defineAnalyzers list:

    [ text:defineAnalyzer <#foo>
      text:analyzer [ . . . ] ]

then in a text:analyzer specification in an entity map, for example, a reference to analyzer <#foo>
is made via:

    text:map (
         [ text:field "text" ; 
           text:predicate rdfs:label;
           text:analyzer [
               a text:DefinedAnalyzer
               text:useAnalyzer <#foo> ]

This makes it straightforward to refer to the same (possibly complex) analyzer definition in multiple fields.

Generic Lucene Analyzers

Details

Description

Attachments

Issue Links

Activity

People

Dates