Uploaded image for project: 'Apache Jena'
  1. Apache Jena
  2. JENA-1453

jena-text Lucene docs contain graph field duplicates

    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Closed
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: Jena 3.6.0
    • Fix Version/s: Jena 3.7.0
    • Component/s: Jena, Text
    • Labels:
      None
    • Environment:

      All

      Description

      The current jena-text integration of Lucene has both duplicate and unused fields that increase the required space and reduce the performance of the Lucene integration.

      Consider:

          ex:SomeOne
             a       ex:Item ;
             skos:prefLabel "Some One" ;
             skos:prefLabel "Some Neat One"@en ;
      

      Assuming that:

      [] a text:EntityMap ;
          text:entityField      "uri" ;
          text:uidField         "uid" ;
          text:defaultField     "label" ;
          text:langField        "lang" ;
          text:graphField       "graph" ;
          text:map (
               [ text:field "label" ; 
                 text:predicate skos:prefLabel ]
      

      and that text:multilingualSupport false ;, then

      The two Lucene documents that will be indexed appear as follows:

      Document<
        stored,indexed,indexOptions=DOCS<uri:http://example.org/SomeOne> 
        indexed,omitNorms,indexOptions=DOCS<graph:http://example.org/G1> 
        stored,indexed,tokenized<label:Some One>
        stored,indexed,omitNorms,indexOptions=DOCS 
          <uid:e7e369a1db7ff71723fda412d1f6308e1f71dd413621f0804ab97858af51196b> 
        stored,indexed,tokenized<graph:http://example.org/G1> 
        stored,indexed,omitNorms,indexOptions=DOCS
          <uid:50b49835488db84487e6e11287b570d7a9b8624fa714a9d51bf8ef444cc60bee>
      >
      
      Document<
        stored,indexed,indexOptions=DOCS<uri:http://example.org/SomeOne> 
        indexed,omitNorms,indexOptions=DOCS<graph:http://example.org/G1> 
        stored,indexed,tokenized<label:Some Neat One> 
        stored,indexed,omitNorms,indexOptions=DOCS<lang:en> 
        stored,indexed,omitNorms,indexOptions=DOCS
          <uid:2cf2b62a4a048d6517a0edddb0dabfdf190f4e074daf077b21a3844c5831376f> 
        stored,indexed,tokenized<graph:http://example.org/G1> 
        stored,indexed,omitNorms,indexOptions=DOCS<lang:en> 
        stored,indexed,omitNorms,indexOptions=DOCS
          <uid:b5dbce956b7105e9c5424620330e5ec3a9d78e8c7d73cba5a880984fa2e89bfd>
      >
      

      The graph field (and associated lang and uid fields) appear twice in each document. The initial occurrence results from the text:graphField configuration and the second is an artifact of TextQueryFuncs.entityFromQuad adding the graph to the Entity via entity.put(...).

      This second occurrence of the graph field is not effective since there is no search over tokenized graph URIs and there is currently no way to return the graph field so no need to store it.

      It might well be a useful improvement to allow the graph field to be retrieved via text:query PF but that would most reasonably be done by adding the Field.Store.YES to the FieldType for the initial occurrence of the graph field.

      The second occurrence of a uid field is the result of the unnecessary graph occurrence resulting from the Entity to Document conversion in TextLuceneIndex. This is never used since the purpose of the uid field is to handle the deleting of documents from the Lucene index when a triple is deleted and does not involve the graph URI.

      The solution is to delete lines 89-90 of TextQueryFuncs.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                code-ferret Code Ferret
                Reporter:
                code-ferret Code Ferret
              • Votes:
                0 Vote for this issue
                Watchers:
                3 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: