Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-1691

Apache Tika for enabling metadata interoperability

Attach filesAttach ScreenshotAdd voteVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • New Feature
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None
    • None

    Description

      If am not wrong, enabling consistent metadata across file formats is already (partially) provided into Tika by relying on TikaCoreProperties and, within the context of Solr, ExtractingRequestHandler (by defining how to map metadata fields in solrconfig.xml). However, I am working on a new component for both schema mapping (to operate on the name of metadata properties) and instance transformation (to operate on the value of metadata) that consists, essentially, of the following changes:

      • A wrapper of Metadata object (MappedMetadata.java) that decorates the set method (currently, line number 367 of Metadata.java) by applying the given mapping functions (via configuration) before setting metadata properties.
      • Basic mapping functions (BasicMappingUtils.java) that are utility methods to map a set of metadata to the target schema.
      • A new MetadataConfig object that, as well as TikaConfig, may be configured via XML file (organized as showed in the following snippet) and allows to perform a fine-grained metadata mapping by using Java reflection.
      tika-metadata.xml
      <?xml version="1.0" encoding="UTF-8" standalone="no"?>
      <properties>
        <mappings>
          <mapping type="type/sub-type">
            <relation name="SOURCE_FIELD">
              <target>TARGET_FIELD</target>
              <expression>exclude|include|equivalent|overlap</expression>
              <function name="FUNCTION_NAME">
                <argument>ARGUMENT_VALUE</argument>
              </function>
              <cardinality>
                <source>SOURCE_CARDINALITY</source>
                <target>TARGET_CARDINALITY</target>
                <order>ORDER_NUMBER</order>
                <dependencies>
                  <field>FIELD_NAME</field>
                </dependencies>
              </cardinality>
            </relation>
          </mapping>
          ...
          <mapping> <!-- This contains the fallback strategy for unknown metadata -->
            <relation>
              ...
            </relation>
          <mapping>
        </mappings>
      </properties>
      

      The theoretical definition of metadata mapping is available in "A survey of techniques for achieving metadata interoperability". This paper shows also some basic examples of metadata mappings.

      Currently, I am still working on some core functionalities, but I have already performed some experiments by using a small prototype.

      By the way, I think that we should modify the method add in order to use set instead of metadata.put (currently, line number 316 of Metadata.java). This is a trivial change (I could create a new Jira issue about that), but it would allow to be coherent with the other implementation of add method and, moreover, the methods of Metadata could be extended more easily.

      I would really appreciate your feedback about this proposal. If you believe that it is a good idea, I could provide the code in few days.

      Thanks a lot,
      Giuseppe

      Attachments

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            gostep Giuseppe Totaro
            gostep Giuseppe Totaro

            Dates

              Created:
              Updated:

              Slack

                Issue deployment