Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-2085 Tika 2.0.0 -- Overarching task list for what we need to do before 2.0.0
  3. TIKA-1607

Introduce new arbitrary object key/values data structure for persistence of Tika Metadata

Attach filesAttach ScreenshotAdd voteVotersWatch issueWatchersLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Sub-task
    • Status: Open
    • Critical
    • Resolution: Unresolved
    • None
    • 1.17, 2.0.0-BETA, 2.1.0
    • core, metadata
    • None

    Description

      I am currently working implementing more comprehensive extraction and enhancement of the Tika support for Phone number extraction and metadata modeling.
      Right now we utilize the String[] multivalued support available within Tika to persist phone numbers as

      Metadata: String: String[]
      Metadata: phonenumbers: number1, number2, number3, ...
      

      I would like to propose we extend multi-valued support outside of the String[] paradigm by implementing a more abstract Collection of Objects such that we could consider and implement the phone number use case as follows

      Metadata: String:  Object
      

      Where Object could be a Collection<HashMap<String/Property, HashMap<String/Property, String/Int/Long>> e.g.

      Metadata: phonenumbers: [(+162648743476: (LibPN-CountryCode : US), (LibPN-NumberType: International), (etc: etc)...), (+1292611054: LibPN-CountryCode : UK), (LibPN-NumberType: International), (etc: etc)...) (etc)] 
      

      There are obvious backwards compatibility issues with this approach... additionally it is a fundamental change to the code Metadata API. I hope that the <String, Object> Mapping however is flexible enough to allow me to model Tika Metadata the way I want.

      Any comments folks? Thanks
      Lewis

      Attachments

        1. TIKA-1607_bytes_dom_values.patch
          18 kB
          Tim Allison
        2. TIKA-1607v1_rough_rough.patch
          42 kB
          Tim Allison
        3. TIKA-1607v2_rough_rough.patch
          51 kB
          Tim Allison
        4. TIKA-1607v3.patch
          61 kB
          Tim Allison

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            lewismc Lewis John McGibbney
            lewismc Lewis John McGibbney

            Dates

              Created:
              Updated:

              Slack

                Issue deployment