Tika
  1. Tika
  2. TIKA-928

Separation of Tika Core Properties From Metadata Processing

    Details

    • Type: Improvement Improvement
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: 1.1
    • Fix Version/s: None
    • Component/s: metadata
    • Labels:
      None

      Description

      The Metadata class is a bit overloaded with both processing and core Tika properties defined in the same place.

      Separating the core properties into a TikaCoreProperties class which contains only composite properties which reference other standards like DublinCore will allow the Metadata class to focus on processing and ease the transition from the now deprecated String properties that were directly included in Metadata via the implements clause.

      This will also allow us to cherry pick only the properties we want from a standard as Tika core properties rather than having to include all the properties in a standard's interface, some of which may be more specific to a particular content type than we want.

        Issue Links

          Activity

          Hide
          Ray Gauss II added a comment -

          Apply to tika-core.

          Show
          Ray Gauss II added a comment - Apply to tika-core.
          Hide
          Nick Burch added a comment -

          Thanks, applied (with a few extra JavaDoc bits) in r1339404.

          It's good to finally make clear which bits of metadata we try to ensure are consistent across formats, and which ones will be file type specific. That way, external consumers who need format specific details know which ones they are, while general users can be sure that the metadata they're looking at is a consistent one

          Show
          Nick Burch added a comment - Thanks, applied (with a few extra JavaDoc bits) in r1339404. It's good to finally make clear which bits of metadata we try to ensure are consistent across formats, and which ones will be file type specific. That way, external consumers who need format specific details know which ones they are, while general users can be sure that the metadata they're looking at is a consistent one
          Hide
          Ray Gauss II added a comment -

          Changes to call properties defined in the new TikaCoreProperties class rather than the now deprecated Metadata string keys.

          Show
          Ray Gauss II added a comment - Changes to call properties defined in the new TikaCoreProperties class rather than the now deprecated Metadata string keys.
          Hide
          Nick Burch added a comment -

          Core patch applied in r1339804, thanks

          Show
          Nick Burch added a comment - Core patch applied in r1339804, thanks
          Hide
          Nick Burch added a comment -

          Parsers patch applied in r1339833, that was an epic patch, thanks!

          Show
          Nick Burch added a comment - Parsers patch applied in r1339833, that was an epic patch, thanks!
          Hide
          Nick Burch added a comment -

          As of r1339868 we now have most of a core set of Properties defined, including aliases for backwards compatibility (until Tika 2.0 when we can tidy things up!)

          I there are probably still a few more common properties we should probably bring across (including converting to Properties with Prefixes as needed). These would be from both interfaces we've not yet worked on, as well as possibly a few more from MSOffice (would need converting to Office in the process if so).

          With that in mind, I'll leave this issue open for now, until we finish the review and sort these additional properties out.

          Show
          Nick Burch added a comment - As of r1339868 we now have most of a core set of Properties defined, including aliases for backwards compatibility (until Tika 2.0 when we can tidy things up!) I there are probably still a few more common properties we should probably bring across (including converting to Properties with Prefixes as needed). These would be from both interfaces we've not yet worked on, as well as possibly a few more from MSOffice (would need converting to Office in the process if so). With that in mind, I'll leave this issue open for now, until we finish the review and sort these additional properties out.

            People

            • Assignee:
              Unassigned
              Reporter:
              Ray Gauss II
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:

                Development