Tika
  1. Tika
  2. TIKA-759

Better handling of content type metadata

    Details

    • Type: Improvement Improvement
    • Status: Open
    • Priority: Minor Minor
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: metadata, mime
    • Labels:
      None

      Description

      Currently we use the "Content-Type" metadata key for storing (and looking up) the media type of a document. This is simple enough and works well especially with HTTP, but not too well in line with XMP or other metadata standards like Dublin Core. So as an improvement I propose the following:

      • Switch to "dc:format" as the standard metadata key for the content type
      • Keep the existing "Content-Type" key for backwards compatibility with existing clients
      • Make the Metadata class aware of such aliases
      • Add getFormat() and setFormat() utility methods to Metadata to simplify client code and to make the exact metadata key more of an implementation detail in Tika

        Issue Links

          Activity

          Hide
          Chris A. Mattmann added a comment -

          +1 to this Jukka!

          In OODT-ville, for many years we've had something called a "Profile", see:

          http://svn.apache.org/repos/asf/oodt/trunk/profile/src/main/java/org/apache/oodt/profile/Profile.java

          A Profile is a metadata description of a resource with 3 different sets of attributes:

          Not saying we should adopt the above. Our OODT stuff is bloated in some areas, and could be reduced, but just thought I'd pass it along for some inspiration!

          Show
          Chris A. Mattmann added a comment - +1 to this Jukka! In OODT-ville, for many years we've had something called a "Profile", see: http://svn.apache.org/repos/asf/oodt/trunk/profile/src/main/java/org/apache/oodt/profile/Profile.java A Profile is a metadata description of a resource with 3 different sets of attributes: housekeeping information about the Profile (its ID, created time, etc.) information about the data that the Profile points to (this is the Dublin Core set of information + some mods, and is housed in the http://svn.apache.org/repos/asf/oodt/trunk/profile/src/main/java/org/apache/oodt/profile/ResourceAttributes.java file) domain-specific metadata, which we keep as a set of ProfileElements (housed in the http://svn.apache.org/repos/asf/oodt/trunk/profile/src/main/java/org/apache/oodt/profile/ProfileElement.java ) and its sub-classes, RangedProfileElement.java and EnumeratedProfileElement.java. ProfileElements correspond to ISO-11179 style elements, with information about (e.g., valid values, ranges, min/max, etc.) Not saying we should adopt the above. Our OODT stuff is bloated in some areas, and could be reduced, but just thought I'd pass it along for some inspiration!
          Hide
          Jörg Ehrlich added a comment -

          the move to dc:format should be done, when this patch has been applied

          Show
          Jörg Ehrlich added a comment - the move to dc:format should be done, when this patch has been applied

            People

            • Assignee:
              Jukka Zitting
              Reporter:
              Jukka Zitting
            • Votes:
              1 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:

                Development