Tika
  1. Tika
  2. TIKA-930

Consolidation of Some Tika Core Properties

    Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 1.2
    • Fix Version/s: 1.2
    • Component/s: metadata
    • Labels:
      None

      Description

      There are a few properties in TikaCoreProperties which overlap and I think we should minimize ambiguity by consolidating them into a single composite property with the clearest name, the most general specification referenced as its primary property, and the others and deprecated strings as its secondaries.

      Here's the proposed pseudo-code for the changes:

      Remove TikaCoreProperties.SUBJECT
      TikaCoreProperties.KEYWORDS <- DublinCore.SUBJECT,

      { Office.KEYWORDS, MSOffice.KEYWORDS, Metadata.SUBJECT }

      Remove TikaCoreProperties.DATE
      TikaCoreProperties.CREATION_DATE <- DublinCore.DATE,

      { Office.CREATION_DATE, MSOffice.CREATION_DATE, Metadata.DATE }

      Remove TikaCoreProperties.MODIFIED
      TikaCoreProperties.SAVE_DATE <- DublinCore.MODIFIED,

      { Office.SAVE_DATE, MSOffice.LAST_SAVED, Metadata.MODIFIED, "Last-Modified" }

      and an example of the Java changes:

      TikaCoreProperties.java Before
          /**
           * @see DublinCore#SUBJECT
           */
          public static final Property SUBJECT = Property.composite(DublinCore.SUBJECT, 
                  new Property[] { Property.internalText(Metadata.SUBJECT) });
            
          /**
           * @see Office#KEYWORDS
           */
          public static final Property KEYWORDS = Property.composite(Office.KEYWORDS,
                  new Property[] { Property.internalTextBag(MSOffice.KEYWORDS) });
      

      would become

      TikaCoreProperties.java After
          /**
           * @see DublinCore#SUBJECT
           * @see Office#KEYWORDS
           */
          public static final Property KEYWORDS = Property.composite(DublinCore.SUBJECT,
                  new Property[] { 
          		    Office.KEYWORDS, 
          		    Property.internalTextBag(MSOffice.KEYWORDS),
          		    Property.internalText(Metadata.SUBJECT)
          		});
      

      Since this would require a bit of refactoring for parsers that use the properties being removed I thought it best to get some feedback before working up a full patch.

      Does this seem like a reasonable approach?

        Issue Links

          Activity

          Hide
          Jörg Ehrlich added a comment -

          Consolidation is a good idea.
          As a general comment up front, the DublinCore interface contains properties from the newer “Terms” namespace (http://dublincore.org/documents/dcmi-terms/). Please note that this newer version of DC has not been standardized, yet. So the general question is, if the DC interface should use those properties. They are interesting, because the new namespace introduces refinements of older properties (like “created” and “modified” instead of “date”). But such refinements are also available in already standardized namespaces like XMP.

          Here a list of recommendations for the core properties:

          Creator:
          Remove Author because this is already covered by Creator.
          Creator is the Author, there is no need to have two properties for this. And DublinCore.Creator should be an ordered Array (as defined in the IPTC spec ) instead of a simple text field.
          TikaCoreProperties.CREATOR <- DublinCore.CREATOR,

          { Metadata.CREATOR, Office.AUTHOR, MSOffice.AUTHOR}

          If Creator becomes an array, INITIAL_AUTHOR or LAST_AUTHOR are not necessarily needed anymore.

          Creation date:
          The original DublinCore.Date is not a specific point in time. That’s why it has never been used for a creation date in any application. And that’s why the DC organization has set up a newer namespace (see above) which introduces new date properties.
          But as this newer namespace is not really used yet, I propose the following:
          TikaCoreProperties.CREATION_DATE <- XMP.CREATE_DATE,

          { DublinCore.CREATED, Office.CREATION_DATE, MSOffice.CREATION_DATE, DublinCore.DATE, Metadata.DATE }

          Modification date:
          I would keep TikaCoreProperties.MODIFIED because so far “modified” has been the vocabulary being used for “date the asset has been last saved”. Here again the DC property is a newer, not standardized one. Removal of SAVE_DATE is good.
          TikaCoreProperties.MODIFIED <- XMP.MODIFY_DATE,

          { Office.SAVE_DATE, MSOffice.LAST_SAVED, DublinCore.MODIFIED ,Metadata.MODIFIED, "Last-Modified" }

          CreatorTool:
          Add CreatorTool which is the application that created the asset. It’s different to “Creator”.
          TikaCoreProperties.CREATOR_TOOL <- XMP.CREATOR_TOOL
          I have provided the XMP Namespace in TIKA-908

          Rating:
          A rating property is being used in almost all applications today, so this should be added:
          TikaCoreProperties.RATING <- XMP.RATING

          Metadata date:
          A lot of applications want to know if only the metadata has been changed but not the content, e.g. a movie application does not need to render the movie new if only the title has been changed. I recommend to add this property.
          TikaCoreProperties. METADATA_DATE <- XMP. METADATA_DATE

          Geo coordinates:
          Almost all camera devices today use the EXIF namespace to capture geo location information. I would recommend to use the EXIF properties as primary ones and the W3C ones as secondary ones. And rename the Geographic interface to something like W3CGeographic.
          TikaCoreProperties.LATITUDE <- EXIF.GPS_LATITUDE,

          {W3CGeographic.LATITUDE}

          The same for Longitude and Altitude.

          Copyright:
          In the future all needed core copyright properties should be added, but as this issue is about consolidating existing properties, this can be tracked in a follow up issue.

          Show
          Jörg Ehrlich added a comment - Consolidation is a good idea. As a general comment up front, the DublinCore interface contains properties from the newer “Terms” namespace ( http://dublincore.org/documents/dcmi-terms/ ). Please note that this newer version of DC has not been standardized, yet. So the general question is, if the DC interface should use those properties. They are interesting, because the new namespace introduces refinements of older properties (like “created” and “modified” instead of “date”). But such refinements are also available in already standardized namespaces like XMP. Here a list of recommendations for the core properties: Creator: Remove Author because this is already covered by Creator. Creator is the Author, there is no need to have two properties for this. And DublinCore.Creator should be an ordered Array (as defined in the IPTC spec ) instead of a simple text field. TikaCoreProperties.CREATOR <- DublinCore.CREATOR, { Metadata.CREATOR, Office.AUTHOR, MSOffice.AUTHOR} If Creator becomes an array, INITIAL_AUTHOR or LAST_AUTHOR are not necessarily needed anymore. Creation date: The original DublinCore.Date is not a specific point in time. That’s why it has never been used for a creation date in any application. And that’s why the DC organization has set up a newer namespace (see above) which introduces new date properties. But as this newer namespace is not really used yet, I propose the following: TikaCoreProperties.CREATION_DATE <- XMP.CREATE_DATE, { DublinCore.CREATED, Office.CREATION_DATE, MSOffice.CREATION_DATE, DublinCore.DATE, Metadata.DATE } Modification date: I would keep TikaCoreProperties.MODIFIED because so far “modified” has been the vocabulary being used for “date the asset has been last saved”. Here again the DC property is a newer, not standardized one. Removal of SAVE_DATE is good. TikaCoreProperties.MODIFIED <- XMP.MODIFY_DATE, { Office.SAVE_DATE, MSOffice.LAST_SAVED, DublinCore.MODIFIED ,Metadata.MODIFIED, "Last-Modified" } CreatorTool: Add CreatorTool which is the application that created the asset. It’s different to “Creator”. TikaCoreProperties.CREATOR_TOOL <- XMP.CREATOR_TOOL I have provided the XMP Namespace in TIKA-908 Rating: A rating property is being used in almost all applications today, so this should be added: TikaCoreProperties.RATING <- XMP.RATING Metadata date: A lot of applications want to know if only the metadata has been changed but not the content, e.g. a movie application does not need to render the movie new if only the title has been changed. I recommend to add this property. TikaCoreProperties. METADATA_DATE <- XMP. METADATA_DATE Geo coordinates: Almost all camera devices today use the EXIF namespace to capture geo location information. I would recommend to use the EXIF properties as primary ones and the W3C ones as secondary ones. And rename the Geographic interface to something like W3CGeographic. TikaCoreProperties.LATITUDE <- EXIF.GPS_LATITUDE, {W3CGeographic.LATITUDE} The same for Longitude and Altitude. Copyright: In the future all needed core copyright properties should be added, but as this issue is about consolidating existing properties, this can be tracked in a follow up issue.
          Hide
          Ray Gauss II added a comment -

          I'm not sure what our policy is on using standards that aren't yet ratified, but I'm in favor of using the most generic standard out there.

          For all of these, the composite's secondary properties array is really just a means of providing backwards compatibility. Individual parsers can set multiple metadata properties with the same value if they desire.

          Individual property comments:

          Creator/Author:
          DublinCore seems to be a bit vague here, but I believe most users treat DublinCore.CREATOR as the creator of the file. Author is the creator of the intellectual property that the file represents. IPTC.CREATOR, which references DublinCore.CREATOR, does further define it as the IP creator. I think something that describes the IP creator should stay in TikaCoreProperties, distinct from the file creator, but naming it AUTHOR isn't as general as something like INTELLECTUAL_PROPERTY_CREATOR would be.

          I'm not sure INITIAL_AUTHOR and LAST_AUTHOR need to be included in TikaCoreProperties though. Those seem like something individual parsers should set.

          Creation date:
          If we go with the newer DC namespace then DublinCore.CREATED should be the primary for TikaCoreProperties.CREATION_DATE. Individual parsers can also set XMP.CREATE_DATE if they want and it doesn't need to be included here.

          Modification date:
          I was just trying to consolidate naming convention, if everyone thinks 'modified' is more standard vocabulary that's fine, but then TikaCoreProperties.CREATION_DATE should be TikaCoreProperties.CREATED. Individual parsers can also set XMP.MODIFY_DATE if they want and it doesn't need to be included here.

          Creator tool:
          Sounds reasonable, and likely a common need.

          Rating:
          Sounds like a common need, though the externalReal and -1 or [0..5] definition of ratings in XMP.RATING may not be generic enough for inclusion here. I'd be interested to hear others' thoughts on this.

          Metadata date:
          Sounds reasonable, and likely a common need.

          Geo coordinates:
          W3C are the most generic and make sense for all file types. Individual parsers can also set EXIF properties. Renaming Geographic to W3CGeographic may make sense.

          Copyright:
          Agreed.

          Show
          Ray Gauss II added a comment - I'm not sure what our policy is on using standards that aren't yet ratified, but I'm in favor of using the most generic standard out there. For all of these, the composite's secondary properties array is really just a means of providing backwards compatibility. Individual parsers can set multiple metadata properties with the same value if they desire. Individual property comments: Creator/Author: DublinCore seems to be a bit vague here, but I believe most users treat DublinCore.CREATOR as the creator of the file. Author is the creator of the intellectual property that the file represents. IPTC.CREATOR, which references DublinCore.CREATOR, does further define it as the IP creator. I think something that describes the IP creator should stay in TikaCoreProperties, distinct from the file creator, but naming it AUTHOR isn't as general as something like INTELLECTUAL_PROPERTY_CREATOR would be. I'm not sure INITIAL_AUTHOR and LAST_AUTHOR need to be included in TikaCoreProperties though. Those seem like something individual parsers should set. Creation date: If we go with the newer DC namespace then DublinCore.CREATED should be the primary for TikaCoreProperties.CREATION_DATE. Individual parsers can also set XMP.CREATE_DATE if they want and it doesn't need to be included here. Modification date: I was just trying to consolidate naming convention, if everyone thinks 'modified' is more standard vocabulary that's fine, but then TikaCoreProperties.CREATION_DATE should be TikaCoreProperties.CREATED. Individual parsers can also set XMP.MODIFY_DATE if they want and it doesn't need to be included here. Creator tool: Sounds reasonable, and likely a common need. Rating: Sounds like a common need, though the externalReal and -1 or [0..5] definition of ratings in XMP.RATING may not be generic enough for inclusion here. I'd be interested to hear others' thoughts on this. Metadata date: Sounds reasonable, and likely a common need. Geo coordinates: W3C are the most generic and make sense for all file types. Individual parsers can also set EXIF properties. Renaming Geographic to W3CGeographic may make sense. Copyright: Agreed.
          Hide
          Jörg Ehrlich added a comment -

          Some answers to Ray's comments:

          Creator:
          The DublinCore creator is usually considered the creator of the intellectual property, not the creator of the file. That is what the "creator tool" property is for. So we should stick with the "creator" property and don't use "author" or any other additional key.

          Rating:
          I think we should better not use anything more generic here. The generic approaches taken in the past are the reason why we have this huge mess of incompatible applications today. There is a strong reason why the Metadata Working Group has introduced this definition as it is. A lot of important applications understand and use this definition today. And didn't we say we wanted to use only something which is clearly defined?

          Geographic:
          Have you found any files or file types which are actually using the W3C approach to store geolocation data? All I have seen until today are using Exif to store it

          Show
          Jörg Ehrlich added a comment - Some answers to Ray's comments: Creator: The DublinCore creator is usually considered the creator of the intellectual property, not the creator of the file. That is what the "creator tool" property is for. So we should stick with the "creator" property and don't use "author" or any other additional key. Rating: I think we should better not use anything more generic here. The generic approaches taken in the past are the reason why we have this huge mess of incompatible applications today. There is a strong reason why the Metadata Working Group has introduced this definition as it is. A lot of important applications understand and use this definition today. And didn't we say we wanted to use only something which is clearly defined? Geographic: Have you found any files or file types which are actually using the W3C approach to store geolocation data? All I have seen until today are using Exif to store it
          Hide
          Ray Gauss II added a comment -

          Creator:
          "The DublinCore creator is usually considered the creator of the intellectual property, not the creator of the file"

          If we're talking about developers that deal with metadata frequently or librarians, taxonomists, etc. then I'd agree, but the average developer may not. I don't have any data to back that up but I don't think we can assume everyone knows DublinCore.CREATOR should be used as the IP creator and as such we should have separate properties since knowing who created a file can be quite useful.

          Rating:
          I don't have a strong opinion here, but we should elicit conversation on it from the group. This should probably be a separate issue.

          Geographic:
          We're not really concerned with how the geo data is stored in the file, just how we want to present the metadata key and value to users, as generically as possible. This page seems to list several other formats that might conceivably be used with Tika: http://en.wikipedia.org/wiki/Geotagging

          Show
          Ray Gauss II added a comment - Creator: "The DublinCore creator is usually considered the creator of the intellectual property, not the creator of the file" If we're talking about developers that deal with metadata frequently or librarians, taxonomists, etc. then I'd agree, but the average developer may not. I don't have any data to back that up but I don't think we can assume everyone knows DublinCore.CREATOR should be used as the IP creator and as such we should have separate properties since knowing who created a file can be quite useful. Rating: I don't have a strong opinion here, but we should elicit conversation on it from the group. This should probably be a separate issue. Geographic: We're not really concerned with how the geo data is stored in the file, just how we want to present the metadata key and value to users, as generically as possible. This page seems to list several other formats that might conceivably be used with Tika: http://en.wikipedia.org/wiki/Geotagging
          Hide
          Nick Burch added a comment -

          In terms of dcmi, we tend to take a pragmatic view on metadata standards. If it's good enough to be useful, and it won't confuse, use it! Try to keep things simple though, so don't include a whole standard just for the sake of it... But if it provides value then go for it

          Show
          Nick Burch added a comment - In terms of dcmi, we tend to take a pragmatic view on metadata standards. If it's good enough to be useful, and it won't confuse, use it! Try to keep things simple though, so don't include a whole standard just for the sake of it... But if it provides value then go for it
          Hide
          Jörg Ehrlich added a comment -

          Hi Ray and Nick,

          It is very important to also "educate" average developers to use the standards in the proper way. As I wrote for the Rating field: It is imperative to stick with standards otherwise you risk sacrificing interoperability, which is one of the most important features for metadata.
          And regarding the Creator field: With IPTC and PLUS there exist very strong and well known standards to depict who created what part of an asset. And I strongly recommend to stick with at least one of them instead of coming up with an own proprietary creator scheme which no one knows about.
          It's nice to be able to be pragmatic, but not using standards for metadata today causes a lot of headache in the future.

          Regarding Geo data: I'm ok with using the W3C properties for the core properties.

          Show
          Jörg Ehrlich added a comment - Hi Ray and Nick, It is very important to also "educate" average developers to use the standards in the proper way. As I wrote for the Rating field: It is imperative to stick with standards otherwise you risk sacrificing interoperability, which is one of the most important features for metadata. And regarding the Creator field: With IPTC and PLUS there exist very strong and well known standards to depict who created what part of an asset. And I strongly recommend to stick with at least one of them instead of coming up with an own proprietary creator scheme which no one knows about. It's nice to be able to be pragmatic, but not using standards for metadata today causes a lot of headache in the future. Regarding Geo data: I'm ok with using the W3C properties for the core properties.
          Hide
          Ray Gauss II added a comment -

          Fixed in r1356560.

          This ended up being a fairly large commit. Feel free to revert or re-open this issue if I've messed something up.

          I've included the commit message here as it describes the majority of the changes:

          • Added the Dublin Core Terms namespace and prefix
          • Changed DublinCore.CREATOR to multi-valued property
          • Consolidated TikaCoreProperties.AUTHOR to TikaCoreProperties.CREATOR
          • Removed TikaCoreProperties.LAST_AUTHOR and added TikaCoreProperties.MODIFIER
          • Added DublinCore.CREATED
          • Consolidated TikaCoreProperties.DATE and TikaCoreProperties.CREATION_DATE to TikaCoreProperties.CREATED
          • Consolidated TikaCoreProperties.SAVE_DATE to TikaCoreProperties.MODIFIED
          • Updated DublinCore.MODIFIED to correct terms namespace
          • Added OpenOfficeXMLCore.SUBJECT
          • Consolidated TikaCoreProperties.SUBJECT to TikaCoreProperties.KEYWORDS
          • Added several temporary transition properties to TikaCoreProperties to ease migrating previous use of 'subject' to more specific properties and maintain backwards compatibility
          • For most mail-related parsers/handlers, transition subject to dc:title
          • For most office-related parsers/handlers, transition subject to OO cp:subject
          • Added TikaCoreProperties.CREATOR_TOOL
          • Added TikaCoreProperties.METADATA_DATE
          • Added TikaCoreProperties.RATING
          • Changed XMP to use common namespace delimiter
          • Added Open Office word processing namespace and prefix to OfficeOpenXMLExtended
          • Added OfficeOpenXMLExtended.COMMENTS
          • Added TikaCoreProperties.COMMENTS which is a composite of OfficeOpenXMLExtended.COMMENTS, ClimateForecast.COMMENT and MSOffice.COMMENTS
          • Deprecated MSOffice.Comments
          • Changed OpenDocumentMetaParser to accommodate TikaCoreProperties since the XML it processes treats dc:date and dc:subject differently than DcXMLParser
          • Change nextMetadata in TextExtractor to a Property rather than String key
          • Changed DcXmlParser to use namespace already defined in DublinCore
          • Updated parsers to reflect TikaCoreProperties changes
          • Updated tika-xmp to reflect TikaCoreProperties changes
          • Registered dcterms namespace in XMPMetadataTest
          • Updated tests to reflect new changes and added some tests for backwards compatibility
          Show
          Ray Gauss II added a comment - Fixed in r1356560. This ended up being a fairly large commit. Feel free to revert or re-open this issue if I've messed something up. I've included the commit message here as it describes the majority of the changes: Added the Dublin Core Terms namespace and prefix Changed DublinCore.CREATOR to multi-valued property Consolidated TikaCoreProperties.AUTHOR to TikaCoreProperties.CREATOR Removed TikaCoreProperties.LAST_AUTHOR and added TikaCoreProperties.MODIFIER Added DublinCore.CREATED Consolidated TikaCoreProperties.DATE and TikaCoreProperties.CREATION_DATE to TikaCoreProperties.CREATED Consolidated TikaCoreProperties.SAVE_DATE to TikaCoreProperties.MODIFIED Updated DublinCore.MODIFIED to correct terms namespace Added OpenOfficeXMLCore.SUBJECT Consolidated TikaCoreProperties.SUBJECT to TikaCoreProperties.KEYWORDS Added several temporary transition properties to TikaCoreProperties to ease migrating previous use of 'subject' to more specific properties and maintain backwards compatibility For most mail-related parsers/handlers, transition subject to dc:title For most office-related parsers/handlers, transition subject to OO cp:subject Added TikaCoreProperties.CREATOR_TOOL Added TikaCoreProperties.METADATA_DATE Added TikaCoreProperties.RATING Changed XMP to use common namespace delimiter Added Open Office word processing namespace and prefix to OfficeOpenXMLExtended Added OfficeOpenXMLExtended.COMMENTS Added TikaCoreProperties.COMMENTS which is a composite of OfficeOpenXMLExtended.COMMENTS, ClimateForecast.COMMENT and MSOffice.COMMENTS Deprecated MSOffice.Comments Changed OpenDocumentMetaParser to accommodate TikaCoreProperties since the XML it processes treats dc:date and dc:subject differently than DcXMLParser Change nextMetadata in TextExtractor to a Property rather than String key Changed DcXmlParser to use namespace already defined in DublinCore Updated parsers to reflect TikaCoreProperties changes Updated tika-xmp to reflect TikaCoreProperties changes Registered dcterms namespace in XMPMetadataTest Updated tests to reflect new changes and added some tests for backwards compatibility

            People

            • Assignee:
              Unassigned
              Reporter:
              Ray Gauss II
            • Votes:
              1 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development