Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-2761

XML Structured Text Is Missing Metadata Fields for mp3 files



    • Type: Bug
    • Status: Resolved
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: 1.19.1
    • Fix Version/s: 2.0.0, 1.20
    • Component/s: metadata
    • Labels:
    • Environment:



      I am using the Tika 1.19 as a GUI to extract metadata from an .mp3 file. The sample rate is available and I am able access it, but only as a string or as part of a JSON document. I am working in XML and wold like to use XML as a content handler. But when the metadata is returned as 'structured text' (XML) the sample rate is not returned. I have tried using Tika 1.19 in a Maven project and experimented with different contentHandlers  and the same issue occurs. I cannot seem to get the sample rate returned in an XML doc, but I am able to access the data from the metadata object itself. If the metadata is returned as a string, the sample rate is there, if it is returned as XML, the sample rate is not returned. I am wondering what I am doing wrong or misunderstanding. Perhaps an issue with the parser or contentHandler that is used?


      Tika 1.19 'Metadata' view (sample rate is available):


      Author: Glee Cast

      Content-Length: 8251946

      Content-Type: audio/mpeg

      X-Parsed-By: org.apache.tika.parser.DefaultParser

      X-Parsed-By: org.apache.tika.parser.mp3.Mp3Parser

      X-TIKA:digest:MD5: e0bdf3a0e171fca838604f9baad46612

      X-TIKA:digest:SHA256: ea1e4aa998f2c6e80139fa100c62fc1ee17652cf702cd484532b90183e7c5cc0

      channels: 2

      creator: Glee Cast

      dc:creator: Glee Cast

      dc:title: Rehab (Glee Cast Version)

      meta:author: Glee Cast

      resourceName: USQX90900223_A4_T7.mp3

      samplerate: 44100

      title: Rehab (Glee Cast Version)

      version: MPEG 3 Layer III Version 1

      xmpDM:album: Glee: The Music, The Complete Season One

      xmpDM:artist: Glee Cast

      xmpDM:audioChannelType: Stereo

      xmpDM:audioCompressor: MP3

      xmpDM:audioSampleRate: 44100

      xmpDM:duration: 206301.296875


      xmpDM:logComment: XXX -

      (P) 2009 Twentieth Century Fox Television - USQX90900223


      xmpDM:trackNumber: 4



      Tika 1.19 'Structured Text' view (no sample rate):


      <?xml version="1.0" encoding="UTF-8"?><html xmlns="http://www.w3.org/1999/xhtml">


      <meta name="xmpDM:genre" content=""/>

      <meta name="X-Parsed-By" content="org.apache.tika.parser.DefaultParser"/>

      <meta name="X-Parsed-By" content="org.apache.tika.parser.mp3.Mp3Parser"/>

      <meta name="creator" content="Glee Cast"/>

      <meta name="xmpDM:album" content="Glee: The Music, The Complete Season One"/>

      <meta name="xmpDM:releaseDate" content=""/>

      <meta name="meta:author" content="Glee Cast"/>

      <meta name="xmpDM:artist" content="Glee Cast"/>

      <meta name="X-TIKA:digest:SHA256" content="ea1e4aa998f2c6e80139fa100c62fc1ee17652cf702cd484532b90183e7c5cc0"/>

      <meta name="dc:creator" content="Glee Cast"/>

      <meta name="xmpDM:audioCompressor" content="MP3"/>

      <meta name="resourceName" content="USQX90900223_A4_T7.mp3"/>

      <meta name="xmpDM:logComment" content="XXX - (P) 2009 Twentieth Century Fox Television - USQX90900223"/>

      <meta name="dc:title" content="Rehab (Glee Cast Version)"/>

      <meta name="Author" content="Glee Cast"/>

      <meta name="Content-Length" content="8251946"/>

      <meta name="X-TIKA:digest:MD5" content="e0bdf3a0e171fca838604f9baad46612"/>

      <meta name="Content-Type" content="audio/mpeg"/>

      <title>Rehab (Glee Cast Version)</title>


      <body><h1>Rehab (Glee Cast Version)</h1>

      <p>Glee Cast</p>

      <p>Glee: The Music, The Complete Season One, track 4</p>


      <p>XXX -  (P) 2009 Twentieth Century Fox Television - USQX90900223</p>



      Tika 1.19 Recursive JSON view (the sample rate is there):




      {     "Author": "Glee Cast",     "Content-Type": "audio/mpeg",     "X-Parsed-By": [       "org.apache.tika.parser.DefaultParser",       "org.apache.tika.parser.mp3.Mp3Parser"     ],     "X-TIKA:content": "Rehab (Glee Cast Version)\nGlee Cast\nGlee: The Music, The Complete Season One, track 4\n206301.3\nXXX - \n(P) 2009 Twentieth Century Fox Television - USQX90900223\n",     "X-TIKA:digest:MD5": "e0bdf3a0e171fca838604f9baad46612",     "X-TIKA:digest:SHA256": "ea1e4aa998f2c6e80139fa100c62fc1ee17652cf702cd484532b90183e7c5cc0",     "X-TIKA:parse_time_millis": "86",     "channels": "2",     "creator": "Glee Cast",     "dc:creator": "Glee Cast",     "dc:title": "Rehab (Glee Cast Version)",     "meta:author": "Glee Cast",     *+_"samplerate": "44100",_+*     "title": "Rehab (Glee Cast Version)",     "version": "MPEG 3 Layer III Version 1",     "xmpDM:album": "Glee: The Music, The Complete Season One",     "xmpDM:artist": "Glee Cast",     "xmpDM:audioChannelType": "Stereo",     "xmpDM:audioCompressor": "MP3",     *_+"xmpDM:audioSampleRate": "44100",+_*     "xmpDM:duration": "206301.296875",     "xmpDM:genre": "",     "xmpDM:logComment": "XXX - \n(P) 2009 Twentieth Century Fox Television - USQX90900223",     "xmpDM:releaseDate": "",     "xmpDM:trackNumber": "4"   }





            • Assignee:
              tallison Tim Allison
              nsincaglia Nick Sincaglia
            • Votes:
              0 Vote for this issue
              3 Start watching this issue


              • Created: