Tika
  1. Tika
  2. TIKA-793

Invalid ASCII character (65533) when retriving MP3 metadata

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 1.0
    • Fix Version/s: 1.1
    • Component/s: metadata, parser
    • Labels:
      None
    • Environment:

      Ubuntu 10.04 (x64), Android (2.2 +)

      Description

      When extracting metadata from certain mp3's (the id3 version appears to be 2.4) I'm seeing invalid characters at the end of the parsed fields. For example:

      American M�

      which should be:

      American Me

      1. TikaTest.java
        2 kB
        William Seemann

        Activity

        Hide
        William Seemann added a comment -

        The code I'm using to test

        Show
        William Seemann added a comment - The code I'm using to test
        Hide
        William Seemann added a comment -

        Also, it's worth noting, I see this issue in almost all of the mp3's I've downloaded from Amazon.com.

        Show
        William Seemann added a comment - Also, it's worth noting, I see this issue in almost all of the mp3's I've downloaded from Amazon.com.
        Hide
        Nick Burch added a comment -

        I've managed to reproduce this on one of my Amazon MP3s, will use that to test a fix when I have a chance

        Show
        Nick Burch added a comment - I've managed to reproduce this on one of my Amazon MP3s, will use that to test a fix when I have a chance
        Hide
        Nick Burch added a comment -

        I've tracked this to two bugs. Both relate to the handling of UTF-16 encoded strings.

        I've fixed the first in r1224865, which was a problem in the null termination stripping

        The second is the handling of the COMM (Comment) tag, which contains both a language and text. We don't currently support the language being encoded differently to the text, that remains to be fixed (and really needs a test file too)

        Show
        Nick Burch added a comment - I've tracked this to two bugs. Both relate to the handling of UTF-16 encoded strings. I've fixed the first in r1224865, which was a problem in the null termination stripping The second is the handling of the COMM (Comment) tag, which contains both a language and text. We don't currently support the language being encoded differently to the text, that remains to be fixed (and really needs a test file too)
        Hide
        William Seemann added a comment -

        Nick, thanks for the prompt fix. Keep up the good work.

        Show
        William Seemann added a comment - Nick, thanks for the prompt fix. Keep up the good work.
        Hide
        Nick Burch added a comment -

        Comment (COM/COMM) tag handling fixed in r1225480 - it uses a different form to the other text tags so needs explicit encoding aware handling of the different parts of it.

        Show
        Nick Burch added a comment - Comment (COM/COMM) tag handling fixed in r1225480 - it uses a different form to the other text tags so needs explicit encoding aware handling of the different parts of it.

          People

          • Assignee:
            Unassigned
            Reporter:
            William Seemann
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development