Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-1310

Parse error - fb:admins property

Details

    • Bug
    • Status: Closed
    • Critical
    • Resolution: Fixed
    • 1.5
    • None
    • metadata
    • None
    • java version "1.7.0_60-ea"
      Java(TM) SE Runtime Environment (build 1.7.0_60-ea-b01)
      Java HotSpot(TM) 64-Bit Server VM (build 24.60-b03, mixed mode)

    Description

      Steps to reproduce the problem:

      1) Download the HTML file:
      curl --output default.html "http://techcrunch.com/2014/05/01/snapchat-adds-text-chat-and-video-calls/"

      2) Extract the metadata
      java -jar tika-app-1.5.jar --json default.html --encoding=UTF-8 > metadata.json

      There is a problem with the "fb:admins" property that does not allow the JSON file to be parsed properly.

      Attachments

        1. cli_json_test.patch
          0.7 kB
          Tyler Bui-Palsulich
        2. multi_metadata_expected.json
          0.2 kB
          Tyler Bui-Palsulich
        3. multi_metadata_output.json
          0.2 kB
          Tyler Bui-Palsulich
        4. multi_valued_test.html
          0.1 kB
          Tyler Bui-Palsulich

        Issue Links

          Activity

            vitormil Vitor Oliveira added a comment -

            Awesome! Thank you all

            vitormil Vitor Oliveira added a comment - Awesome! Thank you all
            hudson Hudson added a comment -

            SUCCESS: Integrated in tika-trunk-jdk1.6 #7 (See https://builds.apache.org/job/tika-trunk-jdk1.6/7/)
            TIKA-1291/TIKA-1310 fix bug in JSON output from CLI (tallison: http://svn.apache.org/viewvc/tika/trunk/?view=rev&rev=1598023)

            • /tika/trunk/CHANGES.txt
            • /tika/trunk/tika-app/src/main/java/org/apache/tika/cli/TikaCLI.java
            • /tika/trunk/tika-app/src/main/java/org/apache/tika/io
            • /tika/trunk/tika-app/src/main/java/org/apache/tika/io/json
            • /tika/trunk/tika-app/src/main/java/org/apache/tika/io/json/JsonMetadataSerializer.java
            • /tika/trunk/tika-app/src/test/java/org/apache/tika/cli/TikaCLITest.java
            • /tika/trunk/tika-app/src/test/resources/test-data/testJsonMultipleInts.html
            hudson Hudson added a comment - SUCCESS: Integrated in tika-trunk-jdk1.6 #7 (See https://builds.apache.org/job/tika-trunk-jdk1.6/7/ ) TIKA-1291 / TIKA-1310 fix bug in JSON output from CLI (tallison: http://svn.apache.org/viewvc/tika/trunk/?view=rev&rev=1598023 ) /tika/trunk/CHANGES.txt /tika/trunk/tika-app/src/main/java/org/apache/tika/cli/TikaCLI.java /tika/trunk/tika-app/src/main/java/org/apache/tika/io /tika/trunk/tika-app/src/main/java/org/apache/tika/io/json /tika/trunk/tika-app/src/main/java/org/apache/tika/io/json/JsonMetadataSerializer.java /tika/trunk/tika-app/src/test/java/org/apache/tika/cli/TikaCLITest.java /tika/trunk/tika-app/src/test/resources/test-data/testJsonMultipleInts.html
            hudson Hudson added a comment -

            SUCCESS: Integrated in tika-trunk-jdk1.7 #7 (See https://builds.apache.org/job/tika-trunk-jdk1.7/7/)
            TIKA-1291/TIKA-1310 fix bug in JSON output from CLI (tallison: http://svn.apache.org/viewvc/tika/trunk/?view=rev&rev=1598023)

            • /tika/trunk/CHANGES.txt
            • /tika/trunk/tika-app/src/main/java/org/apache/tika/cli/TikaCLI.java
            • /tika/trunk/tika-app/src/main/java/org/apache/tika/io
            • /tika/trunk/tika-app/src/main/java/org/apache/tika/io/json
            • /tika/trunk/tika-app/src/main/java/org/apache/tika/io/json/JsonMetadataSerializer.java
            • /tika/trunk/tika-app/src/test/java/org/apache/tika/cli/TikaCLITest.java
            • /tika/trunk/tika-app/src/test/resources/test-data/testJsonMultipleInts.html
            hudson Hudson added a comment - SUCCESS: Integrated in tika-trunk-jdk1.7 #7 (See https://builds.apache.org/job/tika-trunk-jdk1.7/7/ ) TIKA-1291 / TIKA-1310 fix bug in JSON output from CLI (tallison: http://svn.apache.org/viewvc/tika/trunk/?view=rev&rev=1598023 ) /tika/trunk/CHANGES.txt /tika/trunk/tika-app/src/main/java/org/apache/tika/cli/TikaCLI.java /tika/trunk/tika-app/src/main/java/org/apache/tika/io /tika/trunk/tika-app/src/main/java/org/apache/tika/io/json /tika/trunk/tika-app/src/main/java/org/apache/tika/io/json/JsonMetadataSerializer.java /tika/trunk/tika-app/src/test/java/org/apache/tika/cli/TikaCLITest.java /tika/trunk/tika-app/src/test/resources/test-data/testJsonMultipleInts.html
            tallison Tim Allison added a comment -

            r1598023. Let me know if there are any surprises.

            tallison Tim Allison added a comment - r1598023. Let me know if there are any surprises.
            tallison Tim Allison added a comment -

            tpalsulich, thank you for attaching the files.

            I think the correct json should be "multiple_values":"1,2,3,4"

            I don't think that our metadata parsers should be responsible for recognizing a comma-delimited value as multiple values. Our json output, however, should properly put double-quotes around a value that contains numerals and commas, which it currently doesn't.

            Fix on way. Thank you, again tpalsulich, vitormil and Steffen (over on TIKA-1291).

            tallison Tim Allison added a comment - tpalsulich , thank you for attaching the files. I think the correct json should be "multiple_values":"1,2,3,4" I don't think that our metadata parsers should be responsible for recognizing a comma-delimited value as multiple values. Our json output, however, should properly put double-quotes around a value that contains numerals and commas, which it currently doesn't. Fix on way. Thank you, again tpalsulich , vitormil and Steffen (over on TIKA-1291 ).

            Thanks for attaching these tpalsulich

            chrismattmann Chris A. Mattmann added a comment - Thanks for attaching these tpalsulich

            This also affects the current revision. The issue is that <meta property="multiple_values" content="1,2,3,4" /> is parsed into "multiple_values":1,2,3,4. I'm pretty sure (correct me if I'm wrong) the proper response should be "multiple_values":[1,2,3,4]. The JSON formatter at org.apache.tika.cli.TikaCLI.NoDocumentJSONMetHandler should handle multiple values. But, two issues: the JSON option of the CLI is not being tested at all and 1,2,3,4 is parsed as a single value. So, there are no brackets printed around the list.

            I attached a simple HTML file with this issue, the current JSON output, the expected JSON output, and a patch with a (currently failing) unit test.

            tpalsulich Tyler Bui-Palsulich added a comment - This also affects the current revision. The issue is that <meta property="multiple_values" content="1,2,3,4" /> is parsed into "multiple_values":1,2,3,4 . I'm pretty sure (correct me if I'm wrong) the proper response should be "multiple_values":[1,2,3,4] . The JSON formatter at org.apache.tika.cli.TikaCLI.NoDocumentJSONMetHandler should handle multiple values. But, two issues: the JSON option of the CLI is not being tested at all and 1,2,3,4 is parsed as a single value. So, there are no brackets printed around the list. I attached a simple HTML file with this issue, the current JSON output, the expected JSON output, and a patch with a (currently failing) unit test.
            vitormil Vitor Oliveira added a comment - metadata.json https://gist.github.com/vitormil/baea93ea4a9d2ab2e138

            People

              tallison Tim Allison
              vitormil Vitor Oliveira
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: