Details

    • Type: Sub-task
    • Status: Closed
    • Priority: Blocker
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.8.0-incubator
    • Component/s: None
    • Labels:
      None

      Description

      One issue raised by the license review (PDFBOX-366) is the status of the various test PDF files included in the test directory. Many of these don't seem to come with a license that would permit redistribution within an Apache project, so our only option seems to be to remove or replace the files before we can make the first Apache release.

      The full list of potentially (I haven't looked at all of these in detail so some might be OK for us to keep) troublesome test files is:

      $ find test -name '*.pdf'
      test/encryption/encrypted_doc_no_id.pdf
      test/input/10101-AR.pdf
      test/input/601501018.pdf
      test/input/Exolab.pdf
      test/input/FreedomExpressions.pdf
      test/input/Garcia2003b__Correlative_exploration_of_EEG_Signals.pdf
      test/input/Garcia2004_thesis.pdf
      test/input/Hd301212.pdf
      test/input/JavaMail-1.2.pdf
      test/input/Liste732004001452_001_0.pdf_0_.pdf
      test/input/Michel2001__Review_p2_structured.pdf
      test/input/News-Oct-2001-RUS.pdf
      test/input/OLS2000-rsync.pdf
      test/input/OSP_framework.pdf
      test/input/SphericalHomeomorphism.pdf
      test/input/T05140.pdf
      test/input/TEST_SetCharSpacing_Error.pdf
      test/input/amyuni2_05d__pdf1_3_acro4x.pdf
      test/input/authentication.pdf
      test/input/c21-5916 .pdf
      test/input/citi-tr-00-4.ps.gz.pdf
      test/input/connection_pool.pdf
      test/input/cweb.pdf
      test/input/data-000001.pdf
      test/input/defensive_driving_class_schedule.pdf
      test/input/ekb_deutsch.pdf
      test/input/emsv4a4.pdf
      test/input/fdeb.pdf
      test/input/frweb-f-332-18.pdf
      test/input/hexnumberproblem.pdf
      test/input/irs tax guide for small businesses.pdf
      test/input/jose-lugo-test.pdf
      test/input/jun2003.pdf
      test/input/null_thread_bead.pdf
      test/input/ocalc.pdf
      test/input/openoffice-test-document.pdf
      test/input/org.eclipse.platform.doc.isv.pdf
      test/input/pdf_with_lots_of_fields.pdf
      test/input/rc5.pdf
      test/input/reservedparkingsalaryreductionauthorization.pdf
      test/input/ruminations.pdf
      test/input/sampleForSpec.pdf
      test/input/sample_fonts_solidconvertor.pdf
      test/input/sha256.pdf
      test/input/simple-openoffice.pdf
      test/input/surface_interpolation.pdf
      test/input/tech_report.pdf
      test/input/terms_and_conditions_book.pdf
      test/input/test_rotate_270.pdf
      test/input/warp.pdf
      test/input/welcome.pdf
      test/input/whats_new.pdf
      test/input/yaddatest.pdf
      test/pdfparser/genko_oc_shiryo1.pdf
      test/pdfreader/debug.xml.pdf
      test/pdfreader/excel.pdf
      test/pdfreader/ollix_test_2005-03-11_bin.pdf
      test/pdfreader/pdfbox_webpage.pdf

      My suggestion is that (in line with PDFBOX-368) we create a new src/test/resources directory where we move all reviewed and accepted test cases. Once all these files have been reviewed, we just drop the remaining ones for which an acceptable license could not be found. It would be nice if replacements could be created for such test cases, but in some cases (special PDF constructs, etc.) that might be a bit troublesome so I guess we'll just need to live with some reduction in test coverage due to this.

      For more background, see the discussions at http://markmail.org/message/z7meilylwifef7db and http://markmail.org/message/cuyylr6zqs4fwdiz.

        Issue Links

          Activity

          Hide
          jukkaz Jukka Zitting added a comment -

          I quickly browsed through the test files, and only the following look like something that I'd feel comfortable redistributing within an Apache project:

          test/input/cweb.pdf
          test/input/data-000001.pdf
          test/input/Liste732004001452_001_0.pdf_0_.pdf
          test/input/openoffice-test-document
          test/input/sample_fonts_solidconvertor.pdf
          test/input/sampleForSpec.pdf
          test/input/simple-openoffice.pdf
          test/input/yaddatest.pdf
          test/pdfreader/debug.xml.pdf
          test/pdfreader/excel.pdf
          test/pdfreader/ollix_test_2005-03-11_bin.pdf
          test/pdfreader/pdfbox_webpage.pdf

          Note that there is a clear distinction between using and redistributing something. We could still come up with a way to use the test suite in our Hudson CI build and individually by each developer, but we probably can't keep the documents in svn and we definitely can't release them as a part of PDFBox.

          Show
          jukkaz Jukka Zitting added a comment - I quickly browsed through the test files, and only the following look like something that I'd feel comfortable redistributing within an Apache project: test/input/cweb.pdf test/input/data-000001.pdf test/input/Liste732004001452_001_0.pdf_0_.pdf test/input/openoffice-test-document test/input/sample_fonts_solidconvertor.pdf test/input/sampleForSpec.pdf test/input/simple-openoffice.pdf test/input/yaddatest.pdf test/pdfreader/debug.xml.pdf test/pdfreader/excel.pdf test/pdfreader/ollix_test_2005-03-11_bin.pdf test/pdfreader/pdfbox_webpage.pdf Note that there is a clear distinction between using and redistributing something. We could still come up with a way to use the test suite in our Hudson CI build and individually by each developer, but we probably can't keep the documents in svn and we definitely can't release them as a part of PDFBox.
          Hide
          carrier Brian Carrier added a comment -

          The trunk now supports a feature for "external" test files to be stored in the input-ext directory. If the test suite finds that directory, it will process its contents:

          Sending build.xml
          Sending src/test/java/org/apache/pdfbox/util/TestTextStripper.java
          Transmitting file data ..
          Committed revision 742644.

          Now we need an automated way to populate the 'input-ext' directory with the files that were removed.

          Show
          carrier Brian Carrier added a comment - The trunk now supports a feature for "external" test files to be stored in the input-ext directory. If the test suite finds that directory, it will process its contents: Sending build.xml Sending src/test/java/org/apache/pdfbox/util/TestTextStripper.java Transmitting file data .. Committed revision 742644. Now we need an automated way to populate the 'input-ext' directory with the files that were removed.
          Hide
          danielwilson Daniel Wilson added a comment -

          What about files that are attached to issues? Those, in my opinion, form some of the most valuable test cases.

          Additionally, I have received specific permission from the owner to attach a couple of the test files.

          I'm seeing our entire rendering test knocked out here. I understand there can be legal issues, but the quality of our development will surely drop if we can't test an entire area like that.

          Show
          danielwilson Daniel Wilson added a comment - What about files that are attached to issues? Those, in my opinion, form some of the most valuable test cases. Additionally, I have received specific permission from the owner to attach a couple of the test files. I'm seeing our entire rendering test knocked out here. I understand there can be legal issues, but the quality of our development will surely drop if we can't test an entire area like that.
          Hide
          lehmi Andreas Lehmkühler added a comment -

          As Jukka already stated in his comment, we have to remove the troublesome testfiles from svn and we can't release them as part of PDFBox but of course we can use them in our test arena. We have to place them somewhere else (perhaps as a zip in the maven repository??) and to modifiy the build process to get these files automatically and use them in our test suite.

          I think the question about the files attached to issues is a quite difficult one. There are many of them and some of the issue creators allows us to redistribute the files by activating the "grant"-checkbox. But I'm afraid that some of these people aren't in the position to give us the permission because they aren't the authors of these documents, e.g. PDFBOX-450. Finally we have to doublecheck the attached docs before we'll put them to the "offical" test-cases.

          Show
          lehmi Andreas Lehmkühler added a comment - As Jukka already stated in his comment, we have to remove the troublesome testfiles from svn and we can't release them as part of PDFBox but of course we can use them in our test arena. We have to place them somewhere else (perhaps as a zip in the maven repository??) and to modifiy the build process to get these files automatically and use them in our test suite. I think the question about the files attached to issues is a quite difficult one. There are many of them and some of the issue creators allows us to redistribute the files by activating the "grant"-checkbox. But I'm afraid that some of these people aren't in the position to give us the permission because they aren't the authors of these documents, e.g. PDFBOX-450 . Finally we have to doublecheck the attached docs before we'll put them to the "offical" test-cases.
          Hide
          lehmi Andreas Lehmkühler added a comment -

          Now that the CMAP-Files are on their way to the maven-repository, the last question is where to put the test files which can't be longer in svn.
          Is it ok to put them on pdfbox homepage? Or is that too "official"? As an alternative we can put them on someones homepage on people.a.o, can't we?

          Any ideas, suggestions, objections??

          Show
          lehmi Andreas Lehmkühler added a comment - Now that the CMAP-Files are on their way to the maven-repository, the last question is where to put the test files which can't be longer in svn. Is it ok to put them on pdfbox homepage? Or is that too "official"? As an alternative we can put them on someones homepage on people.a.o, can't we? Any ideas, suggestions, objections??
          Hide
          jukkaz Jukka Zitting added a comment -

          We could simply attach the tests here in Jira and point developers to get them from here.

          Show
          jukkaz Jukka Zitting added a comment - We could simply attach the tests here in Jira and point developers to get them from here.
          Hide
          lehmi Andreas Lehmkühler added a comment -

          WIth version 793340 I've added support for processing an "external" testfile directory named input-ext as it is already available for TestTextStripper.

          Show
          lehmi Andreas Lehmkühler added a comment - WIth version 793340 I've added support for processing an "external" testfile directory named input-ext as it is already available for TestTextStripper.
          Hide
          lehmi Andreas Lehmkühler added a comment -

          With version 793349 I've removed all testfiles in question. They will be automatically downloaded from PDFBOX-492 if needed during the testcases.

          Show
          lehmi Andreas Lehmkühler added a comment - With version 793349 I've removed all testfiles in question. They will be automatically downloaded from PDFBOX-492 if needed during the testcases.
          Hide
          lehmi Andreas Lehmkühler added a comment -

          With version 793461 I've removed the encryption testfiles I forget yesterday. They are attached to PDFBOX-492 too.

          Show
          lehmi Andreas Lehmkühler added a comment - With version 793461 I've removed the encryption testfiles I forget yesterday. They are attached to PDFBOX-492 too.
          Hide
          jukkaz Jukka Zitting added a comment -

          Re: automatically downloaded

          It would be better if the user had to explicitly request these test files by running "ant get.testfiles" before building the project. If the user didn't do that, then the relevant tests would simply not run.

          The licensing of these files is quite unclear, so I'd prefer if people had to explicitly decide to want them instead of them being automatically downloaded by PDFBox as a part of the normal build process.

          Show
          jukkaz Jukka Zitting added a comment - Re: automatically downloaded It would be better if the user had to explicitly request these test files by running "ant get.testfiles" before building the project. If the user didn't do that, then the relevant tests would simply not run. The licensing of these files is quite unclear, so I'd prefer if people had to explicitly decide to want them instead of them being automatically downloaded by PDFBox as a part of the normal build process.
          Hide
          lehmi Andreas Lehmkühler added a comment -

          From that point of view I agree with you. I'll change that behaviour. But first of all some of the smaller tests (FDF and encryption) have to be adjusted. Both expect their files to be there. I'll make that optional too.

          Show
          lehmi Andreas Lehmkühler added a comment - From that point of view I agree with you. I'll change that behaviour. But first of all some of the smaller tests (FDF and encryption) have to be adjusted. Both expect their files to be there. I'll make that optional too.
          Hide
          lehmi Andreas Lehmkühler added a comment -

          I've removed the automatic download and the inputfiles for TestFDF are now optional. It seems that the encryption test is never called.

          Show
          lehmi Andreas Lehmkühler added a comment - I've removed the automatic download and the inputfiles for TestFDF are now optional. It seems that the encryption test is never called.

            People

            • Assignee:
              Unassigned
              Reporter:
              jukkaz Jukka Zitting
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development