Apache Any23
  1. Apache Any23
  2. ANY23-37

LGPL'ed components cannot be included in distribution packages

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Critical Critical
    • Resolution: Fixed
    • Affects Version/s: 0.7.0
    • Fix Version/s: 0.7.0
    • Component/s: None
    • Labels:
      None

      Description

      While reviewing dependencies license, I noticed that the it.unimi.dsi:dsiutils:2.0.1 transitive dependency is released under LGPL release, so it cannot be included in the non-maven binary archives.
      A first turnaround solution could be avoiding it is included and reporting it in the README.

      1. ANY23-37.patch
        4 kB
        Lewis John McGibbney
      2. ANY23-37-v2.patch
        7 kB
        Lewis John McGibbney
      3. ANY23-37-v3.patch
        8 kB
        Lewis John McGibbney
      4. package.txt
        5 kB
        Lewis John McGibbney

        Issue Links

          Activity

          Hide
          Lewis John McGibbney added a comment -

          Bulk close for 0.7.0-incubating release

          Show
          Lewis John McGibbney added a comment - Bulk close for 0.7.0-incubating release
          Hide
          Lewis John McGibbney added a comment -

          Can we close?

          Show
          Lewis John McGibbney added a comment - Can we close?
          Hide
          Simone Tripodi added a comment -

          Thanks for noticing Andy! going to drop it right now!

          Show
          Simone Tripodi added a comment - Thanks for noticing Andy! going to drop it right now!
          Hide
          Andy Seaborne added a comment -

          The maven dependency tree does not show dsiutils.

          The mention is in plugins/basic-crawler/src/main/assembly/bin.xml as an exclusion, and the warning is because the exclusion is triggered - double evidence it's not in the dependency tree now.

          The exclusion can be removed.

          <dependencySets>
          <dependencySet>
          <useProjectArtifact>true</useProjectArtifact>
          <outputDirectory>/lib</outputDirectory>
          <excludes>
          <!-- dsiutils is LGPLed -->
          <exclude>it.unimi.dsi:dsiutils</exclude> ******
          <!-- already provided from Any23 -->
          <exclude>org.slf4j:*</exclude>
          </excludes>
          </dependencySet>
          </dependencySets>

          Show
          Andy Seaborne added a comment - The maven dependency tree does not show dsiutils. The mention is in plugins/basic-crawler/src/main/assembly/bin.xml as an exclusion, and the warning is because the exclusion is triggered - double evidence it's not in the dependency tree now. The exclusion can be removed. <dependencySets> <dependencySet> <useProjectArtifact>true</useProjectArtifact> <outputDirectory>/lib</outputDirectory> <excludes> <!-- dsiutils is LGPLed --> <exclude>it.unimi.dsi:dsiutils</exclude> ****** <!-- already provided from Any23 --> <exclude>org.slf4j:*</exclude> </excludes> </dependencySet> </dependencySets>
          Hide
          Lewis John McGibbney added a comment -

          File attachment showing some DSIutils output. Seems as if there is still something we need to address!

          Show
          Lewis John McGibbney added a comment - File attachment showing some DSIutils output. Seems as if there is still something we need to address!
          Hide
          Lewis John McGibbney added a comment -

          This is a good step forward Michele. I was pleased to see this getting dealt with today

          When navigating

          $ cd ANY23_HOME/plugins/basic-crawler
          $ mvn clean package
          

          OR

          $ mvn clean assembly:assembly
          

          I'm still getting mentions of DSIutils...
          I've attached my log output for you to see.

          I'm also not getting very far with running the crawler tool from CLI :|
          Before we close this off, it would be great to test that this is actually working OK.

          Show
          Lewis John McGibbney added a comment - This is a good step forward Michele. I was pleased to see this getting dealt with today When navigating $ cd ANY23_HOME/plugins/basic-crawler $ mvn clean package OR $ mvn clean assembly:assembly I'm still getting mentions of DSIutils... I've attached my log output for you to see. I'm also not getting very far with running the crawler tool from CLI :| Before we close this off, it would be great to test that this is actually working OK.
          Hide
          Michele Mostarda added a comment -

          Fixed @ r1296216.

          The basic-crawler module is now using crawler4j 3.3 which no longer depends on it.unimi.dsi:dsiutils:2.0.1 .

          Please review.

          Show
          Michele Mostarda added a comment - Fixed @ r1296216. The basic-crawler module is now using crawler4j 3.3 which no longer depends on it.unimi.dsi:dsiutils:2.0.1 . Please review.
          Hide
          Lewis John McGibbney added a comment -

          Another update to the patch, unfortunately this is still not working. There is some work to be done with configuring constructors for CrawlConfig and CrawlController classes in Crawler4j. It is starting to annoy me as it's getting late so I am going to call it a day at this stage, amybe pick it up tomorrow or some other time.

          Show
          Lewis John McGibbney added a comment - Another update to the patch, unfortunately this is still not working. There is some work to be done with configuring constructors for CrawlConfig and CrawlController classes in Crawler4j. It is starting to annoy me as it's getting late so I am going to call it a day at this stage, amybe pick it up tomorrow or some other time.
          Hide
          Andy Seaborne added a comment -

          Crawler.java compiles after that (import edu.uci.ics.crawler4j.parser.HtmlParseData but then SiteCrawler.java does not compile.

          "variable controller not initialized" and I see into the patch:

          • controller = new CrawlController( storageFolder.getAbsolutePath() );

          yet controller is final which should be a compile time error in the first place.

          Sorry I can't be more specific - (I'm using maven to compile so this is a somewhat inefficient workflow

          Show
          Andy Seaborne added a comment - Crawler.java compiles after that (import edu.uci.ics.crawler4j.parser.HtmlParseData but then SiteCrawler.java does not compile. "variable controller not initialized" and I see into the patch: controller = new CrawlController( storageFolder.getAbsolutePath() ); yet controller is final which should be a compile time error in the first place. Sorry I can't be more specific - (I'm using maven to compile so this is a somewhat inefficient workflow
          Hide
          Lewis John McGibbney added a comment - - edited

          OK so this patch also removes the DSIutils and fastutils libraries from the basic-crawler pom.xml.

          There will still be the problem with the compile time error. This is because getHTML() is deprecated in the newer version of Crawler4j.
          Around lines 89-98 of Crawler.java [0], instead of making the call to page.getHTML() (line 96), we should instead be specifying something like:

          if (page.getParseData() instanceof HtmlParseData) {
                 HtmlParseData htmlParseData = (HtmlParseData) page.getParseData();
                 String html = htmlParseData.getHtml();
          
                 Crawler.super.performExtraction(
                                 new StringDocumentSource(
                                                 html,
                                                 pageURL
                                 )
                 );
          }
          

          I got totally sidetracked from this after last weekend so apologies about the half baked patch. More details on this can be seen @ [1]

          [0] https://svn.apache.org/viewvc/incubator/any23/trunk/plugins/basic-crawler/src/main/java/org/apache/any23/cli/Crawler.java?view=markup
          [1] http://code.google.com/p/crawler4j/

          Show
          Lewis John McGibbney added a comment - - edited OK so this patch also removes the DSIutils and fastutils libraries from the basic-crawler pom.xml. There will still be the problem with the compile time error. This is because getHTML() is deprecated in the newer version of Crawler4j. Around lines 89-98 of Crawler.java [0] , instead of making the call to page.getHTML() (line 96), we should instead be specifying something like: if (page.getParseData() instanceof HtmlParseData) { HtmlParseData htmlParseData = (HtmlParseData) page.getParseData(); String html = htmlParseData.getHtml(); Crawler. super .performExtraction( new StringDocumentSource( html, pageURL ) ); } I got totally sidetracked from this after last weekend so apologies about the half baked patch. More details on this can be seen @ [1] [0] https://svn.apache.org/viewvc/incubator/any23/trunk/plugins/basic-crawler/src/main/java/org/apache/any23/cli/Crawler.java?view=markup [1] http://code.google.com/p/crawler4j/
          Hide
          Andy Seaborne added a comment -

          I checked out any23/trunk and applied the patch.

          I got a compile error in basic-crawler:
          ------------------------------------------------------------
          [INFO] Compilation failure
          Crawler.java:[93,48] cannot find symbol
          symbol : method getHTML()
          location: class edu.uci.ics.crawler4j.crawler.Page
          ------------------------------------------------------------

          The dependency on edu.uci.ics.dsiutils is still in basic-crawler pom.xml but when I removed dsiutils and fastutils as explicit dependencies, and use the crawler4j 3.3 POM for its dependencies, then mvn dependency:tree on basic-crawler showed no dsiutils or fastutils as dependencies.

          Still had the same compile time error.

          Show
          Andy Seaborne added a comment - I checked out any23/trunk and applied the patch. I got a compile error in basic-crawler: ------------------------------------------------------------ [INFO] Compilation failure Crawler.java: [93,48] cannot find symbol symbol : method getHTML() location: class edu.uci.ics.crawler4j.crawler.Page ------------------------------------------------------------ The dependency on edu.uci.ics.dsiutils is still in basic-crawler pom.xml but when I removed dsiutils and fastutils as explicit dependencies, and use the crawler4j 3.3 POM for its dependencies, then mvn dependency:tree on basic-crawler showed no dsiutils or fastutils as dependencies. Still had the same compile time error.
          Hide
          Lewis John McGibbney added a comment -

          Does anyone feel like having a crack at this one? If you apply the patch on this issue, there should be little to do to get this working, Crawler4j upgraded and the offending library removed. Thank you if anyone can pick this up.

          Show
          Lewis John McGibbney added a comment - Does anyone feel like having a crack at this one? If you apply the patch on this issue, there should be little to do to get this working, Crawler4j upgraded and the offending library removed. Thank you if anyone can pick this up.
          Hide
          Lewis John McGibbney added a comment -

          This patch merely upgrades the crawler4j library in the plugin maven pom.xml, it then goes on to update some class references which were causing compile errors. There is one error I am aware of which I can't fix just now because i don't know enough about crawler4j changes. It is hosted on Github here [1].

          I've also not tried removing the offending LGPL library yet so this really is a stab in the dark.

          [1] https://github.com/yasserg/crawler4j/

          Show
          Lewis John McGibbney added a comment - This patch merely upgrades the crawler4j library in the plugin maven pom.xml, it then goes on to update some class references which were causing compile errors. There is one error I am aware of which I can't fix just now because i don't know enough about crawler4j changes. It is hosted on Github here [1] . I've also not tried removing the offending LGPL library yet so this really is a stab in the dark. [1] https://github.com/yasserg/crawler4j/
          Hide
          Andy Seaborne added a comment -

          cralwer4j is available in the central repo v3.0 and v3.1

          http://repo1.maven.org/maven2/edu/uci/ics/crawler4j/

          It's dependency are to be Apache + BDB-JE + junit licences.

          Show
          Andy Seaborne added a comment - cralwer4j is available in the central repo v3.0 and v3.1 http://repo1.maven.org/maven2/edu/uci/ics/crawler4j/ It's dependency are to be Apache + BDB-JE + junit licences.
          Hide
          Andy Seaborne added a comment -

          I removed it ... and the code compiles but tests fail

          crawler4j seems to need it but there's no declared dependency. Didn't pursue very deeply.

          Getting a release done is a big deal.

          Is there a useful subset of any23 that can be released without basic-crawler? Doing a release of something-not-everything is still worth doing for the process.

          I'll take this to any23-dev.

          Show
          Andy Seaborne added a comment - I removed it ... and the code compiles but tests fail crawler4j seems to need it but there's no declared dependency. Didn't pursue very deeply. Getting a release done is a big deal. Is there a useful subset of any23 that can be released without basic-crawler? Doing a release of something-not-everything is still worth doing for the process. I'll take this to any23-dev.
          Hide
          Paolo Castagna added a comment -

          FYI: I asked Sebastiano and he does not have plans nor intention to change the dsiutils license from LGPL to ASL.
          So, the only option left is to remove the dsiutils dependency... what the shortest path to do that: no idea.

          Show
          Paolo Castagna added a comment - FYI: I asked Sebastiano and he does not have plans nor intention to change the dsiutils license from LGPL to ASL. So, the only option left is to remove the dsiutils dependency... what the shortest path to do that: no idea.
          Hide
          Lewis John McGibbney added a comment -

          Yes I agree with your comments Paolo, but I personally need to be realistic with myself and the Any23 community and can state that it might not be feasible for me to put in the time just now to get the basic-crawler package migrated to o.a.n. Additionally this would certainly be of secondary importance to getting our first 0.7.0-incubating release out of the door. I would be very happy to work on ANY23-47 once this major milestone has been achieved. I'll get back on this issue when I've spoken with Sebastiano, thank you for providing the details Paolo.

          Show
          Lewis John McGibbney added a comment - Yes I agree with your comments Paolo, but I personally need to be realistic with myself and the Any23 community and can state that it might not be feasible for me to put in the time just now to get the basic-crawler package migrated to o.a.n. Additionally this would certainly be of secondary importance to getting our first 0.7.0-incubating release out of the door. I would be very happy to work on ANY23-47 once this major milestone has been achieved. I'll get back on this issue when I've spoken with Sebastiano, thank you for providing the details Paolo.
          Hide
          Paolo Castagna added a comment - - edited

          Hi Lewis, considering the context (i.e. Apache (and you as committer in both projects)) it makes a lot of sense for a project such as Any23 to use Nutch as crawler.
          If that is possible and a comparable amount of work, it certainly makes even more sense.
          I had not looked at the details, in particular at what it would take to get rid of the dsiutils dependency and/or ANY23-47.

          Show
          Paolo Castagna added a comment - - edited Hi Lewis, considering the context (i.e. Apache (and you as committer in both projects)) it makes a lot of sense for a project such as Any23 to use Nutch as crawler. If that is possible and a comparable amount of work, it certainly makes even more sense. I had not looked at the details, in particular at what it would take to get rid of the dsiutils dependency and/or ANY23-47 .
          Hide
          Lewis John McGibbney added a comment -

          Thanks Paolo. I think your right. I opened ANY23-47 to try and deal with this head on, but it seems a hellish amount of work to address the issue given the time I have and the fact that we would really like to get the 0.7.0 release going.

          I'll contact him directly.

          Show
          Lewis John McGibbney added a comment - Thanks Paolo. I think your right. I opened ANY23-47 to try and deal with this head on, but it seems a hellish amount of work to address the issue given the time I have and the fact that we would really like to get the 0.7.0 release going. I'll contact him directly.
          Hide
          Paolo Castagna added a comment -

          > fastutil is ASL
          > dsiutils is LGPL

          The author of fastutil and dsiutils (i.e. Sebastiano Vigna --> http://vigna.dsi.unimi.it/) has changed the license of fastutil to ASL when moving from version 5.1.5 to 6.0.0.
          fastutil is in Maven Central, while dsiutils is not.
          Maybe, dsiutils could/should follow and change to ASL as well as being published/released in the Maven Central repository.
          Asking Sebastiano directly is certainly an option on the table.

          Show
          Paolo Castagna added a comment - > fastutil is ASL > dsiutils is LGPL The author of fastutil and dsiutils (i.e. Sebastiano Vigna --> http://vigna.dsi.unimi.it/ ) has changed the license of fastutil to ASL when moving from version 5.1.5 to 6.0.0. fastutil is in Maven Central, while dsiutils is not. Maybe, dsiutils could/should follow and change to ASL as well as being published/released in the Maven Central repository. Asking Sebastiano directly is certainly an option on the table.
          Hide
          Simone Tripodi added a comment -

          Hi Lewis,

          IIUC there are some transitive dependencies that are not available on Mvn central repo - that is why that repo is stille there - but I obviously 100% agree that must be removed.

          I am investigating on how we can fix dropping that repo, I'll let you know as soon as I have concrete results to share.

          Thanks for your help!

          Show
          Simone Tripodi added a comment - Hi Lewis, IIUC there are some transitive dependencies that are not available on Mvn central repo - that is why that repo is stille there - but I obviously 100% agree that must be removed. I am investigating on how we can fix dropping that repo, I'll let you know as soon as I have concrete results to share. Thanks for your help!
          Hide
          Lewis John McGibbney added a comment -

          These are all being pulled from http://any23.googlecode.com/svn/repo-ext/...

          Why have we still got this repository? A HUGE amount of our dependencies are being pulled from here when I view my install log output. Has anyone tried removing this from the parent pom and building the project as from what I can see most of the deps are available on maven central!

          I also can't see where in the basic-crawler code, code from the offending dep is actually being used. However when removing it my tests failed so I'm guessing it's somewhere in there.

          Show
          Lewis John McGibbney added a comment - These are all being pulled from http://any23.googlecode.com/svn/repo-ext/ ... Why have we still got this repository? A HUGE amount of our dependencies are being pulled from here when I view my install log output. Has anyone tried removing this from the parent pom and building the project as from what I can see most of the deps are available on maven central! I also can't see where in the basic-crawler code, code from the offending dep is actually being used. However when removing it my tests failed so I'm guessing it's somewhere in there.
          Hide
          Andy Seaborne added a comment -

          It's an immediate dependency in basic-crawler.

          dependency:tree at the tope gives me ....

          [INFO] org.apache.any23.plugin:any23-basic-crawler:jar:1.0.0-incubating-SNAPSHOT
          ....
          [INFO] +- it.unimi.dsi:fastutil:jar:6.4.1:compile
          [INFO] +- it.unimi.dsi:dsiutils:jar:2.0.1:compile

          fastutil is ASL
          dsiutils is LGPL

          Show
          Andy Seaborne added a comment - It's an immediate dependency in basic-crawler. dependency:tree at the tope gives me .... [INFO] org.apache.any23.plugin:any23-basic-crawler:jar:1.0.0-incubating-SNAPSHOT .... [INFO] +- it.unimi.dsi:fastutil:jar:6.4.1:compile [INFO] +- it.unimi.dsi:dsiutils:jar:2.0.1:compile fastutil is ASL dsiutils is LGPL
          Hide
          Lewis John McGibbney added a comment -

          This is critical. Does anyone have suggestions to get this sorted? It's this kind of stuff we need to deal with if we want to shift 0.7.0? No?...

          Also where exactly is this dependency managed or even specified? I can't find it. Thanks

          Show
          Lewis John McGibbney added a comment - This is critical. Does anyone have suggestions to get this sorted? It's this kind of stuff we need to deal with if we want to shift 0.7.0? No?... Also where exactly is this dependency managed or even specified? I can't find it. Thanks

            People

            • Assignee:
              Michele Mostarda
              Reporter:
              Simone Tripodi
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development