[SOLR-11087] Get rid of jar duplicates in release - ASF JIRA

Details

Type: Sub-task
Status: Closed
Priority: Major
Resolution: Duplicate
Affects Version/s: None
Fix Version/s: 8.1, 9.0
Component/s: Build
Labels:
None

Description

The dist/ folder contains many duplicate jar files, totalling 10,5M:

4,6M   ./dist/solr-core-6.6.0.jar (WEB-INF/lib)
1,2M   ./dist/solr-solrj-6.6.0.jar (WEB-INF/lib)
4,7M   ./dist/solrj-lib/* (WEB-INF/lib and server/lib/ext)

The rest of the files in dist/ are contrib jars and test-framework.

To weed out the duplicates and save 10,5M, we can simply add a dist/README.md file listing what jar files are located where. The file could also contain a bash one-liner to copy them to the dist folder. Another possibility is to ship the binary release tarball with symlinks in the dist folder, and advise people to use cp -RL dist mydist which will make a copy with the real files. Downside is that this won't work for ZIP archives that do not preserve symlinks, and neither on Windows.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

SOLR-11087.patch
17/Jul/17 11:27
16 kB
Jan Høydahl

Issue Links

duplicates

SOLR-15916 Remove "dist/" from distribution

Closed

Activity

Ascending order - Click to sort in descending order

Jan Høydahl added a comment - 17/Jul/17 11:33

First attempt on this, see patch

Changes the build not to copy duplicate jars into dist/
Webapp "dist" target copies solr-core and solr-solrj jars from build folders, not from dist
dist target puts a README.md file in dist/, explaining where to find remaining files
New bin/solr utils dist <path-to-folder> command that assembles a complete dist folder

The tarball created before this patch is 144Mb, and after 132Mb, which is a 12Mb (9%) reduction.

Todo:

Test on Windows
Update ref-guide
Assess whether the change has other side effects (scripts, docs mm)

Jan Høydahl added a comment - 17/Jul/17 11:33 First attempt on this, see patch Changes the build not to copy duplicate jars into dist/ Webapp "dist" target copies solr-core and solr-solrj jars from build folders, not from dist dist target puts a README.md file in dist/, explaining where to find remaining files New bin/solr utils dist <path-to-folder> command that assembles a complete dist folder The tarball created before this patch is 144Mb, and after 132Mb, which is a 12Mb (9%) reduction. Todo: Test on Windows Update ref-guide Assess whether the change has other side effects (scripts, docs mm)

Uwe Schindler added a comment - 17/Jul/17 12:06 - edited

Hi,

I like the cleanup. Personally, I'd go the other route: I would remove the whole webapp folder and instead of using start.jar, I'd assemble the servlet context in a simple Java main() method, where the JAR files are picked from the server or dist folder. I have done this several times. You can build a webapp without a web.xml in code with about 30 lines of code to startup jetty and link servlet filters and code. The good thing with that is on top, that you don't even need a webapp folder anywhere. And static resources can also be delivered directly from a JAR file! Here is a simple example I did for a micro-service:

      final Server server = new Server();
      server.setStopAtShutdown(true);
      server.setStopTimeout(1000L);
      
      // setup connectors...
      final ServerConnector ipv6 = new ServerConnector(server);
      ipv6.setPort(PORT);
      ipv6.setHost("::1");
      ipv6.setIdleTimeout(IDLE_TIMEOUT);
      server.addConnector(ipv6);

      // add servlet context:
      final ServletContextHandler context = new ServletContextHandler(ServletContextHandler.NO_SECURITY | ServletContextHandler.NO_SESSIONS);
      context.insertHandler(new ResourceHandler());
      context.setBaseResource(Resource.newClassPathResource("/webroot/"));
      context.addServlet(SelectServlet.class, "/select");
      context.addServlet(RecordServlet.class, "/record/*");
      server.setHandler(context);

      // start webserver
      server.start();
      server.join();

This is just an example binding 2 servlets, i just removed other non-servlet and logging stuff, so you can add your own access/jetty logging, too. The good thing with that is also that we can get rid of the stupid internal "/c" redirects as you have full flexibility where you bind what. And finally the WAR file is gone and the server does not unpack it on startup. In addition, we can do all the startup logic before spawning jetty (e.g. starting embedded zookeeper, checking log4j config,...)

Static files and stuff are loaded using context.insertHandler(new ResourceHandler()); context.setBaseResource(Resource.newClassPathResource("/webroot/")); directly from one of the JAR files (where a webroot/ folder is part of the JAR's contents/resources). This can be a separate JAR file or simply in main solr.jar. This makes it very small and unpacking Solr would take only a second then (currently unzipping all those small static files is a mess).

So I'd go that route and as a first step refactor the web.xml file into a simple startup main() method as seen before. I can help with that. I have some time this week, so I may make a quick and dirty mockup - if you agree.

Uwe Schindler added a comment - 17/Jul/17 12:06 - edited Hi, I like the cleanup. Personally, I'd go the other route: I would remove the whole webapp folder and instead of using start.jar, I'd assemble the servlet context in a simple Java main() method, where the JAR files are picked from the server or dist folder. I have done this several times. You can build a webapp without a web.xml in code with about 30 lines of code to startup jetty and link servlet filters and code. The good thing with that is on top, that you don't even need a webapp folder anywhere. And static resources can also be delivered directly from a JAR file! Here is a simple example I did for a micro-service: final Server server = new Server(); server.setStopAtShutdown( true ); server.setStopTimeout(1000L); // setup connectors... final ServerConnector ipv6 = new ServerConnector(server); ipv6.setPort(PORT); ipv6.setHost( "::1" ); ipv6.setIdleTimeout(IDLE_TIMEOUT); server.addConnector(ipv6); // add servlet context: final ServletContextHandler context = new ServletContextHandler(ServletContextHandler.NO_SECURITY | ServletContextHandler.NO_SESSIONS); context.insertHandler( new ResourceHandler()); context.setBaseResource(Resource.newClassPathResource( "/webroot/" )); context.addServlet(SelectServlet.class, "/select" ); context.addServlet(RecordServlet.class, "/record/*" ); server.setHandler(context); // start webserver server.start(); server.join(); This is just an example binding 2 servlets, i just removed other non-servlet and logging stuff, so you can add your own access/jetty logging, too. The good thing with that is also that we can get rid of the stupid internal "/c" redirects as you have full flexibility where you bind what. And finally the WAR file is gone and the server does not unpack it on startup. In addition, we can do all the startup logic before spawning jetty (e.g. starting embedded zookeeper, checking log4j config,...) Static files and stuff are loaded using context.insertHandler(new ResourceHandler()); context.setBaseResource(Resource.newClassPathResource("/webroot/")); directly from one of the JAR files (where a webroot/ folder is part of the JAR's contents/resources). This can be a separate JAR file or simply in main solr.jar. This makes it very small and unpacking Solr would take only a second then (currently unzipping all those small static files is a mess). So I'd go that route and as a first step refactor the web.xml file into a simple startup main() method as seen before. I can help with that. I have some time this week, so I may make a quick and dirty mockup - if you agree.

Jan Høydahl added a comment - 17/Jul/17 12:20

This makes it very small and unpacking Solr would take only a second then (currently unzipping all those small static files is a mess).

Solr no longer unzips a war since long time ago?

I did not go down that route now because it sounds very much more invasive. But I like it a lot, and it would give us much more control over everything, including the naming of Java options such as jetty.port which we could now rename to solr.port etc. Guess this is how our Test framework constructs Jetty instances already.

I'm in no hurry with this. If you have time to mock up a refactoring to get rid of the dependence of all jars being in WEB-INF/lib and instead pick them from /dist then that would be perfect and achieve two goals in one

Jan Høydahl added a comment - 17/Jul/17 12:20 This makes it very small and unpacking Solr would take only a second then (currently unzipping all those small static files is a mess). Solr no longer unzips a war since long time ago? I did not go down that route now because it sounds very much more invasive. But I like it a lot, and it would give us much more control over everything, including the naming of Java options such as jetty.port which we could now rename to solr.port etc. Guess this is how our Test framework constructs Jetty instances already. I'm in no hurry with this. If you have time to mock up a refactoring to get rid of the dependence of all jars being in WEB-INF/lib and instead pick them from /dist then that would be perfect and achieve two goals in one

Uwe Schindler added a comment - 17/Jul/17 12:37

Solr no longer unzips a war since long time ago?

This was meant about unzipping the solr distribution on its own. If all the "static stuff" like jquery and a lot of HTML is gone, it would unzip faster and allocate a lot less disk space. You just have one JAR file with the whole admin interface as a single JAR. This would also allow to package the old and new admin interface separately and maybe exchange them or make it pluggable!? I agree, the WAR is no longer unpacked - you are right. I was taking care of this with the others a while ago. The Unzipping of distribution is now also a lot faster since we removed the Javadocs - thanks about that!

Uwe Schindler added a comment - 17/Jul/17 12:37 Solr no longer unzips a war since long time ago? This was meant about unzipping the solr distribution on its own. If all the "static stuff" like jquery and a lot of HTML is gone, it would unzip faster and allocate a lot less disk space. You just have one JAR file with the whole admin interface as a single JAR. This would also allow to package the old and new admin interface separately and maybe exchange them or make it pluggable!? I agree, the WAR is no longer unpacked - you are right. I was taking care of this with the others a while ago. The Unzipping of distribution is now also a lot faster since we removed the Javadocs - thanks about that!

Jan Høydahl added a comment - 17/Jul/17 12:47

Ah, now I see. You mean jaring up the webapp/web folder. I did a count and those are 310 files. I agree it would compress better and make fewer files to unzip.
But on the other hand I like the fact that you very easily can simply edit a HTML or JS file and see the change in Solr's Admin. I think it has helped people identifying and fixing UI bugs for us. If it was all inside a jar that needed unpack -> edit -> re-pack to see the change, then hacking on Solr's UI would be less accessible both for users and developers.

Moving all the solr.xml things into Java code is actually just a benefit, it keeps folks from treating Solr as a Web app and doing things the wrong way. At the same time, I think we'll see complaints once we do the move, from people who depend on their custom filters etc. But as I said, I think that is just a positive step towards a truly standalone app.

Jan Høydahl added a comment - 17/Jul/17 12:47 Ah, now I see. You mean jaring up the webapp/web folder. I did a count and those are 310 files. I agree it would compress better and make fewer files to unzip. But on the other hand I like the fact that you very easily can simply edit a HTML or JS file and see the change in Solr's Admin. I think it has helped people identifying and fixing UI bugs for us. If it was all inside a jar that needed unpack -> edit -> re-pack to see the change, then hacking on Solr's UI would be less accessible both for users and developers. Moving all the solr.xml things into Java code is actually just a benefit, it keeps folks from treating Solr as a Web app and doing things the wrong way. At the same time, I think we'll see complaints once we do the move, from people who depend on their custom filters etc. But as I said, I think that is just a positive step towards a truly standalone app.

Jan Høydahl added a comment - 03/Jan/19 17:39

thetaphi what do you think of committing this for the size improvement's sake for 8.0, or are you planning on getting rid of start.jar and web apps folder already for 8.0?

Jan Høydahl added a comment - 03/Jan/19 17:39 thetaphi what do you think of committing this for the size improvement's sake for 8.0, or are you planning on getting rid of start.jar and web apps folder already for 8.0?

Shawn Heisey added a comment - 21/Oct/19 16:29

I like the idea of replacing the duplicates in dist with a README describing exactly which jars to copy out of the webapp if they are needed and where to find them. One thing I wonder is whether we expect hardlinks to be supported across all operating systems that natively support tarballs – if we do, we could use hardlinks to share files in dist and the webapp, and mention this fact in the README. If we think that hardlinks might be a specialty item, then just the README would be appropriate.

While looking at ~~SOLR-13841~~ I found that we have duplicate jars other than the ones in dist, which currently are expected. My question for the moment: Should I open a new issue for those other duplicates, or would we want to expand this issue to cover a full audit of the jars in the binary release? I almost started with a new issue ... glad I searched first.

Shawn Heisey added a comment - 21/Oct/19 16:29 I like the idea of replacing the duplicates in dist with a README describing exactly which jars to copy out of the webapp if they are needed and where to find them. One thing I wonder is whether we expect hardlinks to be supported across all operating systems that natively support tarballs – if we do, we could use hardlinks to share files in dist and the webapp, and mention this fact in the README. If we think that hardlinks might be a specialty item, then just the README would be appropriate. While looking at SOLR-13841 I found that we have duplicate jars other than the ones in dist, which currently are expected. My question for the moment: Should I open a new issue for those other duplicates, or would we want to expand this issue to cover a full audit of the jars in the binary release? I almost started with a new issue ... glad I searched first.

Uwe Schindler added a comment - 21/Oct/19 16:43 - edited

Hi,
I think the issue here should focus to get rid on the web application and have a single lib folder directly below the root dir of the distribution. Then we have a solr-main.jar (without solrj) and this one also contains a Main.class to bootstrap Jetty. This would make deployment much easier. As said before, the tons of HTML/Javascript should also be packaged into a JAR file to get rid of tons of small files making unzipping damn slow and consume lots of space (block size). Jetty is able to deliver static contents from a JAR file directy, I use this all the time for microservice-like stuff.

Once we are at that place we could maybe split the root lib folder into 2 of them: One with Solrj and one with the remaining stuff to startup the server. The contrib modules can be linked into cores the usual way with <lib>.

If I would have some more time, I could start in refactoring all this, but my knowledge with Solr is limited. I'd really like to get rid of SolrDispatchFilter and replace it with a simple servlet or better a Jetty Handler directly added to the root context in the Solr Bootstrap class.

About your comment: Hardlinks is a problem on Windows, and Symlinks are not better. I'd not do this.

Uwe Schindler added a comment - 21/Oct/19 16:43 - edited Hi, I think the issue here should focus to get rid on the web application and have a single lib folder directly below the root dir of the distribution. Then we have a solr-main.jar (without solrj) and this one also contains a Main.class to bootstrap Jetty. This would make deployment much easier. As said before, the tons of HTML/Javascript should also be packaged into a JAR file to get rid of tons of small files making unzipping damn slow and consume lots of space (block size). Jetty is able to deliver static contents from a JAR file directy, I use this all the time for microservice-like stuff. Once we are at that place we could maybe split the root lib folder into 2 of them: One with Solrj and one with the remaining stuff to startup the server. The contrib modules can be linked into cores the usual way with <lib> . If I would have some more time, I could start in refactoring all this, but my knowledge with Solr is limited. I'd really like to get rid of SolrDispatchFilter and replace it with a simple servlet or better a Jetty Handler directly added to the root context in the Solr Bootstrap class. About your comment: Hardlinks is a problem on Windows, and Symlinks are not better. I'd not do this.

Uwe Schindler added a comment - 21/Oct/19 16:49

But nevertheless, we can get rid of the duplicates, if we do some classpath magic. If we move all SolrJ JAR file up the tree and add them to Jetty's classpath, we can still keep the rest in the webapp folder. Webapps see JAR files from the context, too.

Uwe Schindler added a comment - 21/Oct/19 16:49 But nevertheless, we can get rid of the duplicates, if we do some classpath magic. If we move all SolrJ JAR file up the tree and add them to Jetty's classpath, we can still keep the rest in the webapp folder. Webapps see JAR files from the context, too.

Shawn Heisey added a comment - 21/Oct/19 18:19

One long-term goal that I have (and I think it's shared by others) is to make the precise way of providing network services into an implementation detail. Solr should become a standalone application, and one way to do that is to embed Jetty into the application, so it's completely under our control and hidden from the user.

We historically have left classpath management mostly up to the container, but we're going to have to take that over if we want the goal above to succeed. Simplification and good separation will be important for that.

On the hardlink idea: It's really only viable in the tarball. I don't think we could do it in the zip version, and in truth I don't like it any more than you do. But if we do proper classpath management with our scripting, the whole notion is moot anyway, because we will have solved the problem.

Shawn Heisey added a comment - 21/Oct/19 18:19 One long-term goal that I have (and I think it's shared by others) is to make the precise way of providing network services into an implementation detail. Solr should become a standalone application, and one way to do that is to embed Jetty into the application, so it's completely under our control and hidden from the user. We historically have left classpath management mostly up to the container, but we're going to have to take that over if we want the goal above to succeed. Simplification and good separation will be important for that. On the hardlink idea: It's really only viable in the tarball. I don't think we could do it in the zip version, and in truth I don't like it any more than you do. But if we do proper classpath management with our scripting, the whole notion is moot anyway, because we will have solved the problem.

Jan Høydahl added a comment - 18/Jan/22 09:19

Closing this as duplicate of ~~SOLR-15916~~, as I think they achieve the same

Jan Høydahl added a comment - 18/Jan/22 09:19 Closing this as duplicate of SOLR-15916 , as I think they achieve the same

Jan Høydahl added a comment - 12/May/22 00:26

Closing after the 9.0.0 release

Jan Høydahl added a comment - 12/May/22 00:26 Closing after the 9.0.0 release

People

Assignee:: Unassigned

Reporter:: Jan Høydahl

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 14/Jul/17 22:25

Updated:: 12/May/22 00:26

Resolved:: 18/Jan/22 09:19