Description
Sub task of SOLR-6806
The dist/ folder contains many duplicate jar files, totalling 10,5M:
4,6M ./dist/solr-core-6.6.0.jar (WEB-INF/lib) 1,2M ./dist/solr-solrj-6.6.0.jar (WEB-INF/lib) 4,7M ./dist/solrj-lib/* (WEB-INF/lib and server/lib/ext)
The rest of the files in dist/ are contrib jars and test-framework.
To weed out the duplicates and save 10,5M, we can simply add a dist/README.md file listing what jar files are located where. The file could also contain a bash one-liner to copy them to the dist folder. Another possibility is to ship the binary release tarball with symlinks in the dist folder, and advise people to use cp -RL dist mydist which will make a copy with the real files. Downside is that this won't work for ZIP archives that do not preserve symlinks, and neither on Windows.
Attachments
Attachments
- SOLR-11087.patch
- 16 kB
- Jan Høydahl
Issue Links
- duplicates
-
SOLR-15916 Remove "dist/" from distribution
- Closed
Activity
Hi,
I like the cleanup. Personally, I'd go the other route: I would remove the whole webapp folder and instead of using start.jar, I'd assemble the servlet context in a simple Java main() method, where the JAR files are picked from the server or dist folder. I have done this several times. You can build a webapp without a web.xml in code with about 30 lines of code to startup jetty and link servlet filters and code. The good thing with that is on top, that you don't even need a webapp folder anywhere. And static resources can also be delivered directly from a JAR file! Here is a simple example I did for a micro-service:
final Server server = new Server(); server.setStopAtShutdown(true); server.setStopTimeout(1000L); // setup connectors... final ServerConnector ipv6 = new ServerConnector(server); ipv6.setPort(PORT); ipv6.setHost("::1"); ipv6.setIdleTimeout(IDLE_TIMEOUT); server.addConnector(ipv6); // add servlet context: final ServletContextHandler context = new ServletContextHandler(ServletContextHandler.NO_SECURITY | ServletContextHandler.NO_SESSIONS); context.insertHandler(new ResourceHandler()); context.setBaseResource(Resource.newClassPathResource("/webroot/")); context.addServlet(SelectServlet.class, "/select"); context.addServlet(RecordServlet.class, "/record/*"); server.setHandler(context); // start webserver server.start(); server.join();
This is just an example binding 2 servlets, i just removed other non-servlet and logging stuff, so you can add your own access/jetty logging, too. The good thing with that is also that we can get rid of the stupid internal "/c" redirects as you have full flexibility where you bind what. And finally the WAR file is gone and the server does not unpack it on startup. In addition, we can do all the startup logic before spawning jetty (e.g. starting embedded zookeeper, checking log4j config,...)
Static files and stuff are loaded using context.insertHandler(new ResourceHandler()); context.setBaseResource(Resource.newClassPathResource("/webroot/")); directly from one of the JAR files (where a webroot/ folder is part of the JAR's contents/resources). This can be a separate JAR file or simply in main solr.jar. This makes it very small and unpacking Solr would take only a second then (currently unzipping all those small static files is a mess).
So I'd go that route and as a first step refactor the web.xml file into a simple startup main() method as seen before. I can help with that. I have some time this week, so I may make a quick and dirty mockup - if you agree.
This makes it very small and unpacking Solr would take only a second then (currently unzipping all those small static files is a mess).
Solr no longer unzips a war since long time ago?
I did not go down that route now because it sounds very much more invasive. But I like it a lot, and it would give us much more control over everything, including the naming of Java options such as jetty.port which we could now rename to solr.port etc. Guess this is how our Test framework constructs Jetty instances already.
I'm in no hurry with this. If you have time to mock up a refactoring to get rid of the dependence of all jars being in WEB-INF/lib and instead pick them from /dist then that would be perfect and achieve two goals in one
Solr no longer unzips a war since long time ago?
This was meant about unzipping the solr distribution on its own. If all the "static stuff" like jquery and a lot of HTML is gone, it would unzip faster and allocate a lot less disk space. You just have one JAR file with the whole admin interface as a single JAR. This would also allow to package the old and new admin interface separately and maybe exchange them or make it pluggable!? I agree, the WAR is no longer unpacked - you are right. I was taking care of this with the others a while ago. The Unzipping of distribution is now also a lot faster since we removed the Javadocs - thanks about that!
Ah, now I see. You mean jaring up the webapp/web folder. I did a count and those are 310 files. I agree it would compress better and make fewer files to unzip.
But on the other hand I like the fact that you very easily can simply edit a HTML or JS file and see the change in Solr's Admin. I think it has helped people identifying and fixing UI bugs for us. If it was all inside a jar that needed unpack -> edit -> re-pack to see the change, then hacking on Solr's UI would be less accessible both for users and developers.
Moving all the solr.xml things into Java code is actually just a benefit, it keeps folks from treating Solr as a Web app and doing things the wrong way. At the same time, I think we'll see complaints once we do the move, from people who depend on their custom filters etc. But as I said, I think that is just a positive step towards a truly standalone app.
thetaphi what do you think of committing this for the size improvement's sake for 8.0, or are you planning on getting rid of start.jar and web apps folder already for 8.0?
I like the idea of replacing the duplicates in dist with a README describing exactly which jars to copy out of the webapp if they are needed and where to find them. One thing I wonder is whether we expect hardlinks to be supported across all operating systems that natively support tarballs – if we do, we could use hardlinks to share files in dist and the webapp, and mention this fact in the README. If we think that hardlinks might be a specialty item, then just the README would be appropriate.
While looking at SOLR-13841 I found that we have duplicate jars other than the ones in dist, which currently are expected. My question for the moment: Should I open a new issue for those other duplicates, or would we want to expand this issue to cover a full audit of the jars in the binary release? I almost started with a new issue ... glad I searched first.
Hi,
I think the issue here should focus to get rid on the web application and have a single lib folder directly below the root dir of the distribution. Then we have a solr-main.jar (without solrj) and this one also contains a Main.class to bootstrap Jetty. This would make deployment much easier. As said before, the tons of HTML/Javascript should also be packaged into a JAR file to get rid of tons of small files making unzipping damn slow and consume lots of space (block size). Jetty is able to deliver static contents from a JAR file directy, I use this all the time for microservice-like stuff.
Once we are at that place we could maybe split the root lib folder into 2 of them: One with Solrj and one with the remaining stuff to startup the server. The contrib modules can be linked into cores the usual way with <lib>.
If I would have some more time, I could start in refactoring all this, but my knowledge with Solr is limited. I'd really like to get rid of SolrDispatchFilter and replace it with a simple servlet or better a Jetty Handler directly added to the root context in the Solr Bootstrap class.
About your comment: Hardlinks is a problem on Windows, and Symlinks are not better. I'd not do this.
But nevertheless, we can get rid of the duplicates, if we do some classpath magic. If we move all SolrJ JAR file up the tree and add them to Jetty's classpath, we can still keep the rest in the webapp folder. Webapps see JAR files from the context, too.
One long-term goal that I have (and I think it's shared by others) is to make the precise way of providing network services into an implementation detail. Solr should become a standalone application, and one way to do that is to embed Jetty into the application, so it's completely under our control and hidden from the user.
We historically have left classpath management mostly up to the container, but we're going to have to take that over if we want the goal above to succeed. Simplification and good separation will be important for that.
On the hardlink idea: It's really only viable in the tarball. I don't think we could do it in the zip version, and in truth I don't like it any more than you do. But if we do proper classpath management with our scripting, the whole notion is moot anyway, because we will have solved the problem.
Closing this as duplicate of SOLR-15916, as I think they achieve the same
First attempt on this, see patch
The tarball created before this patch is 144Mb, and after 132Mb, which is a 12Mb (9%) reduction.
Todo: