Details
-
Sub-task
-
Status: Closed
-
Major
-
Resolution: Implemented
-
None
-
None
Description
Solr's BackupRepository interface provides an abstraction around the physical location/format that backups are stored in. This allows plugin writers to create "repositories" for a variety of storage mediums. It'd be nice if Solr offered more mediums out of the box though, such as some of the "blobstore" offerings provided by various cloud providers.
This ticket proposes that a "BackupRepository" implementation for Amazon's popular 'S3' blobstore, so that Solr users can use it for backups without needing to write their own code.
Amazon offers a s3 Java client with acceptable licensing, and the required code is relatively simple. The biggest challenge in supporting this will likely be procedural - integration testing requires S3 access and S3 access costs money. We can check with INFRA to see if there is any way to get cloud credits for an integration test to run in nightly Jenkins runs on the ASF Jenkins server. Alternatively we can try to stub out the blobstore in some reliable way.
Attachments
Issue Links
- is related to
-
SOLR-15599 Upgrade AWS SDK from v1 to v2 for the S3 Repository
- Closed
- relates to
-
SOLR-15051 Shared storage -- BlobDirectory (de-duping)
- Resolved
- supercedes
-
SOLR-9952 S3BackupRepository
- Closed
- links to
Activity
Thanks, it'll be awesome to get in! In terms of testing, I agree S3 mocks are a good way to go. Even if we can get some sort of cloud credits from INFRA, devs will still likely want something they can run anywhere to sanity check related changes.
I haven't done much looking into libraries yet though. Very interested if you have any suggestions for a particular library to stand in for real s3!
This is really cool, Jason!
I don't think there's any way (yet) to get AWS credits via Apache. Anyone who needs something along those lines would have to use free credits or pay for it themselves.
As Varun mentioned, I think that mocks is a reasonable way to proceed and there's be enough people who'd want to take this to production and would have access to test and report. I know, not ideal but guess it'll work
Great idea, gerlowskija! How far along are you in coding this effort? I work at Salesforce and we wrote+use an S3 implementation of BackupRepository for a production Solr stack. It's not in an upstreamable state right now (e.g., uses some internal libraries for grabbing keys/secrets, etc.), but I would be happy to look into cleaning it up and submitting it for consideration if you haven't started yet. Or if you've already written the code, then feel free to add me on your code review.
In regards to testing, we use the Adobe S3Mock (https://github.com/adobe/S3Mock) library for writing unit tests. Since this code is fairly simple, as you mentioned, the S3 APIs it uses are all mainstream and mockable with that framework. For larger, end-to-end integration tests, we've also started using Minio (https://min.io/) to emulate an S3 server, but I would think that's outside the scope of this ticket.
I would strongly prefer for this to stay outside of solr-core, preferably in solr-extras repo (when that's created). Having AWS libraries shipped with Solr by default would feel very awkward.
I would be happy to look into cleaning it up and submitting it for consideration if you haven't started yet
Hey, that's great news Andy! I'm in a similar situation - I have code from my employer (written largely by Shalin and Dat who I used to work with), but it also needs some recontextualization/cleanup work before it's ready to share here.
FWIW, my employer's implementation is well-tested but I don't think it's seen much production traffic. So maybe your Salesforce implementation is a better base to start from, since it sounds like it's seen a good bit of production usage? If you're able to get things cleaned up for contribution, let's work from what you have, and we can use my copy as a fallback or a sanity-check as necessary.
(Ishan raised some good questions about where this code should ultimately live, but if it makes it easier for you to share code we can handle that last. Feel free to put your S3Repository code where-ever is easiest for the moment, and we can relocate it as necessary at the end.)
I would strongly prefer for this to stay outside of solr-core, preferably in solr-extras repo (when that's created).
My primary goal for this is that it lives as ASF code somewhere. So I'm not against solr-extras as a home, if the community has decided on that approach for handling future contrib-y modules.
But does that consensus exist right now? The last email thread about it ends ambiguously, with Hoss asking some questions (that I seconded) about what benefits solr-extras really provides over a single-repo approach.
From that I was under the impression that solr-extras might happen, but was still very much up in the air. But I might've missed some mail on it?
Sounds good gerlowskija, I'll look into tidying up our code to make it open-sourceable. Ours has been used in production for a little over a year, but since the clean up work may be on the heavy side, the eventual codebase that gets open-sourced will not really have any time in production
I agree with you and Ishan, solr-core isn't the place for this, but we can I think figure out exactly where at a later date. One more thing to call out is that our implementation is built on AWS SDK v1. We would like to move to v2 at a later date (for built-in metrics, etc.) but haven't had the time yet.
While Andy is working on cleaning up his AWS code, I wanted to get the ball rolling on some of the bigger questions around packaging and setup. AFAIK this (or one of the parallel tickets for GCS or Azure) will be Solr's first "first-class" plugin (right?), so there's some things to figure out.
- Where should the source for "first-class" plugins live? Ishan replied to my questions above by continuing the mailing list discussion I referenced in my previous comment. See there for discussion.
- Where should the project host first-class plugins from an artifact/repository sense? The easiest thing to do would probably be to publish our repository of first-class plugins as a new page or pages on our website (maybe /solr/plugins/downloads.html?)
- What should the release process for first-class plugins look like, and how (if at all) should it interlock with the release process for Solr proper? Some of this will need to flow from the ASF's release policy, but there's a lot to figure out beyond that as well.
We don't need answers for all of these questions necessarily to close this ticket - some could be punted to other tickets if desired. But as they'll be eventual blockers I wanted to lay out some of the open questions here as a start.
I think most people in that thread agree that the best place for plugins that need to be developed and released with Solr is the Solr repo itself. Honestly, I feel this should just be a contrib, the same as the existing ones. There is SOLR-14688 for discussing if we want to move to a different strategy. Let's get S3 backup support in Solr first, it's obviously a very important feature that everyone seems to be building on their own these days.
or one of the parallel tickets for GCS or Azure
Wondering if this implementation using the S3 client could just work with GCS? https://cloud.google.com/storage/docs/migrating#migration-simple
Honestly, I feel this should just be a contrib, the same as the existing ones. There is SOLR-14688 for discussing if we want to move to a different strategy.
I didn't realize there was a whole Jira for hashing this out. Thanks for the pointer! In terms of contrib vs plugin, I'm of two minds.
The contrib approach is well trodden, easy, and reversible (contribs can be made into plugins later). All very appealing.
But on the other hand I thought there was a certain amount of consensus about net-new optional modules being created as plugins (and not contribs). I don't want to go against consensus just out of convenience.
(Though maybe this consensus never existed - I spent a little time looking for the relevant thread but couldn't find it. Even if the consensus was clear, there's probably a good argument that the project needs more time to hash out the SOLR-14688 questions, and that devs should continue to create contribs until those discussions play out.)
Anyway, I'll give this some more thought.
the S3 client could just work with GCS?
Interesting, this is news to me and I'll have to read more.
I am uncomfortable with this being a "contrib" in the lucene-solr repository for the following reasons:
- I don't want bloat in the main repository with code that is clearly optional to the search engine. solr-sandbox or solr-extras is better, IMHO.
- I don't want to see flurry of JIRAs, PRs, comments etc. for the main repository dealing with fringe issues to do with S3 or GCS or Azure Blobstore etc. It is a distraction.
- A PR for this "contrib" could include change to other important modules, and I want us to not be able to do so, without going in everytime and looking into the PRs.
- I don't want to be blocked on committing something to the main repository, just because it breaks tests in s3mock, S3 support etc.
And of course, I don't want such things being shipped by default (and today all contribs are shipped with Solr).
I've had a lot of thoughts about these questions. More on the release side than the code side.
I don't have strong thoughts about whether these ancillary components should be contrib or not. We do need to decide that, but it's independent from concerns about how we release the artifacts.
I would like to see a situation where we have a "main" solr download that is a third of the current download size, or smaller if possible. That download would only include core functionality, no bells or whistles. On the code side, there should be a main repository that builds the main download. We would need to decide whether we want individual repositories for other components, or one big "plugins" repository. The same decision would need to be made for the release side – one big plugins download, or a bunch of small ones. I can see advantages and disadvantages to either approach.
Not too long ago, I noticed that the download for ES, our biggest competitor, was under 30 MB, compared to 150MB for Solr. I just downloaded the latest ES version, and it has ballooned to 300MB. I guess they now prefer the same "kitchen sink" philosophy that we have always used ... and they have REALLY embraced it.
But on the other hand I thought there was a certain amount of consensus about net-new optional modules being created as plugins (and not contribs).
I don't think there is a consensus for that? Some contribs were removed from the code recently but I don't think that means no-more-contribs, I believe each deserves it's own discussion, like the one we are having here. The closer there is about this discussion in the thread you referenced, and people isn't in favor of moving "first party plugins" out of Solr repo
I don't want to go against consensus just out of convenience.
contribs are convenient, for you, for users, for other devs, which make them a good option IMO. Again, I think there is specific discussion to be had about each one of them (new and old), we certainly don't want to allow code dumps that fall unsupported and largely unused.
- I don't want bloat in the main repository with code that is clearly optional to the search engine. solr-sandbox or solr-extras is better, IMHO.
I understand Ishan, but that is exactly the discussion going on in the dev list thread linked above. Most people agree that these first party plugins/modules/contribs are better off with the Solr code, for various reasons stated there.
- I don't want to be blocked on committing something to the main repository, just because it breaks tests in s3mock, S3 support etc.
I know that it may be annoying for a feature you don't want/use/need. I see failures for features I don't use all the time, but that's exactly what supporting a feature means. We want to know in advance if a change in Solr is breaking our officially supported plugins, and we don't want to release a Solr version that inadvertently breaks things we support. I want our testing to prevent it.
I would like to see a situation where we have a "main" solr download that is a third of the current download size, or smaller if possible. That download would only include core functionality, no bells or whistles. On the code side, there should be a main repository that builds the main download. We would need to decide whether we want individual repositories for other components, or one big "plugins" repository. The same decision would need to be made for the release side – one big plugins download, or a bunch of small ones. I can see advantages and disadvantages to either approach.
This is the discussion going on here right now. Docker already has a "slim" and a "fat" distribution, we can certainly do that with Solr tars too if we want (I do believe we aren't there yet, see my comment here). But again, I don't think we should block this work until that decision is made.
This particular feature has been requested multiple times, different users have had to built their own, it's obviously something people want and need, lets have it!
I don't think there is a consensus for that?
Ok, fair enough. I started to wonder the same thing when I couldn't find the discussion I thought I remembered.
The concerns Ishan cited above arguing for a separate repo don't resonate with me personally, and I agree with Tomas' responses to them.
But that makes me "anti-separate-repo" not "anti-plugin".
Assuming it lives in `lucene-solr`, there are good technical reasons to consider plugin packaging: plugins can have their own release cadence, they needn't be shipped by default, etc. There are good strategic reasons too. We've told the community that plugins are a first class thing, and they're the way to go in Solr development. It's been highlighted in upgrade notes, presented on at conferences, there's a website of sorts, etc. Telling users to do one thing while we continue to do another undercuts that message in a big way. And it deprives users of concrete examples they need to make their own non-trivial plugins.
That's not to say that this must be a plugin IMO. Just that in an ideal world where the community has answers to SOLR-14688's questions, that's the way I'd lean.
But at the same time, Tomas is right that there's a clear desire for this feature and letting it miss the next release while we wait on answers would be a mistake. So for now I will go with contrib-packaging on these object-store tickets, and they can be reviewed in that form. But I'll leave the PRs uncommitted until either there's answers on SOLR-14688 that I can adopt if applicable, or 8.9 or 9.0 starts to get concrete.
Hey athrog - hope everything's going well over there! I wanted to check in on the S3Repository implementation you mentioned sharing- are you still aiming to share it, or did you hit a roadblock of some sort? Hopefully we'll still get a peek at yours, but if something's come up I just figured I'd reiterate that I'm totally fine with cleaning up what I've got on my side instead. Either way, lmk!
Hi gerlowskija, thanks for checking in again. I don't foresee any major roadblocks – we got some time to work on it recently and found some differences between the Solr version we're on and 8.6, so that took a little longer than I expected to sort things out. Our next steps will be to add some more tests and do some ad-hoc testing with S3 again, since we've now deviated a little from the code we have in production.
Glad to hear it! Lmk if anything comes up or there's anything I can help with.
Hi gerlowskija!
I'm working with athrog on this. I spent much time this week on testing, cleaning things up and integrating the recent changes of SOLR-15090 (mostly in tests). I think we are getting close to open a pull request to start getting feedback from the community.
What we have so far is fully functional for an end-to-end backup/restore cycle with S3. There are still a few of TODOs to address in the code, could be right now or deferred after this ticket if it needs more discussion.
Implementation has a layer of abstraction that hides the underlying blob-store substrate (in our case S3). It was initially designed to be easily extended to other storage providers like Azure or GCS with same implementation of BackupRepository. Since pushing collection backups to a remote blob store shares some concepts, I think it makes sense to also share code.
That's still unclear whether we will keep it since other implementations of BackupRepository are added to. On a longer term, if we keep it, we should merge similar backup repository implementations.
pierre.salagnac - thanks for the update!
I'd really appreciate if you could share the code while it's still evolving for the rest of the folks to view or even contribute to. The sooner the better I guess
Hey pierre.salagnac,
Yeah, I lean Anshum's direction here. If the code is has end-to-end functionality then it sounds to me like it's far enough along to be worth sharing. No one expects 100% perfection and the sooner it's posted the sooner we can get you guys feedback or even start helping with some of the remaining TODOs you mentioned.
To your specific point about having an abstraction specifically for the blob-store BackupRepository implementations, I can see how that might make sense but would have to think it over some more. Seeing how much similarity there is between your S3BackupRepository and (e.g.) the recently added GCSBackupRepository will help make that case too probably.
Anyway, looking forward to seeing what you guys've got when you're ready to share. Thanks for helping getting it out there!
Thanks everyone for the feedback. Just submitted a PR: https://github.com/apache/solr/pull/120
Getting to this point has very much been a team effort, and I want to thank pierre.salagnac for his time and effort spent on this project, as well as Rajeev Bansal, who got this implementation off the ground and working in production originally.
Hey athrog - did you get a chance to look over my comments on PR 120? Just figured I'd check in and see if you had any thoughts or replies there.
(I often miss Github notifications myself, so just wanted to make sure you weren't still waiting on feedback already given!)
Appreciate the bump gerlowskija – I had indeed missed your (thorough) review! Will respond over there.
x-posting a comment from GH, in case you're not getting notifications from there:
Hey athrog - anything holding this up, or did you just miss the notification on the latest round of comments? (If it's a time thing, lmk and I can try to help out a bit myself.)
Seems like the consensus so far would be to (1) remove the BlobRepository abstraction and (2) repurpose/rebrand the contrib you're adding now to be s3-specific, pending any big objections or arguments from your side of things?
Commit 1cb0850b70a7583501718ed635964f2f605d1742 in solr's branch refs/heads/main from Andy Throgmorton
[ https://gitbox.apache.org/repos/asf?p=solr.git;h=1cb0850 ]
SOLR-15089: Allow backup/restoration to Amazon's S3 blobstore (#120)
See solr/contrib/s3-repository/README.md for more information.
Co-authored-by: Andy Throgmorton <athrogmorton@salesforce.com>
Co-authored-by: Pierre Salagnac <psalagnac@salesforce.com>
Co-authored-by: Houston Putman <houston@apache.org>
I missed the maven and idea files in the dev-tools folder. Will fix those after I have it all working on the 8.10 backport.
Commit aa9def440df23db2fd221fa4e5787fb2fd52831d in lucene-solr's branch refs/heads/branch_8x from Houston Putman
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=aa9def4 ]
SOLR-15089: Allow backup/restoration to Amazon's S3 blobstore (#2554)
See solr/contrib/s3-repository/README.md for more information.
Co-authored-by: Andy Throgmorton <athrog@users.noreply.github.com>
Co-authored-by: Andy Throgmorton <athrogmorton@salesforce.com>
Co-authored-by: Pierre Salagnac <psalagnac@salesforce.com>
Commit c112f03482645269d59ffdd9c77bcbbd0248d939 in solr's branch refs/heads/main from Houston Putman
[ https://gitbox.apache.org/repos/asf?p=solr.git;h=c112f03 ]
SOLR-15089 - Various fixes for S3 Repository
- Sensitive AWS sysprop credentials hidden
- AWS credentials in tests for less error logging
- Stax2-api version now set in versions.props
- Protected class made Public for documentation fix in 8.10
Commit a9d4c4a06a22f7ebf97a0068fa758fd89a23e379 in lucene-solr's branch refs/heads/branch_8x from Houston Putman
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=a9d4c4a ]
SOLR-15089: Protect sensitive S3 system properties.
This PR is causing build failures in 8.x and main, will fix shortly.
Commit 72b5b1a73e2251a19cc48ebe1a526377caa13947 in lucene-solr's branch refs/heads/branch_8x from Houston Putman
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=72b5b1a ]
SOLR-15089: Remove unecessary license files.
Commit 26b8fa5350ba5b7c073942f6c084aefa8c62d2ec in solr's branch refs/heads/main from Houston Putman
[ https://gitbox.apache.org/repos/asf?p=solr.git;h=26b8fa5 ]
SOLR-15089: Use synchronous logger in s3 tests.
Commit 919ad85ebe3130f82f793dd8b5fdaff808eb9e3e in solr's branch refs/heads/main from Houston Putman
[ https://gitbox.apache.org/repos/asf?p=solr.git;h=919ad85 ]
SOLR-15089: Solidify S3 tests when running with less resources.
Commit 3f6bb96294b467c4c73bcc4487d2f7b06c9d45d2 in lucene-solr's branch refs/heads/branch_8x from Houston Putman
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=3f6bb96 ]
SOLR-15089: Solidify S3 tests when running with less resources.
Remove non-jar dependency to fix maven artifact generation.
athrog it looks like we need to remove the javax.xml.bind:jaxb-api dependency, as javax dependencies aren't allowed in the project (at least in 8.x).
I see from your comment in build.gradle that this dependency isn't needed if we upgrade to the AWS SDK v2. I'm looking into doing that currently, and was wondering if there was a reason you targeted the v1 SDK in the first place. If not I'll continue on (might need to change some things, such as the proxy configuration), but seeing as it hasn't been released yet, we should be fine.
Will open a separate ticket for that discussion. SOLR-15599
The ant and gradle targets pass now, with a single exception. The ant smoketest target fails due to a dependency on a javax api. This will be addressed in SOLR-15599.
houston Strictly speaking, that lib does not need to be present. If it's not there, the AWS SDK will complain "JAXB is unavailable. Will fallback to SDK implementation which may be less performant" but it will still function the same. I never tested what the performance difference is, but if you're looking for an easy solution for backporting, IMO removing that jar should be safe.
As to why v1 vs v2, it's just a relic of when we first built our plugin – v2 didn't exist yet. We've talked about migrating our internal version to v2 (to get some cool stuff like metrics) but it hasn't been a priority for us. When we last looked at it, pretty much all the bindings changed, so we didn't want re-write the entire plugin in the eleventh hour.
That being said, I am very pro-v2 and think it's a good idea to migrate now that the plugin is committed in 9.x. I'll follow that jira; please feel free to @ me on any v2 PR you write.
I see sporadic test failures for S3IncrementalBackupTest. See http://fucit.org/solr-jenkins-reports/failure-report.html
This is exciting! Maybe we could start by using an S3 Mock?