Augment Solr’s schema-guessing (aka “schemaless”) mode with a new interactive Schema Designer feature in the Admin UI to improve the getting started experience.
The goal of Solr’s current schema guessing mode was to reduce friction when first getting started. However, the current solution suffers from two main problems:
- Most decisions are made based on the first doc seen by the schema guessing logic and thus leads to poor decisions around text fields, single vs. multi-valued, and numeric types. Modern data is complicated and any opaque schema guessing tool that only looks at a single doc is too limited. For an in-depth analysis of the issues surrounding this feature, see: https://issues.apache.org/jira/browse/SOLR-14701.
- Difficult to iterate and refine the schema. If an incorrect decision is made by the schema guesser, Solr puts the onus on the user to troubleshoot, typically requiring looking at logs, address issues with the guessed schema (via cumbersome API calls and knowledge of fields / field types / dynamic fields, etc), delete and re-index the documents. Instead of a friendly getting started experience, the user now has to come up a steep learning curve of looking at logs, deleting documents, using the Schema and/or ConfigSets API correctly, and re-indexing. Operations like changing a single-valued to multi-valued field (or vice-versa) with docValues enabled requires deleting the entire Lucene index and rebuilding it.
Put frankly, the current “getting started” experience misses the mark on ease of use. The community is largely in agreement of this fact and seeks a better solution. Problem #1 can be addressed using a sampling approach where the schema guessing logic looks at multiple docs instead of a single before making decisions.
Problem #2 requires a solution that allows users to quickly iterate on the schema design and immediately see the results of a change. No API only solution is sufficient for solving this issue. Users need a GUI to assist them in tuning the schema interactively without having to mess with XML or the Schema or ConfigSet APIs directly.
We can assume that users will be able to start Solr locally and launch the Admin UI. I don’t think we can throw them directly into defining a collection (config set, shards, replicas, etc). But we can safely assume they have some data they want to search. Thus, a GUI driven approach based around the user’s sample data is a natural first step for improving the getting started experience (see attached schema-designer-1.png).
Moreover, Solr schema design involves a number of non-trivial concepts that may be unfamiliar to new users, e.g. dynamic fields, doc values, copy fields, indexed vs. stored, term vectors, dynamic fields, and so on. A GUI based approach can guide the user in the nuances of Solr schemas. Context sensitive help can link to the Reference Guide.
The best way to do that is show how their data will get indexed (visually) and let them tweak the results interactively. For instance, if you uncheck indexed for a field, the user will see that they cannot sort by that field in the Query Tester. The Query Tester will be schema driven with type-ahead drop-down fields populated from the current schema. If users change the stop words file, they can see the result take effect immediately in the UI.
Screenshots from a prototype schema designer UI are attached to this Jira (schema-designer-1.png). The prototype repurposes several existing views into a more seamless, interactive workflow vs. a number of different screens which require the user to stitch together a cohesive experience.
The basic workflow for the Schema Designer is:
- Launch Solr and open Admin UI, click on Schema Designer. At this point, there are no cores or collections but the _default config set is loaded into ZK. The end user does not care about collections or cores or config sets at this point. Rather, their main goal is to get some data indexed correctly so they can start playing around with Solr queries, i.e. the fun stuff.
- User either selects an existing schema (via type-ahead drop-down) or enters the name of a new schema, e.g. “books”. If new, then the _default configset is used as the starting point. (see attached schema-designer-2.png)
- Next, the user either uploads sample docs or pastes text into the sample docs text area.
- User pushes the Analyze Documents button, which populates the Schema Editor tree in the middle with the results of the “guessing”. This is where we can apply as much intelligence as possible to aid the user in getting started.
- The Schema Editor is a tree with nodes for Fields, Field Types, and Files, with the Fields tab being their main focus. User tweaks the schema settings for each field as needed. They can also add new fields & field types.
- When saving changes, the updates are stored in a temp configset in ZK. This way, the user won’t lose any changes if their connection drops or they leave and come back a few days later. Live config sets will not be affected until the user Publishes their changes.
- Users can switch types (string -> text), single/multi-valued, enable doc values, vectors, etc directly in the Schema editor.
- As the user refines their schema, they can use the Query Tester form in the lower left to see how their schema changes impact document matching results.
- As the user changes their schema, the query is re-executed against the updates. Behind the scenes, the Schema Designer may need to delete and re-index all sample documents, but this is transparent to the user.
- Once satisfied with the schema, the user can apply the changes to Solr directly via Publish (save as a ConfigSet in ZK) or download the ConfigSet to a zip file. (see schema-designer-3.png) The user can choose to index the sample docs after applying the updates by specifying a target collection. If the collection doesn’t exist, the Schema Designer creates it on-the-fly using the saved Config Set. Our goal is ease of use, so we don’t want to make the user go elsewhere to create a collection, just do it inline if that’s what they want. Publish dialog also offers a Schema Diff button (see schema-designer-7.png) to show changes made to the schema by the designer. The diff is computed against the published version of the schema (if it exists) or against the original schema that the designer copied from (e.g. _default).
During the analysis step, the designer backend creates a temporary config set in ZK named designer<schema>, where <schema> is provided by the user, such as “books” in the example wireframe. This allows the designer backend to persist changes to the schema automatically as the user refines the schema. We use a temporary configset in ZK so that live configsets and collections are not affected during the refinement process. The ZK version of the schema is used to enforce MVCC to ensure that two users cannot step on each other’s changes concurrently. Although, it’s envisioned that the typical use case is for one user to refine the schema at a time.
Additionally, during the refinement process, the schema designer creates a temporary collection named designer<schema>. The temp collection allows the designer backend to quickly index the sample docs to support the Query Tester feature. It also serves as a real-time tester of the schema changes before the changes are applied to live collections.
The sample docs provided by the user are stored in the Solr blob store so they don’t have to be re-parsed on every change to the schema / query request.