CHAPTER 5
Have you ever heard of the learning triangle? It basically states that the level of mastery on any specific topic increases as you go through the following process: reading, seeing, hearing, watching, doing, and teaching. You are reading this right now, but to maximize the learning process, I encourage you to follow along in your own Solr installation.
In this chapter, we will create our own example using real-life data: a list of books in the Syncfusion Succinctly Series. It's not a large set of data, but it'll do just fine for me to demonstrate the steps required to index your own content.
I chose the Succinctly series because it is something that I identify with, and it is easy to understand. We will take the library and create an application to index the books, and allow people to browse them via tags or by text searching. Let’s get this party started!

We will start by indexing data for only three fields, and then over the course of the chapter, incrementally add a few more so we can perform queries with faceting, dates, multi-values, and other features that you would most likely need in your application. Let’s take a quick look at our sample data to see what it contains. As you can see, we have things like book title, description, and author. We will be using a CSV file; however, for display, I am currently showing you the data using Excel.

Whenever you want to add fields to your index, you need to tell Solr the name, type, and a couple of other attributes so that it knows what to do with them. In layman’s terms, you define the structure of the data of the index.
You do this by using the Schema.xml file. This file is usually the first one you configure when setting up a new installation. In it you declare your fields, field types, and attributes. You specify how to treat each field when documents are added to or queried from the index, if they are required or multi-valued, and whether they need to be stored or used for searching. Even though it is not required, you can also declare which one is your primary key for each document (which needs to be unique). One very important thing to remember is that it's not advisable to change the schema after documents have been added to the index, so try to make sure you have everything you need before adding it.
If you look at the schema.xml provided in your download, you'll see it includes the following sections:
The version number tells Solr how to treat some of the attributes in the schema. The current version is 1.5 as of Solr 4.10, and you should not change this version in your application.
![]()
Logically there are two types: simple and complex. Simple types are defined as a set of attributes that define its behavior. First you have the name, which is required, and then a class that indicates where it is implemented. An example of a simple type is string, which is defined as:
![]()
Complex types, besides storing data, include tokenizers and filters grouped into analyzers for additional processing. Let’s define what each one is used for:
Tokenizers are responsible for dividing the contents of a field into tokens. Wikipedia defines a token as: “a string of one or more characters that are significant as a group. The process of forming tokens from an input stream of characters is called tokenization.” A token can be a letter, one word, or multiple words all embedded within a single phrase. How those tokens emerge depends on the tokenizer we are currently using.
For example, the Standard Tokenizer splits the text field into tokens, treating whitespace and punctuation as delimiters. Delimiter characters are discarded, with a couple of exceptions. Another example is the Lower Case Tokenizer that tokenizes the input stream by delimiting at non-letters and then converting all letters to lowercase. Whitespace and non-letters are discarded. A third one is the Letter Tokenizer, which creates tokens from strings of contiguous letters, discarding all non-letter characters. And the list goes on and on.
A filter consumes input and produces a stream of tokens. It basically looks at each token in the stream sequentially and decides whether to pass it along, replace it, or discard it. It can also do more complex analysis by looking ahead and considering multiple tokens at once, even though this is not very common.
Filters are chained; therefore, the order affects the outcome significantly. In a typical scenario, general filters are used first, while specialized ones are left at the end of the chain.
Field analyzers are in charge of examining the text of fields and producing an output stream. In simpler terms, they are a logical group of multiple operations made up of at least one (but potentially multiple) tokenizers and filters. It is possible to specify which analyzer should be used at query time or at index time.
![]()
Let’s take a look at one example. In this case, we are going to use one of the most commonly used types, text_general. By using this field to store text, you will be removing stop words and applying synonyms at query time, as well as other operations. Also, you can see that there are two analyzers: one for query time, and the other for index time.

In this section, you specify which fields will make up your index. For example, if you wanted to index and search over the books in Syncfusion’s Succinctly Series or Pluralsight’s Online trainings, then you could specify the following fields:

A field definition has a name, a type, and multiple attributes that tell Solr how to manage each specific field. These are known as Static Fields.
Solr first looks for static definitions, and if none are found, it tries to find a match in dynamic fields. Dynamic fields are not covered in this book.
You might want to interpret some document fields in more than one way. For this purpose, Solr has a way of performing automatic field copying. To do this, you specify in copyField tag the source, description, and optionally, a max size as maxChars of the field you wish to copy. Multiple fields can easily be copied into a single copyField using this functionality.

Copy fields can also be specified using patterns; for example, source="*_i" will copy all fields that end in _i to a single copyField.
In the Apache Solr documentation wiki, there is an incredibly useful table that tells you the required values of the attributes for each use case. I am copying the table here verbatim, and will explain with an example. Please look for “Field Properties by Use Case” in the Solr wiki for more information.

The way to use this table is to look for the specific scenario that you want for your field, and determine the attributes. Let’s say you want a field where you can search, sort, and retrieve contents.
This means there are three scenarios: Search within field, Retrieve contents, and Sort on field. Looking for the required attributes in the columns, you would need to set indexed=“true”, stored=“true”, and multivalued=“false”.
Now let's talk about how to avoid some of the mistakes that people make with the schema.xml.
It’s time to make it our own Solr with our data. We will take our sample data, which can be found in GitHub in the following repository: https://github.com/xaviermorera/solr-succinctly.git.

The repository includes two main folders:
Understanding the documents that we will index in this demo is easy. In the real world, it can be trickier.
Up until now, we indexed some sample documents included in the Solr download. We will use this collection as a base to create our own, and will use a more appropriate name. It is worth mentioning that whenever the word “document” is used, it refers to a logical group of data. It is basically like saying a “record” or “row” in database language. I’ve been in meetings where non-search-savvy attendees only think of Word documents (or something similar) when we use this specific word. Don’t get confused.
Here are the steps to create our first index:

Now go into the succinctlybooks folder and open core.properties. Here is where you specify the name of the core, which is also called collection. It should look like this:


If you forget to rename the collection name within core.properties and try to restart, you will get an error telling you that the collection already exists. The error displayed in the console will be similar to the following:
2972 [main] ERROR org.apache.solr.core.SolrCore ull:org.apache.solr.common.SolrException: Found multiple cores with the name [collection1], with instancedirs [C:\solr-succinctly\succinctly\solr\collection1\] and [C:\solr-succinctly\succinctly\solr\succinctlybooks\] |
It is not a requirement to clear the index and comment out the existing fields; however, given that we have data in our index, we need to do it to avoid errors on fields we remove and types we change.
The following two steps will show you how to ensure we clean out the redundant data.
Step 1: Clear the index
The collection that we just copied came with the sample data we indexed recently. So where does Solr store the index data? Inside the current collection in a folder called “index” in the data folder. If you ever forget, just open the Overview section in the Admin UI section where you can see the current working directory (CWD), instance location, data, and index.

In our case, it can be found here: C:\solr-succinctly\succinctly\solr\succinctlybooks\data\index. If you view the folder contents, this is what a Lucene index looks like:

The next step is to clear the index, as we will be modifying the fields so that we can create our new index. Please stop Solr first by typing Ctrl + C from the console window where you started Solr, open Windows Explorer in your Lucene index, select all files within the index, and delete.
When you restart Solr, your index now has 0 documents. We now have an empty index to start with.
It is necessary to point out that if you do not delete the index, we will be changing the uniquekey from string to int. Given that some of the keys in the original samples have keys that look like “MA147LL/A,” you will get the following error when you restart:

Soon, we will be changing our uniquekey’s name, but not its type. If you insist that you want int as the type for bookid instead of string, you will get the error I just showed you at the start, even if you have a clean index. Figure 65 shows the error you will run into if you do not follow the instructions.

I’ll leave it to you to play around and figure out what the elevate.xml file is used for, which is one of the two potential culprits of this error:

Step 2: Comment out existing fields
There are two sections that I like to remove within Schema.xml:
First, look for the definition of id and comment it out all the way to store, as shown in the following image. Do it with an XML comment, which starts with <!-- and ends in -->.

Now let’s look for the Solr Cell fields, and comment out from title all the way to links. There are a few more fields that you should comment out, which are content, manu_exact, and payloads. Notice I did not comment out text, as it is a catchall field implemented via copyFields. We will soon get to it.

Finally, look for copyFields and comment them out.

Leave dynamicFields and uniqueKey as they are; we will get to them soon.
Creating a search UI for Syncfusion’s Succinctly series could take a long time, and potentially give you some headaches—or it can be done rather quickly, if you have the proper resources. And if you have this book in your hands, you are in luck, as you have a proper resource. Here is the data that we will be using:

Open the schema.xml file for the "succinctlybooks” collection in Notepad++ or any other text editor. In case you forgot or skipped the previous exercises, it is located here: C:\solr-succinctly\succinctly\solr\succinctlybooks\conf.
It is time to define our static fields. The fields should be located in the same section as the sample data fields that we just commented out. Please look for the id field definition, and add them at the same level, starting with bookid.
Bookid will be our unique key. We declare a field with this name, and add the type, which in this case is string. If you want, it can also be an int; it does not really make a big difference. Given that it is a uniquekey, it needs to be indexed to retrieve a specific document; it is required, and unique keys cannot be multivalued. Remember Field Properties by Use Case? Also, please be mindful of capitalization; for example, multiValued has an upper-case V.
![]()
We changed the name of the unique key from id to bookid. Look for the uniquekey tag and change accordingly.
![]()
And now we define the rest of the static fields. You should end up with some entries in the schema like this:

You may have noticed by now that title and description are of type text_general, while author and tags are of type string. As you might have guessed, these are different data types in the Solr landscape.
String is defined as a simple type with no tokenization. That is, it stores a word or sentence as an exact string, as there are no analyzers. It is useful for exact matches, i.e. for faceting.
![]()
On the other hand, the type definition of text_general is more complex, including query and index time analyzers, for performing tokenization and secondary processing like lower casing. It’s useful for all scenarios when we want to match part of a sentence. If you define title as a string, and then you searched for jquery, you would not find jQuery Succinctly. You would need to query for the exact string. This is not what we would most definitively want.

We will be creating facets for tags and authors, which means a string is the correct type to use for these. Will we be able to find them if we only type the name or last name? Let’s wait and see.
In this chapter, we started looking at the schema.xml file. We found out how important this file is to Solr, and we started editing it to define our own collection containing information about the Succinctly e-book series.
In the next chapter, we'll move on to the next stage in our game plan and cover the subject of indexing.