How do I prepare a dataset for pattern recognition search? Recommended read
Traditional search technology only works with lists of individual words. It will match each word, no matter how long or short the list is. Indx takes a different approach. It looks at the entire searchable pattern, whether it is one word or many words. To make this work well, the search data may need to be optimized.
- First letters or word in the Document text
- The length of the searchable pattern
- The amount of repeated text patterns in the index
- Unique patterns or characters such as & % or $
- Words not separated by a space
Pre-processing recommendations
Prepare the text patterns to be indexed with as similar length as possible
Let’s imagine an index with these two products:
Nike Air Zoom Tempo NETX% Men’s Running Shoes Road | Adidas Running Shoe |
When someone searches for "Running shoes", the results that match the length of the phrase most closely will appear higher in the list because they are the most similar to what was typed.
Add a space around binding characters such as “&” or “-”
Let’s imagine a product that is indexed with a pattern like this
Baguette chicken&herbs |
When a user searches for “Baguette with chicken and herbs” which also contains general words that might get matches in other products, the matcher might give you other results above the one you are seeking due to the uniqueness of the “..n&h..” pattern.
In this example you can also consider using the StringReplacer to replace the & entirely with a space.
Consider normalizing national characters such as “ø” to “oe”
If you are working with languages or data that have a lot of unique characters, and your users typically search with standard ASCII characters, you can use the StringReplacer class to change these. A good example of this is wine products.
Best practices for hierarchies and categories
When working with categorized data, the indexed pattern will be identified and judged based on the built-in Relevancy Ranking of Indx Search. This means that repeated patterns (like a shop category named "winter sports") will be lower in relevancy, but less frequent patterns such as "fischer" and "cross-country skis" will be higher in relevancy in the indexing.
If you want to create a search that can match across categories and product names, here are some suggestions:
Avoid categories that contain general keywords that cross paths such as having a single category named “running, swimming, and cycling”.
For example, if a user is searching for “running gear”, but your category also contains swimming accessories, such a query would get results that is not relevant for the user.
👉 Instead, spread the products to more unique and describing categories that will get relevant matches in the pattern search, in this example setting up Running, Swimming, and Cycling as three different categories.
These patterns will help the matcher to get relevant matches to a more explorative search than simply finding the exact product you are looking for.
A pair of skis could be structured like this:
In this example your customers could further explore your product range by for example searching for “skating skis for cold conditions”
When searching for categorized data it might often be useful for the user to go directly to a category page, rather than to the product.
Category index
Sports |
Sports → Winter |
Sports → Winter → Skis |
Sports → Winter → Skis → Cross-country |
Sports → Winter → Skis → Cross-country → Skating |
Product index
Sports → Winter → Skis → Cross-country → Skating → Fischer Carbonlite RCS Cold |
What data should be indexed, and what data should I fetch from another database?
Often we find that regular search engines push us to include a lot of data in their index, which can lead to multiple copies of our datasets and more work to maintain them.
Indx uses RAM for ultra fast performance, instead of databases. To keep memory usage low, only store necessary search patterns in the index. This includes text like names, titles, and categories, and skip non-searchable data like pricing or Urls. To retrieve non-searchable info, use the foreign key in the Document class to fetch from another database.
This way, you can keep the search fast and responsive, while keeping maintenance tasks manageable.
Let’s look at an example of a pair of Nike running shoes in a store.
Indexed text | Optional* | Database |
Nike | ||
Air Zoom Tempo NEXT% Men’s | ||
Running → Shoes → Road | ||
Description text | ||
In stock? | ||
Shoe size | ||
Url | ||
Picture urls | ||
Price | ||
Sale price | ||
GUID or codes |
Indexing of fields such as description or body text will depend on the situation. We do not suggest using pattern searching through big amounts of non-important text as it can give results that don't seem relevant to the user.
Booleans like availability can be added as key filters
With this pattern a user could get relevant results for many search variants:
- Nike running shoes
- Zoomshoe
- Nike Air Next Percent (would hit even with an “irrelevant” word)
- Running shoe (would list more from the category)
How does personalisation work in Indx?
For the need of personalization, using user-specific data, these will usually be able to be ready shortly after a user logs in. This is because the system indexes very quickly.
When a customer logs in, and the system knows, among other things, that this person often buys 🇮🇹 italian food, their favorite soda is 🖤 Pepsi Max, and the vegetables they most often buy 🥦🍅🫑🥬🥑. They also have their own recipes in the system that should be searchable both in its entirety and as single ingredients.
In the background we are still running the index for the entire selection, but when our user logs in we spin up an index on the fly for the products that should be given a higher relevancy, and also including their recipes and personal information. This added index should be up in less than a second as long as there is less than a thousand entries.
In this scenario our user can for example search for the following:
- 🍝 Spaghetti (and get their own recipes in the search)
- 🥤 Soda or Pe (and get Pepsi Max at the top)
- 🥫 Tomato (and get their favorite kind on the top)
- 🥦 Veg (and get their common puchases on top)
This method can work without overloading your server with hundreds of users logged in at the same time, and if you have a massive user base (Hello 👋🏻) the memory and CPU usage will scale linearly. Technically this method can be done with a keyfilter or running multiple indexes. See section below for when to choose approach.
How do I work with aliases?
Aliases to a Document can be added simply by using the same key for multiple Documents.
Here’s an example:
Product name | Key |
Hiking shoes | 0 |
Hiking boots | 0 |
Fishing pole | 1 |
Fishing rod | 1 |
When searching only one of the results with the same key (the one with the highest score) will be shown to the user.
Refer to the API docs to see how to do this in practice
What are the use cases for exclusive filtering?
Indx is currently the only search engine that allows for exclusive filtering, meaning that the user can choose to include all properties except the chosen ones, instead of just including.
Exclusive filtering can be a good option when there are many options to filter through. For example, if a user is searching for an apartment in London, they might be presented with 33 areas to filter through. Using exclusive filtering, they can quickly narrow down their search by selecting the areas they don't want to see results from. This can be much more efficient than having to select all the areas they do want to see results from.
We encourage thinking about this when structuring your data.
When to choose keyfilter or multiple indexes?
When you search, it's helpful to narrow down the search as much as possible. This is called the 'context of the search'. With a faceted search, you have two options. You can either set up a separate instance for each category, or set up one instance that includes all categories with a filter for each category.
If you have multiple indexes, you can combine them for a single search.
The best way to decide if you should use filters or multiple indexes depends on how much data you have. For a small amount of data, it won't make much difference. Most people use filters.
However, if you have a lot of data, you may benefit by dividing these into multiple indexes:
- It doesn't take up much extra memory.
- Indexing is faster.
- Searches are faster when you're only looking in a portion of the data.
- The search results are more relevant.
- You can save memory if you temporarily shut down an index.
It's hard to say exactly how much data is too much, as it depends on the data itself. Indx has tools to analyse the data, and are available to help customers find the best solution.