As the VP of engineering for Search and recommendations, I am in charge of making sure that we have a world class e-commerce search platform that can scale as necessary that will serve the most relevant results for end users. As we keep adding more complex features, scaling a platform as complex as search requires a very different skill set as compared to scaling normal run of the mill applications.
When I first started with the existing platform it was unstable. It often went down and was buggy. There were constant discrepancies between the product list and detail pages in terms of stock and price. The goal in the initial 3 months for me was to find stabilisation in the platform and then think about scaling it. I will try to detail out our journey as a 3 part blog series with Stabilisation, Scale/Growth and Future.
The search system architecture in our e-commerce platform is such that the system is designed not only for “search”, but also it is in itself a cataloging system. This means that for any snapshot of the SKU including price, stock and the merchant selling the SKU, search is the only system to have it in detail. This itself induced too many complexities in the system. Our system architecture relies heavily on micro services. So to fulfil the catalog part of the requirements, search indexes data from 13 different micro services and the list keeps growing. The architecture looks something like this:
To stabilise the platform in the initial phase, we first had to try and find the ‘low hanging fruits’.
- The first thing that we did was get the logging right by making sure that we capture all the necessary data.
- We had 2 nodes for the indexing cluster initially. We found that only one was getting used. We built a property/DB based numbering mechanism and brought parallelism into play while indexing data.
- Cleanup the SOLR schema to remove unnecessary fields. This did wonders to the size of the index. Along with this, we identified the docs which weren’t needed and removed them permanently.
- The length of query keywords always leads to performance degradation. A human user typically doesn’t enter more than 40 characters for query string. There were a lot of queries landing on the system with huge number of characters, some greater than 10k. This was typically due to bots that are both malicious and non-malicious. We employed checks to strip off query strings above a limit.
- Similar to length of the query string, we put limit on the number of keywords that could be entered in the search box too.
- Wildcards are a somewhat grey area. Not many users use wildcards but some power users do use them. It is extremely important to handle the wildcards carefully as they can easily break the system by causing performance bottlenecks. This can easily be studied from query history. Decide what wildcards are absolutely necessary and only allow those. A better option is to remove them completely. At least multiple regexes inside a query should not be allowed.
- Changed all/Browse SOLR handler queries to /select queries. This even though is pretty simple gives a lot of benefits. /browse was our custom handler.
- There were lots of *:* queries where we had to dish out lots of data. We converted all of these into id:* queries or <field>:* queries.
- Convert fields used for faceting where we had numbers into string. Since SOLR is primarily geared towards handling strings better, this also makes a huge difference in average difference time.
- Blank queries on SOLR typically cause performance degradation since these get converted to *:* queries by default. It is always better to handle blank queries at UI layer or else directly convert these into -id:* queries which directly result in no result.
- One more important thing is that SOLR is not good at deep paginations. In our system we had allowed unlimited results for any query and since in each page we had 24 results, this led to a huge number of available pages. Typically a normal user will never go beyond 8–10 page at the maximum. It was found that the deep paginations were caused by bots and crawlers who did competitive crawling or were used for SEO. SOLR GC drastically shoots up on deeper paginations and hence this is definitely super important to keep the browsable pages low.
- Finally we upgraded SOLR from 4.3 to SOLR cloud based SOLR 6.6. This gave the added advantage of HA for indexing. We had lost some data on the earlier master slave model whenever we had outages on master. The upgrade automatically gave a solid boost to our HA as well and load balancing for querying infrastructure too.
- Finally all the necessary socket timeouts to SOLR and internal system were set/reset properly and switches implemented. Whenever there was a spike in traffic, and if we ended up in soup, we could play around with timeouts/switches and just throw nice system problem pages at the user and avoid the entire SOLR cluster going down.
Using all the steps , we were able to bring the much needed stability into the system. Of course there were other tweaks and tuck ins that we did but I guess I have covered the salient points in bringing stability to the search system. In the next post I will try to go into detail on Scaling SOLR.
Feel free to post questions/comments and I will try to answer them.