Search 2010 Architecture and Scale - Part 1 Crawl

Several things have changed in 2010 SharePoint search. Search is now componentized so now you only provision what you need. This blog series will go through what these components are, how they work together, and how to provision them. The first blog we will focus on scaling Crawl. The next blog will be Query and etc…

Special shout-out goes to Jon Waite for his valuable technical input\review…

Basics

No longer is search tied into an SSP. Search consists of a Search Service Application, Search Service Application proxy, web services, search service instance, and the following SQL databases by default:

Search Service Application DB

Search Service Application Crawl Store DB

Search Service Application Property Store DB

Search takes advantage of the shared services framework built on SharePoint foundation. For more details on shared services see my previous blog. It’s necessary to define the components that make up search. Each component plays an important role. This blog will focus on the Crawl Component and the SQL Crawl Database.

It’s possible to provision a search service application\proxy in 3 ways.

Within Central Administrator\Manage Service Applications page

Within Central Administrator\Farm Configuration Wizard

Using PowerShell. See this blog for more details

Crawl Component and Crawl Database

I’ll use indexer and crawler component interchangeably throughout the blog. An indexer is simply a physical server that host one or more crawl components. The huge change in 2010 is the indexer no longer stores a copy of an index. As items are being indexed, they are streamed\propagated to a Query server. Because the indexer no longer holds a physical copy of the index, it’s no longer a single point of failure. By default, when you provision a search service application using Central Administrator or Farm Configuration Wizard, a Crawler component and Crawl database is provisioned for you. For lack of better words, a crawl component’s job is to crawl\index content. The crawl component runs within MSSearch.exe process and the MSSearch.exe process is a windows service “SharePoint Server Search 14”.

It’s possible to provision multiple Crawl databases and Crawler components for a single Search service application. The reasons for doing this are plentiful and some of the reasoning will be explained throughout this post. When a crawler component is provisioned, it requires a mapping to a SQL crawl database. Both of these can be created by using either Central Administrator or PowerShell. To simplify things a bit I’ll cover how to do it in Central Administrator. In order to make changes to the Search topology, you must access the Search Administration page via the following:

Central Administrator\Application Management\Manage Service Applications\Select Search Service Application and select Manage from Ribbon

Scroll to the bottom of the page and this is where you can view\change the search topology.

To provision a new crawl database within Search Admin Page:

Select Modify

New Crawl Database

Populate the new crawl database page and select OK

Select Apply Topology Changes

To provision a new crawl component

Select Modify

New Crawl Component

Populate the New Crawl Component page and select OK

Note: Server specified will host crawl component. Associated Crawl Database is where you specify which Crawl database this crawl component will be mapped to. In this case, I will map to my newly created one.

4. Select Apply Topology Changes

The end result is a secondary crawl database and crawl component.

Fault tolerance + Performance

Just because an index doesn’t physically reside on the indexer doesn’t mean that you should only have one. For example, if only one crawler component is provisioned and the index server hosting that component fails, a major part of search is broken in that no further crawls will take place. Fault tolerance can be achieved by provisioning a secondary crawl component on a secondary server. A crawl component can only map to one SQL Crawl database. Multiple crawl components can map to the same Crawl database. By having multiple crawl components mapped to the same crawl database, fault tolerance is achieved. If the server hosting crawl component 1 crashes, crawl component 2 picks up the additional load while 1 is down. This achieves fault tolerance for the Indexer but what about fault tolerance for the Crawl DB? We fully support SQL mirroring and the Crawl DB can be mirrored in SQL to achieve fault tolerance.

Performance is improved in this setup because you effectively now have two indexers crawling the content instead of one. If you’re not satisfied with crawl times, simply add an additional crawl component mapped to the same crawl DB. The load is distributed across both index servers.

Question: If I specify two indexers to crawl the same content “two crawl components” map to the same Crawl DB, is it possible that the indexers might both attempt to crawl newly discovered items?

Answer: No overlapping would occur. Items are crawled and “picked up” in batches by both index servers for processing. If Indexer 1 picks up and processes Batch 1, then Index Server 2 will process Batch 2.

Crawl Distribution

Crawl Distribution can be achieved by provisioning the following:

Crawl Component 1 –> Crawl DB 1

Crawl Component2 –> Crawl DB 2

By having multiple crawl components each mapped to a unique Crawl database, each host is assigned to only one Crawl DB at crawl time. A host is simply an address defined in a content source. So if I have two web applications provisioned using Host Headers called Sales.Contoso.Com and TPSReports.Contoso.Com, each one is a unique host. I provision a new Search Service application with the following setup:

Index Server 1 “Crawl Component 1” <–> Crawl DB 1

Index Server 2 “Crawl Component2” <–> Crawl DB 2

When I start a crawl, each host in this example is evenly distributed so it will look like this.

Sales.Contoso.Com –> Index Server 1 “Crawl Component 1” <–> Crawl DB 1

TPSReports.Contoso.Com –> Index Server 2 “Crawl Component2” <–> Crawl DB 2

Indexers “Server hosting a crawl component” associated to that crawl database crawl that host. When multiple crawl databases exist, an attempt is made to distribute these evenly. In this example, I only have two hosts. What if I add a third host after the fact, how does the crawler determine which Crawl DB the host will be assigned to? The decision is based on the # of items\doc id’s are stored in the Crawl DB. From the example above, Sales.Contoso.Com has 300,000 items and assigned to Crawl DB 1 and TPSReports.Contoso.Com has 200 items and assigned to Crawl DB 2. The following hosts are added as content sources:

HR.Contoso.Com has 2,000 items

Facilities.Contoso.Com has 8,000 items

When a crawl is initiated, both hosts will be assigned to Crawl DB 2 based on the fact Crawl DB 2 contains less Doc ID’s than Crawl DB 1. Host Distribution isn’t solely determined by the # of Host but rather the # of items in a particular Crawl DB. By default, a Crawl DB with fewer Doc Id’s will get always get assigned a new host. It’s possible to control this automated behavior using Host distribution rules.

Note: Having more crawl databases than crawl components makes no since and is a waste of disk space/resources on the SQL server.

Question: One Crawl Component\Crawl Database exists and already has crawled all hosts in the farm. A decision is made for various reasons to add an additional Crawl Component\Crawl Database. The SharePoint Administrator would like to forklift half of the host addresses already indexed in original Crawl DB over to the newly created Crawl DB.

Answer: Once a new Crawl Component\Crawl Database is provisioned, a few ways exist on how to move half of the host:

· Option 1: Reset the Index and perform a Full Crawl.

· Option 2: Add Host Distribution Rules for half of the host and redistribute. Once complete, host distribution rule\rules can be removed.

Controlling Host Address Distribution

In some instances, a SharePoint administrator will want greater control on the decision making that goes into assigning host to a specific indexer\crawl DB. This is controlled through the use of host distribution rules. This is accessible as a link on the Search Admin page.

When you add a host and apply, if the host is already distributed to a crawl store, the crawler will be paused and the content will be physically moved and assigned to the crawl DB you select after selecting Redistribute Now button below. This is a resource intensive operation.

Need more control?

You can also control whether or not a crawl DB will be used for only Host specified by a Host Distribution rule. When provisioning a new crawl component the last option at the bottom is the following:

By marking a crawl database available to only hosts stored in Host Distribution rules, the indexer is aware of the change and will not automatically allocate newly discovered Host to this crawl database at crawl time. Only host addresses that are defined in Host Distribution rules mapped to a Crawl DB with this property set will get assigned.

A good example of when a SharePoint administrator might take advantage of this functionality is when a newly added Host consists of several million items. This would provide a dedicated Crawl database for that host assuming Host distribution rule was created prior to crawling. Since the database is dedicated to this host, the database will be exempt from receiving any new host.

Question: Can I combine fault tolerance and shorter crawl times with host that are defined in host distribution rules and are mapped to a crawl database marked as dedicated?

Answer: Sure, it’s as simple as defining another crawl component mapped to the dedicated crawl database. Remember, multiple crawl components can map to the same crawl DB which effectively achieves fault tolerance + performance increase in shorter crawl times.

Question: Wait just a minute! I’m happy with my single crawl component\crawl database. I’m retiring the Indexer “server hosting crawl component” and I want this new shiny Server to become my indexer. Do I need to create a new crawl component?

Answer: Nope, no need to create additional crawl components for migrating to a new index server. It’s as simple as editing the current crawl component via Search Admin Page\Search Application topology and specifying the new server.

Observe is Step 1 and Taking Action is Step 2

Before arbitrarily provisioning new crawler components and crawl DB’s, observe the current environment\crawl health so some evidence can be gathered before making this important decision. The obvious reasons of Fault Tolerance and Decreased crawl times are covered in the previous sections so I won’t discuss those further. Observing for System\Hardware bottlenecks is a good first step before considering adding more Crawl Components\Crawl DB’s.

Monitoring Index Server

Observation: The index server is almost maxed on CPU and\or is at the peak of available physical memory and crawl times have increased as a result.

Action Taken: Provision a new crawl component on a different server mapped to the same crawl DB

Monitoring SQL Server

Observation: Crawl Database is I/O bound on SQL and disk latency is unexpectedly high.

Action Taken 1: Provision a new Crawl Database on same SQL server “on a different set of spindles” and forklift half of the host over to it via Host Distribution Rules.

Action Taken 2: Provision a new Crawl Database on a different SQL server with a more suitable disk subsystem and forklift half of the host over to it via Host Distribution Rules.

Note: Provisioning a crawl DB requires provisioning a new Crawl Component.

Observation: SQL Server Memory\CPU is peaking and is unable to sustain the heavy load during crawl times.

Action Taken: Provision a new SQL server and Crawl DB to reside on the newly created Server. Forklift half of the host over via Host Distribution Rules.

Important: These are very basic methods on approaching system bottlenecks. For Example, don’t assume from a general observation of a spiked CPU would automatically require provisioning additional crawl components. More analysis would be required. Such as finding answers to the following questions:

Does CPU only spike during crawl times?

Which process is spiking?

Any host recently added to content sources?

Does SP health monitoring or Performance monitor reveal anything of use?

Etc….

Thanks,

Russ Maxwell, MSFT

Russ Maxwell’s Blog

Search 2010 Architecture and Scale – Part 1 Crawl

Follow Me

Recent Posts

Tags