Keys to Designing and Managing Large Repositories

The B&R team has some deep experience managing large structured document repositories within SharePoint. In some cases, those repositories were established and grew within SharePoint organically over time, while in others there were large sets of content that were migrated in from networked file shares or other ECM solutions like Documentum, OpenText, or FileNet.

Throughout SharePoint’s long history there has often been confusion between software limitations and best practices. To make matters worse in many cases there is no global agreement on what the best practices are. In our experience, many of the best practices are really guidelines that are context specific. There is no generic answer, but rather a starting point from which the requirements can then shape the most appropriate solution within the right context. In our experience, some of those key decision points are:

Structured vs. Unstructured Content

paperclips-unorganized-sm.png

Loosely categorized unstructured content stored in desperate locations across the organization.

paperclips-organized-md.png

Centrally managed structured content that has been properly categorized.

While SharePoint can be used in many ways, the first contextual decision point is structured versus unstructured content. In this post, we will be specifically focused on structured content storage repositories with content types and meta-data defined, and not unstructured repositories. This is an important differentiator for us, since the organization and usability of content in unstructured systems is radically different.

Software Boundaries and IT Capabilities

Understanding the limits of your platforms and systems is vitally important.

When thinking of the actual software and storage boundaries, SharePoint as a platform is very flexible, and continues to increase limits as modern storage and database limits increase. Here are references to the software boundaries for 2013, 2016, and Office 365.

One of the common misconceptions is that the infamous “List View Threshold” implemented in SharePoint 2010 is a boundary limiting the number of items in a list or library to 5,000. This is not a limit to the number of items in a library, but rather the number of records that can be read as part of a database query. This specific topic will be addressed as part of the System User Experience and Content Findability section.

For on-premises versions of SharePoint Server including 2010, 2013, and 2016 our focus has been on establishing system sizes that can be reliably maintained including backup and recovery procedures. This is a pretty important point because in my experience the capabilities and expectations of my clients vary widely. In some cases, they have deep experience with multi-terabyte databases and have plenty of room to work with to both backup and restore databases as needed. In other cases some customers struggle with backing up and restoring databases that are just a few hundred gigabytes due to backup technologies or lack of available working storage space. With this in mind, our initial guiding points are to look at how to isolate the structured repositories into dedicated site collections, each with a dedicated content database. The number and size of those site collections vary depending on the customer’s requirements and backup and recovery capabilities. We frequently start by advising on smaller database sizes of around 100 GB and then adjust based on their comfort levels and capabilities, but they should never exceed the Sys Admin’s ability to capture and restore a database backup.

For Office 365, Microsoft has taken ownership of the system maintenance and regular backup and recovery operations. Within, the service they have also extended the software boundaries which can make it much easier to support systems with larger repositories and fewer site collections, pushing much of the decision to the next two points relating to system usability and content findability.

System User Experience and Content Findability

The user experience of the repository is essential to its long-term success.

We will focus on looking at the processes to initially add or author content and then how it will be later discovered. Patterns and techniques that work fine in other sites or repositories can completely fail with large repositories.

While SharePoint as a platform is typically thought of in terms of collaboration and collaborative content, the scenarios for structured content in large repositories is often different. In some scenarios, the content may be comprised of scanned documents or images that are sent directly to SharePoint, while in others they could be bulk loaded electronic documents.

Unlike the collaborative scenarios, you very rarely want to add the content to the root of a SharePoint library, but either organize the content across libraries and/or sub-folders. To better handle this scenario, we will often incorporate the Content Organizer feature that Microsoft made available with SharePoint Server 2010 which offers a temporary drop off library and rules to selectively route content to another site collection, library, or folder. This rules based approach provides great automation capabilities that help to keep things properly organized, while making it much easier to add content to the system. While the Content Organizer covers most of our common scenarios, we are able to support even more advanced scenarios for automation by leveraging a workflow tool or customization when needed.

Previously, the List View Threshold feature was mentioned. While it is often discussed as a boundary or limitation, it is actually a feature intended to help maintain system performance. For SharePoint Server 2010, 2013, and 2016 it is a system setting that can be set at the web application level. The intention of this feature is to provide protection against unoptimized queries being run against the back-end SQL server. The default value of 5,000 was chosen because that is the point in which queries are processed differently by the database’s query engine, and you will start to see performance related problems. While it is safe to make small changes beyond the default limit, quickly you will experience the performance impacts the feature was designed to avoid.

The important thing to remember is that the threshold is for a given query, so the key task is to plan and implement your views to be optimized. We do this by thinking about a few key things:

Configure the Default View:

By default, SharePoint points to the All Items for the default view. Ideally, no view will be without a viable filter, but the All Items view absolutely should not be the default view in these libraries.

Column Indexes:

Key columns used to drive views or as the primary filter within your list can be indexed to improve performance. Additional information can be found here.

View Filters:

Ideally, all views will be sufficiently filtered to be below the List View Threshold of 5,000 items. This will keep view load time low.

Lookup Fields:

Avoid the use of lookup fields, as these lookup fields will require inefficient queries that perform tables scans to return content. Even smaller repositories of just a few hundred items can exceed the List View Threshold because of the query formatting.

Avoid Group By, Collapsed Option:

While the ability to group by your meta-data can be powerful, we typically instruct our clients to avoid using the option to collapse the Group By selections. The collapse option has some unexpected behavior that will result in additional database queries for each of the group by levels and values and disregard the item limits and paging configuration. It is possible to limit a view to say 30 items, but if you configure it to group by a value and collapse it by default, the first category could have 1,000 items and the system will query and load the full list, ignoring the 30-item limit. This can have severe performance implications, and is typically the primary culprit when we are asked to help troubleshoot page load performance in a specific repository.

While the ability to easily and effectively locate content has a big impact on the user experience of the system, I would argue that it is the most critical and therefore one that needs to be thought through when working within the SharePoint platform so I have broken the topic out into its own section.

If you think about SharePoint sites on a scale or continuum from the small team sites with a few libraries containing a handful of documents up to large, enterprise repositories with millions of documents it should be clear that how you find and interact with content on the two opposite ends of the spectrum needs to evolve. As systems grow larger, well beyond the minimalistic List View Threshold levels, the system needs to become more sophisticated and move away from manual browsing to content or unstructured search keyword queries to a more intelligent search driven experience.

While most systems include a properly configured search service, a much smaller percentage have it optimized in a way that it can leveraged to provide structured searches or dynamically surface relevant content. This optimization takes place at two levels; first within the search service itself, and then with customizations available in the system.

Within the Search Service we will work to identify meta-data key fields that should be established as managed properties for doing specific property searching, determine which fields need to be leveraged within a results screen for sorting and use within refinements. These changes allow us to execute more precise search queries and optimize the search results for our specific needs.

Within the site, we will then look to define the scenarios that people need to look for and find content to define structured search forms and result pages optimized for those scenarios. In some cases, they are generic to the content in the repository, while in others the scenarios are specific to a given task or role helping to simply things for specific business processes. By leveraging structured search pages, we can provide an improved user experience that dramatically reduces the time it takes to locate the relevant content as the initial search results are smaller, and then easily paired down through relevant search refiners. In addition, on common landing pages we will leverage the additional search driven web parts to highlight relevant, related, or new content as needed to support the usage scenarios.

Our Approach to Designing Record Center

As we set out to design and implement our Record Center product, we knew that it must scale to tens of millions of records both with regards to technical performance and from a user experience perspective. To accomplish this, we automated the setup and configuration process in ways to help optimize the solution for our specific purpose and use case.

While doing a product feature overview is outside the scope of this post, we are happy to report that our approach and techniques have been successfully adopted by our clients and that today the average repository size is in the hundreds of thousands of documents while still meeting performance, usability, and system maintenance goals.

Next Steps

I hope that this post provided a good overview of how to plan and maintain large repositories. It is a big topic with lots of nuances and techniques that are learned over time in the trenches. If your group is struggling with designing and managing large repositories and needs help, reach out and setup a consultation. We can either assist your team with advisement services, or help with the implementation of a robust system.

Can We Help?

Contact us today for a free consultation!