TL/DR

Producing a scorecard that accurately measures or predicts relevant outcomes (such as manager quality) requires a multi-stage, data-driven process. Items and weights can’t be chosen by intuition. The effort is worth it, however, because an effective due diligence scorecard can greatly improve the quality, transparency and speed of selection decisions and manager reviews.

Critical issues commonly found with scorecards are:

×

Introduction

Rubrics and dashboards can provide convenient summaries for use in manager selection and performance reviews, but validation is essential.

Most fund selection processes use some form of scorecard to rank and compare managers, as well as provide a high-level overview of their strengths and weaknesses. Considering how much work goes into populating and maintaining an up-to-date and accurate scorecard, and how important it is to end up with an output that’s useful as a decision making tool, it’s important to invest in getting the design right from the beginning.

Scorecard development can be broken down into three components:

  • Items: What categories will the scorecard contain? Examples of common elements include past performance, team stability, and an assessment of the parent firm.
  • Measures: How will each item be measured? Where does the data come from, how frequently can it be collected, and how delayed is it? What is the scale of the measurement? How accurate is it, and how do we know?
  • Weights: How will the measures be weighed within an item? How will the items be weighed to produce a final score? Will the score be numeric or categorical, and what is the interpretation of each score range?

While this high-level process may sound relatively straightforward, each of these questions embeds numerous complex issues and tradeoffs.

The ultimate aim of a scorecard is to assess performance against outcomes, so before jumping into creating a rubric, it’s critical to first get clarity around the end objectives of the process: what are we seeking to score, and why? Then, work backwards to identify items that are correlated to those outcomes. (See Dimensions of Fund Evaluation for the key ingredients that institutional investors can use to construct an objective function.)

There are often many candidate measurements for a given construct, and all are not created equally. It’s not necessary to include every data point available; this only leads to confusion and overwhelm. Rather, the focus should be on finding reliable measures that represent the item and can be observed as frequently and objectively as possible.

The need for objectivity and speed is one reason why Empirically generally recommends quantitative as opposed to qualitative measures whenever possible. It’s not that qualitative measures can’t be just as valuable; it’s that they’re often very time consuming to construct in the proper way, and they’re more vulnerable to criticism or revision when stakeholders disagree with what they say. So if there’s an adequate quantitative measure available for an item, it’s usually best to favor it over a categorical alternative.

Finally, the measures need to be weighted to produce a summary result. It’s intuitive to choose weights that correspond to the perceived importance of each item, but in fact, this rarely results in an optimal scoring system. Just because a fund investor values Item A more highly than Item B, does not necessarily mean that Item A is a better statistical predictor of the investor’s desired outcome.

Instead, quantitative model selection techniques must be used to let the data speak for itself. These techniques can test large numbers of candidate weighting schemes to select the best performing one, and can also integrate information in a wide range of ways. For example, a given measure may have a non-linear impact on outcomes; such a relationship would be very difficult to identify and specify accurately using judgment alone.


Empirically specializes in developing state-of-the-art dashboards for manager evaluation and review, which incorporate all of the best practices discussed in this article. Schedule a Demo to learn more.
×

7 Common Issues with Due Diligence Scorecards

We now move to some of the common pitfalls that commonly befall scorecards, and what can be done to avoid them.

Failure to Obtain Conceptual Buy-In from Decision Makers

If the users of the scorecard are not on board with the conceptual or methodological approach being used, they will not trust what it says and, as a result, will not feel comfortable relying on it to make decisions. Objections can range from very specific – such as questioning the weighting of a particular measure – to very general, such as fundamental disagreement about how managers should be evaluated.

It bears repeating that high-level differences about objectives need to be worked through before commencing development of the scorecard. (See: Building Blocks of an Active Manager Program) Then, a transparent and impartial data-driven process should be used to determine the sets of items, measures and weights that best predict future performance against those objectives. If decision makers have bought in to both the objectives and the approach to the scorecard, they will give its conclusions much more credence.

Keep in mind that individual decision makers may have different priorities. An investment committee is not a homogeneous entity, but (hopefully) a diverse group of people with their own unique beliefs, opinions, and priorities. Therefore, it is critical that flexibility be built into the scorecard to facilitate exploration of how its outputs would change under different scenarios and objective functions.

No static scorecard will be capable of pleasing everyone. It’s best to anticipate differences and built them into the design of the decision tool; otherwise, if the tool is inflexible, discussions will quickly move away from the scorecard and lose structure – leading to a potentially inferior quality of debate, and the reaching of conclusions which are not backed by the stated process and may be inconsistent with the investment policy statement.

Failure to Validate the Items and Measures

A poorly designed scorecard risks introducing a false sense of objectivity and accuracy into a process. Just using numbers and weights to perform a structured evaluation offers no advantage over other methods, if the structure is incorrect.

When building a scorecard, investors must beware of simplifications that are appealing yet lack predictive power. While the output should be elegant and user-friendly, the inputs to a well-performing predictive scorecard are often not intuitive and can’t be selected by conjecture.

Each choice of item and measure in the scorecard needs to have a data-driven basis for inclusion. Similar justification should exist for items or measures that are not included in a scorecard, but which might seem likely candidates. For example, if rolling 3-year performance versus the benchmark is not a component of the final scorecard, evidence-based rationale should be available to explain to stakeholders why other approaches to performance measurement are being favored.

Failure to Create a Dynamic Dashboard

It’s important for the dashboard to be sensitive enough to capture changes in manager quality. Ideally, measures should be selected that can be observed at a regular frequency – such as monthly or quarterly – and with as short a lag as possible, if not in real time.

Frequent updates do not imply that the outcome ratings will also change frequently; lots of movement implies that the weighting scheme is too sensitive or mis-calibrated. But the scorecard needs to be capable of detecting important shifts in a reasonable amount of time; the sooner, the better. If a scorecard is perceived as stale, it will lose credibility with decision makers and become backward-looking as opposed to predictive.

To this end, measures should only be included if it’s possible to update them regularly and consistently. For example, a qualitative rating on a given topic based upon a one-off set of interviews or an ad hoc questionnaire is likely not a good measure for a scorecard, unless the item it is measuring is highly likely to be stable over the period of the manager relationship. Otherwise, this static component will bias the manager’s overall rating once it becomes out-of-date.

Failure to Maintain Objectivity

As already noted, a data-driven process to select the items, measures and weights that comprise the scorecard is essential. In addition, investors should be aware of potential bias-introducing subjectivity in the measurements themselves.

Quantitative measures are not always immune to subjectivity, particularly in their design, but also in their measurement. For example, consider the metric “size of senior investment team,” measured as a number of individuals. This requires defining what should constitute “senior” for the purpose of the scorecard. For example, if seniority is to be judged by years of investment experience, there are several plausible different cutoffs, such as 10 years or 15 years.  Subjectivity might again be required in applying the definition when performing the measurement: for example, should a given work experience count toward an as “investment experience”, or not?

With that said, qualitative measures tend to be much more subjective, which is very concerning when they carry a large weight in a scorecard. Subjectivity can introduce both unintentional bias, as well as  intentional bias, if the individuals compiling the ranking game the scorecard in favor of their preferred manager. Three key questions of note are:

  • Inter-rater agreement: Would multiple qualified individuals performing the same qualitative assessment of the same item, according to the same rubric, arrive at the same conclusion?
  • Reproducibility: Would the same individual performing the same assessment of the same manager under the same conditions, arrive at the same conclusion?
  • Sub-conscious bias: Is the rating being unintentionally influenced by irrelevant factors? For example, is a rating of “investment process quality” being biased by attributes such as the Portfolio Manager’s gender, alma mater, office décor, PowerPoint formatting, accent, appearance, or any number of other noise variables?

As a result of these challenges, developing, deploying and auditing qualitative measures that are truly valid can be highly time-consuming and expensive. It also may require a degree of access to the manager being rated that is impossible or impractical to obtain uniformly, leading to a new set of missing data issues.

While working through these issues may be worth it in certain circumstances, Empirically has a strong preference for objective quantitative measures. In our research, what they lose in potential predictive power (sometimes nothing at all) is more than compensated for by the speed, cost-efficiency and built-in impartiality that they confer.

Where qualitative measures must be used, they should be as specific as possible, and each category of the scale should be clearly defined. Each rating should be conducted by a minimum of two independent analysts, with a third tiebreaker being brought in to resolve large differences of opinion.

Failure to Correctly Weight the Measures

Measures should be weighted using a quantitative model selection algorithm, not based on subjective views of variable importance or predictive power. This algorithm should be tuned to predicting the objectives of interest.

Often, the objectives of an investment committee are more complex than simply maximizing a single metric, such as expected alpha. The weights need to be calibrated such that the scorecard’s ratings should incorporate the full complexity of the decision.

For example, a fund investor might place high value on a manager that can generate a less-correlated return stream, and be willing to trade off some alpha for the diversification benefit. The ranking model needs to be calibrated to embed this tradeoff, so that the output of the scorecard takes into account the whole picture and has the most straightforward interpretation possible.

While the model construction itself is a statistical exercise, the issue of weighting once more relates back to the importance of ensuring that the scorecard embeds decision makers’ true priorities and that any differences are resolved. For example, it’s common to see scorecards that place a lower emphasis on performance than most committee members would individually assign, if asked.

Failure to Interpret the Results

Users of the scorecard need to be provided with data-driven guidance on what the results mean, how to interpret them, and what to expect in terms of accuracy and limitations. Without such information, even if the rating system is perfectly constructed and executed, it will still remain unclear how the ratings should be translated into investment decisions.

The development, backtesting and validation process of a scorecard must generate answers to the following questions, among others:

  • Based on the precision achieved, will the output be quantitative (i.e. 1-100), or binned (i.e. Buy, Hold, Sell)?
  • If categorical, what does the “confusion matrix” look like? For example, in what (estimated) percentage of cases will a Buy actually be a Sell, and vice versa?
  • If numeric, what do the numbers mean? What is the margin of error? Is a 95 better than a 92, or is that just noise? What does a 90 versus an 80 score imply in terms of expected future investment outcomes?
  • Is the measure linear? Is the difference between 100 vs. 90 the same as 90 vs. 80, or Buy vs. Hold and Hold vs. Sell, in terms of the ability to achieve the objectives?
  • What are the key sensitivities and drivers of a given score? How robust is it? What factors would cause it to be inaccurate?

Failure to Audit and Continuously Improve

Investing in getting the design and implementation of a scorecard right will hopefully lead to a system that can be used for many years without major modifications. However, both the external market environment and internal priorities undergo continuous change, and new data is being collected every month or quarter about the inputs and outcomes of interest. Therefore, there’s no such thing as “set and forget” when it comes to maintaining a top-performing decision tool.

Incredibly, many institutional investors and advisors do not formally track and audit the performance of their evaluation process. It’s fundamentally important to know:

  • How often is the scorecard right vs. wrong? (This can be defined in a number of ways.)
  • What is the value added or subtracted by using the scorecard as opposed to alternative selection methods? (Benchmarking the benchmark.)
  • How can performance be improved? Are there new data sources or methodologies we can bring in?
  • Where has the system failed? Are there any structural issues that can be fixed?

Conclusion

When it’s built correctly, there’s an incredible amount of thought and work behind what can look like a simple scorecard. On the other hand, when a scorecard doesn’t address the above issues, and can’t answer some of the questions discussed here, it’s a cause for concern. It’s inappropriate to rely on a tool to make decisions without strong evidence and reasons to believe that its results are accurate. Therefore, investors and fiduciaries should ask the tough questions whenever considering using or adopting a rubric for manager evaluation or review. If the answers aren’t satisfying, that’s an indication that more work is needed to produce a framework that’s ready for use in portfolio management.


Author Information: Jordan Boslego is a Partner at Empirically.

Updated September 2020.