### Introduction

In this Home Analytics series, we will develop a toolbox of statistical methods to help analyze the home as a financial asset. We begin by estimating expected risk, returns, and correlation to other asset classes. Later, we will explore applications of our estimates as they related to the financial decisions of households, investors and the public sector. The toolbox will rely on as much historical research as possible (most notably the work of Karl Case & Robert Shiller). But, innovations will be necessary to accommodate the unique characteristics of this asset class and its data generating process.In part 1 of the Home Analytics series, we will estimate a single parameter

*μ*, the long run average return of real estate. Over a long enough timeline, the long-run historical average return of an asset provides a good first-order approximation of expected future returns.

The prior art on long-run historical average estimation is extensive. But, unlike highly homogeneous and liquid assets, each home is unique and doesn’t trade often. The estimation methods proposed herein are inspired by those methods used for more liquid asset classes but can accommodate the heterogeneity in holding periods.

### Estimated Long Run Returns

Below, we show the average historical returns for the country, states, and Core Based Statistical Areas (CBSAs), which include both larger metropolitan and smaller micropolitan areas. Included are the standard errors and data counts. The following data cleaning steps have been taken:- States and CBSAs with fewer than 10,000 sale pairs have been removed.
- Homes with more than +/-50% returns per year are removed (these are typically flips or are not arms-length transactions).
- Only homes purchased for between $50,000 and $5,000,000 are included.
- Only arms-length residential repeat sales are included (no foreclosures or short sales)

*Figure 1: Average long-run returns of homes. Interact with the chart to see the underlying data, data count, and standard errors. For CBSA, the bubble sizes correspond to the underlying data count.*

### Inspiration for the Methodology

With the luxury of homogeneity and a dense time series of return data, a good estimate of the average long-run return*μ̂*can be computed by taking the average return of the series:

*μ̂ =*

^{1}⁄_{N}∑r_{i}*μ̂*is an estimate of the long run average return*r*is the return_{i}_{i}in the sample of returns

Figure 2: The last 150 years of S&P 500 total returns have returned an average of 8.7% per year Under certain assumptions, the estimate

*μ̂*represents the “best guess” for a return

*r*in the sample of assuming the following model of returns:

_{i}*r*

_{i}∽ N(μ, σ^{2})*μ*is the true long run average return*σ*is variance in returns^{2}

*r*

_{i}= μ + σε_{i}*ε*

_{i}∽ N(0, 1)*ε*is a random noise variable_{i}

On a side note, S&P 500 returns more closely fit a log-normal distribution rather than a normal distribution. In this case, the average of the log-returns is a more robust measure of expectation. This can be achieved very easily by replacing

*r*with

_{i}*ln*(

*r*+ 1)

_{i}### Sale Pair Data

Home sales do not provide us with the same stream of period-by-period returns. In order to obtain the return for a home, we need at least two transactions: a purchase (at time*t*) and a sale (at time

*T*):

*P*is the sale price_{sale}*P*is the purchase price_{purchase}*R*is the total holding period return

This return occurs over a holding period

*τ*:

*τ = T - t*

*T*is the sale date*t*is the purchase date*τ*is the holding period

*Figure 3: Observed home returns in Seattle exhibit a diffusion pattern, inspiring the use of a geometric Brownian motion model. The color coding differentiates purchase year. A small number of outliers have been removed.*

If we had period-by-period returns for each home in the U.S. (e.g. a market price of each home was generated every period), the simple average formula would produce a good long-run average estimate. However, we do not control the holding period. Figure 3 suggests that both the expected return and the variance of returns increase almost linearly with the holding period. A robust model needs to adapt to the fact that both the expectation and the errors around the expectation are functions of time.

On a slightly more subtle point, despite the over 120mm homes in America, only about 20-25mm sale pairs can be generated from the past 20 years of transactions. Our model implicitly assumes that the observed returns are uniformly distributed across geography and time. This isn’t strictly true. Homes in dense urban geographies tend to outperform their rural counterparts and trade more frequently leading to a slight positive geographic bias. Separately, homes tend to turn over more quickly when the economy is strong and home price appreciation is high, leading to an additional positive temporal bias. We will explore solutions to these two sources of bias, namely geographic and temporal biases, in future posts by increasing the geographic granularity (i.e. estimating zip code level returns) and by estimating period-by-period return indices.

### A Simple Model for Home Returns

Home returns tend to be log-normally distributed. Said another way, the logarithm of home returns are normally distributed. Most traded assets exhibit log-normally distributed returns*r*:

*r = ln(R + 1) = ln(p*

_{sale}) - ln(p_{purchase})*r ∽ N(μτ, σ*

^{2}τ)*r*is the log-return of a home

This model is a simple geometric Brownian motion (GBM) diffusion model for asset returns. Buried within is an assumption that both log-returns and log-variance scale linearly with time.

*E[r(τ)] = μτ*

*Var[r(τ)] = σ*

^{2}τ*Figure 4: A box & whiskers plot of home returns shows (1) just how disperse are home returns around the average and (2) how that variance increases with time.*

In future posts, we will explore the limitations of these assumptions. Some of the more important model breaks are:

- Home returns appear to be slightly auto-correlated (not independent)
- Home variance more accurately follows a jump-diffusion model, with a jump in volatility at transaction (due in part to information asymmetry/transaction costs)

For now, these assumptions only cause a slight bias in our desired estimates (and their respective standard errors).

Now that we are armed with a model of the data generating process of home returns, we can begin to think about how to estimate our parameter of interest and its standard errors.

### Estimating the Long Run Average Return

Provided the following vectors of returns:… and vector of holding periods:

We can rewrite our linear regression as follows:

*ε*

_{i}∽ N(0, 1)*r*is the return of sale pair_{i}_{i}*τ*is the holding period of sale pair_{i}_{i}*ε*is the white noise random error of sale pair_{i}_{i}- μ is the true long run average rate of return
- σ is the true variance rate of returns

We are trying to estimate

*μ̂*that minimizes the sum of squared errors

*ε*of this model.

_{i}^{2}At this point, it is worth comparing the new optimization program to our previous homogeneous/homoscedastic model:

The two big innovations of our home return model are:

- Heterogeneity of Holding Period Returns: The prediction
*μτ*scales with holding period_{i} - Heterscedactic Errors: Our squared prediction errors
*(r*are scaled by an known variance factor_{i}- μτ_{i})^{2}*σ*^{2}τ_{i}

This beautifully simple model is both convex and has a unique closed-form solution. It is known as the weighted least-square regression:

Where:

Since

*σ*appears in all terms in both the numerator and denominator, it has no bearing on the estimate of

^{2}*μ̂*. This is convenient because we don’t yet have an estimate for

*σ*.

^{2}Rewritten, the solution takes the following simplified form:

The standard error can be backed out from the limiting distribution of our estimate

*μ̂*:

*σ̂*is the maximum likelihood estimate (MLE) for return variance rate^{2}*ŝ*is the ordinary least square likelihood estimate (OLS) for return variance rate^{2}

With this we can estimate the distribution, and therefore the (unbiased) standard errors, of our estimate

*μ̂*: