Posts on Models for missing data

Non-ignorable missingness

Mon, 01 Jan 0001 00:00:00 +0000

Statistics is basically a missing data problem!

– Little 2013

Nearly all samples – whether by design or by accident – are incomplete. We very rarely make a complete census of all individuals in a population or all sites on a landscape. Sometimes we don’t collect, or can’t collect, complete information for individual samples or measures. For instance, we might know an animal was alive when it was last seen, so we know it survived at least that long, but know nothing about its current status. Or we might have information on the coverage of an invasive species down to a certain patch size, beyond which patches are too small or numerous to survey.

Disentangling concepts of status, trend, and trajectory

Mon, 01 Jan 0001 00:00:00 +0000

The terms status and trend are ubiquitous in resource monitoring and management settings. To be useful and robust, however, they require precise (mathematical) definitions. It has been my experience that misunderstanding these terms can lead to misapplication of model predictions and to researchers and managers drawing the wrong conclusions from the data. In this post we show how relatively simple, even intuitive, definitions for each of these terms clarifies their intent, and improves the insights provided by models of monitoring data.

Unequal inclusion probabilities

Mon, 01 Jan 0001 00:00:00 +0000

The Sonoran Desert is among the most extreme environments on Earth. Sampling in these remote, rugged landscapes requires a different approach. When the Park Service established monitoring in Organ Pipe Cactus National Monument they used an approach to select sites based on the cost of travel to sites on the broader landscape, visiting less “costly” sites with higher probability than more costly sites. The cost surface that defined the probability of inclusion of sites was developed using terrain data, and a tool that estimates the time to travel to any arbitrary location on the landscape.

Sampling and populations

Mon, 01 Jan 0001 00:00:00 +0000

We sample for a very practical reason. It’s usually impossible to get information on the whole population, so we use a sample to make inferences about the population. In our case, the population is typically all sites in a stratum or all sites – in all strata – at the scale of an entire park. Typically, the inference we seek entails three questions.

What’s the best estimate of the population mean?

We can generate a sample mean, \(\bar{x}\) , from our sample. This is the best estimate of the population mean.

Interpreting coefficients

Mon, 01 Jan 0001 00:00:00 +0000

Making sense of the effects of variables included as predictors #

Some aspects of covariate effects are readily apparent – for instance, the sign of a coefficient in a model says at least something about the general directionality of the effect, positive or negative. However, a deeper understanding of a model typically requires inferences that go well beyond simple measures of the directionality or significance of effects – it requires understanding the size of effects.

Stratum-varying fixed effects

Mon, 01 Jan 0001 00:00:00 +0000

Assume we have three strata, \(s_0\) , \(s_1\) , and \(s_2\) , where \(s_0\) is the “reference” stratum – in other words, \(s_0\) is the stratum for which the 0/1 indicator is 0 across the board in the indicator matrix below (the first row):

\[\begin{bmatrix} 1 & 0 & 0 \\ 1 & 1 & 0 \\ 1 & 0 & 1 \end{bmatrix}\]

B_0 + (B_1 + B_1_s1_offset * s1 + B_1_s2_offset * s2) * x_1 

# in stratum s0
B_0 + (B_1) * x_1 

# in stratum s1
B_0 + (B_1 + B_1_s1_offset * s1) * x_1 

# in stratum s2
B_0 + (B_1 + B_1_s2_offset * s2) * x_1 

# lm(y~x1*x2)
model.matrix(~x1*x2, tibble(x1 = runif(5), x2 = runif(5)))

The offset term

Mon, 01 Jan 0001 00:00:00 +0000

Counts of things naturally scale with the length or duration of observation, the area sampled, and sampling intensity ( Citation: McElreath, 2018 McElreath, R. (2018). Statistical rethinking: A bayesian course with examples in r and stan. Chapman; Hall/CRC. ) . For instance, the longer the river stretch we survey, the more fish we’ll tend to find.

Offset terms are used to model rates – e.g., counts per unit area or time. In the context of the model, the offset term transforms the response variable from a rate to a count.