Coping with Selection Effects: A Primer on Regression with Truncated Data. (arXiv:1901.10522v1 [astro-ph.IM])
<a href="http://arxiv.org/find/astro-ph/1/au:+Mantz_A/0/1/0/all/0/1">Adam B. Mantz</a> (KIPAC, Stanford)
The finite sensitivity of instruments or detection methods means that data
sets in many areas of astronomy, for example cosmological or exoplanet surveys,
are necessarily systematically incomplete. Such data sets, where the population
being investigated is of unknown size and only partially represented in the
data, are called “truncated” in the statistical literature. Truncation can be
accounted for through a relatively straightforward modification to the model
being fitted in many circumstances, provided that the model can be extended to
describe the population of undetected sources. Here I examine the problem of
regression using truncated data in general terms, and use a simple example to
show the impact of selecting a subset of potential data on the dependent
variable, on the independent variable, and on a second dependent variable that
is correlated with the variable of interest. Special circumstances in which
selection effects are ignorable are noted. I also comment on computational
strategies for performing regression with truncated data, as an extension of
methods that have become popular for the non-truncated case, and provide some
general recommendations.
The finite sensitivity of instruments or detection methods means that data
sets in many areas of astronomy, for example cosmological or exoplanet surveys,
are necessarily systematically incomplete. Such data sets, where the population
being investigated is of unknown size and only partially represented in the
data, are called “truncated” in the statistical literature. Truncation can be
accounted for through a relatively straightforward modification to the model
being fitted in many circumstances, provided that the model can be extended to
describe the population of undetected sources. Here I examine the problem of
regression using truncated data in general terms, and use a simple example to
show the impact of selecting a subset of potential data on the dependent
variable, on the independent variable, and on a second dependent variable that
is correlated with the variable of interest. Special circumstances in which
selection effects are ignorable are noted. I also comment on computational
strategies for performing regression with truncated data, as an extension of
methods that have become popular for the non-truncated case, and provide some
general recommendations.
http://arxiv.org/icons/sfx.gif