PTOPALS

We have a set of observed population numbers \begin{equation} \opop=\left[\begin{array}{c} \opop_1\\ \vdots\\ \opop_\nop \end{array}\right] \end{equation} over \(\nop\) age intervals \([a_i,a_i+n_i)\) for \(i=1,\ldots,\nop\). Data is often tabulated for fixed intervals of one year (\(n_i=1\)), or five years (\(n_i=5\)), with the final age interval possibly open (\(n_\nop=\infty\)). The objective is to estimate the population at single years of age \(\pop_x\), \(x=0,1,2,\ldots,\maxage\) out to a maximum age \(\maxage\) under conditions where \(\opop\) contains noise added as part of a confidentialization process. In P-TOPALS (Dyrting, Flaxman, and Sharygin, 2022) the estimate is expressed relative to a prior age distribution \(\spop\) \begin{equation} \pop_x = \spop_x \, \exp\left(\bmat_x\mmult\bvec\right) , \end{equation} where \(\bmat_x\) is a row vector of B-splines evaluated at age \(x\), \(\bvec\) is a column vector of weights to be determined. This form allows the user to include prior information about the age distribution into the estimation problem. This information might be in the form of specific knowledge of the components of population change (births, deaths, and net migration) which have been used to make a population estimate independent of the census data, or general views on the persistence of stationary features of the distribution due to the predominance of special populations with stable age distributions, or the propagation `up' the age profile of non-stationary features associated with past major demographic events. The weights \(\bvec\) are found by maximizing the penalized log likelihood function \begin{equation} \label{eqn:loglik} \loglik(\bvec) = \ploglik(\bvec) - \frac{\pen}{2}\,\bvec'\mmult \diff_\order'\mmult \diff_\order\mmult\bvec \end{equation} where \(\diff_\order\) is the \(\order\)th order difference matrix and \(\pen\) is the roughness penalty. The first term on the right hand side of the above equation is the log likelihood of having the tabulated distribution \(\opop\) conditional on the underlying true distribution being \(\pop\). We assume for simplicity that the noise injected can be approximated by a normal distribution, in which case \begin{equation} \label{eqn:ploglik} \ploglik(\bvec)=-\ones'\mmult \frac{1}{2\var}(\opop-\apop)^2 , \end{equation} where \(\var\) can be age-dependent. Here \(\ones\) is a vector of ones and \(\apop\) is the vector of \(\nop\) smoothed numbers given in terms of \(\pop_x\) by the sum \begin{equation} \label{eqn:apop} \apop_i = \sum_{a_i\le x\lt a_i+n_i} \pop_x,\quad i=1,\ldots,\nop . \end{equation} The B-splines are defined on a relatively fine grid of knots and smoothing relative to the standard is achieved by the second term which penalizes first differences in the weights for adjacent splines. Assuming \(\loglik(\bvec)\) is maximized at a stationary point we get the following nonlinear equation for \(\bvec\) \begin{equation} \label{eqn:nonlinear} \gmat'(\bvec)\mmult\vmat\mmult (\opop-\apop)-\pen \diff'\mmult \diff\mmult \bvec=0, \end{equation} where \begin{equation} \vmat = \diag(\apop/\var), \end{equation} and \(\gmat(\bvec)\) is the matrix of logarithmic derivatives \begin{equation} \gmat(\bvec) = \frac{1}{\apop}\grad{\apop}. \end{equation} This equation can be solved by iterated linear regressions as shown in Dyrting (2020). The penalty \(\pen\) can be set manually or chosen using one of the criteria discussed in Dyrting (2020). By default PTOPALS uses the penalty that optimizes the Bayesian Information Criterion.

Extra Information

The user can alter the default settings of the method using an extrainfo string of the form PTOPALS:ExtraInfo. ExtraInfo is an optional comma-separated list of quoted strings of the form "Parameter=Value".

Standard
By default PTOPALS uses a constant (flat) standard \(\spop_x\). This can be changed with the extrainfo string "Standard=PCurveHandle" where PCurveHandle is the name of the standard pcurve object.

Degree
By default PTOPALS uses cubic B-Splines. The degree of splines can be changed with the extrainfo string "Degree=\(\degree\)".

Order
By default PTOPALS uses a second order penalty. The order can be changed with the extrainfo string "Order=\(\order\)".

Knots
There are two ways to set the B-Spline knots. The first way is to set the maximum age using extrainfo string "MaxAge=\(\maxage\)". The internal points are then equally spaced upwards from \(0\) as follows \begin{eqnarray} x_0&=&0\\ x_i &= &x_{i-1}+\knotspacing,\quad i=1,\ldots,\nintern\\ x_{\nintern+1}&=&\maxage \end{eqnarray} where \begin{equation} \nintern=\ceil{\frac{\maxage}{\knotspacing}}-1 \end{equation} and the spacing \(\knotspacing\) is specified by the extrainfo string "KnotSpacing=\(\knotspacing\)". The default values are \(\maxage=110\) and \(\knotspacing=2.5\). The second way is to set the knots explicitly using the extrainfo string "Knots=\(x_0\)|\(x_1\)|\(x_2\)|\(\ldots\)|\(x_\nintern\)|\(\maxage\)".

Penalty
By default PTOPALS automatically calculates the penalty \(\pen\) by minimising the Bayesean information criterion ("Penalty=BIC"). An alternative is the Akaike information criterion ("Penalty=AIC"). The penalty can also be specified explicitly using "Penalty=\(\pen\)".

Var
The noise variance is set using the extrainfo string "Var=VarPCurveName" where VarPCurveName is the name of a PCURVE object with the age-specific variances in the Numbers column. Alternatively a flat variance \(\beta\) can be specified using "Var=\(\beta\)". By default PTOPALS uses a flat variance with \(\beta=1000\).

MaxIts
The number of iterations used to solve for \(\bvec\) can be set using the extrainfo string "MaxIts=\(\maxits\)". The default is \(\maxits=10\).