# Model validation

*Validity* represents the degree to which the method actually measures what it claims to
measure. *Reliability* is an integral part of validity, representing the absence of
measurement error.

*Concurrent validity* refers to the degree to which an outcome of a (simplified) method
correlates with the 'gold-standard' method; it is common that the outcome of the simplified method
is not the same as the outcome of the gold-standard method (for example,
SUV versus K_{i}).
When the outcome measure of both methods is the same (for example, K_{i} from
Patlak plot versus K_{i} from
compartmental model), we can refer to it as
*absolute validity*.
Bland-Altman plot
(Bland & Altman, 1986) can be used to
report the absolute validation results and to determine systematic error and heteroscedasticity.
Intraclass correlation coefficients can also be used to report absolute
validation results. *Categorical* measures can be compared using *weighted kappa*
(de Vet et al., 2011).
The use of correlation coefficients (Pearson's or Spearman's) is not recommended.

## Kinetic validation

Kinetic validation is done to find out whether the model is quantitatively consistent with the experimental tracer kinetic data. Inconsistent models are rejected; usually, the kinetic and biochemical information gained in the validation studies is useful for reformulating the model. Mismatches are revealed by the existence of non-random residuals; the difference between the experimental data and the response function fitted to the data has some time-dependent pattern (Motulsky and Ransnas, 1987).

The standard errors of the estimates also needs to be considered to determine the
*reliability* of the estimated parameters.
Large standard errors of the estimates usually mean that the data are very noisy or that the number
of variable parameters in the model is more than necessary (over-parameterization). Statistical
methods (Schwartz criterion, Akaike Criterion, F test; see
Landaw and DiStefano, 1984 or
Glatting et al. 2007) that consider the goodness of
fit versus the number of parameters in the model should be used to guide the selection among
alternative models.

The noise level in the data affects the number of parameters that may be estimated from the data and usually is the primary determinant of the precision in the estimated parameters. To determine the range of optimized parameter values, Monte Carlo simulations are recommended. The alternative approach, using covariance matrices, is faster and often used, but it requires the assumption that the process being modelled is linear. Since most of the kinetic models of metabolism are nonlinear, this assumption is invalid (Graham, 1985).

The choice of the model should not be done solely according to identifiability criteria (e.g. F test). A non-identifiable model may permit the determination of the range of some meaningful parameters, while with an identifiable model one may compute the exact value of some parameters deprived of any physical or physiological meaning.

To describe the movements of all known celestial bodies, Ptolemy used abstract mathematical functions (cycles and epicycles). He obtained a remarkably good fit.

However, the good fit did not help in predicting future observations.

Aristarchus of Samos (~310-230 BC), and later Copernicus (AD 1473-1545), based their computations on a physical basis, the heliocentric hypothesis. However, their fit was not better than Ptolemy's.

If the correlation between two estimated parameters is very high, then the likelihood of determining a unique set of values for the parameters is low (Budinger et al., 1985); this suggests that either one or more of the model parameters should be fixed or the model should be reduced.

## Biochemical validation

A critical test of the validity of a model is to compare the biochemical results predicted from the kinetic data by the model with those measured directly by chemical means (e.g. microdialysis or biopsy samples). These studies not only provide the time course of total radioactivity in plasma and in the target organ, but also differentiate between the original tracer and its label-carrying metabolites.

Intervention studies can be performed to estimate the sensitivity of the kinetic data and estimates to the physiologic parameter of interest. Model parameters of interest should change in the proper direction and by an appropriate magnitude in response to a variety of biologic stimuli. In addition, it is useful to test whether the parameters of interest do not change in response to a perturbation in a different factor, e.g. does an estimate of receptor density remain unchanged when blood flow is increased (Carson, 1991).

The assumptions made in the reduction of model should be tested, either by explicit experimentation or at least by computer simulation. Simulations provide the magnitude of bias in the parameter estimates due to errors in various assumptions.

The absolute accuracy of model parameters should be tested with a "gold standard", if one is available for the measurement of interest (e.g., the microsphere technique for perfusion measurements). It should be noted that e.g. scanner resolution and change in the physical meaning of parameters due to model reduction can make the comparison of results difficult.

The validation of a PET method for a neuroreceptor assay can be done by comparing the results
with those obtained from *in vitro* binding experiments. However, caution must be exercised
in these comparisons. Binding discrepancies between *in vitro* and *in vivo*
conditions have been observed. In *in vitro* receptor binding experiments, the concentration
of the unbound ligand is uniform, while the uniformity of the unbound ligand concentration in tissue
*in vivo* is less likely to be maintained owing to a multiplicity of factors such as the fast
binding-rebinding phenomenon in the synaptic zone and diffusional restrictions.
In addition, the ligand binding characteristics of receptor systems have been shown to have large
species-to-species differences in both *K _{D}* and

*B*values (Huang et al., 1986).

_{max}In some cases, PET method can be compared to traditional Fick's method, if venous blood sampling from the organ of interest is possible.

## Test/retest studies

Test/retest studies are conducted by repeating the PET study two times for a group of subjects in
similar conditions. Test/retest study provides a simple estimate of the *within-subject
variability* (lack of *repeatability*) of the method. *Reproducibility* is tested
by a replication study, which must be completely independent, ideally utilizing different
instruments and analysis tools.

However, *repeatability* or *reproducibility* is only one aspect of the problem.
A simple ratio provides more reproducible results than a model based method. Yet, the model based
method might reveal true between-subject or treatment differences that are ignored or "normalized"
by empirical methods.

Repeatability is estimated by calculating the mean and standard deviation of the absolute values
of the difference between test and retest values. *Repeatability coefficient* (*RC*)
is recommended by the British Standards Institution (1976), and when applied to PET test-retest
studies, it is defined as

*RC*(Bland and Altman, 1986). To facilitate comparisons across regions of interest,

*test-retest variability*(

*TRV*) can be calculated as the absolute value of the difference between test and retest values, divided by the mean of both measurements (Parsey et al., 2000). The mean

*TRV*percentage ± SD should be reported.

The *intraclass correlation coefficient* (*ICC*)
(Shrout and Fleiss, 1979;
Parsey et al., 2000) is a useful
measure of reliability, as this statistical parameter compares the within-subject (WS) variability
to the between-subject (BS) variability in repeated observations.

In the one-way ANOVA model, *BSMSS* is the mean sum of squares between subjects,
*WSMSS* is the mean sum of squares within subjects, *K* is the number of repeated
observations (*K=2* in test-retest study) and *N* is the number of subjects:

, where *y _{nk}* is the

*k*th observation of subject

*n*,

*y*is the mean of observations for subject

_{n.}*n*, and

*y*is the mean of all observations.

_{..}Negative *ICC* value indicates that more differences are observed within than between
subjects. *ICC* ranges between -1 (no reliability, i.e. *BSMSS=0*) to 1 (maximum
reliability, achieved in the case of identity between test and retest, i.e. *WSMSS=0*)
(Parsey et al., 2000).
As a rule, the method with the highest reliability (higher *ICC*) should be chosen
(Laruelle, 1999).

Ludbrook (2010) has reviewed linear regression methods for comparing two methods or to measurers.

Any diurnal variation, or stress-induced changes, should be controlled in test-retest studies.

## Interobserver variability

Good parameter should also have low *interobserver variability*. Interobserver variability
can be assessed by using Bland-Altman analysis
(Bland & Altman, 1986) and Lin's
concordance correlation coefficient (*CCC*)
(Lin, 1989;
Barnhart et al., 2002).

Parameters for assessing the effectiveness of therapy should, in addition to low interobserver
variability, also have the ability to differentiate between treatment responders and nonresponders.
These qualities can be combined in the *variability effect coefficient* (*VEC*)
(Benz et al., 2008).

## See also:

## References:

Barnhart HX, Haber M, Song J. Overall concordance correlation coefficient for evaluating
agreement among multiple observers. *Biometrics* 2002; 58:1020-1027. doi:
10.1111/j.0006-341X.2002.01020.x.

Barnhart HX, Barboriak DP. Applications of the repeatability of quantitative imaging biomarkers:
a review of statistical analysis of repeat data sets. *Transl Oncol.* 2009; 2(4): 231-235.
doi: 10.1593/tlo.09268.

Benz MR, Evilevitch V, Allen-Auerbach MS, Eilber FC, Phelps ME, Czernin J, Weber WA.
Treatment monitoring by ^{18}F-FDG PET/CT in patients with sarcomas: interobserver
variability of quantitative parameters in treatment-induced changes in histopathologically
responding and nonresponding tumors. *J Nucl Med.* 2008; 49(7): 1038-1046.
doi: 10.2967/jnumed.107.050187.

Bertoldo A, Cobelli C. Data modeling and simulation. *In:* Feng DD (ed.):
*Biomedical Information Technology.* Elsevier, 2008, pp 115-136.

Bland JM, Altman DG. Statistical methods for assessing agreement between two
methods of clinical measurement. *Lancet* 1986; i: 307-310.
doi: 10.1016/S0140-6736(86)90837-8.

British Standards Institution. Precision of test methods I: Guide for the determination and
reproducibility for a standard test method. *British Standard 5497*; 1979.

Budinger TF, Huesman RH, Knittel B, Friedland RP, Derenzo SE (1985):
Physiological modeling of dynamic measurements of metabolism using positron emission tomography.
*In:* The Metabolism of the Human Brain Studied with Positron Emission Tomography.
(Eds: Greitz T et al.) Raven Press, New York, 165-183.

de Vet HCW, Terwee CB, Mokkink LB, Knol DL (eds.): *Measurement in Medicine: A Practical
Guide*. Cambridge University Press, 2011. ISBN:
9780521118200.

Gerke O, Vilstrup MH, Segtnan EA, Halekoh U, Høilund-Carsen PF.
How to assess intra- and inter-observer agreement with quantitative PET using variance component
analysis: a proposal for standardisation. *BMC Med Imaging* 2016; 16:54.
doi: 10.1186/s12880-016-0159-3.

Glatting G, Kletting P, Reske SN, Hohl K, Ring C. Choosing the optimal fit function: Comparison
of the Akaike information criterion and the F-test. *Med Phys.* 2007; 34(11): 4285-4292.
doi: 10.1118/1.2794176.

Huang EP, Wang XF, Choudhury KR, McShane LM, Gönen M, Ye J, Buckler AJ, Kinahan PE, Reeves AP,
Jackson EF, Guimaraes AR, Zahlmann G. Meta-analysis of the technical performance of an imaging
procedure: guidelines and statistical methodology. *Stat Methods Med Res.* 2015; 24(1):
141-174. doi: 10.1177/0962280214537394.

Huang SC, Phelps ME (1986): Principles of tracer kinetic modeling in positron emission tomography
and autoradiography. *In:* Positron Emission Tomography and Autoradiography: Principles and
Applications for the Brain and Heart.
(Eds: Phelps M, Mazziotta J, Schelbert H) Raven Press, New York, 287-346.

Johnson M, Karanikolas BDW, Priceman SJ, Powell R, Black ME, Wu H-M, Czernin J, Huang S-C, Wu L.
Titration of variant HSV1-tk gene expression to determine the sensitivity of ^{18}F-FHBG PET
imaging in a prostate tumor. *J Nucl Med.* 2009; 50(5): 757-764.
doi: 10.2967/jnumed.108.058438.

Kessler LG, Barnhart HX, Buckler AJ, Choudhury KR, Kondratovich MV, Toledano A, Guimaraes AR,
Filice R, Zhang Z, Sullivan DC. The emerging science of quantitative imaging biomarkers terminology
and definitions for scientific studies and regulatory submissions.
*Stat Methods Med Res.* 2015; 24(1): 9-26.
doi: 10.1177/0962280214537333.

Lin LI. A concordance correlation coefficient to evaluate reproducibility.
*Biometrics.* 1989; 45: 255-268.
doi: 10.2307/2532051.

Ludbrook J. Linear regression analysis for comparing two measurers or methods of measurements:
But which regression? *Clin Exp Pharmacol Physiol.* 2010; 37: 692-699. doi:
10.1111/j.1440-1681.2010.05376.x.

Ogden RT, Ojha A, Erlandsson K, Oquendo MA, Mann JJ, Parsey RV. In vivo quantification of
serotonin transporters using [^{11}C]DASB and positron emission tomography in humans:
modeling considerations. *J Cereb Blood Flow Metab* 2007; 27: 205-217.
doi: 10.1038/sj.jcbfm.9600329.

Parsey RV, Slifstein M, Hwang D-R, Abi-Dargham A, Simpson N, Mawlawi O, Guo N-N, Van Heertum R,
Mann JJ, Laruelle M. Validation and reproducibility of measurement of 5-HT_{1A} receptor
parameters with [*carbonyl*-^{11}C]WAY-100635 in humans: comparison of arterial
and reference tissue input functions. *J Cereb Blood Flow Metab* 2000; 20: 1111-1133. doi:
10.1097/00004647-200007000-00011.

Phair RD. Development of kinetic models in the nonlinear world of molecular cell biology.
*Metabolism* 1997; 46:1489-1495.
doi: 10.1016/s0026-0495(97)90154-2.

Raunig DL, McShane LM, Pennello G, Gatsonis C, Carson PL, Voyvodic JT, Wahl RL, Kurland BF,
Schwarz AJ, Gönen M, Zahlmann G, Kondratovich MV, O'Donnell K, Petrick N, Cole PE, Garra B,
Sullivan DC. Quantitative imaging biomarkers: a review of statistical methods for technical
performance assessment. *Stat Methods Med Res.* 2015; 24(1): 27-67.
doi: 10.1177/0962280214537344.

Riaño Barros DA, McGinnity CJ, Rosso L, Heckemann RA, Howes OD, Brooks DJ, Duncan JS,
Turkheimer FE, Koepp MJ, Hammers A. Test-retest reproducibility of cannabinoid-receptor type 1
availability quantified with the PET ligand [^{11}C]MePPEP.
*Neuroimage* 2014; 97: 151-162. doi:
10.1016/j.neuroimage.2014.04.020.

Shrout PE, Fleiss JL. Intraclass correlations: uses in assessing rater reliability.
*Psychol Bull* 1979; 86: 420-428.
doi: 10.1037/0033-2909.86.2.420

Tags: Modeling, Validation, Bland-Altman

Updated at: 2020-12-10

Created at: 2011-11-22

Written by: Vesa Oikonen, Kaisa Liukko, Jarkko Johansson