Model validation

Validity represents the degree to which the method actually measures what it claims to measure. Reliability is an integral part of validity, representing the absence of measurement error.

Concurrent validity refers to the degree to which an outcome of a (simplified) method correlates with the 'gold-standard' method; it is common that the outcome of the simplified method is not the same as the outcome of the gold-standard method (for example, SUV versus K_i). When the outcome measure of both methods is the same (for example, K_i from Patlak plot versus K_i from compartmental model), we can refer to it as absolute validity. Bland-Altman plot (Bland & Altman, 1986) can be used to report the absolute validation results and to determine systematic error and heteroscedasticity. Intraclass correlation coefficients can also be used to report absolute validation results. Categorical measures can be compared using weighted kappa (de Vet et al., 2011). The use of correlation coefficients (Pearson's or Spearman's) is not recommended.

Kinetic validation

Kinetic validation is done to find out whether the model is quantitatively consistent with the experimental tracer kinetic data. Inconsistent models are rejected; usually, the kinetic and biochemical information gained in the validation studies is useful for reformulating the model. Mismatches are revealed by the existence of non-random residuals; the difference between the experimental data and the response function fitted to the data has some time-dependent pattern (Motulsky and Ransnas, 1987).

The standard errors of the estimates also needs to be considered to determine the reliability of the estimated parameters. Large standard errors of the estimates usually mean that the data are very noisy or that the number of variable parameters in the model is more than necessary (over-parameterization). Statistical methods (Schwartz criterion, Akaike Criterion, F test; see Landaw and DiStefano, 1984 or Glatting et al. 2007) that consider the goodness of fit versus the number of parameters in the model should be used to guide the selection among alternative models.

The noise level in the data affects the number of parameters that may be estimated from the data and usually is the primary determinant of the precision in the estimated parameters. To determine the range of optimized parameter values, Monte Carlo simulations are recommended. The alternative approach, using covariance matrices, is faster and often used, but it requires the assumption that the process being modelled is linear. Since most of the kinetic models of metabolism are nonlinear, this assumption is invalid (Graham, 1985).

The choice of the model should not be done solely according to identifiability criteria (e.g. F test). A non-identifiable model may permit the determination of the range of some meaningful parameters, while with an identifiable model one may compute the exact value of some parameters deprived of any physical or physiological meaning.

To describe the movements of all known celestial bodies, Ptolemy used abstract mathematical functions (cycles and epicycles). He obtained a remarkably good fit.
However, the good fit did not help in predicting future observations.
Aristarchus of Samos (~310-230 BC), and later Copernicus (AD 1473-1545), based their computations on a physical basis, the heliocentric hypothesis. However, their fit was not better than Ptolemy's.

If the correlation between two estimated parameters is very high, then the likelihood of determining a unique set of values for the parameters is low (Budinger et al., 1985); this suggests that either one or more of the model parameters should be fixed or the model should be reduced.

Biochemical validation

A critical test of the validity of a model is to compare the biochemical results predicted from the kinetic data by the model with those measured directly by chemical means (e.g. microdialysis or biopsy samples). These studies not only provide the time course of total radioactivity in plasma and in the target organ, but also differentiate between the original tracer and its label-carrying metabolites.

Intervention studies can be performed to estimate the sensitivity of the kinetic data and estimates to the physiologic parameter of interest. Model parameters of interest should change in the proper direction and by an appropriate magnitude in response to a variety of biologic stimuli. In addition, it is useful to test whether the parameters of interest do not change in response to a perturbation in a different factor, e.g. does an estimate of receptor density remain unchanged when blood flow is increased (Carson, 1991).

The assumptions made in the reduction of model should be tested, either by explicit experimentation or at least by computer simulation. Simulations provide the magnitude of bias in the parameter estimates due to errors in various assumptions.

The absolute accuracy of model parameters should be tested with a "gold standard", if one is available for the measurement of interest (e.g., the microsphere technique for perfusion measurements). It should be noted that e.g. scanner resolution and change in the physical meaning of parameters due to model reduction can make the comparison of results difficult.

The validation of a PET method for a neuroreceptor assay can be done by comparing the results with those obtained from in vitro binding experiments. However, caution must be exercised in these comparisons. Binding discrepancies between in vitro and in vivo conditions have been observed. In in vitro receptor binding experiments, the concentration of the unbound ligand is uniform, while the uniformity of the unbound ligand concentration in tissue in vivo is less likely to be maintained owing to a multiplicity of factors such as the fast binding-rebinding phenomenon in the synaptic zone and diffusional restrictions. In addition, the ligand binding characteristics of receptor systems have been shown to have large species-to-species differences in both K_D and B_max values (Huang et al., 1986).

In some cases, PET method can be compared to traditional Fick's method, if venous blood sampling from the organ of interest is possible.

Test/retest studies

Test/retest studies are conducted by repeating the PET study two times for a group of subjects in similar conditions. Test/retest study provides a simple estimate of the within-subject variability (lack of repeatability) of the method. Reproducibility is tested by a replication study, which must be completely independent, ideally utilizing different instruments and analysis tools.

However, repeatability or reproducibility is only one aspect of the problem. A simple ratio provides more reproducible results than a model based method. Yet, the model based method might reveal true between-subject or treatment differences that are ignored or "normalized" by empirical methods.

Repeatability is estimated by calculating the mean and standard deviation of the absolute values of the difference between test and retest values. Repeatability coefficient (RC) is recommended by the British Standards Institution (1976), and when applied to PET test-retest studies, it is defined as

\[ RC = 2 \times SD(\text{scan}_1 -\text{scan}_2) \notag \]

Assuming that the data is normally distributed, in 95% of the cases the difference between the two measurements will be less than RC (Bland and Altman, 1986). To facilitate comparisons across regions of interest, test-retest variability (TRV) can be calculated as the absolute value of the difference between test and retest values, divided by the mean of both measurements (Parsey et al., 2000). The mean TRV percentage ± SD should be reported.

\[ TRV = \frac{1}{n} \sum^n \frac{|\text{scan}_1 - \text{scan}_2|}{(\text{scan}_1 + \text{scan}_2)/2} \times 100 \% \notag \]

The intraclass correlation coefficient (ICC) (Shrout and Fleiss, 1979; Parsey et al., 2000) is a useful measure of reliability, as this statistical parameter compares the within-subject (WS) variability to the between-subject (BS) variability in repeated observations.

\[ ICC = \frac{BSMSS-WSMSS}{BSMSS + (K-1) \times WSMSS} \notag \]

In the one-way ANOVA model, BSMSS is the mean sum of squares between subjects, WSMSS is the mean sum of squares within subjects, K is the number of repeated observations (K=2 in test-retest study) and N is the number of subjects:

\[\begin{align*} SS_{WIT} &= \sum\limits_{n=1}^N \sum\limits_{k=1}^K (y_{nk} - \overline{y_{n.}})^2 \\ SS_{BET} &= \sum\limits_{n=1}^N K (\overline{y_{n.}} - \overline{y_{..}})^2 \\ DF_{BET} &= N - 1 \\ DF_{WIT} &= N \times (K - 1) \\ WSMSS &= \frac{SS_{WIT}}{DF_{WIT}} \\ BSMSS &= \frac{SS_{BET}}{DF_{BET}} \end{align*}\]

, where y_nk is the kth observation of subject n, y_n. is the mean of observations for subject n, and y_.. is the mean of all observations.

Negative ICC value indicates that more differences are observed within than between subjects. ICC ranges between -1 (no reliability, i.e. BSMSS=0) to 1 (maximum reliability, achieved in the case of identity between test and retest, i.e. WSMSS=0) (Parsey et al., 2000). As a rule, the method with the highest reliability (higher ICC) should be chosen (Laruelle, 1999).

Ludbrook (2010) has reviewed linear regression methods for comparing two methods or to measurers.

Any diurnal variation, or stress-induced changes, should be controlled in test-retest studies.

Interobserver variability

Good parameter should also have low interobserver variability. Interobserver variability can be assessed by using Bland-Altman analysis (Bland & Altman, 1986) and Lin's concordance correlation coefficient (CCC) (Lin, 1989; Barnhart et al., 2002).

Parameters for assessing the effectiveness of therapy should, in addition to low interobserver variability, also have the ability to differentiate between treatment responders and nonresponders. These qualities can be combined in the variability effect coefficient (VEC) (Benz et al., 2008).

Literature

Ayers GD, Cohen AS, Bae SW, Wen X, Pollard A, Sharma S, Claus T, Payne A, Geng L, Zhao P, Tantawy MN, Gammon ST, Manning HC. Reproducibility and repeatability of ¹⁸F-(2S, 4R)-4-fluoroglutamine PET imaging in preclinical oncology models. PLOS One 2025;20(1): e0313123. doi: 10.1371/journal.pone.0313123.

Barnhart HX, Haber M, Song J. Overall concordance correlation coefficient for evaluating agreement among multiple observers. Biometrics 2002; 58:1020-1027. doi: 10.1111/j.0006-341X.2002.01020.x.

Barnhart HX, Barboriak DP. Applications of the repeatability of quantitative imaging biomarkers: a review of statistical analysis of repeat data sets. Transl Oncol. 2009; 2(4): 231-235. doi: 10.1593/tlo.09268.

Benz MR, Evilevitch V, Allen-Auerbach MS, Eilber FC, Phelps ME, Czernin J, Weber WA. Treatment monitoring by ¹⁸F-FDG PET/CT in patients with sarcomas: interobserver variability of quantitative parameters in treatment-induced changes in histopathologically responding and nonresponding tumors. J Nucl Med. 2008; 49(7): 1038-1046. doi: 10.2967/jnumed.107.050187.

Bertoldo A, Cobelli C. Data modeling and simulation. In: Feng DD (ed.): Biomedical Information Technology. Elsevier, 2008, pp 115-136.

Bland JM, Altman DG. Statistical methods for assessing agreement between two methods of clinical measurement. Lancet 1986; i: 307-310. doi: 10.1016/S0140-6736(86)90837-8.

British Standards Institution. Precision of test methods I: Guide for the determination and reproducibility for a standard test method. British Standard 5497; 1979.

Budinger TF, Huesman RH, Knittel B, Friedland RP, Derenzo SE (1985): Physiological modeling of dynamic measurements of metabolism using positron emission tomography. In: The Metabolism of the Human Brain Studied with Positron Emission Tomography. (Eds: Greitz T et al.) Raven Press, New York, 165-183.

de Vet HCW, Terwee CB, Mokkink LB, Knol DL (eds.): Measurement in Medicine: A Practical Guide. Cambridge University Press, 2011. ISBN: 9780521118200.

Gerke O, Vilstrup MH, Segtnan EA, Halekoh U, Høilund-Carsen PF. How to assess intra- and inter-observer agreement with quantitative PET using variance component analysis: a proposal for standardisation. BMC Med Imaging 2016; 16:54. doi: 10.1186/s12880-016-0159-3.

Glatting G, Kletting P, Reske SN, Hohl K, Ring C. Choosing the optimal fit function: Comparison of the Akaike information criterion and the F-test. Med Phys. 2007; 34(11): 4285-4292. doi: 10.1118/1.2794176.

Huang EP, Wang XF, Choudhury KR, McShane LM, Gönen M, Ye J, Buckler AJ, Kinahan PE, Reeves AP, Jackson EF, Guimaraes AR, Zahlmann G. Meta-analysis of the technical performance of an imaging procedure: guidelines and statistical methodology. Stat Methods Med Res. 2015; 24(1): 141-174. doi: 10.1177/0962280214537394.

Huang SC, Phelps ME (1986): Principles of tracer kinetic modeling in positron emission tomography and autoradiography. In: Positron Emission Tomography and Autoradiography: Principles and Applications for the Brain and Heart. (Eds: Phelps M, Mazziotta J, Schelbert H) Raven Press, New York, 287-346.

Johnson M, Karanikolas BDW, Priceman SJ, Powell R, Black ME, Wu H-M, Czernin J, Huang S-C, Wu L. Titration of variant HSV1-tk gene expression to determine the sensitivity of ¹⁸F-FHBG PET imaging in a prostate tumor. J Nucl Med. 2009; 50(5): 757-764. doi: 10.2967/jnumed.108.058438.

Kessler LG, Barnhart HX, Buckler AJ, Choudhury KR, Kondratovich MV, Toledano A, Guimaraes AR, Filice R, Zhang Z, Sullivan DC. The emerging science of quantitative imaging biomarkers terminology and definitions for scientific studies and regulatory submissions. Stat Methods Med Res. 2015; 24(1): 9-26. doi: 10.1177/0962280214537333.

Lin LI. A concordance correlation coefficient to evaluate reproducibility. Biometrics. 1989; 45: 255-268. doi: 10.2307/2532051.

Ludbrook J. Linear regression analysis for comparing two measurers or methods of measurements: But which regression? Clin Exp Pharmacol Physiol. 2010; 37: 692-699. doi: 10.1111/j.1440-1681.2010.05376.x.

Ogden RT, Ojha A, Erlandsson K, Oquendo MA, Mann JJ, Parsey RV. In vivo quantification of serotonin transporters using [¹¹C]DASB and positron emission tomography in humans: modeling considerations. J Cereb Blood Flow Metab 2007; 27: 205-217. doi: 10.1038/sj.jcbfm.9600329.

Parsey RV, Slifstein M, Hwang D-R, Abi-Dargham A, Simpson N, Mawlawi O, Guo N-N, Van Heertum R, Mann JJ, Laruelle M. Validation and reproducibility of measurement of 5-HT_1A receptor parameters with [carbonyl-¹¹C]WAY-100635 in humans: comparison of arterial and reference tissue input functions. J Cereb Blood Flow Metab 2000; 20: 1111-1133. doi: 10.1097/00004647-200007000-00011.

Phair RD. Development of kinetic models in the nonlinear world of molecular cell biology. Metabolism 1997; 46:1489-1495. doi: 10.1016/s0026-0495(97)90154-2.

Raunig DL, McShane LM, Pennello G, Gatsonis C, Carson PL, Voyvodic JT, Wahl RL, Kurland BF, Schwarz AJ, Gönen M, Zahlmann G, Kondratovich MV, O'Donnell K, Petrick N, Cole PE, Garra B, Sullivan DC. Quantitative imaging biomarkers: a review of statistical methods for technical performance assessment. Stat Methods Med Res. 2015; 24(1): 27-67. doi: 10.1177/0962280214537344.

Riaño Barros DA, McGinnity CJ, Rosso L, Heckemann RA, Howes OD, Brooks DJ, Duncan JS, Turkheimer FE, Koepp MJ, Hammers A. Test-retest reproducibility of cannabinoid-receptor type 1 availability quantified with the PET ligand [¹¹C]MePPEP. Neuroimage 2014; 97: 151-162. doi: 10.1016/j.neuroimage.2014.04.020.

Shrout PE, Fleiss JL. Intraclass correlations: uses in assessing rater reliability. Psychol Bull 1979; 86: 420-428. doi: 10.1037/0033-2909.86.2.420

Tags: Modeling, Validation, Bland-Altman

Updated at: 2025-01-13
Created at: 2011-11-22
Written by: Vesa Oikonen, Kaisa Liukko, Jarkko Johansson