## ABSTRACT

**Objective**
To determine the sensitivity and specificity of several classification rules for stability and instability of angle in childhood esotropia.

**Methods**
We conducted 10 000 Monte Carlo simulations of participants with no actual change in angle of esotropia during follow-up, where “observed” changes in ocular alignment were sampled from a distribution of measurement errors for the prism and alternate cover test. Additional simulations were conducted for a range of “true” changes (1.0, 2.5, 4.2, 5.0, 7.5, and 10.0 prism diopters [PD] per visit) with up to 10 follow-up visits. We then estimated sensitivities and specificities for specific rules for retrospectively classifying stability (all measurements within 0, 5, 10, or 15 PD) and instability (≥2 measurements differing by ≥10 PD, etc) across a fixed number of visits. Results were extended to classifying ocular alignment stability and instability prospectively based on a varying number of measurements.

**Results**
For a series of 4 measurements, the rules that optimized sensitivity and specificity were “all measurements within 5 PD” for stability and “at least 2 measurements differing by 15 PD or more” for instability. For a series of 3 measurements, all 3 measurements needed to be identical to confirm stability.

**Conclusions**
We derived definitions of stability and instability in childhood esotropia using estimates of actual measurement error that may be useful for clinical practice and for future clinical studies of esotropia.

One factor that may affect the timing of strabismus surgery is whether the magnitude of the angle of misalignment seems to be changing (instability) or not changing (stability) with time. Although stability has been defined by some authors for data analysis,^{1}^{- 4} previous definitions do not account for the presence of measurement error in strabismus assessment. Measurement error can have a large effect on the accuracy of classification when the classification is based on a cutoff point for a continuous measurement.^{5} The choice of the cutoff point itself also affects misclassification. Real data are not directly useful for determining the extent of misclassification because real data include measurement error, and, hence, the true value and the true classification are unknown. As a result, simulated data are commonly used to determine the impact of measurement error on misclassification.

In the present study, we used Monte Carlo simulations to quantify the effect of measurement error. We explore the effect of several rules using differing cutoff points for defining stability and instability on misclassification by calculating sensitivity and specificity. In each simulation, ocular alignment data are generated using a statistical model in which the “true” angle of misalignment and changes in the true angle are specified as part of the simulation and, thus, are known. The simulated ocular alignment data also include “observed” data that incorporate estimates of measurement error into the true data. The estimates of measurement error (4.2 prism diopters [PD] for angles >20 PD and 1.7 PD for angles ≤20 PD) were based on results from a previous test-retest study^{6} and did not vary by age. Based on the present findings, we make recommendations regarding rules for prospective classification of stability and instability in children with esotropia.

Data for a true angle and an observed angle at 10 time points were simulated for 10 000 hypothetical participants with moderate esotropia (baseline angle within 20-50 PD) and no change in angle over time according to the procedures in the Appendix). This process was repeated to simulate data for participants with a constant, predefined amount of change in angle (1.0, 2.5, 4.2, 5.0, 7.5, and 10.0 PD) between successive time points. The final data set contained 130 000 simulated participants, each of whom had a constant, predefined amount of change in angle between successive visits that ranged from −10.0 to 10.0 PD, including 0 PD (no change over time).

We considered 2 situations in which classification of stability and instability of ocular alignment is of interest: (1) when each participant has a fixed number of measurements and stability is being classified retrospectively, such as for a research study, and (2) when the participant is being observed prospectively for stability, such as in clinical practice. In the latter case, the number of measurements is not fixed in advance; at any follow-up time a decision may be made to classify the participant as stable based on the available measurements or to continue follow-up. The 2 situations require different classification rules.

Clinicians commonly think of changes in strabismus angle according to whether the measurement exceeds specific steps in the prism sets used to measure the misalignment. For moderate angles of strabismus (20-50 PD), commonly available prism sets have 5-PD increments. Hence, we considered classification rules that used multiples of 5 PD as the cutoff points for classification. The rules that we evaluated are listed in Table 1.

Although in truth a participant's alignment is either stable or unstable, the stability rules classify a participant's alignment as either stable or not stable and the instability rules classify a participant's alignment as either unstable or not unstable. Note that the definitions of stable and unstable are mutually exclusive; however, a classification of not stable (using the stability rule) is not the same as a classification of unstable (using the instability rule). Likewise, a classification of stable is not the same as a classification of not unstable.

As with the rules for classifying stability and instability retrospectively based on a fixed number of measurements, rules for classifying stability and instability prospectively were based on using multiples of 5 PD for cutoff points as listed in Table 1.

For analysis, the true alignment of participants with no change (0 PD) in true angle over time was considered stable, and the true alignment of those with any amount of change in true angle was considered unstable. The interval need not be specified; the rules can be applied to any evenly spaced measurements, with the results applying over the interval that was used.

Each rule was used to classify the alignment of each simulated participant according to the simulated observed angles and simulated true angles. The observed classification was compared with the true classification to obtain the sensitivity and specificity for each classification rule. For a stability rule, the sensitivity is the probability that a truly stable alignment is classified as stable based on observed angles, and the specificity is the probability that a truly unstable alignment is classified as not stable based on observed angles. One minus specificity is the false-positive rate, that is, the probability that a truly unstable alignment is classified as stable. For an instability rule, the sensitivity is the probability that a truly unstable alignment is classified as unstable based on observed angles, and the specificity is the probability that a truly stable alignment is classified as not unstable based on observed angles. One minus specificity is the probability that a truly stable alignment is classified as unstable (false-positive probability).

In the sensitivity analysis, participants with patterns of change over time other than a constant increase or decrease were simulated, and other variables of the original simulations also were varied, such as baseline distribution of angles (mean, standard deviation, range, and form of distribution) and standard error of measurement, to determine how these factors affected results.

The Esotropia Treatment Study^{7} was conducted by the Pediatric Eye Disease Investigator Group to investigate the stability of ocular alignment in childhood esotropia. Participants had follow-up visits every 6 weeks for 18 weeks for a total of 4 visits. Results for the classification rules based on 4 visits were used to select a rule for stability and a rule for instability, which then were used to classify participants in the Esotropia Treatment Study.

The sensitivity and specificity of each of the classification rules for stability and instability for a fixed number of measurements are given in Table 2 (stability) and Table 3 (instability). For example, the sensitivity of a 0 PD rule for defining stability (all measured angles must be identical) applied to 4 measurements is 4%, whereas the sensitivity of the 5 PD rule (all measured angles are within 5 PD) applied to 4 measurements is 45%. In other words, because of measurement error, the 4 measured angles of a participant with an unchanging angle have only a 4% probability that they will be identical and a 45% probability that they will differ by no more than 5 PD (Table 2). Participants with a small change in the true angle of −2.5 or 2.5 PD between successive measurements across 4 measurements are slightly less likely to have measurements that do not differ (2% probability), whereas the probability that their measurements differ by 5 PD at most is 23% (for a −2.5-PD change) or 25% (for a 2.5-PD change). The slight asymmetry in specificity for an increase vs decrease in angle is due to several factors: (1) regression to the mean resulting from limiting the analysis to a range of baseline angles representing moderate esotropia, (2) higher precision of measurement for smaller angles (≤20 PD) compared with larger angles (>20 PD), and (3) rounding observed angles down to the nearest prism in the prism set. Nevertheless, these same effects would be expected in the clinical situation.

It is evident from Table 2 that the probability that a truly unstable alignment is misclassified as stable for any given rule depends on the amount of true angle change: the smaller the change, the more likely a participant's alignment will be misclassified as stable. Likewise, the larger the change in true angle, the more likely the alignment will be correctly classified as unstable for any given rule.

Applying the “all measurements within 5 PD” rule for stability to fewer than 4 measurements results in higher sensitivity for stability than with 4 measurements (60% for 3 measurements and 80% for 2 measurements) because there is less chance that a stable participant's measurements will differ by more than 5 PD due to measurement error when there are fewer measurements. However, the probability of a false-positive stable classification is substantially increased with fewer measurements.

If planning a study in which angle stability will be classified, the choice of number of measurements will depend on the false-positive and false-negative misclassification rates that can be tolerated. For example, for a target of 20% or less false-positive rates for angle changes corresponding to measurement error or greater (−4.2 or 4.2 PD per visit), one could take 3 measurements and apply the 0 PD rule, 4 measurements and apply the 5 PD rule, or 5 measurements and apply the 10 PD rule. The corresponding sensitivities for detecting stability are 12%, 45%, and 79%. With only 2 measurements, the false-positive rate for angle changes of −4.2 or 4.2 PD per visit is greater than 20% regardless of what rule is used, so there is no acceptable rule for stability that is based on only 2 measurements.

For instability (Table 3), the rule that classifies a participant's alignment as unstable when a second measurement is observed that differs by 15 PD or more from any previous measurement has a false-positive rate of less than 20% when the number of measurements is 5 or less. When there are only 2 measurements, a rule that classifies that angle as unstable when the 2 measurements differ by 10 PD or more has a false-positive rate of 18%. For classifying instability, the false-positive rate decreases as the number of measurements decreases because the likelihood that 1 measurement will differ by any given amount (or more) from other measurements due to measurement error alone is reduced when there are fewer measurements.

Of interest to the clinician is the ability to classify stability during a series of return visits. A set of rules for classification of stability was developed by selecting the pragmatic rule for each possible choice of stability cutoff value (0, 5, 10, or 15 PD) that had a 1 − specificity (false-positive rate) no greater than 20% for per-visit angle changes that met or exceeded measurement error and was based on the smallest number of measurements meeting this criterion. We also considered combinations of the pragmatic rules that would be expected to meet this criterion. This resulted in the rules for prospective stability classification given in Table 1.

The sensitivity and specificity for stability classification with prospective follow-up depended on the number of visits that one was willing to continue following participants who have not yet met the criteria for stability (data not shown). At least 3 visits were required. As the number of visits was increased, the sensitivity increased but the specificity decreased because the probability that a participant with changing angle would meet stability criteria by chance increased as the number of visits increased. Chances of misclassification of a participant with a large angle change (more than measurement error) remained low as the number of visits was increased, but the chances of misclassification of a participant with an angle change smaller than measurement error were greatly increased as the number of measurements increased. There was not a lot of difference in rule performance for angle changes that were much larger or smaller than measurement error, but for angle changes in the vicinity of measurement error, rule 2 (classify as stable if the last 4 consecutive measurements are within 5 PD) performed slightly better than the other proposed rules. Rules meeting the criterion of 20% or less false-positive rates that are based on more than 4 measurements would be expected to perform somewhat better yet.

Neither the stability rule nor the instability rule was sensitive to the population mean angle, the standard deviation of the population angles, the shape of the population angle distribution, or the range of the population angle distribution (data not shown). There was a very slight effect of changing the population mean or range that was due to a different standard error of measurement for smaller angles (≤20 PD) as opposed to large angles (>20 PD) and regression to the mean. For example, when the mean angle was smaller, a greater proportion of the population was measured with the smaller measurement error found for smaller angles, and this resulted in better specificity for classifying negative angle changes. However, there also was regression to the mean in the negative direction in a greater proportion of the population, which resulted in worse specificity for classifying positive angle changes. Varying standard error of measurement, as expected, had a large effect on classification: the smaller the measurement error, the more accurate the classification (Figure 1). Likewise, varying the pattern of angle change affected the accuracy of classification (Figure 2). Participants whose angle was changing but the amount of change was getting smaller over time were harder to classify correctly (as unstable) than those with constant angle change, who were, in turn, harder to classify than those whose angle was changing but the amount of change was getting larger over time. Similar results were obtained for the prospective classification rules and for the other fixed rules, regardless of the number of measurements, that is, the results were sensitive to only the standard error of measurement and the pattern of angle change.

Effect of measurement error on the probability of stable (A) and unstable (B) classification for the rule “all 4 measurements are within 5 prism diopters (PD) (see locator="http://archopht.ama-assn.org/cgi/content/full/128/12/1555/DC1">Appendix)” as a function of true angle change. The base assumption (standard error of measurement [SEM] was 4.2 for angles >20 PD and 1.7 for angles **≤** 20 PD) is represented by the solid black line. The broken lines show the effect of varying the base assumption.

Effect of pattern of angle change on the probability of stable (A) and unstable (B) classification for the rule “at least 2 measurements differing by 15 PD (see Appendix) or more” as a function of true angle change. The base assumption (constant rate of angle change) is represented by the solid black line. The broken lines show the effect of varying the base assumption.

For classifying alignment of participants in the Esotropia Treatment Study,^{7} it was desired to maximize sensitivity while keeping the percentage of false-positives for stability and instability lower than 20% for per-visit angle changes exceeding measurement error (4.2 PD per visit). The “all measurements within 5 PD” rule was adopted for stability, despite its low sensitivity (45%) because less stringent rules did not meet the specified false-positive criterion. Likewise, the rule for instability of “at least 2 measures differing by 15 PD or more” led to the highest sensitivities while meeting the target of less than 20% false-positive results (ie, 12%). Using these rules, 46% of 59 participants with infantile esotropia were classified as having unstable ocular alignment, and 20% were classified as having stable alignment. For 60 participants with acquired nonaccommodative esotropia, the corresponding figures were 22% unstable and 37% stable, and for 41 participants with acquired partially accommodative esotropia, the figures were 15% unstable and 39% stable.^{7}

We modeled “changes in the observed angle of misalignment,” using estimates of measurement error based on a test-retest study, to test potential rules for stability and instability in childhood esotropia. For a series of 4 measurements, such as used in the Esotropia Treatment Study,^{7} the rules that optimized sensitivity and specificity were “all measurements within 5 PD” as the rule for stability and “at least 2 measures differing by 15 PD or more” as the rule for instability. For the more common clinical scenario of the performance of an undefined number of prospective measurements, we found that a minimum of 3 identical measurements or 4 measurements within 5 PD would be needed to confirm stability.

In the present modeling, we also noted that chances for prospective misclassification of alignment of a participant with changing (unstable) alignment as stable are high for modest angle changes (≤5 PD per visit), and so clinically it would be reasonable to space visits far enough apart to give changes time to manifest. The optimum timing of such visits needs to be the subject of a future study.

In considering the practical application of these results, one approach to account for measurement error in strabismus measurements might be to make multiple measurements during the same office visit. Such an approach would require the clinician to make each measurement independently, accepting that the previous measurement might have been unrepresentative of the true misalignment. Ignoring the previous measurement and approaching the next measurement independently a few minutes later is difficult if not impossible. Acquiring multiple unbiased measurements during the same office visit may require a more automated means of measuring the magnitude of strabismus, such as eye tracker technology.^{8}

One potential weakness of this study was highlighted by the sensitivity analysis; the sensitivity and specificity of the proposed classification rules were highly dependent on the magnitude of measurement error. Nevertheless, we believe that the present estimates of prism and alternate cover test measurement error are reasonable because they were derived from a prospective test-retest study specifically designed for the purpose of estimating measurement error.^{6} The sensitivity and specificity of the proposed classification rules remained high even when the population mean, standard deviation, range, and shape of the angle distribution were varied. In conclusion, clinically useful definitions of stability and instability in childhood esotropia have been derived using estimates of actual measurement error and may be reasonable to apply to clinical practice and future strabismus studies.

**Correspondence:** B. Michele Melia, ScM, Jaeb Center for Health Research, 15310 Amberly Dr, Ste 350, Tampa, FL 33647 (pedig@jaeb.org).

**Submitted for Publication:** December 22, 2009; final revision received April 8, 2010; accepted April 13, 2010.

**Pediatric Eye Disease Investigator Group:** A full list of the members is available at http://www.ncbi.nlm.nih.gov/pubmed/18973948.

**Funding/Support:** This study was supported by grants EY011751 and EY015799 from the National Eye Institute of the National Institutes of Health, Department of Health and Human Services.

**Previous Presentation:** This study was presented in part at the annual meeting of the Association for Research in Vision and Ophthalmology; April 28, 2008; Fort Lauderdale, Florida.

## REFERENCES

*Am J Ophthalmol*2004;138 (6) 1003- 1009

PubMed Link to Article

*J AAPOS*2003;7 (5) 349- 353

PubMed Link to Article

*J AAPOS*2008;12 (1) 66- 68

PubMed Link to Article

*Trans Am Ophthalmol Soc*1994;92117- 125, discussion 126-131

PubMed

*Ophthalmology*2007;114 (10) 1804- 1809

PubMed Link to Article

*Arch Ophthalmol*2009;127 (1) 59- 65

PubMed Link to Article

*Ophthalmology*2008;115 (12) 2266- 2274, e4

PubMed Link to Article

*Invest Ophthalmol Vis Sci*. 2007;48: E-Abstract 900http://www.arringtonresearch.com/kfa/abstract/Infrared%20Eyetracker%20MethodsforMeasuringStrabismus.pdf August28 2009;

## Figures

Effect of measurement error on the probability of stable (A) and unstable (B) classification for the rule “all 4 measurements are within 5 prism diopters (PD) (see locator="http://archopht.ama-assn.org/cgi/content/full/128/12/1555/DC1">Appendix)” as a function of true angle change. The base assumption (standard error of measurement [SEM] was 4.2 for angles >20 PD and 1.7 for angles **≤** 20 PD) is represented by the solid black line. The broken lines show the effect of varying the base assumption.

Effect of pattern of angle change on the probability of stable (A) and unstable (B) classification for the rule “at least 2 measurements differing by 15 PD (see Appendix) or more” as a function of true angle change. The base assumption (constant rate of angle change) is represented by the solid black line. The broken lines show the effect of varying the base assumption.

## Tables

## References

*Am J Ophthalmol*2004;138 (6) 1003- 1009

PubMed Link to Article

*J AAPOS*2003;7 (5) 349- 353

PubMed Link to Article

*J AAPOS*2008;12 (1) 66- 68

PubMed Link to Article

*Trans Am Ophthalmol Soc*1994;92117- 125, discussion 126-131

PubMed

*Ophthalmology*2007;114 (10) 1804- 1809

PubMed Link to Article

*Arch Ophthalmol*2009;127 (1) 59- 65

PubMed Link to Article

*Ophthalmology*2008;115 (12) 2266- 2274, e4

PubMed Link to Article

*Invest Ophthalmol Vis Sci*. 2007;48: E-Abstract 900http://www.arringtonresearch.com/kfa/abstract/Infrared%20Eyetracker%20MethodsforMeasuringStrabismus.pdf August28 2009;

## Correspondence

*AMA PRA Category 1 Credit*

^{TM}per course. Physicians should claim only the credit commensurate with the extent of their participation in the activity. Physicians who complete the CME course and score at least 80% correct on the quiz are eligible for

*AMA PRA Category 1 Credit*

^{TM}.

## Multimedia

**Classifying Stability of Misalignment in Children With Esotropia Using Simulations**

*Arch Ophthalmol*.2010;128(12):1555-1560.AppendixAppendix -Download PDF (85 KB). This file requires Adobe Reader

^{®}.

Appendix. Procedures for simulating angle data.

Comment Author(s)* (if multiple authors, separate names by comma)