Interpreting European Organisation for Research and Treatment for Cancer Quality of life Questionnaire core 30 scores as minimally importantly different for patients with malignant melanoma

Introduction: Health-related quality of life (HRQOL) is increasingly recognised as an important end-point in cancer clinical trials. The concept of minimally important difference (MID) enables interpreting differences and changes in HRQOL scores in terms of clinical meaningfulness. We aimed to estimate MIDs for interpreting group-level change of European Organisation for Research and Treatment for Cancer Quality of life Questionnaire core 30 (EORTC QLQ-C30) scores in patients with malignant melanoma. Methods: Data were pooled


Introduction
Health-related quality of life (HRQOL) is increasingly recognised as an important end-point in cancer clinical trials [1].Understanding the amount of change in HRQOL scores that are clinically relevant is crucial for interpretation.The concept of minimally important difference (MID) enables the interpretation of differences between groups and changes over time in HRQOL scores in terms of clinical meaningfulness [2e6].MID is defined as 'the smallest difference in score in the outcome of interest that informed patients or informed proxies perceive as important, either beneficial or harmful, and which would lead the patient or clinician to consider a change in the management' [2].MIDs are commonly determined by anchorbased and distribution-based methods [7].Anchor-based methods express differences or change in HRQOL scores by linking specific HRQOL domains to clinical variables that have known clinical relevance [3,8e10] or to patient-/ physician-derived ratings of change in the specific domain [4e6].The usefulness of anchor-based MIDs is reliant on the anchor selected, how discriminant groups are defined with respect to that anchor and the strength of the relationship (conceptually and empirically) between the anchor and the target HRQOL domain [11].Distributionbased methods rely on the statistical distribution of HRQOL scores, e.g. standard deviation (SD) criteria or the standard error of measurement (SEM) [12,13].Because distribution-based methods do not consider patients'/clinicians' perspective, they have been recommended to be used as supportive evidence to anchor-based methods [7].
The European Organisation for Research and Treatment for Cancer Quality of life Questionnaire core 30 (EORTC QLQ-C30) is widely used to assess HRQOL in cancer patients [14].Osoba et al. [4] published guidelines for interpreting small (5e10 points), moderate (10e20 points) and large changes (>20 points) in EORTC QLQ-C30 scores using a global patient rating of change as anchor, in patients with breast and small-cell lung cancer.In an early application of clinical anchors, King [3] compiled published evidence about differences in EORTC QLQ-C30 scores between groups for multiple cancer sites and clinical anchors and found that the score range for small, moderate and large effects differed between HRQOL scales.More recent guidelines by Cocks et al. [5,6] highlighted the need to differentiate not only between the EORTC QLQ-C30 scales but also between the direction of change (improvement versus deterioration) and clinical settings.This implies that a global rule for MIDs applicable to all situations is highly unlikely [7,11,15].
This study aims to provide MID estimates for EORTC QLQ-C30 scales in patients with malignant melanoma who undergo adjuvant treatment.We focused on examining MIDs for group-level change (both within and between groups) in HRQOL scores over time [16].There are currently no MID guidelines for the EORTC QLQ-C30 specific to malignant melanoma.In contrast to Osoba et al. [4], we used multiple clinical anchors that were available in our database.Furthermore, the guidelines of King [3] and Cocks et al. [5,6] were based on metaanalyses of published studies, pooling across cancer sites, whereas we used individual patient data from archived EORTC melanoma trials.

Data description
Data were pooled from three published adjuvant melanoma phase III EORTC trials.Trial 1 assessed the effect of two regimens of interferon of intermediate dose versus observation alone in patients with stage IIb/III melanoma after surgery and enrolled 1388 patients [17].Trial 2 compared adjuvant immunotherapy with anti-CTLA-4 monoclonal antibody (ipilimumab) versus placebo after complete resection of high-risk stage III melanoma and enrolled 951 patients [18,19].Trial 3 compared the effect of adjuvant therapy with PEG-Intron to observe after adequate dissection of the regional lymph in American Joint Committee on Cancer stage III melanoma and enrolled 1256 patients [20].All three trials assessed HRQOL using the EORTC QLQ-C30 at baseline, during treatment and on several follow-up time points after the end of treatment.When pooling, three key time points were identified that were common across all three trials: (i) Start of treatment (T1); time point before or on the first day of treatment administration.If no treatment was administered, then T1 was the time point before or on the date of randomisation.(ii) End of treatment (T2); last day of protocol treatment administration.Patients who were under observation alone did not contribute data at T2. (iii) End of follow-up (T3); the last day of the protocol follow-up period.For patients under observation, T3 was the last day after baseline.
Trial 1 used version 2 of the QLQ-C30, whereas trial 2 and 3 used version 3. The two versions differ only in the response categories of questions 1e5 (in the PF domain), coded as yes/no in version 2, whereas version 3 uses a four-point Likert scale ranging from 'not at all' to 'very much'.The scoring of the EORTC QLQ-C30 scales was done according to the EORTC QLQ-C30 scoring manual [14], with the means of the raw scores for each scale transformed to fall between 0 and 100.For consistency in signs of the change scores across the various scales, the symptom scores were reversed to follow the functioning scales interpretation, i.e. all scales were scored such that 0 represents the worst possible score and 100, the best possible score.The FI scale was omitted from the analysis because suitable anchors were not available.

Clinical anchors
Anchors were constructed using clinical data from physician examinations, common terminology criteria for adverse events (CTCAE) and laboratory results that were available in the trial data sets.Anchors were initially selected based on the strength of correlation with the corresponding QLQ-C30 scale.We prioritised anchors with correlations of !j0.30j as proposed by Revicki et al. [7], and where achievable, anchors with stronger correlations were targeted [21].The selected anchors were further verified for clinical plausibility by a panel of melanoma and HRQOL experts to avoid spurious findings.This panel was also tasked to identify clinically relevant changes for each of the selected anchors.For each QLQ-C30 scale, multiple anchors could be selected.Details on the anchor selection procedures have been described by Musoro et al. [16].The retained anchors comprised World Health Organisation performance status (PS) and 7 CTCAEs (gastrointestinal disorder, anorexia, pain, fatigue, immune disorder, diarrhoea and nervous system disorder).The PS was scored between 0 (no symptoms of cancer) and 4 (bedbound), whereas the CTCAEs were graded between 0 (no toxicity) and 4 (life-threatening).

Definition of clinical change groups
Three clinical change status groups (CCGs) were defined after consultation with our panel of clinical experts: deterioration (worsened by 1 anchor category), stable (no change in anchor category) and improvement (improved by 1 anchor category).Patients who changed by 2 or more categories of an anchor were considered to be above the 'minimal' expected change and so were excluded from data sets used to estimate mean change and MIDs.

Data analysis
Individual-level change scores of the EORTC QLQ-C30 scales and their corresponding anchors were computed between T1 and T2 and between T2 and T3.Only subjects with both EORTC QLQ-C30 and anchor data available for a given pair of time points contributed to calculation of change scores.
Two anchor-based methods were then used to estimate MIDs for improvements and deterioration for each EORTC QLQ-C30 scale and its corresponding anchors.The primary method involved calculating the mean HRQOL change score for the improvement and deterioration CCGs.This is applicable for interpreting change within a group of patients, and it is analogous to the mean HRQOL change score over time for a single treatment group in a trial.Effect sizes (ESs) were computed by dividing the mean change HRQOL score between adjacent time points (e.g.T1 and T2) by the SD of the HRQOL scores at the earlier time point (T1).Only mean change scores with an ES of >0.2 or 0.8 were considered appropriate for inclusion as MIDs.This was based on Cohen's [13] recommendations that an ES of 0.2 is small, 0.5 is moderate and !0.8 is large.The rationale here was that observed effect sizes <0.2 reflected changes that were clinically unimportant, and those !0.8 were clearly more than minimally important.We also compared the difference in change scores between the improvement (or deterioration) CCG and no change CCG using analysis of variance (ANOVA).
The secondary method involved linear regression applied to compare change scores for subjects in the improvement (or deterioration) CCGs versus the stable CCG.For a given EORTC QLQ-C30 scale/anchor pair, separate models were fitted for improving and deteriorating scores.The outcome variable was the HRQOL change score, and the covariate was a binary anchor variable, coded as 'stable' Z 0 and 'improvement' Z 1 when modelling improvement and 'stable' Z 0 and 'deterioration' Z 1 when modelling deterioration.The resulting slope parameters correspond to the mean change score for improvement and deterioration, respectively.This is useful for interpreting changes between groups of patients, and it is analogous to comparing the mean HRQOL change score in a target treatment group to a control group in a trial.For a given HRQOL scale, the anchor-based estimates from multiple anchors were triangulated to a single value via a correlation-based weighted average.
Distribution-based techniques were used as supportive methods by estimating the 0.2 SD, 0.3 SD, 0.5 SD and SEM separately at T1, T2 and T3.These techniques have previously been used in the literature to estimate MIDs [7].However, because these estimates rely solely on the statistical distribution of the HRQOL scores and do not include an inherent valuation of clinical relevance, they are used to give context to our derived anchor-based estimates.Testeretest reliability estimates to compute SEM for the QLQ-C30 were obtained from Hjermstad et al. [22].All statistical analyses were performed using the SAS software [23].An in-depth description of the statistical methodology, including the anchor selection process, has previously been published [16].

Results
The baseline demographic and clinical characteristics of the study population are presented in Tables 1 and 2. The characteristics of the patients across the 3 trials were similar.In Table 3, the descriptive statistics of the QLQ-C30 scale scores at T1, T2 and T3 are summarised.The distribution of the various scale scores was similar across the different time points.The time period (in months) between T1 and T2 ranged from 0.1 to 24.2 with a mean of 10.4 (SD Z 6.1) for trial 1, from 0 to 38.4 with a mean of 12.3 (SD Z 12.8) for trial 2 and from 0.1 to 57 with a mean of 23.7 (SD Z 16.6) for trial 3. The period between T2 and T3 ranged from 0 to 31.3 with a mean of 8.9 (SD Z 6.4) for trial 1, from 0 to 64.4 with a mean of 11.2 (SD Z 11) for trial 2 and from 0.5 to 64.4 with a mean of 27.5 (SD Z 19.7) for trial 3.
Cross-sectional correlations of the QLQ-C30 scale scores with their corresponding selected anchors (at T1, T2 and T3) and correlations between their change scores (between T2eT1 and T3eT2) are presented in Table 4.At least one anchor was constructed for each QLQ-C30 scale, except for the constipation scale for which no suitable anchors were found.The cross-sectional correlations ranged from 0.16 to 0.76 in absolute value, with more than 90% of the correlation coefficients being above the 0.3 threshold [7].Much lower correlations (range: 0.1e0.53)were observed between the change scores.
The distribution of patients across the different anchor categories is summarised in Table A.1.According to the anchors, most patients remained stable (63%e 88%), for both periods between T2 & T1 and T3 & T2.Relatively low proportions of patients either improved (4%e20%) or deteriorated (2%e11%).
Table 5 presents the range of estimated MID values from the mean change method and the linear regression for each HRQOL scale, across multiple anchors and over time (change between T2 & T1 versus T3 & T2).MID estimates are only presented for scales with at least one appropriate anchor or for which CCG has an ES of >0.2 or 0.8.Detailed results on the estimates per anchor from the mean change method and the linear regression are presented in Tables A.2 and A.3, respectively.Generally, the MID estimates varied by scale, direction of change scores (improvement versus deterioration), selected anchor and time point.This is illustrated in Fig. 1, in which estimates from the mean change method in Table 5 are plotted along with their 95% confidence intervals (CIs).Although the MID estimates for change between T1 and T2 were comparable to those for change between T2 and T3, relatively wider CIs were observed in the latter time period, reflecting the relatively smaller sample size.The MID estimates were always in the expected direction according to the anchor, i.e. positive versus negative change scores within the improvement versus deterioration CCG, respectively.Based on ANOVA, the difference in change scores between the improvement (or deterioration) CCG and no change CCG for most of the EORTC QLQ-C30 scales were statistically significant (p-value <0.05).Nonsignificant differences were mostly observed among the CCGs with an ES of <0.2.As shown in Table 5, generally the MIDs for interpreting within-group change in HRQOL scores (estimated using the mean change method) ranged from 4 to 18 points and À16 to À4 points for improvement and deterioration, respectively.MIDs for between-group change (estimated using the linear regression) ranged from 3 to 16 points and À16 to À3 points for improvement and deterioration, respectively.For the majority of the QLQ-C30 scales, the estimated MIDs ranged from 5 to 10 points in absolute values.
The results in Table 5 were further summarised to single MID values per scale in Table 6 by taking a correlation-weighted average across multiple anchors.This facilitates the selection of MIDs for per QLQ-C30 scale for use in practice.Furthermore, in Table 6, we also compared the anchor-based MIDs to estimates from commonly used distribution-based approaches in the literature.The distribution-based estimates for each QLQ-C30 scale were very similar across T1, T2 and T3.For a particular distribution-based approach, the estimates across the different time points were mostly within a <1 point range for a given QLQ-C30 scale.Therefore, only results at T1 are reported in Table 6.The anchor-based MID estimates tended to be larger than the 0.2 SD and smaller than the 0.5 SD.Most of the anchorbased estimates were closer to both the 0.3 SD and the 1 SEM.

Discussion
Our study determined MIDs for group-level change of the EORTC QLQ-C30 scores over time, using individual patient data pooled across three published international randomised EORTC adjuvant melanoma clinical trials.Anchors for each QLQ-C30 scale were selected based on both the statistical correlation and clinical plausibility.Multiple anchors were selected for most QLQ-C30 scales.The cross-sectional correlations between the anchors and their corresponding scales were usually greater than the recommended 0.3 correlation threshold [7].However, lower correlations were observed when considering the changes over time, which may be attributed to cumulative measurement error.The use of multiple anchors per scale provided some reassurance about the plausibility of the estimated MIDs.Despite the modest correlation between the anchors/scales change scores, the estimated MIDs were often within a small range (generally < 5 points range) and were also in the expected direction of change according to the anchor.
Similar to recent findings on MIDs for the QLQ-C30 by Cocks et al. [5,6] and Maringwa et al. [8,9], we observed that MIDs vary by scales as well as by the direction of change (improvement versus deterioration).Furthermore, akin to the study by Maringwa et al. [8,9], there were no systematic differences in the magnitude of change between deteriorating and improving scores.This is in contrast to the study by Cocks et al. [6] and other studies that assessed MIDs for the Functional Assessment of Cancer Therapy questionnaires [24,25], in which estimates for deterioration tended to be larger than those for improvement.However, we noted that the latter studies used a patient-or clinician-rated global rating of change as anchors, whereas our study and those of Maringwa et al. applied clinical anchors.It will be interesting to further examine this observation in other studies.
Our MID estimates across many scales were somewhat within the suggested 5e10 points range suggested by Osoba et al. [4], as shown in Table 5. Cocks et al. [5,6] and Maringwa et al. [8,9] also made similar observations, which is reassuring.However, as pointed out by Cocks et al. [5,6], we also observed that the thresholds for some scales could be much lower.For example, the MIDs for the EF and CF scales could be as low as 3 points.On the other hand, much bigger thresholds were observed for scales such as RF and AP, whereas MIDs for the AP scale could be as high as 18 points.This reinforces the evidence that there is no single global standard for clinically meaningful change, and scalespecific MIDs should therefore be selected with more caution.
For any given QLQ-C30 scale, no remarkable differences were observed among MIDs for change scores between T1 and T2 and between T2 and T3.This is probably because the patients' HRQOL in these adjuvant melanoma studies were relatively stable over time as shown by the mean scores at T1, T2 and T3 in Table 3.Furthermore, according to the anchors, the majority of the patients remained stable over time or changed by only one category (Table A.1). Comparable estimates (results not shown) were also obtained from applying the mean change method to the merged data of all possible pairwise time point differences of HRQOL scores (where a subject can contribute multiple change scores that are calculated across different pairs of time points).We also made a distinction between MIDs for interpreting within-group changes, obtained from the mean change method, and MIDs for interpreting changes between groups, obtained from the linear regression.Estimates from both approaches were often in the same range.While clinicians and researchers seeking MID would often like simple guidance, results such as those presented in this article are often complex, as a consequence of there being numerous anchors, various distributionbased criteria and various HRQOL scales.In Table 5, we represented this complexity as the range of MIDs generated by the various anchors.However, we acknowledge end-users may find such a range of options confusing, wondering which they should use.So to provide a single MID value per QLQ-C30 scale, we further simplified by calculating a correlation-weighted average across multiple anchors.End-users can choose to work with either the ranges provided in Table 5 or the single values provided in Table 6, whichever they feel most comfortable with.
A limitation of our study is that anchor-based MIDs could only be estimated for QLQ-C30 scales for which a suitable anchor was available in the database.For example, no suitable anchors were found for the constipation (CO) scale.Different anchors also represent different categorisations of clinical relevance that may or may not exceed a 'true' MID.Furthermore, the available anchors relied exclusively on clinical observations or interpretations.The potentially inflated MID  MIDs from the mean change method and the linear regression are useful for interpreting within-group and between-groups change, respectively.The symptom scores were reversed to follow the functioning scales interpretation, i.e. 0 represents the worst possible score and 100, the best possible score; no MID (nM) is used where no MID estimate is available either due to the absence of a suitable anchor or ES was either <0.2 or !0.8 Abbreviations: T1, T2 and T3 are time points for the start of treatment, end of treatment and end of follow-up, respectively.AP, appetite loss; CF, cognitive functioning; CO, constipation; DI, diarrhoea; DY, dyspnoea; EF, emotional functioning; FA, fatigue; NV, nausea/vomiting; PA, pain; PF, physical functioning, QL, global quality of life; RF, role functioning; SF, social functioning; SL, sleep disturbance; MID, minimally important difference.estimates for scales such as RF and AP may be due to an underestimation of their relevance by the physicianrated anchors (such as performance status or CTCAE grades) compared to the patient self-reported assessment.However, given that our data set is limited, it will be interesting to further examine this observation in future studies.Anchors related to mental health/distress of patients were not available in our study, which is a notable lack because these are important aspects of HRQOL.In addition, anchors that are based on the patient's perspective of change (e.g.subjective significance questionnaires) were not available.Nonetheless, it is reassuring to notice the considerable overlap between our findings and those of Osoba et al. [4], which was based on using individual patients' ratings of change as anchor.One out of the three trials that were pooled in this study used version 2 of the EORTC QLQ-C30.
Although the scales were transformed to have values between 0 and 100, the PF scale of version 2 can only take a limited range of values compared to version 3. It will be interesting to further investigate in a larger sample if these differences may affect MID estimates.
Another limitation is that our data originate from three controlled clinical trials, each with specific selection and treatment criteria.Although results are consistent among the three trials, extrapolation beyond their specific setting remains unverified.
In conclusion, our findings can help clinicians and researchers to interpret the clinical relevance of group-level change of QLQ-C30 scores over time in patients with malignant melanoma.We have provided MID estimates for interpreting changes in HRQOL scores over time for both within the group and between the groups of patients.Our results will also aid to perform more accurate sample size calculations when primary outcomes are based on EORTC QLQ-C30 scales.

Fig. 1 .
Fig. 1.Mean change and 95% confidence interval for improvement and deterioration EORTC QLQ-C30 scales, across multiple anchors and at different time periods.Estimates are available only for scales with at least 1 suitable anchor or with effect size !0.2 and < 0.8 within the deteriorate and improve groups, respectively.These mean change scores are useful for interpreting within-group change over time.Abbreviations: AP, appetite loss; CF, cognitive functioning; CO, constipation; DI, diarrhoea; DY, dyspnoea; EF, emotional functioning; FA, fatigue; NV, nausea/vomiting; PA, pain; PF, physical functioning, QL, global quality of life; RF, role functioning; SF, social functioning; SL, sleep disturbance.Deterioration Z worsened by 1 anchor category, no change Z no change in anchor category and improvement Z improved by 1 category.

Table 1
Selected baseline demographic and clinical characteristics of the patients by study.

Table 2
Distribution of patients by baseline disease stage.

Table 3
Summary statistics of the EORTC QLQ-C30 scale scores at T1, T2 and T3.T2 and T3 are time points for start of treatment, end of treatment and end of follow-up, respectively.AP, appetite loss; CF, cognitive functioning; CO, constipation; DI, diarrhoea; DY, dyspnoea; EF, emotional functioning; FA, fatigue; NV, nausea/vomiting; PA, pain; PF, physical functioning, QL, global quality of life; RF, role functioning; SF, social functioning; SL, sleep disturbance; SD, standard deviation; EORTC QLQ-C30, European Organisation for Research and Treatment for Cancer Quality of life Questionnaire core 30.

Table 4
Cross-sectional correlations of the EORTC QLQ-C30 scale scores with anchors and correlations between their change scores.European Organisation for Research and Treatment for Cancer Quality of life Questionnaire core 30.Example of cross-sectional correlations: PF at T1 versus performance status at T1 Z À0.39, PF at T2 versus performance status at T2 Z À0.41 and PF at T3 versus performance status at T3 Z À0.35.Example of change score correlations: (PF at T2 À PF at T1) versus (Performance status at T2 e Performance status at T1) Z À0.23 and (PF at T3 e PF at T2) versus (Performance status at T3 e Performance status at T2) Z À0.28.

Table 5
Range of anchor-based MID estimates from the mean change method and linear regression.

Table A . 1
Frequency of patients by change scores of anchors.T2 and T3 are time points for the start of treatment, end of treatment and end of follow-up, respectively.Anchor change scores: À4 to À1, 0 and 1 to 4 represent improvement, no change and deterioration, respectively.Only the À1, 0 and 1 change score categories were used to estimate MIDs.No MIDs for deterioration were calculated for CTCAE diarrhoea between T2 and T3 because only 2 patients experienced a clinically minimal deterioration.All the ESs for the no change group were <0.2.The symptom scores were reversed to follow the functioning scales interpretation, i.e. 0 represents the worst possible score and 100, the best possible score.Abbreviations: T1, T2 and T3 are time points for the start of treatment, end of treatment and end of follow-up, respectively.AP, appetite loss; CF, European Organisation for Research and Treatment for Cancer Quality of life Questionnaire core 30.No results are presented for deterioration in DI scale based on CTCAE diarrhoea between T2 and T3 because only 2 patients experienced a clinically minimal deterioration.aTheseestimatedchange scores were not considered to summarise the MID estimate because their ESs were either <0.2 or !0.8.Separate regression modelswere fitted for each scale/anchor pair: Outcome Z HRQOL change score, covariate Z binary anchor variable, coded as 'stable' Z 0 and 'improvement' Z 1 or 'deterioration' Z 1 for models on improvement and deterioration, respectively.The mean change scores Z slope parameters.No results are presented for deterioration in DI scale based on CTCAE diarrhoea between T2 and T3 because only 2 patients experienced a clinically minimal deterioration.Abbreviations: T1, T2 and T3 are time points for the start of treatment, end of treatment and end of follow-up, respectively.AP, appetite loss; CF, cognitive functioning; CO, constipation; DI, diarrhoea; DY, dyspnoea; EF, emotional functioning; FA, fatigue; NV, nausea/vomiting; PA, pain; PF, physical functioning, QL, global quality of life; RF, role functioning; SF, social functioning; SL, sleep disturbance; CTCAE, common terminology criteria for adverse events; MID, minimally important difference; ES, effect size. a These estimated change scores were not considered to summarise the MID estimate because their ES was either <0.2 or !0.8.