Comparative responsiveness of outcome measures for total knee arthroplasty

Summary Objective The aim of this study was to compare the responsiveness of various patient-reported outcome measures (PROMs) and clinician-reported outcomes following total knee arthroplasty (TKA) over a 2-year period. Methods Data were collected in a prospective cohort study of primary TKA. Patients who had completed Forgotten Joint Score-12 (FJS-12), Western Ontario and McMaster Universities (WOMAC) osteoarthritis (OA) index, EQ-5D, Knee Society Score and range of movement (ROM) assessment were included. Five time points were assessed: pre-operative, 2 months, 6 months, 1 year and 2 years post-operative. Results Data from 98 TKAs were available for analysis. Largest effect sizes (ES) for change from pre-operative to 2-month follow-up were observed for the Knee Society Score (KSS) Knee score (1.70) and WOMAC Total (−1.50). For the period from 6 months to 1 year the largest ES for change were shown by the FJS-12 (0.99) and the KSS Function Score (0.88). The EQ-5D showed the strongest ceiling effect at 1-year follow-up with 84.4% of patients scoring the maximum score. ES for the time from 1- to 2-year follow-up were largest for the FJS-12 (0.50). All other outcome measures showed ES equal or below 0.30. Conclusion Outcome measures differ considerably in responsiveness, especially beyond one year post-operatively. Joint-specific outcome measures are more responsive than clinician-reported or generic health outcome tools. The FJS-12 was the most responsive of the tools assessed; suggesting that joint awareness may be a more discerning measure of patient outcome than traditional PROMs.


Introduction
The outcomes of total knee arthroplasty (TKA) can be assessed with various methods; implant survivorship, image-based assessment, clinical assessment and patient-reported outcome measures (PROMs). While the first three modalities are objective in nature, patient report can provide a subjective measure of the patients' perception of the success of an intervention.
The importance of including patients' views on treatment outcome in orthopaedics has been well established in recent years and a variety of patient-reported measures are available 1 . Furthermore self-reported questionnaires are a potentially cost-effective way of monitoring patient outcome in large volumes. PROMs can be broadly dichotomised into generic health status questionnaires such as the EQ-5D or SF-36 (that assess the individuals overall quality of life) and disease/joint-specific tools such as the Western Ontario and McMaster Universities (WOMAC) score which focus on specific constructs such as pain, stiffness and joint function in activities of daily living 2 . These latter examples allow a more focused evaluation of an intervention such as TKA. The most common orthopaedic patient-reported outcome (PRO) tools have been extensively analysed regarding their validity and reproducibility 3e5 . More recently researchers have turned to assess the responsiveness and floor/ ceiling effects 6,7 . Responsiveness to change is of particular importance in longitudinal studies where the scoring should reflect changes over time. If a questionnaire is not sufficiently responsive to the construct being assessed, it will not capture changes at follow-up, which is especially important in mid-to-long-term studies where changes in the patients' pain and function are typically not as pronounced as in the early post-operative phase. This is of direct relevance to measuring PRO following TKA where patient function changes markedly in the early post-operative phase but is followed by more subtle changes over time 8 .
Previous studies of instrument responsiveness however tended to focus on comparison of general health measures vs joint-specific measures 6,9 or covered follow-up only up to 12 months 10e13 . Comprehensive analyses of multiple outcome assessment tools at various time points over 2 years are lacking.
The aim of this study is to compare the responsiveness of various PROMs (FJS-12, WOMAC score, EQ-5D) and clinicianreported outcomes (Knee Society Score, range of motion) following TKA.

Sample population
Data were collected in a prospective cohort study of primary TKA between 2007 and 2009 at Kantonsspital St. Gallen, Switzerland. This was a pragmatic study that reflected local surgical practice at the time using both, mobile and fixed bearing designs. Informed consent was obtained from the participants and ethical approval was granted by the local ethics committee. Patients who had completed FJS-12, WOMAC score, EQ-5D and Knee Society Score (KSS) were included. Participants were assessed at five different time points: pre-operatively, and at 2, 6, 12 and 24 months postoperatively. Socio-demographic and clinical data included gender, Body Mass Index (BMI), age at time of surgery and side of implant.

Outcome measures
A single experienced study performed the clinical examinations and handed over the questionnaires to the patients who completed them independently.

WOMAC
WOMAC osteoarthritis (OA) index is a widely used self-report outcome measure in patients with lower limb OA that was introduced by Bellamy and Buchanan 14 . The original score with 5-point Likert response categories consists of 24 questions covering three dimensions: pain (five questions), stiffness (two questions), and function (17 questions). The WOMAC has been extensively tested for validity, reliability, feasibility, and responsiveness for measuring changes after different OA interventions 14e17 and has also been evaluated in an electronic form 18 . WOMAC scores were linearly transformed to a 0e100 scale with higher scores indicating more severe impairment.

Forgotten Joint Score (FJS-12)
The FJS-12 is a recently published PRO scale to assess joint awareness in hips and knees during various activities of daily living 19,20 . It uses a 5-point Likert response format, consisting of 12 equally weighted questions with the raw score transformed to range from 0 to 100 points. High scores indicate good outcome, i.e., a high degree of being able to forget about the affected joint in daily life. In its validation study 19 it showed a low ceiling effect and high internal consistency (Cronbach's Alpha 0.95) and discriminated well between patient groups known to show different outcome.

EQ-5D
The EQ-5D is a standardised generic quality of life assessment instrument with five items for use as a measure of self-reported general health 21 . Applicable to a wide range of health conditions and treatments, it provides a simple descriptive profile and a single index value for health status. It is one of the internationally most frequently used measures to gain quality of life scores for analysis in health economics as utility weights (ranging from 0 to 1) for calculating quality of life adjusted life years (QALYs) can be obtained 22 .

KSS
The KSS 23 is a widely used clinician-reported outcome score with good published validity data 24 . The clinical part (Knee Score) of the KSS covers pain, range of movement (ROM), alignment and stability. The functional part (Function Score) of the KSS covers the patient's mobility (walking distance and stairs) and potential walking aids. Score range of the KSS is from 0 to 100 points for each part with higher scores indicating less severe impairment.

ROM
Active measures of flexion and extension were determined using universal goniometry. A high level of accuracy has been previously demonstrated assessing knee range of motion with this instrument in the clinical setting 25 and specifically in this patient group 26 . All measurements were made by the study nurse.

Statistical analysis
Sample characteristics are given as means, standard deviations (SDs), ranges, and frequencies. As measures of responsiveness we provide effect sizes (ES, mean difference divided by SD at earlier assessment), standardised response means (SRMs, mean change divided by the standard deviation of the change score) and relative validity (RV). RV was obtained from the ratio of the F-statistics from an analysis of variance for repeated measures, comparing two time points. As a reference measure (the denominator) we used the WOMAC total scale for all time points. In addition, we provide percentages of patients obtaining the highest or the lowest possible score on a measure (i.e., floor and ceiling effects). Statistical analyses were performed with SPSS 20.0.

Patient characteristics
During the study period 537 patients underwent TKA at our institution. Our part-time study nurse recruited 98 of these for the study. Mean age at baseline was 68.1 years (SD 8.6), 49% were female (Table I). The number of subjects for whom data was available varied according to the different time points as shown in Table II. All available data points were included in the analyses.

Responsiveness over time
To highlight how the different measures perform over different time-intervals following surgery we analysed data by investigating responsiveness compared to baseline and also to the previous follow-up assessment. Presenting responsiveness indices this way allows to demonstrate more clearly at which time point after surgery the various measures are able to capture change. Baseline comparisons are also detailed in the Tables III and IV. Pre-operative to 2-month follow-up Largest ES for change from pre-operative to 2-month follow-up were observed for the KSS knee score (1.70), WOMAC-pain (À1.52) and WOMAC total (À1.50). In contrast, range of motion only changed little with an ES of À0. 19. SRM was biggest for WOMAC pain (À1.18) and WOMAC function (À0.91) and smallest for ROM (À0.20). At baseline, only WOMAC stiffness showed floor and ceiling effects with 12.4% of the patients obtaining the lowest possible score and 14.6% the highest possible score. At 2-months follow-up most pronounced floor and ceiling effects were observed for the EQ-5D (39.4% highest score) and again WOMAC stiffness (29.0% lowest score). Score change of the FJS-12 could not be calculated as this score was not administered pre-operatively. Further details are given in Tables IIeV.

2-Month to 6-month follow-up
From 2-month to 6-month follow-up the biggest change in terms of ES was found for the FJS-12 and the KSS function score (both 1.20). The KSS knee score (0.52) and the EQ-5D (0.37) showed the smallest change for this period. SRM was smallest for the EQ-5D (0.35) and biggest for the KSS function score (1.14) and the WOMAC total score (À0.86). At 6-month follow-up the most pronounced floor and ceiling effects were found for the EQ-5D (67.4% highest score), the WOMAC stiffness score (51.6%) and the KSS function score (14.6% highest score). Further details are given in Tables IIeV.

6-Month to 1-year follow-up
For the period from 6 months to 1 year the greatest ES for change were shown by the FJS-12 (0.99) and the KSS function score (0.88). The FJS-12 was also the largest in terms of SRM (0.99), followed by the WOMAC function and total score (both À0.90) and the KSS function score (0.89). Again the EQ-5D and the WOMAC stiffness score performed worst with regard to ES and SRM. These two scores also showed the strongest floor and ceiling effects at 1year follow-up (EQ-5D 84.4% highest score and WOMAC stiffness 64.6% lowest score). At this time point the FJS-12 was the only score that had less than 10% in the highest or lowest category. Further details are given in Tables IIeV. 1e2-year follow-up ES for the time from 1-to 2-year follow-up were biggest for the FJS-12 (0.50). All other scores showed ES equal or below 0.30. SRM was highest for the WOMAC Total score (0.31) and the FJS-12 (0.30). ROM remained constant at a mean of 120 showing no ceiling effect as TKA patients' ROM is naturally less than a healthy individual's ROM. All other outcome measures showed substantial floor and ceiling effects. The FJS-12 had 33.0% of patients showing the highest score followed by the KSS Knee score (37.6%) and the WOMAC Total score (39.6%). Further details are given in Table IIeV.

Discussion
This study demonstrates that outcome measures widely used in orthopaedic research differ substantially with regard to their responsiveness. Previous authors have highlighted differences between various tools, however have focused on early outcome, typically comparing two instruments ability to assess change over 6e12 months post-operatively 10e13, 27 .
Complicating outcome assessment interpretation is the fact that the various scores have differing (sometimes substantial) ceiling effects, e.g., they are not capturing change due to a lack of discriminatory power of the scores as opposed to a lack of change.
A particular strength of this study is the comprehensive assessment of various outcome tools over five time points, which allows a more detailed analysis of the behaviour of the different tools into the later recovery phase. There is scant data on PROM  responsiveness for 2-year follow-up periods and longer. Whereas Browne et al. 28 suggested to follow-up patients until 1 year postoperatively, our data demonstrate the need of longer follow-up periods. We captured change between 12 and 24 months using responsive measures. The need for follow-up beyond one year has been recognised and is also reflected by journal author guidelines 29 . It is of note that orthopaedic journal author guidelines started to require 2-year outcome data for clinical studies involving new implants despite the fact that the ability of various PROMs to capture change over this time frame still needs further investigation.
The sex ratio we report in this study was unexpectedly equal, however, this reflects the patient throughput in our clinics on the 2 days/week the study nurse was present to recruit. To check that this had no confounding effect on our study findings we compared weighted and unweighted ES of the measures. We weighted the study cohort to reflect the sex ratio from our local arthroplasty database (59.8% female) and calculated the difference. This did not influence the results presented in the manuscript and we can therefore be confident in the analysis presented.
The joint-specific scores (WOMAC score and the FJS-12) showed the highest responsiveness in terms of ES and SRMs compared to the KSS or the EQ-5D. The KSS and ROM measurement was able to detect change up to 1 year follow-up. However, the KSS barely improved (1.4 points) between the 1 and 2-year follow-up and ROM remained constant at 120 (Table II). The two parts of the KSS (Knee and function score) also showed limited responsiveness in terms of ES and SRM between 1-and 2-year follow-up but performed substantially better during the first post-operative year. The decreasing function score of the KSS from pre-operatively to 2 months postoperatively is mainly due to the use of walking aids. The KSS is very sensitive to this question by subtracting 20 points if crutches are used. However, at our hospital we often recommend crutches to elderly patients for 2 months for safety reasons (especially in winter) so this may well have skewed the KSS in our study. McKay et al. 30  In this study, the EQ-5D performed very poorly in terms of responsiveness, which is related to the vast ceiling effect from 6months follow-up onwards (e.g., 84.4% of the patients had the highest possible score at 1-year follow-up, Table V). Similarly, Ko et al. 6 found better responsiveness of joint-specific measures in TKA patients compared to the generic SF-36, and the clinician-reported KSS. These results highlight the importance of the disease/jointspecific PROMs for orthopaedic outcome research as they provide a valuable means to sensitively capture changes in patient's condition especially once the post-operative rehabilitation phase has been completed.
However, joint-specific measures also show different responsiveness. Theiler et al. 12 compared the WOMAC with the clinicianreported Lequesne algo-functional index at baseline, 6 and 12 months in patients undergoing total hip arthroplasty (THA) and TKA and found superior responsiveness of the patient-reported WOMAC score. In a recent study Williams et al. 7 compared responsiveness of the WOMAC, the Knee Outcome Survey e Activities of Daily Living Scale (ADLS) and the Lower extremity Functional Scale (LEFS) in patients with knee OA participating in a   rehabilitation programme (2, 6 and 12 months after the start of the programme). In contrast to our study TKA was an exclusion criterion and patients were in a better condition (baseline WOMAC total score was 28.1 points vs pre-operative WOMAC total score was 48.4 points in our study). When comparing their baseline with 2-month follow-up ES for change were 0.33 for the ADLS, 0.32 for the LEFS and 0.43 for the WOMAC (values we have calculated from summary tables in their manuscript). This suggests that in patients with a lower symptom burden, the responsiveness of these specific outcome measures is poor. Similarly, in the later post-operative phase after TKA in our study, ES for change were low for the WOMAC score (0.30 between 1 and 2 years follow-up). The FJS-12 was more responsive with an ES of 0.50 in that time period. Generic scores such as the EQ-5D failed to detect change after the early rehabilitation phase since 81% TKA patients report good outcomes following surgery 8,31 . Therefore joint-specific PROMs are needed to capture change over time or to pick up differences between two groups in a cross-sectional study design. The FJS-12, a measure of patients' joint awareness during activities of daily living, performed best with regard to ES of changes between from 2 and 6-months, 6 and 12 months and between 1 and 2-years followup (Table III). From a logistical and patient compliance point of view, it is notable that these advantageous measurement characteristics accompany a low number of questions asked.
In most of the outcome measures (WOMAC, ROM, KSS) SD decreased over time (halving between pre-op assessment and 2year follow-up). This is very important in the interpretation of ES, as the SD is the denominator. In the early recovery phase, floor/ ceiling effects are less pronounced because data are more normally distributed. This results in larger SDs (ES denominator). Thus, the same mean difference results in lower ES in the early recovery phase compared to the later phase. It is critical to consider this statistical artifact (which affects SRM in a similar manner) when interpreting the results in Table IV. Beyond ES, RV allows for comparative analysis of individual scores. According to Fayers and Hays 32 RV gives the ratio of sample sizes "that would be required to detect the known group difference using one measure versus the other". Therefore RV allows comparison of sample size needed for each instrument. Our data highlight that the EQ-5D requires 5 times as many patients as the WOMAC score to demonstrate baseline to 1-year change. It requires 10 times the number of patients compared to the WOMAC score to capture change between 1 and 2 years post-operatively. For longer term follow-up (2 years) the FJS-12 requires only one third of the number of patients compared to the WOMAC score. These are important considerations when powering outcome studies with PROMs.
The good responsiveness to change of the FJS-12 is perhaps because this score is based on a more discerning construct. A 'forgotten joint' (i.e., that the patient has no awareness of the affected joint during various activities of daily living) is very hard to accomplish. The relatively large ES of this score at 1-and 2-year follow-up are beneficial with regard to powering outcome studies over a longer time span, as substantial floor and ceiling effects compromise responsiveness.

Conclusion
Outcome measures differ considerably in responsiveness, especially beyond one year post-operatively (i.e., when comparing scores at 1-and 2-year follow-up). Joint-specific self-reported outcome measures are more responsive than clinician-reported or generic health outcome tools. The FJS-12 was the most responsive tool assessed. This suggests that joint awareness may be a more discerning measure of patient outcome than traditional PROMs.

Author's contributions
KG and JMG conceived the study objective. All authors participated in the study design. KG coordinated data collection. JMG and KG performed the statistical analysis and interpreted the results. All authors helped to outline the manuscript. KG, JMG and DH drafted the manuscript. All authors read and approved the final version.

Competing interests
None.

Role of funding source
This study had no specific funding or sponsor.