## Indices of Interobserver Agreement Must Be Assessed

The results of this research have several implications. First, proportional reliability appears to be a preferred reliability index for time-based frequency recording because, unlike interval reliability, proportional reliability does not provide falsely high scores for high-throughput responses and, unlike accurate reliability, is not too affected by simple response deviations. Secondly, given the adverse effects of recording errors at the end of the interval, observer training should include procedures to improve early detection and a specific examination of this capability as a performance criterion. Finally, the differences between the reliability indices are influenced to some extent by the number of intervals that serve as the basis for calculation. For example, responding at the end of the interval has a greater impact on calculations based on a larger number than a smaller number of intervals (for the same total session duration), simply because there are more opportunities for observers to evaluate responses at adjacent intervals. The use of 10 s intervals to calculate reliability is based on the traditional method of recording 10 s intervals common in applied research. If the dependent variable is specified as a percentage of the intervals at which the response occurred, the unit of measure used to calculate reliability (10 s interval) is the unit of measure used to calculate the data. This means that consent is assessed for the notified measure. On the other hand, when data are taken on the frequency of response, the dependent variable is usually expressed as responses per minute, not as responses per 10-second interval, and the relevant question from the point of view of reliability is whether observers agree on the number of responses obtained in a given minute. Therefore, we suggest that the most appropriate unit (interval) to calculate the accurate and proportional reliability of frequency data is 1 min instead of 10 s. Figure 1 shows the individual reliability values for each observer for each of the six session types, and Table 1 summarizes the results as average percentages for observers.

All calculation methods gave high reliability values for the low-throughput session. Overall reliability remained high for the session at a moderate pace, while interval, proportional and accurate reliability values decreased significantly for observers 1 and 2 and decreased slightly for observers 3, 4 and 5. Interestingly, the overall reliability, interval, and proportional reliability for the high-rate session remained high (Ms = 98.9%, 100% and 94.6%, respectively), while the exact agreement for each observer decreased significantly (M = 77.3%). Behavioral scientists have developed a sophisticated methodology to assess behavioral changes that depends on an accurate measure of behavior. Direct observation of behaviour has traditionally been the mainstay of behavioural measurement. Therefore, researchers need to pay attention to psychometric properties, . B such as the inter-observer agreement, observational measures to ensure a reliable and valid measurement. Among the many indexes in the Interobserver agreement, the percentage of match is the most popular.

Its use persists despite repeated warnings and empirical evidence suggesting that due to its inability to account for chance, it is not the most psychometrically sound statistic for determining agreement among observers. Cohen`s kappa (1960) has long been proposed as the most psychometrically sound statistic for assessing interobserver matching. Kappa is described and the calculation methods are presented. Mudford et al. (2009) compared exact and proportional reliability (called “block by block”) to time window analysis, which evaluates a match when the data sets of the two observers contain a response to ± not each other. Twelve observers recorded data from six video samples of client-therapist interactions and focused on a target response during each session, with varying response rates (three samples) or duration (three samples). Response rates were 4.8, 11.3 and 23.5 per minute, respectively, for low, medium and high responses. The results showed that the exact and proportional reliability of the low-rate response was equally high (Ms= 78.3% and 85.3%, respectively). However, the reliability of the exact match was significantly lower than the proportional reliability for medium-rate responses (M = 59.5% and 76.8%, respectively) and high-rate responses (Ms= 50.3% and 88%, respectively).

These results suggest that reliability calculations are influenced by response rate, but they did not determine whether lower results with an exact match are a function of the response rate itself or another characteristic of the high-throughput response, such as periodic bursting. The objective of this study was to expand the results of Mudford et al. (2009) comparing interobserver reliability scores based on four calculation methods with datasets that have changed in response rates (Study 1), and then performing a more detailed analysis to identify response distribution characteristics that may contribute to lower reliability scores (Study 2). At constant intervals, the response property that most likely affects interobserver reliability is the response rate. Although several studies have examined the effect of response rates on outcomes resulting from the use of different data collection methods (Powell and Rockinson, 1978; Repp, Roberts, Slack, Repp, & Berkler, 1976), so far, only one study has examined the influence of response rates on interobserver reliability indices (Mudford, Martin, Hui, & Taylor, 2009). A stricter variation in interval reliability is the accuracy of the correspondence (Repp, Deitz, Boles, Deitz, & Repp, 1976), where a correspondence is defined by the two observers recording the same number of responses in an interval. This method is the most conservative reliability estimate because any difference in data recording leads to complete disagreement. For example, an interval in which one observer records four responses while the other observer records five responses is considered a disagreement. Response measurement in applied research usually involves data collection by human observers, which is likely to be more prone to errors than automatic transduction. Therefore, the assessment of the consistency or reliability of observers has become a standard feature of applied research and is achieved by determining the extent of the scoring agreement between the files of independent observers. A number of factors can affect reliability (see Reviews of Kazdin, 1977; Page and Iwata, 1986); This study focuses on the methods of calculating reliability statistics and their impact on response rate and distribution.

Langenbucher, J., Labouvie, E., & Morgenstern, J. (1996). Methodological developments: Measurement of diagnostic agreement. Journal of Counseling and Clinical Psychology, 64, 1285-1289. Figure 2 shows the individual match values for each observer for each of the six session types, and Table 1 summarizes the results as average percentages between observers. All calculation methods provided high reliability values in both moderate and high sessions, suggesting that higher response rates per se did not affect reliability. Similarly, all calculation methods in the constant and burst sessions provided high reliability scores, suggesting that the burst response did not seem to affect reliability, at least for these observers. The reliability values for the session in the middle of the interval were also uniformly high based on all calculations. The only noticeable difference between the reliability indices was observed for half and a half of the session.

Overall reliability remained high (M = 99.3%), while successively lower values were obtained using interval, proportional and exact methods (M = 86.3%, 71.9% and 53.7%, respectively). This effect was extremely consistent for all observers, but was more pronounced in observers 6, 7 and 10. Inter-observer agreement (IOA) is a key aspect of data quality in time-and-motion clinical trials. So far, these studies have used simple, ad hoc approaches to IOA assessment, often with minimal reporting on methodological details. The most important methodological issues are the alignment of time-stamped task intervals, which rarely have corresponding start and end times, and the evaluation of the IOA for several dummy variables. We present a combination of methods that simultaneously address these two problems and provide a more appropriate measure for assessing osteoarthritis for time and movement studies. The problem of alignment is solved by converting task-level data into small time windows and then aligning data from different observers by time. A method applicable to multivariate nominal data, the Iota score, is then applied to timed data. We illustrate our approach by comparing Iota scores with the average univariate Cohen-Kappa scores by applying these measures to existing data from an observational study of emergency physicians.

Although both scores gave very similar results under certain conditions, iota was more resistant to sparse data problems. Our results suggest that iota applied to time windows significantly improves previous methods of evaluating IOA in time and motion studies, and that Cohen`s kappa and other univariate measurements should not be considered the gold standard. Rather, there is an urgent need to continuously and explicitly discuss methodological issues and solutions to improve how data quality is assessed in time and movement studies to ensure that the conclusions drawn from these studies are sound. .