ASSRJ-14968 Camera Ready.pdf

Page 1 of 26

Advances in Social Sciences Research Journal – Vol. 10, No. 7

Publication Date: July 25, 2023

DOI:10.14738/assrj.107.14968.

Iddrisu, S. A. (2023). Validation of Mentee-Teachers’ Assessment Tool within the Framework of Generalisability Theory at the

Faculty of Education, University for Development Studies. Advances in Social Sciences Research Journal, 10(7). 103-128.

Services for Science and Education – United Kingdom

Validation of Mentee-Teachers’ Assessment Tool within the

Framework of Generalisability Theory at the Faculty of

Education, University for Development Studies

Simon Alhassan Iddrisu

University for Development Studies,

P. O. Box TL 1350, Tamale, Northern Region, Ghana

ABSTRACT

Practitioners in assessment and other researchers have over the years expressed

dissatisfaction with the lack of consistency in scores obtained from use of multiple

measurement instruments. Such scores derived from these largely inconsistent and

unreliable procedures are relied upon by decision makers in taking very important

decisions in education, health and other related fields. The purpose of this study

therefore, was to apply the G theory procedures to validate the mentee assessment

tool being used at the Faculty of Education, University for Development Studies. The

G study involved estimating the generalisability (reliability-like) coefficients of

mentees’ assessment scores, and determining the level of acceptability (validity) of

these coefficients. A nested design was used because different sets of raters

assessed different student-mentees on different occasions in the field. The

relationship among these variables (facets); students, raters and occasions,

appropriately mirrored a nested relationship. Data obtained by raters on 300

students, in the 2018/2019 off-campus teaching practice, were entered into EDUG

software for relevant analysis of the results. The study found that both rater and

student facets accounted for the largest measurement errors in mentees’ observed

scores, reporting an estimated G coefficient of 0.62(62%), and representing a

positive moderate relationship. Based on these findings, the study concluded that,

the quality of mentee observed scores could be improved for either relative or

absolute decisions by varying the number of levels of both raters and occasions. To

achieve acceptable G coefficient values of 0.83 and above, it is recommended that,

decision makers employ a model that uses four raters per occasion for three

occasions of assessment.

Keywords: Object of measurement, Relative decision, Absolute decision, universe of

generalisation, universe of admissible observations, Composite facet, Generalisability

study, Decision(D) Study, Optimisation

INTRODUCTION

Educational assessors and researchers alike increasingly express concern about the reliability

and validity of scores produced from multiple measurement procedures such as tests, rating

scales, surveys and other forms of observations (1-Alkharusi, 2012). This is because scores

generated through the use of any measurement procedure in educational and psychometric

assessments, often are basis for making very important decisions (Kolen and Brennan, 2014,

1995; Hughes & Garrett, 2018). Kolen (2014) identified three levels of decision-making based

Page 2 of 26

104

Advances in Social Sciences Research Journal (ASSRJ) Vol. 10, Issue 7, July-2023

Services for Science and Education – United Kingdom

on assessment scores which are: the individual, institutional, and public policy levels. Individual

level decisions based on results may involve a student opting to attend a certain tertiary or non- tertiary institution or even electing to pursue a certain programme of study (Fendler, 2006;

Kolen & Brennan, 2014). Institutional level decisions likewise rely on previous assessment

records to either certify professionals or to admit applicants into tertiary programmes in

relevant institutions. Public policy level decisions address general problems such as improving

quality and access to education in the nation for all to benefit from. Shavelson and Webb (2004)

submitted that, the usefulness of any assessment score largely is dependent on the extent to

which we can generalize with accuracy and precision to wider set of situations.

Allen and Yen (2011) also reckoned that, assessment results generally have multiple purposes

and applications, such as found in the selection of new employees, applicants or clients for

varied reasons. Yukawa, Gansky, O’Sullivan, Teherani and Fieldman (2020) maintained that in

the training of budding professionals in the fields of education, health, law, agriculture,

business, etc, assessments remained integral in the process, where relevant rating scales or

tools are administered periodically in the conduct of these assessments.

Atilgan (2019), likewise indicated that the choice of a tool for assessment in education depends

on what attribute to be measured. Essay-type instruments and tailored rubrics are among

several tools reviewed in literature for purposes of assessing the writing skills and other

competencies of trainees (Atilgan, 2019; Atilgan, Kan & Aydin, 2017; Turgut & Baykul, 2010).

Graham, Harris, and Herbert (2011) for instance, used a writing-based essay type rubric in

assessing students’ writing skills at the primary. Fleming, House, Hanson, Garbutt, Kroenke

Abedin and Rubio (2013) developed the Mentoring Competency Assessment tool (MCA), which

they used in assessing the skills of mentors in clinical and translational science. An estimate of

the reliability and validity of scores obtained from the MCA tool showed high and positive

relationship among the competencies examined.

Many educational settings, schools and other similar institutions are arguably the largest

consumers of data emanating from multiple testing and other assessment procedures (Miller et

al., 2011). A major challenge associated with measurements in both the social sciences and

education is therefore the inconsistencies (unreliability) of its measurements (Sirec, 2017,

Revelle, 2016, Brennan, 2005). When the same characteristic is measured on two different

occasions, the results obtained often are different (Steyer, 1999; Revelle, 2016). Steyer et al.

(1999), also intimated that, irrespective of the measures an institution or body may put in place

to assure the sanctity of scores produced from measurements processes, many potential

sources of error continue to persist and must be removed.

Many studies cited in literature have used the G theory in investigating the reliability of rating

scales and scores obtained from such ratings. Kim, Schatschneider, Wanzek, Gatlin and Otaiba

(2017), in a study examined raters and tasks as facets of interest contributing to measurement

error in another generalisability study and reported that, the rater facet was a major

contributor to measurement error. Sudweeks, Reeve and Bradshaw (2005) similarly, estimated

the individual contributions to total variance in a G study with raters and occasions as variables

(facets) of interest. However, they reported that, a rater’s years of experience in teaching

contributed more to measurement error than the rater factor. This report on rater contribution

Page 3 of 26

105

Iddrisu, S. A. (2023). Validation of Mentee-Teachers’ Assessment Tool within the Framework of Generalisability Theory at the Faculty of Education,

University for Development Studies. Advances in Social Sciences Research Journal, 10(7). 103-128.

URL: http://dx.doi.org/10.14738/assrj.107.14968

to measurement errors contradicted that by Kim et al. (2017), who reported a substantial

contribution by the rater factor to measurement error.

Researchers like Kan (2007), Graham, Hebert, Sandbank and Harris, (2016), Bouwer, Beguin,

Sanders, and van der Ber (2015), and Gebril (2009) variously conducted G studies on a guidance

tool rating scale, with the number of essay samples, types of essays, and types of tasks set as

factors of interest. Atilgan (2019), examined the reliability of essay rating rubrics using a G

theory framework, while Lin and Xiao (2018) investigated the rater reliability using holistic and

analytic scoring keys within G theory procedures. With these different studies, the ultimate goal

was to quantify the contribution of the individual facets and their composites to total variance.

G theory was chosen over Classical Test Theory (CTT) and Item Response Theory (IRT) due to

its superiority in quantifying multiple sources of error in a single study (Brennan, 2005).

Whereas the Classical Test Theory focuses on the measurement of reliability in order to

differentiate among individuals (Cardinet et al., 2010), G theory in addition, enables the user to

evaluate the quality of measurements, not just among individuals but also objects (Cardinet,

Johnson, & Pini; 2010). Again, while in CTT, coefficient values determined serve as global

indicators of the quality of measurement, G theory does not only calculate the coefficients, but

it further provides information on the relative contributions and importance of the different

sources of measurement error. Through this unique function, G theory thus permits the user to

module the factors into a measurement procedure to improve measurements.

The Generalisability theory practically is applied at two levels, namely; the G study and the

decision (D) study levels (Heitman, Kovaleski & Pugh, 2009). Whereas the G study enables the

estimation of variance components and reliability coefficients, the D study enables the

investigator determine optimal number of levels of facets and possibly, ‘positively impact

interrater reliability’ for making decisions (Moskal & Leydens, 2000, p.28). It allows the user to

alternatively employ different levels of the variables involved so as to improve the quality of

measurements.

Statement of the Problem

The Faculty of Education, University for Development Studies, trains professional teachers for

the various levels of education, in line with the national aims of producing quality teachers. This

Faculty of Education introduces new programmes and creates more academic departments.

Like many curricula used for training of professionals, that for the training of professional

teachers have two main components; namely, the content, and pedagogy (practical) aspects.

Pedagogical training, equips students with the relevant professional skills and attitudes they

require to enable them teach proficiently in the classrooms, at all levels of education.

The practical components of the training which are implemented as; school observation, On- campus or peer teaching, and Off-campus teaching practice, often are assessed using a tool. This

assessment instrument over the years (since 2012/2013 academic year) has been changed for

at least two different times on account of the grossly unsatisfactory grades the Professional

Education Practice Unit (PEPU) receives on behalf of faculty from raters assigned to evaluate

mentees.