Page 1 of 26
Advances in Social Sciences Research Journal – Vol. 10, No. 7
Publication Date: July 25, 2023
DOI:10.14738/assrj.107.14968.
Iddrisu, S. A. (2023). Validation of Mentee-Teachers’ Assessment Tool within the Framework of Generalisability Theory at the
Faculty of Education, University for Development Studies. Advances in Social Sciences Research Journal, 10(7). 103-128.
Services for Science and Education – United Kingdom
Validation of Mentee-Teachers’ Assessment Tool within the
Framework of Generalisability Theory at the Faculty of
Education, University for Development Studies
Simon Alhassan Iddrisu
University for Development Studies,
P. O. Box TL 1350, Tamale, Northern Region, Ghana
ABSTRACT
Practitioners in assessment and other researchers have over the years expressed
dissatisfaction with the lack of consistency in scores obtained from use of multiple
measurement instruments. Such scores derived from these largely inconsistent and
unreliable procedures are relied upon by decision makers in taking very important
decisions in education, health and other related fields. The purpose of this study
therefore, was to apply the G theory procedures to validate the mentee assessment
tool being used at the Faculty of Education, University for Development Studies. The
G study involved estimating the generalisability (reliability-like) coefficients of
mentees’ assessment scores, and determining the level of acceptability (validity) of
these coefficients. A nested design was used because different sets of raters
assessed different student-mentees on different occasions in the field. The
relationship among these variables (facets); students, raters and occasions,
appropriately mirrored a nested relationship. Data obtained by raters on 300
students, in the 2018/2019 off-campus teaching practice, were entered into EDUG
software for relevant analysis of the results. The study found that both rater and
student facets accounted for the largest measurement errors in mentees’ observed
scores, reporting an estimated G coefficient of 0.62(62%), and representing a
positive moderate relationship. Based on these findings, the study concluded that,
the quality of mentee observed scores could be improved for either relative or
absolute decisions by varying the number of levels of both raters and occasions. To
achieve acceptable G coefficient values of 0.83 and above, it is recommended that,
decision makers employ a model that uses four raters per occasion for three
occasions of assessment.
Keywords: Object of measurement, Relative decision, Absolute decision, universe of
generalisation, universe of admissible observations, Composite facet, Generalisability
study, Decision(D) Study, Optimisation
INTRODUCTION
Educational assessors and researchers alike increasingly express concern about the reliability
and validity of scores produced from multiple measurement procedures such as tests, rating
scales, surveys and other forms of observations (1-Alkharusi, 2012). This is because scores
generated through the use of any measurement procedure in educational and psychometric
assessments, often are basis for making very important decisions (Kolen and Brennan, 2014,
1995; Hughes & Garrett, 2018). Kolen (2014) identified three levels of decision-making based
Page 2 of 26
104
Advances in Social Sciences Research Journal (ASSRJ) Vol. 10, Issue 7, July-2023
Services for Science and Education – United Kingdom
on assessment scores which are: the individual, institutional, and public policy levels. Individual
level decisions based on results may involve a student opting to attend a certain tertiary or non- tertiary institution or even electing to pursue a certain programme of study (Fendler, 2006;
Kolen & Brennan, 2014). Institutional level decisions likewise rely on previous assessment
records to either certify professionals or to admit applicants into tertiary programmes in
relevant institutions. Public policy level decisions address general problems such as improving
quality and access to education in the nation for all to benefit from. Shavelson and Webb (2004)
submitted that, the usefulness of any assessment score largely is dependent on the extent to
which we can generalize with accuracy and precision to wider set of situations.
Allen and Yen (2011) also reckoned that, assessment results generally have multiple purposes
and applications, such as found in the selection of new employees, applicants or clients for
varied reasons. Yukawa, Gansky, O’Sullivan, Teherani and Fieldman (2020) maintained that in
the training of budding professionals in the fields of education, health, law, agriculture,
business, etc, assessments remained integral in the process, where relevant rating scales or
tools are administered periodically in the conduct of these assessments.
Atilgan (2019), likewise indicated that the choice of a tool for assessment in education depends
on what attribute to be measured. Essay-type instruments and tailored rubrics are among
several tools reviewed in literature for purposes of assessing the writing skills and other
competencies of trainees (Atilgan, 2019; Atilgan, Kan & Aydin, 2017; Turgut & Baykul, 2010).
Graham, Harris, and Herbert (2011) for instance, used a writing-based essay type rubric in
assessing students’ writing skills at the primary. Fleming, House, Hanson, Garbutt, Kroenke
Abedin and Rubio (2013) developed the Mentoring Competency Assessment tool (MCA), which
they used in assessing the skills of mentors in clinical and translational science. An estimate of
the reliability and validity of scores obtained from the MCA tool showed high and positive
relationship among the competencies examined.
Many educational settings, schools and other similar institutions are arguably the largest
consumers of data emanating from multiple testing and other assessment procedures (Miller et
al., 2011). A major challenge associated with measurements in both the social sciences and
education is therefore the inconsistencies (unreliability) of its measurements (Sirec, 2017,
Revelle, 2016, Brennan, 2005). When the same characteristic is measured on two different
occasions, the results obtained often are different (Steyer, 1999; Revelle, 2016). Steyer et al.
(1999), also intimated that, irrespective of the measures an institution or body may put in place
to assure the sanctity of scores produced from measurements processes, many potential
sources of error continue to persist and must be removed.
Many studies cited in literature have used the G theory in investigating the reliability of rating
scales and scores obtained from such ratings. Kim, Schatschneider, Wanzek, Gatlin and Otaiba
(2017), in a study examined raters and tasks as facets of interest contributing to measurement
error in another generalisability study and reported that, the rater facet was a major
contributor to measurement error. Sudweeks, Reeve and Bradshaw (2005) similarly, estimated
the individual contributions to total variance in a G study with raters and occasions as variables
(facets) of interest. However, they reported that, a rater’s years of experience in teaching
contributed more to measurement error than the rater factor. This report on rater contribution
Page 3 of 26
105
Iddrisu, S. A. (2023). Validation of Mentee-Teachers’ Assessment Tool within the Framework of Generalisability Theory at the Faculty of Education,
University for Development Studies. Advances in Social Sciences Research Journal, 10(7). 103-128.
URL: http://dx.doi.org/10.14738/assrj.107.14968
to measurement errors contradicted that by Kim et al. (2017), who reported a substantial
contribution by the rater factor to measurement error.
Researchers like Kan (2007), Graham, Hebert, Sandbank and Harris, (2016), Bouwer, Beguin,
Sanders, and van der Ber (2015), and Gebril (2009) variously conducted G studies on a guidance
tool rating scale, with the number of essay samples, types of essays, and types of tasks set as
factors of interest. Atilgan (2019), examined the reliability of essay rating rubrics using a G
theory framework, while Lin and Xiao (2018) investigated the rater reliability using holistic and
analytic scoring keys within G theory procedures. With these different studies, the ultimate goal
was to quantify the contribution of the individual facets and their composites to total variance.
G theory was chosen over Classical Test Theory (CTT) and Item Response Theory (IRT) due to
its superiority in quantifying multiple sources of error in a single study (Brennan, 2005).
Whereas the Classical Test Theory focuses on the measurement of reliability in order to
differentiate among individuals (Cardinet et al., 2010), G theory in addition, enables the user to
evaluate the quality of measurements, not just among individuals but also objects (Cardinet,
Johnson, & Pini; 2010). Again, while in CTT, coefficient values determined serve as global
indicators of the quality of measurement, G theory does not only calculate the coefficients, but
it further provides information on the relative contributions and importance of the different
sources of measurement error. Through this unique function, G theory thus permits the user to
module the factors into a measurement procedure to improve measurements.
The Generalisability theory practically is applied at two levels, namely; the G study and the
decision (D) study levels (Heitman, Kovaleski & Pugh, 2009). Whereas the G study enables the
estimation of variance components and reliability coefficients, the D study enables the
investigator determine optimal number of levels of facets and possibly, ‘positively impact
interrater reliability’ for making decisions (Moskal & Leydens, 2000, p.28). It allows the user to
alternatively employ different levels of the variables involved so as to improve the quality of
measurements.
Statement of the Problem
The Faculty of Education, University for Development Studies, trains professional teachers for
the various levels of education, in line with the national aims of producing quality teachers. This
Faculty of Education introduces new programmes and creates more academic departments.
Like many curricula used for training of professionals, that for the training of professional
teachers have two main components; namely, the content, and pedagogy (practical) aspects.
Pedagogical training, equips students with the relevant professional skills and attitudes they
require to enable them teach proficiently in the classrooms, at all levels of education.
The practical components of the training which are implemented as; school observation, On- campus or peer teaching, and Off-campus teaching practice, often are assessed using a tool. This
assessment instrument over the years (since 2012/2013 academic year) has been changed for
at least two different times on account of the grossly unsatisfactory grades the Professional
Education Practice Unit (PEPU) receives on behalf of faculty from raters assigned to evaluate
mentees.