Part 3: Critique and Evaluation of Psychoeducational Assessment Data– 15 points
Based on our lens of the needs of Bilingual students, discuss the reliability, validity, authenticity, and washback of the data. Does the data communicate a realistic picture of the student’s abilities and needs? Discuss strong points and weak points of the assessment tools, data results, and provide recommendations for interpretation and improvement in relations to multicultural language learners. Also discuss data uses along raciolinguistic perspectives and its impact on equity for your bi/multilingual learner.
Guide for Part 3:
Critique and Evaluation of Psychoeducational Assessment Data
3.1. Psychometrics: 3 points (1 paragraph each)
1. Describe the test and the use of its results as it relates to our theories on:
Describe any factors that have negatively impacted any and all of the above-mentioned aspects of the assessments and its data.
3.2. Impact on Language Learners: 3 point (1 paragraph)
Does the data communicate a realistic picture of the student’s abilities and needs? Why or why not? Discuss strong points and weak points of the assessment tools and data results as it relates to bilingual and multicultural students.
3.3 Recommendations: 3 points (1-2 paragraphs)
a. Provide recommendations for improving the test to increase equity and for interpreting the results based on our theories and learning about multicultural language learners (ex: items, assumptions, psychometrics, et al).
b. Provide recommendations for the administration of the test to other multicultural and multilingual similar to the student you are evaluating. What could have been done to make it more equitable for this student?
Additional points for Part 3: An additional 1 point is reserved for correct grammar and readability.
According to Mahoney (2017), “reliability is measured as a coefficient (a number between 0 and 1), which informs us empirically of how much contamination (or error) is part of the overall test score” (p. 130). It is important to also keep in mind that no test is 100 percent reliable (Mahoney, 2017). According to the Woodcock-Johnson IV Tests of Achievement Technical Manual (2014), “The standard error of the reliability coefficient provides a confidence band within which the true reliability coefficient would be expected to fall. Table 4-2 reports the 68% confidence band for several typical reliabilities and sample sizes” (p. 90). The reliability scores according to the testing manual Table 4.2 are .85, .90 and .95 averaged for diverse testing samples (Woodcock-Johnson IV Technical Manual, 2014, p. 91). Mahoney (2017) supports these reliability scores by stating that for a test to be considered to have a reliable coefficient the number needs to be .85 or higher which is what is indicated in the technical manual.
According to the WISC-V Technical and Interpretive Manual (2015), the reliability scores for the WISC-V were determined using 3 methods: Internal consistency, test-retest (stability), and interscorer agreement. Average coefficients across the 11 age groups of composite scores ranged from .88 (Processing speed index) to .96 (FSIQ and general ability Index). The reliability estimates for the complementary subsets, process, and composite scores were outlined in table 4.2 of the WISC-V Technical and Interpretative Manual. Based on tests of reliability, the average coefficients across 11 age groups ranged from .90 to .94 for composite scores. Furthermore, the average coefficients for subtests and process scores were .82 to .89. The internal consistency reliability coefficients ≥ .90 have been recommended for making relevant decisions related to diagnosis, as well as decision about tailored instruction/ interventions for children.
Mahoney (2017) states “each use of the test must be considered for validity on a case-by-case basis.” (p.43). The Woodcock-Johnson IV Tests of Achievement Technical Manual (2014) states, “in an independent review, Braden and Niebling (2012) judged the quality of the WJ III content validity evidence, upon which the WJ IV continues to build, as near the strong end of their rating scale.” (p. 119). The rating scale ranged from 0-5 in which the test scored at a 4. The content of the test covers a wide range or core curricular areas. According to the Woodcock-Johnson IV Tests of Achievement Technical Manual (2014), “the representativeness of the WJ IV test content, process, and construct validity was addressed through specification of a test revision blueprint informed by contemporary CHC theory and cognitive neuroscience research.” (p. 219). The teaching manual provides detailed graphs and data that support the validity of the test and what it intends to measure.
According to Canivez & Watkins (2016), The evidence for WISC-V validity was structured around standards which reflect Messick’s (1995) unified validity theory which “prescribes evidence based on test content, response processes, internal structure, relations with other variables, and consequences of testing.” Furthermore, Canivez & Watkins (2016) state that for the WISC-V, test content was derived via a review of the literature and item/ subset review by “experts and advisory panel members (specialists in child psychology, neuropsychology, and/or learning disabilities).” A standardization study was conducted using a nationally representative sample to develop norms to support score WISC-V interpretation. Participants included 2,200 children ages 6-16, each of which was closely matched to 2012 US census data on race/ethnicity, parent education level, and geographic region and balanced with respect to gender. The WISC-V results showed “composite and subtest scores demonstrate high levels of internal consistency….both primary index scores and subtest scores demonstrate moderate to high consistency over testing occasion,…[and] scoring of the WISC-V is highly consistent across raters.” (Efficacy Research Report, 2018).
According to the WISC-V Technical and Interpretive Manual, various subsets within the five primary indexes are moderately to highly correlated with one another, this suggests a high probability of construct validity/ convergent validity. The WISC-V was also tested for validity for students in special populations such as, intellectually gifted, intellectual disability-mild severity, intellectual disability-moderate severity, borderline intellectual functioning, specific learning disorder-reading, specific learning disorder-reading and written expression, specific learning disorder-mathematics, attention-deficit/hyperactivity disorder, disruptive behavior, traumatic brain injury, English language learners, Autism Spectrum Disorder with language impairment, and Autism Spectrum Disorder without language impairment,” (Canivez & Watkins, 2016). With that exhaustive list of specific student needs, the “evidence from these studies suggests that the WISC-V subtests are internally consistent for a wide variety of clinical groups, and their consistency is comparable to that for non-clinical test-takers,” (Efficacy Research Report, 2018).
Overall, it was determined that the WISC-V is sensitive to the performance differences of learners in varying reference groups. Furthermore, the identified patterns of score differences were consistent within each diagnostic category, thus providing support for the diagnostic utility of the WISC-V in identifying children with learning disabilities, neurodevelopmental disorders, or intellectual giftedness.
The Woodcock-Johnson IV Tests of Achievement Technical Manual (2014) is a criterion-referenced test in that it “gathers information about student progress or achievement in relation to a specified criteria” (Gottlieb, 2016, p. 202). This type of testing allows for teachers to be able to understand the language abilities of a student and can develop accommodations appropriate for that child. This allows for the test results to be used in an authentic manner and to help students progress in their language skills.
“[a]ll items from the new WJ IV tests underwent extensive pilot testing. After each test item pool was developed, project staff first administered the items to a restricted sample to try out the item format and verify that the item instructions were clear. After any necessary modifications were made, each test was administered to a convenience sample of approximately 100 to 200 examinees from a wide range of ages and abilities. The purpose of this round of pilot testing was to obtain preliminary item difficulty estimates and other item statistics to assess whether further item development or modifications were needed prior to the tryout study”
In addition, “[a] primary goal for the new tests and items was to capture the important aspects of the underlying constructs and cover a wide range of difficulty (construct-representation), while avoiding the measurement of other, confounding abilities (construct-irrelevant variance)” (Woodcock-Johnson IV Technical Manual, 2014, p. 43). Mahoney (2014) states that “[f]airness in testing is closely related to bias” (p. 108). The reviewers of the test looked at the content and format of the questions in order to evaluate any “potential bias or sensitivity issues for women, individuals with certain disabilities, and cultural or linguistic minorities” (Woodcock-Johnson IV Technical Manual, 2014, p. 43). Included in the technical manual were examples of questions for a reviewer to consider which included whether the item contained language that may not be familiar to certain groups or whether the item assume familiarity with concepts or relationships that may not be familiar to all groups (Woodcock-Johnson IV Technical Manual, 2014, p. 44). If any of these items was considered potentially biased it was removed from the pool. Although these tests seem to be able to adequately capture a student’s learning needs it can also cause students to underperform due to limited familiarity with the person conducting the test. If a student is taking the exam with someone they have no contact with outside of that session the student may feel shy or unable to perform adequately due to feeling embarrassed. It is important to be able to provide these testing tools with people the student has familiarity with such as a teacher whom the student sees regularly.
There were many factors considered when administering the exam in specific clinical groups. According to the Woodcock-Johnson IV Technical Manual (2014), [t]he comprehensiveness of the WJ IV battery made it impossible to administer all key tests and clusters to all clinical groups. To reduce examinee response burden, which is a significant concern in clinical groups, a diagnostic group-targeted approach to test selection was used.” (p. 210). The clinical groups included were: “gifted, intellectual disabilities (ID)/mental retardation (MR), learning disabilities (LD; reading, math, and writing), language delay, attention deficit/hyperactivity disorder (ADHD), head injury, and autism spectrum disorders (ASD)” (Woodcock-Johnson IV Technical Manual, 2014, p. 209). This differentiation across groups allows for positive washback in that it can provide insight into how these different groups test and what accommodations can be made for their identified learning needs. Also, as stated earlier, the test being criterion-referenced also allows for evaluators to be able to assess a student on an individual basis as opposed to comparing them to their peers which allows for a decrease in anxiety and judgement of a student.
However, due to the WJ IV and WISC-V being administered during school hours, it may cause negative washback where students miss important instruction time in the classroom. Many students who may need to be in class all day to be able to grasp material will miss out on important information during this time which can cause them to feel frustrated and fall behind. Students being pulled out of their classes can also cause them to feel singled out or embarrassed by peers.
As mentioned earlier, research was conducted to assess the validity of the WISC-V across varying subsets of the population, including English Language Learners (ELL). For ELL’s, the sample was “50% female, 88% Hispanic, and 13% Asian. 50% of participants had parents with at least 12 years of education, with 6% reporting at least 16 years of parental education. 38% of participants were drawn from the West, 31% from the South, 19% from the Midwest, and 13% from the Northeast.” (Wechsler, D., & Kaplan, E., 2015). Results showed that ELL’s “scored significantly lower than their matched control counterparts on the Verbal Comprehension and Working Memory indices, as well as the Full-Scale IQ” (Wechsler, D., & Kaplan, E., 2015). However, “index scores containing subtests requiring minimal expressive language and reduced receptive language abilities showed no significant differences between groups” (Wechsler, D., & Kaplan, E., 2015). However, it is important to remember that the WISC-V is an instrument normed on children whose primary language is English and these children may come from a variety of cultural backgrounds. It is the overall responsibility of the individual administering the test to determine if the student being assessed is similar enough to those represented in the normative sample. This feat requires familiarity with the WISC-V, its psychometric properties, and its sample, as well as familiarity with the child. Culture plays an important role in an individual’s development and identity formation, which also influences the types of experiences that an individual is exposed to. Socio-Economic status (SES) is also another factor that plays a role in the development of cognitive skills and oftentimes, the differences that are seen on measures of cognitive ability could be attributed to socio-economic status/exposure/access rather than culture.
The data gathered when administering the WJ IV provides a wide range of results in different content areas such as oral, reading and writing; which is useful to obtain a holistic assessment of a student. It appears that the detailed review of reliability as well as validity included in the Technical Manual helps provide a review of how the test was developed and what considerations were applied when it came to student population which included; disabilities, sex, age and several other criterias. The WJ IV was also developed using a large nationally representative sample pool which helped improve the overall structure and reliability of the test. This assessment is a positive tool used as a starting point to be able to assess English Language Learners and although some parts are available in Spanish it is important to also include other languages in order to make the test more inclusive of other language cultures.
Due to the length of the WJ IV it is important to take into consideration the mental state of students as they move along to the different tests. If a student is fatigued towards the end of the test the results may not be as accurate. Juan seemed to become frustrated during certain parts of the exam due to lack of confidence in responding which could have affected his scores although his scores were consistently low. It is important to check in with the students and if needed, they should be able to take a break and resume testing at an appropriate time. In addition, it is important for the test to be administered by people who are part of the school environment in order for student’s to be able to feel comfortable speaking and answering questions which can diminish the probability of a student underperforming.