Use of Automated Writing Evaluation (AWE) for placement tests: Can scores of AWE be criteria to place students into language courses?

Use of Automated Writing Evaluation (AWE) for placement tests: Can scores of AWE be criteria to place students into language courses? Zhi Li, Hyejin Yang, Stephanie Link, Volker Hegelheimer IOWA STATE UNIVERSITY October 5-6, 2012 University of Illinois, Urbana-Champaign 1

English Placement Test Ø To place international students into appropriate ESL writing classes Ø Practical needs Low cost options Immediate scoring Improved time management

Example: Time Management Fall 2012 @ ISU: 500+ essays to score 1 human rater: 5 min/essay -------------------------------------------------------------- How long will it take for raters to score all the essays? -------------------------------------------------------------- 18 raters: 28 essays = 2.3 hours Each essay rated 2 to 3 times TOTAL: 4.6 hours

What can the computer do?

What can the computer do? Human Rating 5 minutes Computer Rating 1 minute Essays per Rater 23 2.8 hours to rate 500 essays vs. 4.6 hours

Motivation and Purpose Low cost options Immediate scoring Improved time management To investigate whether the scores of Criterion can be utilized to help make placement decisions in an ESL program.

AWE Validation Studies Ø High level of correspondence e-rater IntelliMetric Intelligent Essay Assessor (IEA).73-.93(correlation) 87-97% (Exact agreement).50-.90 (correlation) 56-88% (Exact agreement).81-.83 (correlation) Attali & Burstein, 2005; Burstein, Chodorow, & Leacock, 2004 Elliot, 2003; Vantage Learning, 1998, 1999, 2000, 2001 Landauer, Laham, and Foltz (2003)

Computer Scoring for Placement Ø Concerns ACCUPLACER OnLine WritePlacer Plus by IntelliMetric Impersonal Distorts the nature of writing Discriminates according to: length grammar mechanics Weak correlations may be due to: lack of formal training/calibrating human evaluators Herrington and Moran, 2006 Jones, 2006 James, 2006

Computer Scoring for Placement Ø Validity ACCUPLACER OnLine WritePlacer Plus by IntelliMetric not that much worse.. than placement by readers (p. 126) Useful with spotchecking and retesting a valid tool for assessing writing samples and placing students in composition courses Herrington and Moran, 2006 Jones, 2006 James, 2006

Research questions Ø RQ1. What is the relationship between Criterion output and EPT decisions? Holistic scores Trait feedback Ø RQ2. To what extent can holistic scores of Criterion distinguish between different levels of ESL writing classes?

Participants Ø 135 international undergraduate students Ø Fall semester 2012 at ISU Disciplines Number of participants Engineering 48 LAS 37 Business 33 Design 10 Human Science 5 Agriculture 2

Setting Ø Paper-based English Placement Test (30 min.) Ø Topic: Modern convenience from Criterion Topic category - College level first year Topic mode - Persuasive Modern conveniences such as fast food, automatic teller machines, and labor-saving appliances promise to make life easier. Do these products and services actually make our lives more convenient or do they simply create new problems? Explain your position with reasons and examples from your own experience, observations, or reading.

Ø Number of raters Experienced: 9 New: 9 Rating Procedure Ø Rubric based on ACTFL Proficiency Guidelines General Description, Organization, Grammar & Vocab, Functional, Mechanics, and, Comprehensibility Ø Placement based on two raters agreement Third rating for controversial papers Inter-rater reliability: 62% exact agreement

EPT Scoring Criteria Advanced Mid (Pass) Advanced Low (101C/D) Intermediate High (101B) ü able to meet a range of work and/or academic writing needs. ü able to narrate and describe with detail in all major time frames ü cohesive devices in texts up to several paragraphs ü good control of the most frequently used target-language syntactic structures and a range of general vocabulary. ü able to meet basic work and/or academic writing needs. ü able to narrate and describe in major time frames ü a limited number of cohesive devices, ü some redundancy and awkward repetition ü some additional effort may be required in the reading of the text. ü able to write compositions and simple summaries related to work and/or school experiences ü inconsistent in the use of appropriate major time markers, resulting in a loss of clarity. ü correspond to those of the spoken language.

Curriculum Ø ESL Writing Curriculum Placement Decisions Engl 101B Engl 101C Pass/ Eng150

Materials Writing samples from EPT Stratified Random Sampling EPT Level Two-rater Samples Verbatim Transcription Three-rater Samples Word count (M) 101 B 30 15 259 101 C 30 15 260 Pass 30 15 302

Data Collection Ø Entering essays into Criterion Ø Data extraction Holistic scores Trait feedback (error numbers are normalized) Grammar (S-V agreement, fragment and etc.) Usage (wrong article, preposition errors and etc.) Mechanics (spelling, missing comma, and etc.) Style (repetition of words, short sentences, and etc.)

Criterion Scoring rubric 4 3 Slights some parts of the task Treats the topic simplistically or repetitively Is organized adequately, but you need more fully to support your position with discussion, reasons, or examples Shows that you can say what you mean, but you could use language more precisely or vigorously Demonstrates control in terms of grammar, usage, or sentence structure, but you may have some errors. Neglects or misinterprets important parts of the topic or task Lacks focus or is simplistic or confused in interpretation Is not organized or developed carefully from point to point Provides examples without explanation, or generalizations without completely supporting them Uses mostly simple sentences or language that does not serve your meaning Demonstrates errors in grammar, usage, or sentence structure

Data Analysis Ø RQ1: Criterion output vs. Human ratings à Descriptive Statistics à Correlation à Regression Ø RQ2: Criterion output differences b/w EPT levels à ANOVA

RQ1: Criterion output vs. Human ratings Distribution of Criterion scores over EPT levels B (N=43) C (N=45) Pass (N=44) 28 18 19 19 20 10 5 3 0 6 1 3 0 0 0 1 2 3 4 5 Criterion scores

RQ1: Criterion output vs. Human ratings Ø Correlation (Spearman rho) EPT levels (complete set N=132) EPT levels (two-rater N = 89) EPT levels (three-rater N=43) Criterion scores (N=132) Criterion scores 0.39** 0.47** 0.22 1 Ø GUMStyle Word count 0.25** 0.31** 0.11 0.69** Total errors -0.40** -0.48** -0.21-0.43** Grammar -0.25** -0.33** -0.07-0.36** Usage -0.21* -0.20-0.22-0.47** Mechanics -0.28** -0.30** -0.22-0.34** Style -0.30** -0.35** -0.20-0.26** ** significant at 0.05

RQ1: Criterion output vs. Human ratings Ø Regression analysis (Criterion scores) Model Beta t p-value Constant 7.963 0.000 Word count 0.604 11.933 0.000 Total errors 0.339 2.101 0.038 Grammar -0.287-4.914 0.000 Usage -0.341-6.330 0.000 Mechanics -0.229-3.117 0.002 Style -0.383-2.678 0.008 Dependent variable: Criterion scores R 2 is 0.727

RQ1: Criterion output vs. Human ratings Ø Regression analysis (EPT levels) Model Beta (standardized coefficient) t p-value Constant 6.039 0.000 Word count 0.117 1.372 0.173 Total errors 0.384 1.379 0.170 Grammar -0.219-2.099 0.038 Usage -0.186-1.988 0.049 Mechanics -0.305-2.489 0.014 Style -0.549-2.252 0.026 Dependent variable: EPT levels R 2 is 0.208

RQ2: Differences b/w EPT levels Ø Post-hoc Multiple comparison in One-way ANOVA (N=135) Model B-C C-Pass B-Pass Criterion Scores -0.139-0.580* -0.719* Word count -1.489-41.978* -43.467* Total errors 3.02* 1.9 4.92* Grammar 0.379 0.38 0.761* Usage -0.088 0.474* 0.562* Mechanics -0.013 1.193* 1.180* Style 2.740* 0.348 3.088* *. The mean difference is significant at 0.05 level

Discussion Ø RQ1 à relatively low correlation May be due to: Different grading rubrics Essay lengths Essay prompt level on Criterion (College 1 st year) Ø RQ2 à Distinguished Pass from 101B / C Can - because of: wide coverage of error categories Cannot - because of: style (repetition and spelling)

Implications & Future Studies Ø Potential use for distinguishing PASS from Non-Pass confirming placement through diagnostic test Ø Future studies on The effects of different essay topic categories and mode The predictive evidence of Criterion output Paper-based writing vs. computer-based writing

References Ø Attali, Y., & Burstein, J. (2006). Automated essay scoring with e-rater V. 2. The Journal of Technology, Learning and Assessment,4(3). Retrieved from http://www/jtla.org. Ø Fulcher, G. (1997). English Language Placement Test: Issues in reliability and validity. Language Testing, 14(2), 113 139. Ø James, C. L. (2006). Validating a computerized scoring system for assessing writing and placing students in composition courses. Assessing Writing, 11, 167 178. Ø Ware, P. D., & Warschauer, M. (2006). Electronic feedback and second language writing. In Hyland & F. Hyland (Eds.), Feedback in second language writing: Contexts and issues (pp. 105-122). Cambridge: Cambridge University Press.

Thank you! Questions and Comment? Acknowledgements: Yoo Ree Chung --------------------------------------------------------------- Zhi Li zhili@iastate.edu Hyejin Yang hjyang@iastate.edu Stephanie Link smcross@iastate.edu Volker Hegelheimer volkerh@iastate.edu Website: volkerh.public.iastate.edu/awe 28