A Computer-Aided Analysis of Word Form Errors in College English Writing – A Corpus-based Study

Based on contrastive analysis and computer-aided error analysis, this paper uses qualitative and quantitative methods to explore word form errors committed by Chinese non-English majors in their writing, collected in Chinese Learner English Corpus (CLEC). The aim is to offer English learners methods to help improve their English writing proficiency and yield some suggestions on English language teaching. The main findings are as follows: (1) the word form errors account for 29.42% of the total language errors; (2) there is a negative correlation between word form errors and writing quality; (3) there is a significant difference in word form errors committed by college learners of different writing ability. In the end, this paper analyzes the reasons for word form errors and puts forward some pedagogical suggestions.


INTRODUCTION
During the process of acquiring a foreign language, learners unavoidably make different kinds of errors. Corder noted that errors can be significant in three ways: they provide the teacher with information about how much the learner had learnt; they provide the researcher with evidence of how language was learnt, and they serve as devices by which the learner discovers the rules of the target language [1]. Mistakes, misjudgments, miscalculations, and erroneous assumptions form an important aspect of learning virtually any skill or acquiring information [2]. Therefore, Researchers try to use many methodologies to analyze these errors so as to help teachers teach more efficiently and students learn better.
Because of the popularization of computers, the approach of corpus-based research on interlanguage and language errors is becoming prevalent in recent years. Leech pointed out that error analysis in the 1960s became data-oriented [3]. Since that time, and with current advance merits in computers used in linguistics, the powerful ability of storage and processing offers a new way of language error analysis, which gives language researchers the rare opportunity to further contemplate those linguistic errors in their contexts. This paper attempts to study word form errors committed by Chinese non-English majors in their writing through computer-aided error analysis in the hope of helping college learners improve their English proficiency.

II. LITERATURE REVIEW
In 1997, Professor Gui at Guangdong University of Foreign Studies launched a project of a one-million word corpus (Chinese Learner English Corpus). The data of this interlanguage (IL) corpus is gathered from the written production of general subject-matter by Chinese learners from middle schools and high schools up to English majors in university. The project aims at conducting systematic tagging and analysis of the learner errors and making a contrastive survey of IL features in various dimensions. In this corpus, variables including school, age, sex, level, time of learning, and type of performance are followed in the process of data collection and are annotated to facilitate comparative analysis of various purposes. Also, an error tagging system is designed since it is part of the important preliminary work [4].
In China, much research has been conducted on the basis of Chinese Learner English Corpus (CLEC). In 1999, Yang adopted CLEC to conduct his research for writing his doctoral dissertation. In his research, he analyzed briefly all types of errors that were tagged in CLEC where he found that intralingual and interlingual errors accounted for 64.18% and 35.82% respectively. The qualitative analysis of errors in his study was focused on crosslinguistic influence, although L1 transfer was a important factor behind the errors [5]. Li and Cai studied the misuse of English articles from the sub-corpus of "Non-English major college students" in the CLEC. The authors found that the misuse of articles by Chinese learners mainly followed three patterns: omission of articles, overapplication of articles and confusion of articles which accounts for 54.2%, 29.6% and 16.2% respectively [6]. Gui published his paper on errors in the CLEC from a cognitive perspective. He classified all the errors into three levels: lexical perceptual level, lexical-grammatical level and syntactic level. Lexical perceptual errors were mainly related to memory, e.g. mortaility for mortality. Lexical-grammatical errors were related to the misunderstanding of the target language system, e.g. hitted for hit. The conclusion was that language transfer played a very important role in all three levels [7]. In 2006, Guo analyzed characteristics of spelling errors tagged in CLEC [8]. He found that Lexical errors in a sub-corpus of CLEC comprised 30% of all written mistakes made by Chinese college students in English and there was a negative correlation between lexical errors and writing quality [9], [10]. Xia discussed possible causes of word class errors in the CLEC [11]. PAPER A COMPUTER-AIDED ANALYSIS OF WORD FORM ERRORS IN COLLEGE ENGLISH WRITING -A CORPUS-BASED STUDY From the above-mentioned achievements, it is clear that Chinese researchers have made great efforts on EA and language transfer. Most of the research is concerned with the common errors or one specific error in CLEC. The present paper uses qualitative and quantitative methods to explore word form errors committed by Chinese non-English majors in their writing, collected in CLEC, which has the same title "Health Gains in Developing Countries". The aim is to provide English learners with some methods to help improve their English writing proficiency and put forward some suggestions on English language teaching.

III. RESEARCH METHODOLOGY
This study involves contrastive and computer-aided error analyses on the word form errors collected from written samples of non-English majors in the CLEC. Corder's procedures are adopted in this study, including corpus selection, error identification, error classification and error explanation [1].

A. Research Questions
The present study attempts to answer the following questions: What is the percentage of word form errors out of all the language errors in the writing sample?
Are there any correlations between the writing scores and the word form errors in the writing sample?
Are there any differences in word form errors between the good writing sample and the poor writing sample?

B. Research Instruments
This study adopts the following instruments: This is a corpus software tool which can be used to search key words.

3) Detagging tool
Because CLEC is a tagged corpus, it is necessary to use a detagging tool to delete all information and tags in order to analyze students' raw samples.

4) The statistical softwares
Excel was employed to calculate the proportion of each type of errors and SPSS 21. was used to analyze all errors in detail.

5) Judges
Although the lower grade non-English-major corpus and the higher grade non-English-major corpus in CLEC offer each writing score which was given by a rater (college teacher) based on the Principles of Scoring Writing in CET4, two more college English teachers were chosen as the judges or raters to score the writing again in order to ensure the reliability of the evaluation. The two college teachers were chosen as the raters because both of them have been engaged in the work of scoring writing on the College English Test Band Four test (CET 4 ) for several years and are quite experienced raters.

C. Research Procedures
CLEC contains five sub-corpora, made up of written texts by Chinese English learners of 5 groups: high school students, lower grade college students (non-English majors), higher grade college students (non-English majors) lower grade college students (English majors) and higher grade college students (English majors), and covers 1,207,879 words. This research chooses lower grade college students and higher grade college students' corpora as research subjects. The two corpora contain both different and same writing tasks. In order to do more exact analysis, the same writing task which is entitled "Health Gains in Developing Countries" was chosen. There were 290 compositions, including 130 lower grade college students' compositions and 160 higher grade college students' compositions. There are 40320 word tokens in total, excluding all error tags and annotations.
Because CLEC is a tagged corpus, the information and annotations in the writing needed to be deleted by means of a detagging tool, the raw compositions were then given to two judges to be scored. The judges were asked to read each composition and rank it on a fifteen-point scale based on the Principles of Scoring Writing in CET4. According to the principles, a holistic method of scoring was employed which means that the judges' attention was to be focused on the overall effect of the writing rather than on the specific aspects of the writing (such as spelling, choice of words, grammar, etc.). The mean score of each composition, which was received from two teachers' scores and the score offered by corpus, was calculated and recorded as the writing proficiency level of the subjects.
CLEC is a tagged corpus which tags 11 types of errors, including word form error(fm), verb phrase error(vb), noun phrase error(np), pronoun(pr), adjective phrase error(aj), adverb error(ad), preposition phrase error(pp), conjunction error(cj), wording error(wd), collocation error(cc), syntax error(sn). The result is achieved through inputting "fm", which stands for the word form error, into the search engine.
The total number of word form errors was counted through running "Concordance" of AntConc. Based on the scores that the raters had marked on each composition, the sample compositions were ranked into three levels. Those compositions belonging to the top fifty scores were considered high-level writing. Those compositions belonging to the bottom fifty scores were considered low-level writing, the writing, with the remaining compositions belonging to middle-level writing. Then T-test in SPSS program was run to see which type of errors most significantly distinguishes the good writing from the poor writing. The correlation between word form errors and the writing's score was found out by running The Pearson Correlation program. The statistical results were interpreted and analyzed by using related acquisition theories. Then answers to the research questions were discovered and some conclusions of the present study were drawn. Finally, some pedagogical implications were suggested by referring to the above conclusions of the study.

IV. RESULTS AND DISCUSSION
This study adopts quantitative and qualitative methods to analyze the word form errors in college English writing.

1) Word form error distribution
The descriptive data of the language errors in the When it comes to word form errors, there are 1349 word form errors which account for 29.42% out of all language errors in the writing sample, and most of word form errors are high frequency words, e.g. medecine for medicine, levle for level. Word form errors occur on average 4.65 times in each student' writing sample. For the two extremes of minimum and maximum occurrences, 18 of the students did not make word form errors and one of the students made 20 word form errors. 2

) The Relationship Between word form Error Occurrences and Writing Quality
The table II reports the correlation between word form error occurrences and writing quality. The result shows that there is negative correlation between writing score and word form errors at the significance of 0.01. Its correlation coefficient is -.352. Though the coefficient is not very high, it has statistical significance. It indicates that the more word form errors the students made, the lower the writing quality.

3) Diffenences in word form Errors between the High-Score and the Low-Score Writing
Among the 290 sample writing, the top 50 were labeled high-score writing, the bottom 50 were labeled low-score writing, and the rest were labeled intermediate-score writing. Table III shows the difference in word form errors between the high-score and the low-score writing. It can be seen that there are 353 and 157 total word form errors in the low-score and high-score writing respectively. The low-score students made an average of 7.06 word form errors, while the high-score students made an average of 3.14. It means that low-score students made word form errors twice as often as high-score students. Through further ANOVA analysis, there is a strong significant difference in word form errors between the high-score and the low-score writing (F=18.401, p<0.001).

B. Qualitative Analysis of word form Errors
The above section has discussed that word form errors account for 29.42% of total language errors and have a negative correlation with writing quality. Low-score college students made word form errors twice as often as high-score students. It is word form errors that to some extent distinguish low-score college writing from highscore writing, which means that with the development of the learning stage and writing quality, students made fewer word form errors, that is to say, the number of word form errors can predict writing quality to some degree. However, high-score college students still made a large number of word form errors in writing which hindered readers' comprehension. Therefore, it is very important for teachers and students themselves to find ways of reducing their word form errors in writing.

1) Classification of word form errors
Gui and Yang [4] classify word form errors into three sub-types, including errors in spelling, word formation and capitalization. Wang & Sun divided formal errors of lexes into errors of phonological deviation, graphemic deviation and morphological deviation [12]. Phonological deviation means that incorrect words caused target words' phonological changes, that is to say, the incorrect word and the target word do not have the same pronunciation e.g., developement and development. Graphemic deviation means incorrect words did not cause target words' phonological changes, but use different letters, that is to say, the incorrect word and the target word still have the same pronunciation, e.g., tecknology and technology. Morphological deviation means words were coined in some way, e.g., grieveness. The present study combines the two classifications and selects a few typical examples in the sample writing to help illustrate each error type (see Table  IV). Table IV shows that the students made large numbers of errors in the first subtype -phonological deviation.

a) Phonological Deviation
As is shown in the above Table IV, phonological deviation is quite common among Chinese English learners. And within the category of phonological deviation, insertion and omission occur generally in the unstressed syllables in the middle of the word, e.g., development spelled as *developement, medicine as * medcine. This phenomenon indicates that the word stress has a quite strong influence upon spelling. Stressed syllables and the two ends of  the word form have strong phonetic representation in the learner's mental lexicon; meanwhile the representation of unstressed syllables is comparatively vague, therefore contributing to a large number of spelling errors. The relative relationship of the syllables in a word forms the rhythm of the word and so rhythm should be included in graphemic knowledge in the mental lexicon. When substitutions in phonological deviation are analyzed, it is found that consonants and vowels seldom substitute for each other, but almost always a consonant substitutes for a consonant, e.g., information as * imformation and a vowel substitutes for a vowel, e.g., average as *avarage, and incorrect words caused target words' phonological changes, b) Graphemic Deviation Quite a large number of graphemic deviation is seen in the word form errors of lexes. In high-score and low-score writing, considerable numbers of homophonic substitution errors are discovered, i.e., a letter or letters with the same pronunciation taking the place of another letter or letters, e.g., *tecknology for technology. This is because of the lack of correspondence of mapping between phonemes and graphemes in English. In fact, there are many more letters and letter clusters than phonemes in English. Thereby the learner has to choose a grapheme carefully for a phoneme. Double letter errors are also included in graphemic deviation, e.g., especially as *especialy, insure as *inssure. Aside from homophonic substitution and double letter errors, another graphemic deviation is caused by silent letters, including silent letter insertion and silent letter omission, e.g., watch as *whatch, and rhythm as *rythm. The common feature shared by silent letter and double letter errors is their lack of phonological motivation, and it does not matter whether a silent letter is omitted or inserted, or whether a consonant is doubled or not, the pronunciation of the word stays the same. From these findings it can be concluded that the quality of a letter and its position are not only stored separately, but the graphemic position and phonemic position should be differentiated, too. In some words, two graphemic positions have only one phonemic position, e.g. the double letters "tt" in unforgettable. And in other words, a graphemic position may have no phonemic position at all, e.g., the silent letter "w" in wreak.

c) Morphological Deviation
The study shows that with the English level of learners increasing, the percentage of morphological deviation errors rises too. The most likely reason is that the higherlevel learners more often begin to apply their word formation knowledge to attempt to produce compound words, inflectional forms and derivatives than the lower-level learners. Because of their lack of ability to use the morphological strategies properly, they risk coining words that are logically feasible, but do not exist in the English vocabulary. Low-score learners make much fewer such errors, which does not mean that they have mastered morphological strategies, only because they have not yet reached the stage where new strategies are attempted and guesses are made on the basis of the words they already know.

2) Reason for word form errors
There are mainly three reasons why EFL learners make so many word form errors in their writing. The first reason is the essential difference between English and Chinese. The Chinese language has a different origin from that of western languages. The Chinese language belongs to the Sino-Tibetan language family, while the English language belongs to the Indo-European language family. The Chinese language consists of ideographic characters, while the English language consists of orthographic spelling. The two languages differ from each other in the ways the sounds are put together, the ways in which they influence each other, and especially the rhythm, stress, and pitch patterns [13]. The second reason is the complexity of English spelling. According to Ehri, there are 70 graphemes for 40 phonemes, e.g., the phoneme /k/ can be represented by graphemes c, ch, ck, k, kh [14]. Therefore the learner has to choose a grapheme carefully for a phoneme. The third reason is the students' carelessness. The key words expectancy and mortality in English were given in both the writing instructions and the graph, but they still made many errors in spelling these two words. The word expectancy was incorrectly written as *epectancy once, *expectacy twice, *expectany twice, *expenctancy four times and *expentancy twice. The word mortality was incorrectly written as *mortailiaty once, *mortility once, *mortarility once, and *mortaility 374 times.

V. CONCLUSION
The analysis of the word form errors in this study revealed that Chinese college learners of English have many problems in spelling, which enormously influence their writing quality. Thus it is of great importance to reduce word form errors to improve learners' writing ability.

A. Enlarging Vocabulary Size and Focusing on Highfrequency Word Learning
The study shows that a big problem of the students' writing is word form errors. This is partly due to students' small vocabulary size as well as the improper use of highfrequency words in writing. According to the College English syllabus for non-English majors, students should have mastered at least 3000-4000 English words when they have finished English learning in college. Knowing these words is quite enough for them to express their ideas, however, college students' both receptive and productive vocabulary is quite limited, which can be safely concluded from the current study and the previous ones [9], [10], [15].
iJET -Volume 11, Issue 03, 2016 These studies show that the high frequency vocabulary size of the college students was about 1966 words out of a list of 3000 high frequency words after 7 years' English study and college students made a lot of mistakes in using the high frequency words in their writing. Additionally, the problem of students' writing does not only lie in whether they can write down a considerable number of words in English, but also whether they can put down correct English words and communicate in an efficient, appropriate and natural way. Therefore, students should not only enlarge their vocabulary, but also master the usages of the high-frequency words.

B. Promoting Phonemic Knowledge
At the beginning, target words and the corresponding student spellings should be compared to determine whether phonemic awareness skills might give rise to the misspelling. A common example is an error of omission, in which a student fails to represent each phoneme in the target word with at least one letter partly because they misunderstand that the incorrect word and the target word have the same pronunciation (e.g. climb for clim). If impaired phonemic awareness appears to be involved, one goal in intervention would be to promote the learners' phonemic awareness skills that are relevant to the target error patterns. It is found that quite a lot of Chinese college students' English phonemic awareness is so weak that it greatly influences their speaking, listening, reading, and writing ability, which was caused by primary and middle school English teachers' ignorance of teaching English phonemic knowledge [16], [17]. Phonological knowledge is used to spell words early in the developmental process. Early developmental spelling errors, especially those using vowels, sonorant, and consonant sequences typically are more challenging for lower-level learners to represent orthographically. Thus students should promote their phonemic knowledge as early as possible.

C. Improving Orthographic Knowledge
Some error patterns might be misspelled due to a lack of orthographic knowledge. Orthographic knowledge refers to the ability to convert spoken phonemes to graphemes. Some formal errors may be due to a learners' failure to make the appropriate conversion. For instance, learner are supposed to learn that /ae/ is spelled with the letter a, and the /e/ sound is spelled with the letter e, and not vice versa. A more complicated example is that learners know when to double a stressed single final consonant preceded by a short vowel when adding on a suffix beginning with a vowel (e.g. getting). Orthographic knowledge also covers an appreciation for ortho-tactic principles, which are positional constraint on the conversion of phonemes to graphemes. For instance, the initial /k/ is generally represented by the letter k if it precedes the letters e or i; meanwhile, the /k/ is represented by the letter c if it precedes the letters a, o, or u. What's more, /k/ often is represented by the letters ck in the medial or final position of words, but never in the initial position of words (e.g. back and bucket, but not *ckake for cake).
Poor orthographic knowledge was evidenced in spellings that were phonetically plausible (e.g. *consum for consume; *levle for level) yet did not follow conventional orthographic patterns (i.e. use of vowel-consonant-e to mark long vowels and deletion of the vowel preceding the syllabic l). Other misspellings suggesting a lack of ortho-graphic knowledge included using incorrect vowel graphemes for stressed vowels (e.g., personality as *personility; extension as *icstention). Lastly, still other error patterns suggest violations of ortho-tactic principles, such as *disdrict for district, *swimm for swim. For word *disdrict, the student's use of sd violated the spelling of s+consonant sequences. For word *swimm, the student used a double m in the word final position, which is not permitted.

D. Enhancing Morphological Knowledge
English learners also must depend on their knowledge of inflectional and derivational morphology to spell some words. For instance, although the inflected words "jumped", "knitted", and "hugged" end with different phonemes, they are all spelled similarly because of the uniform spelling of the regular past tense inflectional marker. Therefore, a student's awareness of the morphophonemic relationship between the past tense morpheme and its various phonemic representations helps in spelling all occurrences of regular past tense verbs. The explicit use of morphological knowledge becomes an increasingly important spelling strategy as students face the demands of writing multi-morphemic words. As a result, some of the error patterns identified may require instruction that increases students' awareness of morphology to facilitate optimal change.
In summary, word form errors might be due to a number of different underlying factors. The percentage of occurrence differed across these factors. Several deficient linguistic areas may give rise to Chinese English learners' misspellings, including phonemic and morphological awareness, and orthographic knowledge. Therefore, it is necessary to improve students' phonemic, morphological and orthographic knowledge in teaching and learning in the hope of promoting their English writing proficiency.