Corpus-based Linguistics and Language Education | International Collaborative Project on Corpora of Learners’ Language Use

English Top » Linguistics Informatics » International Collaborative Project on Corpora of Learners’ Language Use

1. Project name: International Corpus of Crosslinguistic Interlanguage

2. Outline of the Project:
The major aim of this project is to collect in-class English essays written by beginning- and intermediate-level learners of English with different L1 backgrounds. We collaborate with learner corpus researchers in about ten different countries. In addition, this project aims to provide data in a “two-way format” in such a way that we mutually provide the data of foreign studies university students who study the languages of the partner countries (e.g. German, Spanish, Chinese, etc.) as foreign languages.

3. Core Members in Charge: Tono, Yukio (GSACS); Negishi, Masashi (GSACS)

4. Collaborating Researchers:
Germany: Tom Rankin (Vienna University of Economics and Business Administration)
Hong Kong, China: Dr. David Lee, (City University of Hong Kong)
Israel: Tammar Aviad, (PhD student, University of Haifa)
Poland: Dr. Agnieszka Lenko-Szymanska, (University of Warsaw)
Singapore: Dr. Huaqing Hong, (National Institute of Education)
Spain: Dr. Pascual Perez-Paredes, (Universidad de Murcia)
Maria Belen Diez Bedmar (University of Jaen)
Taiwan: Austina Shih (The Language Training & Testing Center)
May Ma (The Language Training & Testing Center)

5. Progress:
The Japanese learner corpus called the JEFLL Corpus (10,000 composed English sentences and data of 670,000 words) has been already compiled and publicly available via the web, under the leadership of Tono, Yukio. In order to collect data in each country in a format that can be compatible with this, we have asked the above-mentioned delegates in each country to do the pilot data collection. We have gone through the pilot test phase and so far collected five sets of data (German, Spanish, Taiwanese, Israeli, and Polish). We are in the process of transcribing and formatting the data for further analysis. The rest of the pilot data collection should be finished by April 2009, and will move onto the main data collection phase in 2009 and 2010. The corpus is to be completed by the end of 2010 and will be publicly available.

Menu: