1.Project name: Collection of spontaneous conversation of French, Spanish and Turkish for creating spoken corpus
2.Outline of the Project
This project is a continuation of the previous researches conducted in the 21st Century COE Program. In this project we propose to create corpora of unprompted conversations in these languages. They will be processed for transcription and related annotation. This is an international collaborative research program with Marmara University, Istanbul University, Middle East Technical University, Mersin University for Turkish, University of Aix-Marseille for French, Autonomous University of Madrid for Spanish.
3.Core member in charge: Kawaguchi, Yuji (GSACS)
4.Toshihiro Takagaki (GSACS)
Spoken French Corpus
Collaborative researchers:
Hisae Akihiro (University of Aix-Marseille)
Shunsuke Nakata (Doctoral Student, GSACS)
Mito Matsuzawa (Doctoral Student, GSACS)
Kaori Sugiyama (Doctoral Student, GSACS)
Koga Kentaro (Master’s Student, GSACS)
Academic Advisers:
José Deulofeu (University of Aix-Marseille)
André Valli (University of Aix-Marseille)
Frédéric Sabio (University of Aix-Marseille)
Spoken Spanish Corpus
Collaborative researchers:
Haruka Kuno (Doctoral Student, GSACS)
Ryo Tsutahara (Doctoral Student, GSACS)
Hideki Kumakura (Doctoral Student, GSACS)
Academic Advisers:
Chieko Kimura (The Autonomous University of Madrid)
Spoken Turkish Corpus
Collaborative researchers:
Selim Yılmaz (Marmara University)
Arsun Uras Yılmaz (Istanbul University)
Yuu Tsukui (Japanese Embassy, Turkey)
Academic Advisers:
Şükriye Ruhi (Middle East Technical University)
Yeşim Akasan (Mersin University)
Mustafa Aksan (Mersin University)
5.Accomplishment:
1) Spoken Turkish Corpus
In academic year 2007, only Turkish corpus project was realized. Selim Yılmaz recorded spontaneous conversations of 22 informants at Marmara University. Total recording length is 7 hours 17 minutes. Transcription is completed.
In academic year 2008, we invited Professor Şükriye Ruhi whose group is constructing spoken Turkish corpora with a million tokens at Middile East Technical University and Professor Yeşim and Mustafa Aksan of Mersin University who are preparing the first National Corpus of Written Turkish with fifty million tokens. We organized with them a workshop and agreed to promote collaborative researches in the future.
In academic year 2009, new Turkish corpus was constructed. Selim Yılmaz and Arsun Uras Yılmaz recorded 23 spontaneous conversations at Marmara University and Istanbul University. Total recording length is 8 hours 47 minutes. Transcription is completed. In academic year 2010, new Turkish corpus was added. Selim Yılmaz recorded 30 spontaneous conversations at Marmara University. Total recording length is 6 hours 48 minutes. Transcription is completed. The total number of words of Spoken Turkish Corpus is 330,048 tokens. In academic year 2011, another Turkish corpus was under construction. Selim Yılmaz and Arsun Uras Yılmaz recorded 21 spontaneous conversations at Marmara University and Istanbul University. Total recording length is 7 hours 17 minutes. Transcription is ongoing.
References
Yuji Kawaguchi, “Predicate-final Structure of spoken Turkish: Corpus Analysis Note,” 2011, Working Papers in Corpus-based Linguistics and Language Education 7 ‘Field Research, Linguistic Corpus, and Linguistic Informatics IV,’ 171-197.
Ruhi Şükriye, “The Pragmatic of yani as a Parenthetical Marker in Turkish: Evidence from the METU Turkish Corpus” ‘Working Papers in Corpus-based Linguistics and Language Education 3’ Global COE Program, Tokyo University of Foreign Studies, Graduate School of Global Studies, 2009, 285-298.
Aksan Yeşim, Mustafa Aksan, “Building a National Corpus of Turkish: Design and Implementation” ‘Working Papers in Corpus-based Linguistics and Language Education 3’ Global COE Program, Tokyo University of Foreign Studies, Graduate School of Global Studies, 2009, 299-310.
Karadaş Derya Çokal, Şükriye Ruhi, “Features for an Internet Accessible Corpus of Spoken Turkish Discourse” ‘Working Papers in Corpus-based Linguistics and Language Education 3’ Global COE Program, Tokyo University of Foreign Studies, Graduate School of Global Studies, 2009, 311-320.
Yuji Kawaguchi, A corpus-driven analysis of -r dropping in spoken Turkish, 2009, Corpus Analysis and Variation in Linguistics, John Benjamins, 281-297.
2) Spoken French and Spanish Corpus
In academic year 2009, we sent Spanish team to Automonous University of Madrid for field recording of spontaneous Spanish conversation and they recorded 34 hours 56 minutes. Transcription is completed. The total token number of the Spanish corpus 2009 is of 46,2248.
In academic year 2009, we also sent French team to University of Aix-Marseille in France. They recorded 34 different spontaneous dialogues in university course rooms or individual houses. Now total tokens in the dialogues already transcribed are 48,4232 and transcription is still ongoing. In academic year 2011, new team will be sent again to University of Aix-Marseille.