Corpus-based Linguistics and Language Education | Research on Lexicon/Language-use based on Subject-Categorized Corpra

English Top » Linguistics Informatics » Research on Lexicon/Language-use based on Subject-Categorized Corpra

1. Project name: Lexicon/Examination of Examples using a Subject-Categorized Corpus

2. Outline of the Project:
A search engine (JSC) for the examination of corpus examples was developed. Using the JSC, a search keyword is entered into the command line. In order to enable other GCOE members to utilize the textbook corpus, we are organizing subcontracted development of a mouse-controlled retrieval style Web interface. In addition, development of a morpheme filter is also underway.

3. Core Member in Charge: Sano, Hiroshi (GSACS)

4. Collaborating Researchers:
Kawaguchi, Yuji (GSACS); Tono, Yukio (GSACS)

5. Progress:
A search engine (JSC) for the examination of corpus examples was developed. Ultimately, we plan to install this onto the GCOE dedicated server and use it for the corpus search.
This year, we plan to upgrade the JSC interface, and we are currently organizing subcontract work. In addition, development of a morpheme filter is also being conducted.
For the production of the tagged corpus, for Japanese morphological analysis, last year’s Chasen, a software, was used. Mekabu—being free software—may also be used. The two differ with regard to differences in the morpheme recognition units; it is said that Chasen uses relatively short units (close to morphemes) and that Mekabu uses longer units (close to words). Development is underway, using Chasen’s morphological analysis, on a morpheme filter program in which morpheme recognition units can be changed according to externally specified rules. Furthermore, the filter's morpheme conversion rules are to be created at this university. For example, we will start with a morpheme classification for Japanese language teaching, and this will make it possible to add tags usable in Japanese language research, such as a Matsushita grammatical-style morpheme classification.

6. Accomplishments:
（1）Corpus search engine
We performed an analysis on the corpus of Japanese textbooks (14,000,000 words) and on the corpus of home appliance instruction manuals written in Japanese (240,000 words). We developed a search engine system and accomplished a model search via web browsers for the corpus of Japanese textbooks.
（2）Morpheme filter
We performed a qualitative analysis on the corpus of home appliance instruction manuals written in Japanese, and identified the difference between the use of subordinate clauses in Japanese and the use of subordinate clauses in English that are often seen in instruction manuals when describing how to handle appliances.

Menu: