(48) Automatic Extraction of Low Frequency Bilingual Word Pairs from
Parallel Corpora with Various Languages
¡¡¡¡¡¡Lecture Notes in Artificial Intelligence,Springer-Verlag,Vol.3518,pp.32-37,2005-5
In this paper, we propose a new learning method for extraction of low
frequency bilingual word pairs from parallel corpora with various languages.
We call this new learning method Adjacent Information Learning (AIL).
The essence of AIL is to use the hypothesis that the equivalents of
the words, which adjoin the source language words of bilingual word
pairs, adjoin the target language words of bilingual word pairs in
local parts of bilingual sentence pairs. Our system using this AIL
can extract not only high frequency bilingual word pairs but also low
frequency bilingual word
pairs. It is important to extract low frequency bilingual word pairs
because frequencies of many bilingual word pairs are very low when
large-scale parallel corpora are unobtainable. In addition, AIL is
a language-independent learning
method. Therefore, using AIL, our system can extract bilingual word
pairs from parallel corpora with various languages. Evaluation experiments
indicated that the extraction rate of our system using AIL was 60.1%
in parallel corpora with five different languages. This extraction
rate of our system using AIL was more than 8.0 percentage points higher
than the extraction rates of the system based on the Dice coefficient.
Moreover, the extraction rate of bilingual word pairs for which the
frequencies are 1 and 2 improved 11.0 and 6.6 percentage points by
using AIL, respectively.