Lexical Ambiguity Resolution for Turkish in Direct Transfer Machine
Transkript
Lexical Ambiguity Resolution for Turkish in Direct Transfer Machine
Lexical Ambiguity Resolution for Turkish in Direct Transfer Machine Translation Models A. Cüneyd TANTUĞ 1 Eşref ADALI 1 Kemal OFLAZER 2 1 Istanbul Technical University Faculty of Eletrical-Electronic Engineering Computer Engineering Department 34469, Maslak, Istanbul, Türkiye {cuneyd, adali}@cs.itu.edu.tr 2 Sabancı University Faculty Of Engineering and Natural Sciences 34956, Orhanlı, Tuzla, Türkiye oflazer@sabanciuniv.edu Abstract. This paper presents a statistical lexical ambiguity resolution method in direct transfer machine translation models in which the target language is Turkish. Since direct transfer MT models do not have full syntactic information, most of the lexical ambiguity resolution methods are not very helpful. Our disambiguation model is based on statistical language models. We have investigated the performances of some statistical language model types and parameters in lexical ambiguity resolution for our direct transfer MT system. 1. Introduction This paper presents a statistical lexical ambiguity resolution method in direct transfer machine translation models in which the target language is Turkish. This resolution method is based on statistical language models (LM) which exploit collocational occurrence probabilities. Although, lexical ambiguity resolution methods are generally required for some NLP purposes like accent restoration, word sense disambiguation, homograph and homophone disambiguation, we focus only on lexical ambiguity of word choice selection in machine translation (MT). The direct transfer model in MT is the transfer of the sentence in the source language to the sentence in the target language on word-by-word basis. While this model is the simplest technique for MT, it nevertheless works fine with some close language pairs like Czech-Slovak [1], Spanish-Catalan [2]. We have been implementing an MT system between Turkic Languages, the first stage of the project includes the Turkmen-Turkish language pair. The lexical ambiguity problem arises when the bilingual transfer dictionary produces more than one corresponding target language root words for a source language root word. Hence, the MT system has to choose the right target word root by means of some evaluation criteria. This process is called word choice selection in MT 2 A. Cüneyd TANTUĞ, Eşref ADALI, Kemal OFLAZER related tasks. Most of the methods developed to overcome this problem are based on some syntactic analysis or domain knowledge. However, there is no such syntactical analysis or any other deeper knowledge in MT systems that utilize direct transfer models. 2. Previous Work The general task of “word sense disambiguation” (WSD) studies the assignment of correct sense labels to the ambiguous words in naturally occurring free text. WSD has many application areas like MT, information retrieval, content and thematic analysis, grammatical analysis, spelling correction, case changes and etc. In general terms, word sense disambiguation (WSD) involves the association of a given word in a text or discourse with a definition or meaning (sense) which is distinguishable from other meanings potentially attributable to that word. In the history of the WSD area, supervised and unsupervised statistical machine learning techniques are used as well as other AI methods. A Bayesian classifier was used in work done by Gale et al [5]. Leacock et al [6] compared the performances of Bayesian, content vector and neural network classifiers in WSD tasks. Shütze’s experiments showed that unsupervised clustering techniques can get results approaching the supervised techniques by using a large scale context (1000 neighbouring words) [7]. However, all these statistical WSD suffer from two major problems: need for manually sense-tagged corpus (certainly for supervised techniques) and data sparseness. The lexical ambiguity resolution in MT is a related task of WSD which focuses on deciding the most appropriate word in the target language among a group of possible translations of a word in the source language. There are some general lexical ambiguity resolution techniques; unfortunately, most of the successful techniques require complex information sources like lemmas, inflected forms, parts of speech, local and distant collocations and trigrams [8]. Most of the disambiguation work use syntactic information with other information like taxonomies, selectional restrictions and semantic networks. In recent work, complex processing is avoided by using partial (shallow) parsing instead of full parsing. For example, Hearst [9] segments text into noun and prepositional phrases and verb groups and discards all other syntactic information. Mostly, the syntactic information is generally simply part of speech used conjunction with other information sources [10]. In our work, there is no syntactic transfer so there is not any syntactic level knowledge but the part of speech. So we have developed a selection model based on the statistical language models. Our disambiguation module performs not only the lexical ambiguity resolution but also the disambiguation of source language morphological ambiguities. Our translation system is designed mainly for translation from Turkic languages to Turkish. One serious problem is the almost total lack of computational resources and tools for these languages to help with some of the problems on the source language side. So all the morphological ambiguities are transferred to the target language side. Apart from the lexical disambiguation problem, our disambiguation module should Lexical Ambiguity Resolution for Turkish in Direct Transfer Machine Translation Models 3 decide the right root word by considering the morphological ambiguity. For example the Turkmen word “akmak” (foolish) has two ambiguous morphological analyses: akmak+Adj ak+Verb+Pos^DB+Noun+Inf1+A3sg+Pnon+Nom (fool) (to flow) These analyses can not be disambiguated until the transfer phase because there is no POS tagger for the source language. The transfer module converts all the root words to the target language as stated below. ahmak+Adj budala+Adj ak+Verb+Pos^DB+Noun+Inf1+A3sg+Pnon+Nom (fool) (stupid) (to flow) Note that there are two possible translations of the source word regarding its “fool” sense. The disambiguation module that will be designed has to handle both the lexical ambiguity and the morphological ambiguity. 3. Language Models Statistical language models define probability distributions on word sequences. By using a language model, one can compute the probability of a sentence S (w1w2w3…wn) by the following formula: P(S)=P(w1)P(w2|w1)P(w3|w1w2)…P(wn|w1…wn-1) This means that the probability of any word sequence can be calculated by decomposition using the chain rule but usually due to sparseness, most terms above would be 0, so n-gram approximations are used. 3.1. Training Corpus Since Turkish has an agglutinative and highly inflectional morphology, the training corpus cannot be just texts collected from various sources. In order to calculate frequencies of the root words, these texts should be processed by a lemmatizer. However, some of our models require not only the root words but also some other morphological structures so the words in the training corpus should be processed with a morphological analyzer and the morphological ambiguity should be resolved manually or by using a POS-Tagger. We have used such a corpus which is composed of texts from a daily Turkish newspaper. Some statistics about the corpus is depicted in Table 1. 4 A. Cüneyd TANTUĞ, Eşref ADALI, Kemal OFLAZER Table 1. Training Corpus Statistics # of tokens root word vocabulary size root words occurring 1 or 2 times root words occurring more than 2 times 3.2. 948,000 25,787 14,830 10,957 Baseline Model At first glance, the simplest way of word choice selection can be implemented by incorporating word occurrence frequencies collected from a corpus. We took this model as our baseline model. Note that this is same as the unigram (1-gram) language model. 3.3. Language Model Types We have used two different types of language models for lexical ambiguity resolution. The first LM Type 1 is built by using only root words and dismissing all other morphological information. LM Type 2 uses the probability distributions of root words and their part of speech information. All types of these language models are back-off language models combined with Good-Turing discounting for smoothing. Additionally, we have used a cutoff 2 which means n-grams occurring fewer than two times are discarded. We have computed our language models by using CMU-Cambridge Statistical Language Modeling Toolkit [11]. 3.4. LM Parameters Apart from the type of the LM, there are two major parameters to construct a LM. The first one is the order the model; the number of successive words to consider. The second parameter is the vocabulary size. It might be better to have all the words in a LM. However this is not practically possible because a large LM is hard to handle and manage in real world applications, and is prone to sparseness problems. Therefore, a reduction in vocabulary size is a common process. So, deciding how many of the most frequent words will be used to build a LM becomes the second parameter to be determined. Lexical Ambiguity Resolution for Turkish in Direct Transfer Machine Translation Models 5 4. Implementation We have employed our tests on a direct MT system which translates text in Turkmen Language to Turkish. The main motivation of this MT system is the design of a generic MT system that performs automatic text translations between all Turkic languages. Although the system is currently under development and it has some weaknesses (like the small coverage of the source language morphological analyzer, insufficient multi-word transfer block), the preliminary results are at an acceptable level. The system has following processing blocks: 1. 2. 3. 4. 5. 6. 7. Tokenization Source Language (SL) Morphological Analysis Multi-Word Detection Root Word Tansfer Morphological Structure Transfer Lexical & Morphological Ambiguity Resolution Target Language (TL) Morphological Synthesis Our word-to-word direct MT system generates all possible Turkish counterparts of each input source root word by using a bilingual transfer dictionary (while the morphological features are directly transferred). As the next step, all candidate Turkish words are used to generate a directed acyclic graph of possible word sequence, as shown in Figure 1. Source Language Näme üçin adamlar dürli dillerde gepleýärler ? w1 näme w2 üçin ne w3 adam kim w5 dil insan için <s> w4 dürli konuş türlü adam w6 geple </s> dil söyle Fig. 1. The process of decoding the most probable target language sentence As seen in the figure, each root word of the sentence in the source language can have one or more Turkish translations which produce lexically ambiguous output. The transition probabilities are determined by the language model. As an example, the probability of the transition between “dil” and “konuş” is determined by the probability P(“konuş”|“dil”) which is calculated from the corpus in the training stage of the bigram language model. Then, the ambiguity is disambiguated by trying to find the path with the maximum probability by using the Viterbi algorithm [13]. 6 A. Cüneyd TANTUĞ, Eşref ADALI, Kemal OFLAZER 5. Evaluation In the evaluation phase of our work, the aim is to find the LM type, order and vocabulary size which performs best with our direct MT model. We have conducted our tests for n=1,2,3,4 and root word vocabulary size =3K, 4K, 5K, 7K and 10K. Note that 10K is nearly the vocabulary size of the training corpus which means that all the root words are used to construct the LM. The performance evaluation of the proposed resolution method is investigated by means of NIST scores achieved by each LM type for different parameters. NIST is a widely used, well-known method for evaluating the performance of the MT systems [14]. We have used the BLEU-NIST evaluation tool mteval that can be accessed from NIST. These evaluations are calculated on 255 sentences. In order to find out which LM type and parameters are better, we have run our direct transfer system with different LMs. For a fair evaluation, we have measured the performances of these LMs against the performance of our baseline model. The results are given in figure 2 and figure 3. 0.1000 NIST Score Improvement 0.0900 0.0800 0.0700 3K 0.0600 4K 0.0500 5K 0.0400 7K 0.0300 10K 0.0200 0.0100 0.0000 n=1 n=2 n=3 n=4 LM Order Fig. 2. LM Type 1 (Root Word) Results 0.0800 NIST Score Improvement 0.0700 0.0600 3K 0.0500 4K 0.0400 5K 7K 0.0300 10K 0.0200 0.0100 0.0000 n=1 n=2 n=3 n=4 LM Order Fig. 3. LM Type 2 (Root Word + POS Tag) Results Lexical Ambiguity Resolution for Turkish in Direct Transfer Machine Translation Models 7 In our tests, the root word language model (type 1, n=2 and vocabulary size = 7K) performs best. We examined that there is absolutely no difference between 7K and 10K vocabulary selection. This is why the 7K and 10K lines in the graphs are superposed. An interesting and important result is the decrease of the score with n higher than 2 for all LM types and parameters (except type 1 with 3K vocabulary). This can be explained with the fact that most of the dependencies are between two consecutive words. Also, as expected, the NIST score improvement increases for larger vocabulary sizes, but it is clear that using a LM which has a 5K vocabulary is meaningful. As an example, the translation of the Turkmen sentence in the figure 1 by using our baseline model is below: Input : Näme üçin adamlar dürli dillerde gepleýärler ? Output1: ne için insanlar türlü dillerde söylüyorlar ?(Type1,n=1,3K) Output2: ne için insanlar türlü dillerde konuşuyorlar ?(Type2,n=2,3K) Although Output1 is an acceptable translation, the quality of the translation can be improved by using a LM type 1 with n=2. P(“söyle”) (to say) is larger than P(“konuş”) (to speak) in the baseline model. On the contrary, P(“konuş”|“dil”) (dil means language) is larger than the P(“söyle”|“dil”) in the LM with n=2 so this makes the translation more fluent. In the example below, one can see the positive effects of increasing the vocabulary size. The source root word “çöl” (desert) has three translations; “seyrek” (scarce), “kıt” (inadequate) and “çöl” (desert) in the transfer dictionary, there is no probability entry about these words in a 3K vocabulary so all three target words have probabilities given by the smoothing process. The system chooses the first one “seyrek” (scarce) which results in a wrong translation. This error is corrected by the LM with a 7K vocabulary which involves the word “çöl” (desert) . içinde kompas tapypdyr . Input : Bir adam çölüñ Output1: bir insan seyreğin içinde pusula bulmuştur . (Type1,n=1,3K) Output2: bir insan çölün içinde pusula bulmuştur . (Type1,n=1,7k) There are also examples for erroneous translations. In the following translation instances, the LM with n=2 and 5K vocabulary can not correctly choose the word “dur” (to stay), instead, it chooses the verb “dikil” (to stand) because of the higher probability of P(“dikil”|”sonra”) (“sonra” means “after”). In this case, the baseline translation model outperforms all other LMs. wagtlap durandan soñ , ondan : Hamyr gyzgyn tamdyrda esli tagamly bir zadyñ ysy çykyp ugrapdyr . durduktan sonra , ondan Output1: hamur kızgın tandırda epeyce süre tatlı bir şeyin kokusu çıkıp başlamıştır . (Type1,n=1,3K) Output2: hamur kızgın tandırda epeyce süre dikildikten sonra , ondan tatlı bir şeyin kokuyu çıkıp başlamıştır . (Type1,n=2,5K) Input In some cases, POS information decreases NIST scores which is opposed as expected. For instance, the following translations show a situation where the baseline system produces the right result by choosing the verb “var” (to exist) whereas the 8 A. Cüneyd TANTUĞ, Eşref ADALI, Kemal OFLAZER more complex (type 2 with n=2 and 5K vocabulary) LM generated a false word choice by selecting the verb “git” (to go). Input : Sebäbi meniñ içimde goşa ýumruk ýaly gyzyl bardy Output1: nedeni benim içimde çift yumruk gibi altın vardı.(Type1,n=1,3K) Output2: nedeni benim içimde çift yumruk gibi altın gitti.(Type2,n=2,5K) 6. Conclusions Our major goal in this work is proposing a lexical ambiguity resolution method for Turkish to be used in our direct transfer MT model. The lexical ambiguity occur mainly because of the transfer of the source language root words. We have built language models for statistical resolution of this lexical ambiguity and then these LMs are used to generate Hidden Markov Models. Finding the path with the highest probability in HMMs has done by using the Viterbi method. By this way, we can disambiguate the lexical ambiguity and choose the most probable word sequences in the target language. Two types of language models (root words and root word+POS tag) are used and we have investigated the effects of the other parameters like LM order or vocabulary size. The LM built by using root words performs best with the parameters n=2 and 7K vocabulary. Since Turkish is an agglutinative language, taking NIST as the evaluation method may not be very meaningful because NIST considers only surface form matching of the words, not the root word matching. Even though the model chooses the right root word, the other morphological structures can make the surface form of the translated word different from surface forms of the words in the reference sentences. This means that the real performance scores are equal or probably higher than the actual performance scores. We are expecting higher NIST score improvements with the development and extending of our translation model. There are some problems like SL morphological analyzer mistakes and transfer rule insufficiencies to handle some cases. Acknowledgments This project is partly supported by TÜBİTAK (The Scientific and Technical Research Council of Turkey) under the contract no 106E048. Lexical Ambiguity Resolution for Turkish in Direct Transfer Machine Translation Models 9 References 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. Hajič, J., Hric J., Kubon, V. : Machine Translation of Very Close Languages. Applied NLP Processing, NAACL, Washington (2000) Canals, R., et.al. : interNOSTRUM: A Spanish-Catalan Machine Translation System. EAMT Machine Translation Summit VIII, Spain (2001) Hirst, G. Charniak, E. : Word Sense and Case Slot Disambiguation, In AIII-82, pp. 95-98 (1982) Black, E. : An Experiment in Computational Discrimination of English Word Senses, IBM Journal of Research and Development 32(2), pp. 185-194 (1988) Gale, W., Church, K. W., and Yarowsky, D. : A Method for Disambiguating Word Senses in a Large Corpus, Statistical Research Report 104, Bell Laboratories (1992) Leacock, C.; Towell, G., Voorhees, E. : Corpus-based statistical sense resolution, Proceedings of the ARPA Human Language Technology Worskshop, San Francisco, Morgan Kaufman(1993) Schütze, H. : Automatic word sense discrimination, Computational Linguistics, 24(1), (1998) Yarowsky, D. : Unsupervised word sense disambiguation rivaling supervised methods. In Proceedings of the 33rd Annual Meeting of the Association for Computational Linguistics (ACL’95), Cambridge, MA. (1995) Hearst, M. A. : Noun homograph disambiguation using local context in large corpora. Proceedings of the 7th Annual Conf. of the University of Waterloo Centre for the New OED and Text Research, Oxford, United Kingdom, 1-19 (1991) McRoy, S. W. : Using multiple knowledge sources for word sense discrimination. Computational Linguistics, 18(1) 1-30 (1992) Clarkson, P.R., Rosenfeld R. : Statistical Language Modeling Using the CMU-Cambridge Toolkit. Proceedings ESCA Eurospeech (1997) Fomey, G.D., Jr. : The Viterbi Algorithm, IEEE Proc. Vol. 61, pp. 268-278 (1973) NIST Report (2002) Automatic Evaluation of Machine Translation Quality Using N-gram Co-Occurrence Statistics. http://www.nist.gov/speech/tests/mt/doc/ngramstudy.pdf