[EBOOK] Machine Learning Approaches For Dealing With Limited Bilingual Training Data In Statistical Machine Translation PDF Download

Machine translating

Machine Learning Approaches for Dealing with Limited Bilingual Training Data in Statistical Machine Translation

Book Details:

Author : Gholamreza Haffari
Publisher :
Release : 2009
ISBN :
Pages : 0 pages

Download or read book Machine Learning Approaches for Dealing with Limited Bilingual Training Data in Statistical Machine Translation written by Gholamreza Haffari and published by . This book was released on 2009 with total page 0 pages. Available in PDF, EPUB and Kindle. Book excerpt: Statistical Machine Translation (SMT) models learn how to translate by examining a bilingual parallel corpus containing sentences aligned with their human-produced translations. However, high quality translation output is dependent on the availability of massive amounts of parallel text in the source and target languages. There are a large number of languages that are considered Low-density, either because the population speaking the language is not very large, or even if millions of people speak the language, insufficient online resources are available in that language. This thesis covers machine learning approaches for dealing with such situations in statistical machine translation where the amount of available bilingual data is limited. The problem of learning from insufficient labeled training data has been dealt with in machine learning community under two general frameworks: (i) Semi-supervised Learning, and (ii) Active Learning. The complex nature of machine translation task poses severe challenges to most of the algorithms developed in machine learning community for these two learning scenarios. In this thesis, I develop semi-supervised learning as well as active learning algorithms to deal with the shortage of bilingual training data for Statistical Machine Translation task, specific to cases where there is shortage of bilingual training data. This dissertation provides two approaches, unified in what is called the bootstrapping framework, to this problem. I assume that we are given access to a monolingual corpus containing large number of sentences in the source language, in addition to a small or moderate sized bilingual corpus. The idea is to take advantage of this readily available monolingual data in building a better SMT model in an iterative manner : By selecting an important subset of these monolingual sentences, Preparing their translations, and using them together with the original sentence pairs to Re-train the SMT model. When preparing the translation of the selected sentences, if we use a human annotator, then the framework fits into theActive l;earning scenario in machine learning. Instead if we sue the SMT system generated translations,then we get the Self-training framework which fits into the semi-supervised learning scenario in machine learning. The key points that I address throughput this thesis are (1) how to choose the important sentences, (2) how to provide their translations (possibly with as little effort as possible), and (3) how to use the newly collected information in training the SMT model. As a result, we have a fully automatic and general method to improve the phrase-based SMT models for the situation where the amount of bilingual training data is small. The success of self-training in SMT and many other NLP problems raises the question why self-training works. I investigate this question by giving a theoretical analysis of the self-training for decision lists. I provide objective functions which are motivated by information theory for the resulting semi-supervised learning algorithms. These objective functions provide us with : (1) Insights about why and when we should expect self-training to work well, and (2) Proofs of the convergence of their corresponding algorithms.

Computers

Machine Learning in Translation Corpora Processing

Book Details:

Author : Krzysztof Wolk
Publisher : CRC Press
Release : 2019-02-25
ISBN : 0429588836
Pages : 205 pages

Download or read book Machine Learning in Translation Corpora Processing written by Krzysztof Wolk and published by CRC Press. This book was released on 2019-02-25 with total page 205 pages. Available in PDF, EPUB and Kindle. Book excerpt: This book reviews ways to improve statistical machine speech translation between Polish and English. Research has been conducted mostly on dictionary-based, rule-based, and syntax-based, machine translation techniques. Most popular methodologies and tools are not well-suited for the Polish language and therefore require adaptation, and language resources are lacking in parallel and monolingual data. The main objective of this volume to develop an automatic and robust Polish-to-English translation system to meet specific translation requirements and to develop bilingual textual resources by mining comparable corpora.

Computers

Machine Translation and Transliteration involving Related Low resource Languages

Book Details:

Author : Anoop Kunchukuttan
Publisher : CRC Press
Release : 2021-09-08
ISBN : 1000422410
Pages : 215 pages

Download or read book Machine Translation and Transliteration involving Related Low resource Languages written by Anoop Kunchukuttan and published by CRC Press. This book was released on 2021-09-08 with total page 215 pages. Available in PDF, EPUB and Kindle. Book excerpt: Machine Translation and Transliteration involving Related, Low-resource Languages discusses an important aspect of natural language processing that has received lesser attention: translation and transliteration involving related languages in a low-resource setting. This is a very relevant real-world scenario for people living in neighbouring states/provinces/countries who speak similar languages and need to communicate with each other, but training data to build supporting MT systems is limited. The book discusses different characteristics of related languages with rich examples and draws connections between two problems: translation for related languages and transliteration. It shows how linguistic similarities can be utilized to learn MT systems for related languages with limited data. It comprehensively discusses the use of subword-level models and multilinguality to utilize these linguistic similarities. The second part of the book explores methods for machine transliteration involving related languages based on multilingual and unsupervised approaches. Through extensive experiments over a wide variety of languages, the efficacy of these methods is established. Features Novel methods for machine translation and transliteration between related languages, supported with experiments on a wide variety of languages. An overview of past literature on machine translation for related languages. A case study about machine translation for related languages between 10 major languages from India, which is one of the most linguistically diverse country in the world. The book presents important concepts and methods for machine translation involving related languages. In general, it serves as a good reference to NLP for related languages. It is intended for students, researchers and professionals interested in Machine Translation, Translation Studies, Multilingual Computing Machine and Natural Language Processing. It can be used as reference reading for courses in NLP and machine translation. Anoop Kunchukuttan is a Senior Applied Researcher at Microsoft India. His research spans various areas on multilingual and low-resource NLP. Pushpak Bhattacharyya is a Professor at the Department of Computer Science, IIT Bombay. His research areas are Natural Language Processing, Machine Learning and AI (NLP-ML-AI). Prof. Bhattacharyya has published more than 350 research papers in various areas of NLP.

Cross-language information retrieval

Learning Transfer Rules for Machine Translation with Limited Data

Book Details:

Author : Katharina Probst
Publisher :
Release : 2005
ISBN :
Pages : 297 pages

Download or read book Learning Transfer Rules for Machine Translation with Limited Data written by Katharina Probst and published by . This book was released on 2005 with total page 297 pages. Available in PDF, EPUB and Kindle. Book excerpt: Abstract: "The transfer-based approach to machine translation (MT) captures structural transfers between the source language and the target language, with the goal of producing grammatical translations. The major drawback of the approach is the development bottleneck, requiring many human-years of rule development. On the other hand, data-driven approaches such as example-based and statistical MT achieve fast system development by deriving mostly non-structural translation information from bilingual corpora. This thesis aims at striking a balance between both approaches by inferring transfer rules automatically from bilingual text, aiming specifically at scenarios where bilingual data is in sparse supply. The rules are learned using a variety of information, such as parses that are available for one of the languages, and morphological information that is available for both languages. They are learned in three stages, first producing an initial hypothesis, then capturing the syntactic structure, and finally adding appropriate unification constraints. The learned rules are used in a run-time translation system, a statistical transfer system which is a combination of a transfer engine and a statistical decoder. We demonstrate the effectiveness of the learned rules on Hebrew -> English and a Hindi -> English translation tasks. The main contribution of this thesis is a new framework for inferring structural information with feature constraints from bilingual text, as well as an investigation of the taxonomy of learnable rules and their effectiveness. The framework is designed to be applicable for any language pair, and the inferred rules can be used in conjunction with a statistical decoder. In addition to presenting methods to integrate syntactic and statistical information, the thesis makes a case for inferring information from very small training corpora, and provides methods to do so."

Computers

Statistical Machine Translation

Book Details:

Author : Philipp Koehn
Publisher : Cambridge University Press
Release : 2010
ISBN : 0521874157
Pages : 447 pages

Download or read book Statistical Machine Translation written by Philipp Koehn and published by Cambridge University Press. This book was released on 2010 with total page 447 pages. Available in PDF, EPUB and Kindle. Book excerpt: The dream of automatic language translation is now closer thanks to recent advances in the techniques that underpin statistical machine translation. This class-tested textbook from an active researcher in the field, provides a clear and careful introduction to the latest methods and explains how to build machine translation systems for any two languages. It introduces the subject's building blocks from linguistics and probability, then covers the major models for machine translation: word-based, phrase-based, and tree-based, as well as machine translation evaluation, language modeling, discriminative training and advanced methods to integrate linguistic annotation. The book also reports the latest research, presents the major outstanding challenges, and enables novices as well as experienced researchers to make novel contributions to this exciting area. Ideal for students at undergraduate and graduate level, or for anyone interested in the latest developments in machine translation.

Computers

Neural Machine Translation

Book Details:

Author : Philipp Koehn
Publisher : Cambridge University Press
Release : 2020-06-18
ISBN : 1108601766
Pages : 410 pages

Download or read book Neural Machine Translation written by Philipp Koehn and published by Cambridge University Press. This book was released on 2020-06-18 with total page 410 pages. Available in PDF, EPUB and Kindle. Book excerpt: Deep learning is revolutionizing how machine translation systems are built today. This book introduces the challenge of machine translation and evaluation - including historical, linguistic, and applied context -- then develops the core deep learning methods used for natural language applications. Code examples in Python give readers a hands-on blueprint for understanding and implementing their own machine translation systems. The book also provides extensive coverage of machine learning tricks, issues involved in handling various forms of data, model enhancements, and current challenges and methods for analysis and visualization. Summaries of the current research in the field make this a state-of-the-art textbook for undergraduate and graduate classes, as well as an essential reference for researchers and developers interested in other applications of neural methods in the broader field of human language processing.

Computers

Learning Machine Translation

Book Details:

Author : Cyril Goutte
Publisher : MIT Press
Release : 2009
ISBN : 0262072971
Pages : 329 pages

Download or read book Learning Machine Translation written by Cyril Goutte and published by MIT Press. This book was released on 2009 with total page 329 pages. Available in PDF, EPUB and Kindle. Book excerpt: How Machine Learning can improve machine translation: enabling technologies and new statistical techniques.

Paraphrases for Statistical Machine Translation

Book Details:

Author : Ramtin Mehdizadeh Seraj
Publisher :
Release : 2015
ISBN :
Pages : 47 pages

Download or read book Paraphrases for Statistical Machine Translation written by Ramtin Mehdizadeh Seraj and published by . This book was released on 2015 with total page 47 pages. Available in PDF, EPUB and Kindle. Book excerpt: Statistical Machine Translation (SMT) is the task of automatic translation between two natural languages (source language and target language) by using bilingual corpora. To accomplish this goal, machine learning models try to capture human translation patterns inside a bilingual corpus. An open challenge for SMT is finding translations for phrases which are missing in the training data (out-of-vocabulary phrases). We propose to use paraphrases to provide translations for out-of-vocabulary (OOV) phrases. We compare two major approaches to automatically extract paraphrases from corpora: distributional profile (DP) and bilingual pivoting. The multilingual Paraphrase Database (PPDB) is a freely available automatically created (using bilingual pivoting) resource of paraphrases in multiple languages. We show that a graph propagation approach that uses PPDB paraphrases can be used to improve overall translation quality. We provide an extensive comparison with previous work and show that our PPDB-based method improves the BLEU score by up to 1.79 percent points. We show that our approach improves on the state of the art in three different settings: when faced with limited amount of parallel training data; a domain shift between training and test data; and handling a morphologically complex source language.

Language Arts & Disciplines

Linguistically Motivated Statistical Machine Translation

Book Details:

Author : Deyi Xiong
Publisher : Springer
Release : 2015-02-11
ISBN : 9812873562
Pages : 159 pages

Download or read book Linguistically Motivated Statistical Machine Translation written by Deyi Xiong and published by Springer. This book was released on 2015-02-11 with total page 159 pages. Available in PDF, EPUB and Kindle. Book excerpt: This book provides a wide variety of algorithms and models to integrate linguistic knowledge into Statistical Machine Translation (SMT). It helps advance conventional SMT to linguistically motivated SMT by enhancing the following three essential components: translation, reordering and bracketing models. It also serves the purpose of promoting the in-depth study of the impacts of linguistic knowledge on machine translation. Finally it provides a systematic introduction of Bracketing Transduction Grammar (BTG) based SMT, one of the state-of-the-art SMT formalisms, as well as a case study of linguistically motivated SMT on a BTG-based platform.

Exploration and Exploitation of Multilingual Data for Statistical Machine Translation

Book Details:

Author :
Publisher :
Release : 2012
ISBN : 9789461821973
Pages : 179 pages

Download or read book Exploration and Exploitation of Multilingual Data for Statistical Machine Translation written by and published by . This book was released on 2012 with total page 179 pages. Available in PDF, EPUB and Kindle. Book excerpt: "Shortly after the birth of computer science, researchers realised the importance of machine translation as a task worth of concentrated effort, but it is only recently that algorithms are able to provide automatic translations usable by the masses. Modern translation systems are dependent on bilingual corpora, a modern Rosetta Stone, from which the learn cross-lingual relationships that can be used to translate sentences which are not in the training corpus. This data is crucial. If it is insufficient, or out-of-domain, then translation quality degrades. To improve quality, we need to both perfect methods that extract usable translation from additional multilingual resources, and improve the constituent models of a translation system to better exploit existing multilingual data sets. In this thesis, we focus on these dual problems. Our approach is two-fold, and the thesis is structures accordingly. In part I we study the problem of extracting translations from the web, with a focus on exploiting the growing predominance of microblog platforms. We present novel methods for the language identification of microblog posts, and conduct a thorough analysis of existing methods that explore these microblog posts for new translations. In part II we study the orthogonal problem of improving language models for the tasks of reranking and source side morphological analysis. We begin by analysing a plethora of syntactic features for reranking n-best lists output from an automatic translation system. We then present a novel algorithm that allows for exact inference from high-order hidden Markov models, which we use to segment source text input. In this way, the thesis gives insight into the retrieval of relevant training data, and introduces novel methods that better utilise existing multilingual corpora."--Omslag.

Using Linguistic Knowledge in Statistical Machine Translation

Book Details:

Author : Rabih Mohamed Zbib
Publisher :
Release : 2010
ISBN :
Pages : 162 pages

Download or read book Using Linguistic Knowledge in Statistical Machine Translation written by Rabih Mohamed Zbib and published by . This book was released on 2010 with total page 162 pages. Available in PDF, EPUB and Kindle. Book excerpt: In this thesis, we present methods for using linguistically motivated information to enhance the performance of statistical machine translation (SMT). One of the advantages of the statistical approach to machine translation is that it is largely language-agnostic. Machine learning models are used to automatically learn translation patterns from data. SMT can, however, be improved by using linguistic knowledge to address specific areas of the translation process, where translations would be hard to learn fully automatically. We present methods that use linguistic knowledge at various levels to improve statistical machine translation, focusing on Arabic-English translation as a case study. In the first part, morphological information is used to preprocess the Arabic text for Arabic-to-English and English-to-Arabic translation, which reduces the gap in the complexity of the morphology between Arabic and English. The second method addresses the issue of long-distance reordering in translation to account for the difference in the syntax of the two languages. In the third part, we show how additional local context information on the source side is incorporated, which helps reduce lexical ambiguity. Two methods are proposed for using binary decision trees to control the amount of context information introduced. These methods are successfully applied to the use of diacritized Arabic source in Arabic-to-English translation. The final method combines the outputs of an SMT system and a Rule-based MT (RBMT) system, taking advantage of the flexibility of the statistical approach and the rich linguistic knowledge embedded in the rule-based MT system.

Computers

Progress in Machine Translation

Book Details:

Author : Sergei Nirenburg
Publisher : IOS Press
Release : 1993
ISBN : 9789051990744
Pages : 338 pages

Download or read book Progress in Machine Translation written by Sergei Nirenburg and published by IOS Press. This book was released on 1993 with total page 338 pages. Available in PDF, EPUB and Kindle. Book excerpt:

Computers

Joint Training for Neural Machine Translation

Book Details:

Author : Yong Cheng
Publisher : Springer Nature
Release : 2019-08-26
ISBN : 9813297484
Pages : 78 pages

Download or read book Joint Training for Neural Machine Translation written by Yong Cheng and published by Springer Nature. This book was released on 2019-08-26 with total page 78 pages. Available in PDF, EPUB and Kindle. Book excerpt: This book presents four approaches to jointly training bidirectional neural machine translation (NMT) models. First, in order to improve the accuracy of the attention mechanism, it proposes an agreement-based joint training approach to help the two complementary models agree on word alignment matrices for the same training data. Second, it presents a semi-supervised approach that uses an autoencoder to reconstruct monolingual corpora, so as to incorporate these corpora into neural machine translation. It then introduces a joint training algorithm for pivot-based neural machine translation, which can be used to mitigate the data scarcity problem. Lastly it describes an end-to-end bidirectional NMT model to connect the source-to-target and target-to-source translation models, allowing the interaction of parameters between these two directional models.

Machine learning

A Machine Learning Approach to Word Alignment in Statistical Machine Translation

Book Details:

Author : Michael Camilleri (M.Sc.)
Publisher :
Release : 2009
ISBN :
Pages : 100 pages

Download or read book A Machine Learning Approach to Word Alignment in Statistical Machine Translation written by Michael Camilleri (M.Sc.) and published by . This book was released on 2009 with total page 100 pages. Available in PDF, EPUB and Kindle. Book excerpt:

Computers

Quality Estimation for Machine Translation

Book Details:

Author : Lucia Specia
Publisher : Springer Nature
Release : 2022-05-31
ISBN : 3031021681
Pages : 148 pages

Download or read book Quality Estimation for Machine Translation written by Lucia Specia and published by Springer Nature. This book was released on 2022-05-31 with total page 148 pages. Available in PDF, EPUB and Kindle. Book excerpt: Many applications within natural language processing involve performing text-to-text transformations, i.e., given a text in natural language as input, systems are required to produce a version of this text (e.g., a translation), also in natural language, as output. Automatically evaluating the output of such systems is an important component in developing text-to-text applications. Two approaches have been proposed for this problem: (i) to compare the system outputs against one or more reference outputs using string matching-based evaluation metrics and (ii) to build models based on human feedback to predict the quality of system outputs without reference texts. Despite their popularity, reference-based evaluation metrics are faced with the challenge that multiple good (and bad) quality outputs can be produced by text-to-text approaches for the same input. This variation is very hard to capture, even with multiple reference texts. In addition, reference-based metrics cannot be used in production (e.g., online machine translation systems), when systems are expected to produce outputs for any unseen input. In this book, we focus on the second set of metrics, so-called Quality Estimation (QE) metrics, where the goal is to provide an estimate on how good or reliable the texts produced by an application are without access to gold-standard outputs. QE enables different types of evaluation that can target different types of users and applications. Machine learning techniques are used to build QE models with various types of quality labels and explicit features or learnt representations, which can then predict the quality of unseen system outputs. This book describes the topic of QE for text-to-text applications, covering quality labels, features, algorithms, evaluation, uses, and state-of-the-art approaches. It focuses on machine translation as application, since this represents most of the QE work done to date. It also briefly describes QE for several other applications, including text simplification, text summarization, grammatical error correction, and natural language generation.

Computers

Introduction to Google Translate

Book Details:

Author : Gilad James, PhD
Publisher : Gilad James Mystery School
Release :
ISBN : 7819501847
Pages : 90 pages

Download or read book Introduction to Google Translate written by Gilad James, PhD and published by Gilad James Mystery School. This book was released on with total page 90 pages. Available in PDF, EPUB and Kindle. Book excerpt: Google Translate is a multilingual translation service provided by Google. It allows users to translate words, phrases, and entire documents between multiple languages. The service was launched in April 2006 and has since been constantly updated to provide more accurate translations. Initially offering translations in only two languages, Google Translate now supports over 100 languages. The translation process works by analyzing the text or document input by the user, breaking it up into smaller segments, and then using statistical algorithms to match these segments with translations from its database. Google Translate has been a helpful tool for people to communicate across different languages, whether it be for business or personal use. However, it must be noted that automated translations often carry a high risk of inaccuracies due to the complexities inherent in language and the nuances of different cultures and contexts. It is always recommended to use translations as a starting point, and then have a native speaker review and refine the language to ensure accuracy.

Machine translating

Leveraging Diverse Sources in Statistical Machine Translation

Book Details:

Author : Majid Razmara
Publisher :
Release : 2013
ISBN :
Pages : 117 pages

Download or read book Leveraging Diverse Sources in Statistical Machine Translation written by Majid Razmara and published by . This book was released on 2013 with total page 117 pages. Available in PDF, EPUB and Kindle. Book excerpt: Statistical machine translation (SMT) is often faced with the problem of having insufficient training data for many language pairs. We propose several approaches to leveraging other available sources in SMT systems to enhance the quality of translation. Particularly, we propose approaches suitable in these four scenarios: 1. when an additional parallel corpus is available; 2. when parallel corpora between the source language and a third language and between that language and the target language are available; 3. when an abundant source-language monolingual corpus is available; 4. when no additional resource is available. In the heart of these solutions lie two novel approaches: ensemble decoding and a graph propagation approach for paraphrasing out-of-vocabulary words. Ensemble decoding combines a number of translation systems dynamically at the decoding step. Our experimental results show that ensemble decoding outperforms various strong baselines including mixture models, the current state-of-the-art for domain adaptation in machine translation. We extend ensemble decoding to do triangulation on-the-fly when there exist parallel corpora between the source language and one or multiple pivot languages and between those and the target language. These triangulated systems are dynamically combined together and possibly to a direct source-target system. Experiments in 12 different language pairs show significant improvements over the baselines in terms of BLEU scores. Ensemble decoding can also be used to apply stacking to statistical machine translation. Stacking is an ensemble learning approach that enhances the bias of the models. We show that stacking can consistently and significantly improve over the conventional SMT systems in two different language pairs and three different training sizes. In addition to ensemble decoding, we propose a novel approach to mining translations for OOV words using a monolingual corpus on the source-side language. We induce a lexicon by constructing a graph on the source language phrases and employ a graph propagation technique in order to find translations for those phrases. Experimental results in two different settings show that our graph propagation method significantly improves performance over two strong baselines under intrinsic and extrinsic evaluation metrics.