EBookClubs

Read Books & Download eBooks Full Online

EBookClubs

Read Books & Download eBooks Full Online

Book Exploration and Exploitation of Multilingual Data for Statistical Machine Translation

Download or read book Exploration and Exploitation of Multilingual Data for Statistical Machine Translation written by and published by . This book was released on 2012 with total page 179 pages. Available in PDF, EPUB and Kindle. Book excerpt: "Shortly after the birth of computer science, researchers realised the importance of machine translation as a task worth of concentrated effort, but it is only recently that algorithms are able to provide automatic translations usable by the masses. Modern translation systems are dependent on bilingual corpora, a modern Rosetta Stone, from which the learn cross-lingual relationships that can be used to translate sentences which are not in the training corpus. This data is crucial. If it is insufficient, or out-of-domain, then translation quality degrades. To improve quality, we need to both perfect methods that extract usable translation from additional multilingual resources, and improve the constituent models of a translation system to better exploit existing multilingual data sets. In this thesis, we focus on these dual problems. Our approach is two-fold, and the thesis is structures accordingly. In part I we study the problem of extracting translations from the web, with a focus on exploiting the growing predominance of microblog platforms. We present novel methods for the language identification of microblog posts, and conduct a thorough analysis of existing methods that explore these microblog posts for new translations. In part II we study the orthogonal problem of improving language models for the tasks of reranking and source side morphological analysis. We begin by analysing a plethora of syntactic features for reranking n-best lists output from an automatic translation system. We then present a novel algorithm that allows for exact inference from high-order hidden Markov models, which we use to segment source text input. In this way, the thesis gives insight into the retrieval of relevant training data, and introduces novel methods that better utilise existing multilingual corpora."--Omslag.

Book Linguistically Motivated Statistical Machine Translation

Download or read book Linguistically Motivated Statistical Machine Translation written by Deyi Xiong and published by Springer. This book was released on 2015-02-11 with total page 159 pages. Available in PDF, EPUB and Kindle. Book excerpt: This book provides a wide variety of algorithms and models to integrate linguistic knowledge into Statistical Machine Translation (SMT). It helps advance conventional SMT to linguistically motivated SMT by enhancing the following three essential components: translation, reordering and bracketing models. It also serves the purpose of promoting the in-depth study of the impacts of linguistic knowledge on machine translation. Finally it provides a systematic introduction of Bracketing Transduction Grammar (BTG) based SMT, one of the state-of-the-art SMT formalisms, as well as a case study of linguistically motivated SMT on a BTG-based platform.

Book Syntax based Statistical Machine Translation

Download or read book Syntax based Statistical Machine Translation written by Philip Williams and published by Morgan & Claypool Publishers. This book was released on 2016-08-01 with total page 211 pages. Available in PDF, EPUB and Kindle. Book excerpt: This unique book provides a comprehensive introduction to the most popular syntax-based statistical machine translation models, filling a gap in the current literature for researchers and developers in human language technologies. While phrase-based models have previously dominated the field, syntax-based approaches have proved a popular alternative, as they elegantly solve many of the shortcomings of phrase-based models. The heart of this book is a detailed introduction to decoding for syntax-based models. The book begins with an overview of synchronous-context free grammar (SCFG) and synchronous tree-substitution grammar (STSG) along with their associated statistical models. It also describes how three popular instantiations (Hiero, SAMT, and GHKM) are learned from parallel corpora. It introduces and details hypergraphs and associated general algorithms, as well as algorithms for decoding with both tree and string input. Special attention is given to efficiency, including search approximations such as beam search and cube pruning, data structures, and parsing algorithms. The book consistently highlights the strengths (and limitations) of syntax-based approaches, including their ability to generalize phrase-based translation units, their modeling of specific linguistic phenomena, and their function of structuring the search space.

Book Statistical Machine Translation

Download or read book Statistical Machine Translation written by Philipp Koehn and published by Cambridge University Press. This book was released on 2010 with total page 447 pages. Available in PDF, EPUB and Kindle. Book excerpt: The dream of automatic language translation is now closer thanks to recent advances in the techniques that underpin statistical machine translation. This class-tested textbook from an active researcher in the field, provides a clear and careful introduction to the latest methods and explains how to build machine translation systems for any two languages. It introduces the subject's building blocks from linguistics and probability, then covers the major models for machine translation: word-based, phrase-based, and tree-based, as well as machine translation evaluation, language modeling, discriminative training and advanced methods to integrate linguistic annotation. The book also reports the latest research, presents the major outstanding challenges, and enables novices as well as experienced researchers to make novel contributions to this exciting area. Ideal for students at undergraduate and graduate level, or for anyone interested in the latest developments in machine translation.

Book Machine Translation and Transliteration involving Related  Low resource Languages

Download or read book Machine Translation and Transliteration involving Related Low resource Languages written by Anoop Kunchukuttan and published by CRC Press. This book was released on 2021-08-12 with total page 220 pages. Available in PDF, EPUB and Kindle. Book excerpt: Machine Translation and Transliteration involving Related, Low-resource Languages discusses an important aspect of natural language processing that has received lesser attention: translation and transliteration involving related languages in a low-resource setting. This is a very relevant real-world scenario for people living in neighbouring states/provinces/countries who speak similar languages and need to communicate with each other, but training data to build supporting MT systems is limited. The book discusses different characteristics of related languages with rich examples and draws connections between two problems: translation for related languages and transliteration. It shows how linguistic similarities can be utilized to learn MT systems for related languages with limited data. It comprehensively discusses the use of subword-level models and multilinguality to utilize these linguistic similarities. The second part of the book explores methods for machine transliteration involving related languages based on multilingual and unsupervised approaches. Through extensive experiments over a wide variety of languages, the efficacy of these methods is established. Features Novel methods for machine translation and transliteration between related languages, supported with experiments on a wide variety of languages. An overview of past literature on machine translation for related languages. A case study about machine translation for related languages between 10 major languages from India, which is one of the most linguistically diverse country in the world. The book presents important concepts and methods for machine translation involving related languages. In general, it serves as a good reference to NLP for related languages. It is intended for students, researchers and professionals interested in Machine Translation, Translation Studies, Multilingual Computing Machine and Natural Language Processing. It can be used as reference reading for courses in NLP and machine translation. Anoop Kunchukuttan is a Senior Applied Researcher at Microsoft India. His research spans various areas on multilingual and low-resource NLP. Pushpak Bhattacharyya is a Professor at the Department of Computer Science, IIT Bombay. His research areas are Natural Language Processing, Machine Learning and AI (NLP-ML-AI). Prof. Bhattacharyya has published more than 350 research papers in various areas of NLP.

Book Machine Translation and Transliteration Involving Related and Low resource Languages

Download or read book Machine Translation and Transliteration Involving Related and Low resource Languages written by Anoop Kunchukuttan and published by Chapman & Hall/CRC. This book was released on 2021-08-12 with total page 0 pages. Available in PDF, EPUB and Kindle. Book excerpt: Machine Translation and Transliteration involving Related, Low-resource Languages discusses an important aspect of natural language processing that has received lesser attention: translation and transliteration involving related languages in a low-resource setting. This is a very relevant real-world scenario for people living in neighbouring states/provinces/countries who speak similar languages and need to communicate with each other, but training data to build supporting MT systems is limited. The book discusses different characteristics of related languages with rich examples and draws connections between two problems: translation for related languages and transliteration. It shows how linguistic similarities can be utilized to learn MT systems for related languages with limited data. It comprehensively discusses the use of subword-level models and multilinguality to utilize these linguistic similarities. The second part of the book explores methods for machine transliteration involving related languages based on multilingual and unsupervised approaches. Through extensive experiments over a wide variety of languages, the efficacy of these methods is established. Features Novel methods for machine translation and transliteration between related languages, supported with experiments on a wide variety of languages. An overview of past literature on machine translation for related languages. A case study about machine translation for related languages between 10 major languages from India, which is one of the most linguistically diverse country in the world. The book presents important concepts and methods for machine translation involving related languages. In general, it serves as a good reference to NLP for related languages. It is intended for students, researchers and professionals interested in Machine Translation, Translation Studies, Multilingual Computing Machine and Natural Language Processing. It can be used as reference reading for courses in NLP and machine translation. Anoop Kunchukuttan is a Senior Applied Researcher at Microsoft India. His research spans various areas on multilingual and low-resource NLP. Pushpak Bhattacharyya is a Professor at the Department of Computer Science, IIT Bombay. His research areas are Natural Language Processing, Machine Learning and AI (NLP-ML-AI). Prof. Bhattacharyya has published more than 350 research papers in various areas of NLP.

Book Machine Learning in Translation Corpora Processing

Download or read book Machine Learning in Translation Corpora Processing written by Krzysztof Wolk and published by CRC Press. This book was released on 2019-02-25 with total page 205 pages. Available in PDF, EPUB and Kindle. Book excerpt: This book reviews ways to improve statistical machine speech translation between Polish and English. Research has been conducted mostly on dictionary-based, rule-based, and syntax-based, machine translation techniques. Most popular methodologies and tools are not well-suited for the Polish language and therefore require adaptation, and language resources are lacking in parallel and monolingual data. The main objective of this volume to develop an automatic and robust Polish-to-English translation system to meet specific translation requirements and to develop bilingual textual resources by mining comparable corpora.

Book Machine Learning Approaches for Dealing with Limited Bilingual Training Data in Statistical Machine Translation

Download or read book Machine Learning Approaches for Dealing with Limited Bilingual Training Data in Statistical Machine Translation written by Gholamreza Haffari and published by . This book was released on 2009 with total page 0 pages. Available in PDF, EPUB and Kindle. Book excerpt: Statistical Machine Translation (SMT) models learn how to translate by examining a bilingual parallel corpus containing sentences aligned with their human-produced translations. However, high quality translation output is dependent on the availability of massive amounts of parallel text in the source and target languages. There are a large number of languages that are considered Low-density, either because the population speaking the language is not very large, or even if millions of people speak the language, insufficient online resources are available in that language. This thesis covers machine learning approaches for dealing with such situations in statistical machine translation where the amount of available bilingual data is limited. The problem of learning from insufficient labeled training data has been dealt with in machine learning community under two general frameworks: (i) Semi-supervised Learning, and (ii) Active Learning. The complex nature of machine translation task poses severe challenges to most of the algorithms developed in machine learning community for these two learning scenarios. In this thesis, I develop semi-supervised learning as well as active learning algorithms to deal with the shortage of bilingual training data for Statistical Machine Translation task, specific to cases where there is shortage of bilingual training data. This dissertation provides two approaches, unified in what is called the bootstrapping framework, to this problem. I assume that we are given access to a monolingual corpus containing large number of sentences in the source language, in addition to a small or moderate sized bilingual corpus. The idea is to take advantage of this readily available monolingual data in building a better SMT model in an iterative manner : By selecting an important subset of these monolingual sentences, Preparing their translations, and using them together with the original sentence pairs to Re-train the SMT model. When preparing the translation of the selected sentences, if we use a human annotator, then the framework fits into theActive l;earning scenario in machine learning. Instead if we sue the SMT system generated translations,then we get the Self-training framework which fits into the semi-supervised learning scenario in machine learning. The key points that I address throughput this thesis are (1) how to choose the important sentences, (2) how to provide their translations (possibly with as little effort as possible), and (3) how to use the newly collected information in training the SMT model. As a result, we have a fully automatic and general method to improve the phrase-based SMT models for the situation where the amount of bilingual training data is small. The success of self-training in SMT and many other NLP problems raises the question why self-training works. I investigate this question by giving a theoretical analysis of the self-training for decision lists. I provide objective functions which are motivated by information theory for the resulting semi-supervised learning algorithms. These objective functions provide us with : (1) Insights about why and when we should expect self-training to work well, and (2) Proofs of the convergence of their corresponding algorithms.

Book Leveraging Diverse Sources in Statistical Machine Translation

Download or read book Leveraging Diverse Sources in Statistical Machine Translation written by Majid Razmara and published by . This book was released on 2013 with total page 0 pages. Available in PDF, EPUB and Kindle. Book excerpt: Statistical machine translation (SMT) is often faced with the problem of having insufficient training data for many language pairs. We propose several approaches to leveraging other available sources in SMT systems to enhance the quality of translation. Particularly, we propose approaches suitable in these four scenarios: 1. when an additional parallel corpus is available; 2. when parallel corpora between the source language and a third language and between that language and the target language are available; 3. when an abundant source-language monolingual corpus is available; 4. when no additional resource is available. In the heart of these solutions lie two novel approaches: ensemble decoding and a graph propagation approach for paraphrasing out-of-vocabulary words. Ensemble decoding combines a number of translation systems dynamically at the decoding step. Our experimental results show that ensemble decoding outperforms various strong baselines including mixture models, the current state-of-the-art for domain adaptation in machine translation. We extend ensemble decoding to do triangulation on-the-fly when there exist parallel corpora between the source language and one or multiple pivot languages and between those and the target language. These triangulated systems are dynamically combined together and possibly to a direct source-target system. Experiments in 12 different language pairs show significant improvements over the baselines in terms of BLEU scores. Ensemble decoding can also be used to apply stacking to statistical machine translation. Stacking is an ensemble learning approach that enhances the bias of the models. We show that stacking can consistently and significantly improve over the conventional SMT systems in two different language pairs and three different training sizes. In addition to ensemble decoding, we propose a novel approach to mining translations for OOV words using a monolingual corpus on the source-side language. We induce a lexicon by constructing a graph on the source language phrases and employ a graph propagation technique in order to find translations for those phrases. Experimental results in two different settings show that our graph propagation method significantly improves performance over two strong baselines under intrinsic and extrinsic evaluation metrics.

Book Data Selection for Statistical Machine Translation

Download or read book Data Selection for Statistical Machine Translation written by Amittai Axelrod and published by . This book was released on 2014 with total page 124 pages. Available in PDF, EPUB and Kindle. Book excerpt: Machine translation, the computerized translation of one human language to another, could be used to communicate between the thousands of languages used around the world. Statistical machine translation (SMT) is an approach to building these translation engines without much human intervention, and large-scale implementations by Google, Microsoft, and Facebook in their products are used by millions daily. The quality of SMT systems depends on the example translations used to train the models. Data can come from a variety of sources, many of which are not optimal for common specific tasks. The goal is to be able to find the right data to use to train a model for a particular task. This work determines the most relevant subsets of these large datasets with respect to a translation task, enabling the construction of task-specific translation systems that are more accurate and easier to train than the large-scale models. Three methods are explored for identifying task-relevant translation training data from a general data pool. The first uses only a language model to score the training data according to lexical probabilities, improving on prior results by using a bilingual score that accounts for differences between the target domain and the general data. The second is a topic-based relevance score that is novel for SMT, using topic models to project texts into a latent semantic space. These semantic vectors are then used to compute similarity of sentences in the general pool to the target task. This work finds that what the automatic topic models capture for some tasks is actually the style of the language, rather than task-specific content words. This motivates the third approach, a novel style-based data selection method. Hybrid word and part-of-speech (POS) representations of the two corpora are constructed by retaining the discriminative words and using POS tags as a proxy for the stylistic content of the infrequent words. Language models based on these representations can be used to quantify the underlying stylistic relevance between two texts. Experiments show that style-based data selection can outperform the current state-of-the-art method for task-specific data selection, in terms of SMT system performance and vocabulary coverage. Taken together, the experimental results indicate that it is important to characterize corpus differences when selecting data for statistical machine translation.

Book Paraphrases for Statistical Machine Translation

Download or read book Paraphrases for Statistical Machine Translation written by Ramtin Mehdizadeh Seraj and published by . This book was released on 2015 with total page 47 pages. Available in PDF, EPUB and Kindle. Book excerpt: Statistical Machine Translation (SMT) is the task of automatic translation between two natural languages (source language and target language) by using bilingual corpora. To accomplish this goal, machine learning models try to capture human translation patterns inside a bilingual corpus. An open challenge for SMT is finding translations for phrases which are missing in the training data (out-of-vocabulary phrases). We propose to use paraphrases to provide translations for out-of-vocabulary (OOV) phrases. We compare two major approaches to automatically extract paraphrases from corpora: distributional profile (DP) and bilingual pivoting. The multilingual Paraphrase Database (PPDB) is a freely available automatically created (using bilingual pivoting) resource of paraphrases in multiple languages. We show that a graph propagation approach that uses PPDB paraphrases can be used to improve overall translation quality. We provide an extensive comparison with previous work and show that our PPDB-based method improves the BLEU score by up to 1.79 percent points. We show that our approach improves on the state of the art in three different settings: when faced with limited amount of parallel training data; a domain shift between training and test data; and handling a morphologically complex source language.

Book Use of Source Language Context in Statistical MacHine Translation

Download or read book Use of Source Language Context in Statistical MacHine Translation written by Rejwanul Haque and published by LAP Lambert Academic Publishing. This book was released on 2012-02 with total page 228 pages. Available in PDF, EPUB and Kindle. Book excerpt: The translation features typically used in state-of-the-art statistical machine translation (SMT) model dependencies between the source and target phrases, but not among the phrases in the source language themselves. A swathe of research has demonstrated that integrating source context modelling directly into log-linear phrase-based SMT (PB-SMT) and hierarchical PB-SMT (HPB-SMT), and can positively influence the weighting and selection of target phrases, and thus improve translation quality. In this book we present novel approaches to incorporate source-language contextual modelling into the state-of-the-art SMT models in order to enhance the quality of lexical selection. We investigate the effectiveness of use of a range of contextual features, including lexical features of neighbouring words, part-of-speech tags, supertags, sentence-similarity features, dependency information, and semantic roles. We explored a series of language pairs featuring typologically different languages, and examined the scalability of our research to larger amounts of training data.

Book Natural Language Processing and Chinese Computing

Download or read book Natural Language Processing and Chinese Computing written by Chengqing Zong and published by Springer. This book was released on 2014-11-26 with total page 491 pages. Available in PDF, EPUB and Kindle. Book excerpt: This book constitutes the refereed proceedings of the Third CCF Conference, NLPCC 2014, held in Shenzhen, China, in December 2014. The 35 revised full papers presented together with 8 short papers were carefully reviewed and selected from 110 English submissions. The papers are organized in topical sections on fundamentals on language computing; applications on language computing; machine translation and multi-lingual information access; machine learning for NLP; NLP for social media; NLP for search technology and ads; question answering and user interaction; web mining and information extraction.

Book Typologically Robust Statistical Machine Translation

Download or read book Typologically Robust Statistical Machine Translation written by Joachim Daiber and published by . This book was released on 2018 with total page 172 pages. Available in PDF, EPUB and Kindle. Book excerpt: "Machine translation systems often incorporate modeling assumptions motivated by properties of the language pairs they initially target. When such systems are applied to language families with considerably different properties, translation quality can deteriorate. Phrase-based machine translation systems, for instance, are ill-equipped to handle the challenges caused by relaxed word order constraints and productive word formation processes in morphologically rich languages. In this thesis, we ask what role the properties of languages, as studied in the field of linguistic typology, play in how well machine translation systems perform. We focus in particular on word order and morphology, and show that typological differences in these areas can be bridged by making certain linguistic phenomena overt to the translation system. Understanding and exploiting typological differences between languages enables improvements to the typological robustness of translation systems without significantly changing the assumptions of the underlying translation models. In the area of word order, we examine the influence of word order freedom on preordering, a popular technique to model word order in phrase-based machine translation, and propose a method to improve its typological robustness. For morphological complexity, we show that reducing the dissimilarity between the source and target language improves phrase-based machine translation for typologically diverse language pairs. Finally, we show that besides helping to bridge the performance gaps between typologically diverse languages, linguistic typology can also serve as a source of knowledge to guide reordering models and to facilitate universal reordering models applicable to multiple target languages."--Samenvatting auteur.

Book Using Linguistic Knowledge in Statistical Machine Translation

Download or read book Using Linguistic Knowledge in Statistical Machine Translation written by Rabih Mohamed Zbib and published by . This book was released on 2010 with total page 162 pages. Available in PDF, EPUB and Kindle. Book excerpt: In this thesis, we present methods for using linguistically motivated information to enhance the performance of statistical machine translation (SMT). One of the advantages of the statistical approach to machine translation is that it is largely language-agnostic. Machine learning models are used to automatically learn translation patterns from data. SMT can, however, be improved by using linguistic knowledge to address specific areas of the translation process, where translations would be hard to learn fully automatically. We present methods that use linguistic knowledge at various levels to improve statistical machine translation, focusing on Arabic-English translation as a case study. In the first part, morphological information is used to preprocess the Arabic text for Arabic-to-English and English-to-Arabic translation, which reduces the gap in the complexity of the morphology between Arabic and English. The second method addresses the issue of long-distance reordering in translation to account for the difference in the syntax of the two languages. In the third part, we show how additional local context information on the source side is incorporated, which helps reduce lexical ambiguity. Two methods are proposed for using binary decision trees to control the amount of context information introduced. These methods are successfully applied to the use of diacritized Arabic source in Arabic-to-English translation. The final method combines the outputs of an SMT system and a Rule-based MT (RBMT) system, taking advantage of the flexibility of the statistical approach and the rich linguistic knowledge embedded in the rule-based MT system.

Book Machine Translation Summit

Download or read book Machine Translation Summit written by Makoto Nagao and published by IOS Press. This book was released on 1989 with total page 248 pages. Available in PDF, EPUB and Kindle. Book excerpt:

Book Multilinguality in Knowledge Graphs

Download or read book Multilinguality in Knowledge Graphs written by L.-A. Kaffee and published by IOS Press. This book was released on 2023-11-14 with total page 218 pages. Available in PDF, EPUB and Kindle. Book excerpt: Content on the web is predominantly written in English, making it inaccessible to those who only speak other languages. Knowledge graphs can store multilingual information, facilitate the creation of multilingual applications, and make content accessible to multiple language communities. This book, Multilinguality in Knowledge Graphs, presents studies which assess and improve the state of labels and languages in knowledge graphs and the application of multilingual information. The author proposes ways of using multilingual knowledge graphs to reduce the gaps in coverage between languages, and the book explores the current state of language distribution in knowledge graphs by developing a framework based on existing standards, frameworks, and guidelines to measure label and language distribution in knowledge graphs. Applying this framework to a dataset representing the web of data, and to Wikidata, both a lack of labeling on the web and a bias towards a small set of languages were found. The book explores how a knowledge of labels and languages can be used in the domain of answering questions, and demonstrates how the framework can be applied to the task of ranking and selecting knowledge graphs for a set of user questions. Transliteration and translation of knowledge graph labels and aliases are also covered, as is the automatic classification of labels into one or the other to train a model for each task. The book provides a wide range of information on working with data and knowledge graphs in less-resourced languages.