Machine translation techniques

К содержанию номера журнала: Вестник КАСУ №2 - 2012
Автор: Ларина М.В.

Technology represents one major strategy that agencies use to address and manage their foreign language shortfalls. Computer translation - often referred to as machine translation - has been under development for decades. Machine translation, which is the automated translation by a computer from one written language to another without human oversight and intervention. It has distinct problems. One of the difficulties that machine translation is faced with is the ability of adjustment to context, but an advantage of machine translation is the speed. Machine translation is cheaper and faster, although it is also less accurate, than using human translators. To be effective, the translated text must initially be in grammatically correct form and cannot include colloquialisms. Because of this, only 60 to 70 percent accuracy is claimed by vendors.

One of the great inaccessibility problems of the Web is to the people who do not speak the major language, English, the lingua franca. And the demand for translation of English into other languages and vice versa leads to a steady improvement on the curve of machine translation. There are a variety of third-party Web sites that specialize in having the best software, making it possible to read someone's site through that intermediate site.

Information technology has made remarkable advances in recent years. The private sector (without the same kinds of security concerns as the Intelligence Community) has led the adoption of technologies that are also critical to intelligence. The Community will never be able to hire enough linguists to meet its needs. It is difficult for the Community to predict which languages will be most in demand and to hire the necessary linguists in advance. And even an aggressive hiring and training effort would not produce an analytic workforce that can absorb the huge quantity of unclassified foreign language material available today.

Eventually, all analysts will have basic foreign-language processing tools easily available to them so that even those who are not language-qualified can pull pieces of interest and get a quick, rough translation. NSA has done pioneering work on machine translation and is pursuing a number of separate initiatives; the military services, CIA (including In-Q-Tel), and other agencies sponsor largely independent projects. There is an abundance of activity, but not a concerted, coherent effort, which has led to steady but slow development.

The general idea behind machine translation is that computers have the patience, stamina and speed to quickly parse through gigabytes of text, matching text terms with equivalent terms from an external vocabulary. Human translators often scoff at the output of machine translators, noting the high rate of comical errors. An often cited, perhaps apocryphal, example of poor machine translation is the English to Russian transformation of "out of sight, out of mind" to the Russian equivalent of "invisible idiot." Despite limitations, machine translation is the only way to transform gigabytes and terabytes of text. As long as people continue to type messages, reports, manuscripts and notes into electronic documents, they will need computers to parse and organize the resulting text.

Although many machine-translation programs are currently available, few evaluation methods of such translation exist for any given application area. It is difficult to evaluate machine-translation systems objectively because the quality of a translation depends on the combination of three factors: the translation program, the dictionary, and the original document.

During the 1950s, enthusiasts voiced extraordinary claims for new Machine Translation technology. It has had lofty goals, promising quick and cheap translation. DARPA funded a computer program to translate Soviet documents into English. The difficulties of machine translation became clear when the Russian term hydraulic ram was translated as "water goat." There was a backlash of skepticism following the disastrous failure of the machine translation effort in the 1950s.

One hallmark of the Air Force Foreign Technology Division (FTD) was (and continues to be for HQ NASIC) its machine translation (MT) capabilities. In 1955, the Rome Air Development Center at Griffiss AFB, New York, was tasked to develop an MT system for the center. The IBM Mark I Translating Device produced its first automated translation in 1959, and, in October 1963, FTD installed the Mark II, which provided word-for-word Russian language translations at the rate of about 5,000 words per hour.

The National Air and Space Intelligence Center (NASIC) has been developing, operating, and maintaining Systran [MT] systems since 1969. In July 1970, FTD upgraded to an IBM 360 Systran system. Translation speed increased 20-fold and the system analyzed the Russian text sentence-by-sentence to provide improved grammar and syntax. In October 1982, an optical character reader was added to the system to more fully automate text translation.

In September 1971, Air Force Rome Air Development Center developed an English-to-Vietnamese automated translator. Designed to operate on the IBM 360/67 computer, the translation system had an output rate of 80,000 to 100,000 words per hour. As part of the overall "Vietnamization Program," RADC produced in May an automated translation from English to Vietnamese of AF Manual 51-37, Instrument Flying. The translation was accomplished using the LOGOS I System for English-to-Vietnamese machine translation.

By the late 1970s three types of projects include those relying on "brute force" methods involving larger and faster computers; those based on a linguistic tradition which asserts that knowledge required for machine translation can be assimilated to the structure of a grammar-based system with a semantic component; and those stemming from artificial intelligence research, with an emphasis on knowledge structures. At that time the artificial intelligence approach seemed to have the best chance of simulating the communicative abilities necessary for realistic machine translation and gives an account of how knowledge structures might cope with one of the classic problems of machine translation: that of metaphor, or "semantic boundary breaking".

Machine translation efforts at RADC concluded on 27 October 1980 upon completion of a German/English translation system, dubbed METAL. Developed in conjunction with the University of Texas at Austin, the third-generation machine translated with an accuracy rate of 83 percent. From its beginnings 25 years before as an in-house research and development project, translation machines were designed by the Center for Russian, Chinese, and Vietnamese languages.

Today's MT capabilities provides translation "on-the-fly." Within seconds after receiving text, the computer begins providing the translation. Also, almost all HQ NASIC personnel have access to the interactive machine translation system. Russian is the most "robust" language, with built-in Russian translation dictionaries containing more than 350,000 words and expressions.

The Systran MT systems are the only known MT systems that cover the wide range of systems of interest to NASIC and which employ the context-sensitive language analysis that is compatible with NASIC's systems. In addition, Systran MT systems have been identified as the only Department of Defense Intelligence Information System (DODIIS) migration MT System by the DODIIS Migration Board. Existing Systran MT systems include Russian-English, French-English, German-English, Chinese- English, Spanish-English, Korean-English, Slovak-English, Albanian-English, Ukrainian-English, Serbo-Croatian-English, Japanese-English, Polish-English, English-Chinese, English-Japanese, English-Korean, Czech-English, Arabic- English, Urdu-English, and Farsi-English.

Over the past few years there has been a significant research program funded by ARPA, NSA and other government agencies to develop and test automatic machine translation algorithms. While this research program has been constrained to a limited source of documents and a limited set of languages, results so far have been very promising. However a follow-on program is needed to transfer the results of this research into operational use. NSA sponsored work to extend the applicability of the best language translation algorithms to more languages and more general domains; to improve the computational efficiency of those algorithms; to port those algorithms to networked workstations; and to develop good human-machine interfaces to allow easy control and operation of the system.

For textual information, there are ongoing research programs for document retrieval by topic, for data extraction and for machine translation. For several years, ARPA, NSA and other agencies onducted and sponsored research programs to develop algorithms for large vocabulary, continuous speech recognition. A follow-on to this research program was needed to further improve the recognition algorithms and to build a prototype speech recognition system and a system capable of processing continuous speech dictation of arbitrary text.

NSA sponsored work to extend the applicability of the best large vocabulary continuous speech recognition systems to vocabularies with sizes up to 50,000 words and to languages other than English; to improve the computational efficiency of those algorithms; to port those algorithms to networked workstations; and to develop effective human-machine interfaces to allow easy training, testing and general use of the system. The goal of the program is to deliver a usable prototype system for taking dictation on arbitrary topics using continuous speech input.

A major effort was initiated for development of efficient and reliable text summarization technology. Text summarization will combine existing text generation systems with a new understanding of how to identify key points of information in a text to reduce the volume of text an analyst needs to review. Prototype development for text summarization and relevance feedback from users is a near-term goal of the program.

By 2000 many projects to develop and use technology, including machine translation tools, for foreign language training and processing were under way in the Intelligence Community with funding from the National Foreign Intelligence Program, Joint Military Intelligence Program, and the Tactical Intelligence and Related Activities budget. A number of pilot projects are underway that could eventually help IC analysts and information processors deal with the increasing volume of foreign language material.

But humans remained a key part of this equation. The trend was toward development of tools that are intended to assist rather than replace the human language specialist and the instructor. Still, though this capability was not intended to replace humans, it was increasingly useful in niche areas, such as technical publications.

By 2003 the performance of machine translation technology on Arabic news feeds had vastly improved from essentially garbled output to nearly edit-worthy text, often understandable down to the level of individual sentences. This work pointed the way to unprecedented capabilities for exploiting huge volumes of speech and text in multiple languages.

Historically three different approaches to MT have been used: direct translation, interlingual translation and transfer based translation. From the 1980's and early 1990's a few new approaches were also introduced. These recent approaches to machine translation are knowledge-based, corpus-based, hybrid methods and human in loop.

Direct translation is the oldest approach to MT. If the MT system uses direct translation, it usually meant that the source language text was not analyzed structurally beyond morphology. The translation is based on large dictionaries and word-by-word translation with some simple grammatical adjustments e.g. on word order and morphology. A direct translation system is designed for a specific source and target language pair. The translation unit of the approach is usually a word.

The lexicon is normally conceived of as the repository of word-specific information. Traditional lexical resources, such as machine readable dictionaries, therefore contain lists of words. These lists might delineate senses of a word, represent the meaning of a word, or specify the syntactic frames in which a word can appear, but the level of granularity with which they are concerned is the individual word. There are many linguistic phenomena which pose a challenge to this "word focus" in the lexicon. The incorporation of elements at a higher level of abstraction -- at the phrasal level, where particular words are grouped together into fixed phrases -- provides a basis for improved computational processing of language.

One of the oldest still used MT systems today, Systran, is basically a direct translation system. The first version of it was published in 1969. Over the years the system has been developed quite much, but still its translation capability is mainly based on very large bilingual dictionaries. No general linguistic theory or parsing principles are necessarily present for direct translation to work; these systems depends instead on well developed dictionaries, morphological analysis, and text processing software.

The interlingua approach was historically the next steps in the development of MT. Esperanto was an interlingua for translating between languages. In an interlingua based MT approach translation is done via an intermediary (semantic) representation of the SL text. Interlingua is supposed to be a language independent representation from which translations can be generated to different target languages. The interlingua approach assumes that it is possible to convert source texts into representations common to more than one language. From such interlingual representations texts are generated into other languages. Translation is thus in two stages: from the source language to the interlingua (IL) and from the IL to the target language.

Transfer systems divide translation into steps which clearly differentiate source language and target language parts. The first stage converts source texts into abstract representations; the second stage converts these into equivalent target language-oriented representations; and the third generates the final target language texts. Whereas the interlingua approach necessarily requires complete resolution of all ambiguities in the SL text so that translation into any other language is possible, in the transfer approach only those ambiguities inherent in the language in question are tackled.

Knowledge-based machine translation follows the linguistic and computational instructions supplied to it by human researchers in linguistics and programming. The texts to be translated have to be presented to the computer in machine-readable form. The machine translation process may be unidirectional between a pair of languages: the translation is possible only from Russian to English, for example, and not vice versa, in one system. Or it may be bidirectional.

The dominant approach since around 1970 has been to use handcrafted linguistic rules, but this approach is very expensive to build, requiring the manual entry of large numbers of "rules" by trained linguists. This approach does not scale up well to a general system. Such systems also produce translations that are awkward and hard to understand.

Corpus-based approaches to machine translation (statistical or example-based) tried, and partially succeeded to replace traditional rule-based approaches, beginning in the mid-1990s, following the developments in language technology. The main advantage of corpus-based machine translation systems is that they are self-customising in the sense that they can learn the translations of terminology and even stylistic phrasing from previously translated materials.

One of the many problems in the field of machine translation is that expressions (multi-word terms) convey ideas that transcend the meanings of the individual words in the expression. A sentence may have unambiguous meaning, but each word in the sentence can have many different meanings. Autocoding (or automatic concept indexing) occurs when a software program extracts terms contained within text and maps them to a standard list of concepts contained in a nomenclature. The purpose of autocoding is to provide a way of organizing large documents by the concepts represented in the text. Autocoders transform text into an index of coded nomenclature terms (sometimes called a "concept index" or "concept signature").

Word sense disambiguation is a technique for assigning the most appropriate meaning to a polysemous word within a given context. Word sense disambiguation is considered essential for applications that use knowledge of word meanings in open text, such as machine translation, knowledge acquisition, information retrieval, and information extraction. Accordingly, word sense disambiguation may be used by many commercial applications, such as automatic machine translation (e.g. see the translation services offered by www.altavista.com, www.google.com), intelligent information retrieval (helping the users of search engines find information that is more relevant to their search), text classification, and others.

By the turn of the century, this newer approach based on statistical models - a word or phrase is translated to one of a number of possibilities based on the probability that it would occur in the current context - has achieved marked success. The best examples substantially outperform rule-based systems. Statistics-based machine translation (SMT) also may prove easier and less expensive to expand, if the system can be taught new knowledge domains or languages by giving it large samples of existing human-translated texts.

Despite some success, however, severe problems still exist: outputs are often ungrammatical and the quality and accuracy of translation falls well below that of a human linguist - and well below demands of all but highly specialized commercial markets.

Hybrid methods are still fundamentally statistics-based, but incorporate higher level abstract syntax rules to arrive at the final translation. Such hybrids have been explored in the research community, but without any real success because it was difficult to merge the fundamentally different approaches. New algorithms exploit knowledge of how words, phrases and patterns should be translated; knowledge of how syntax-based and non-syntax based translation rules should be applied; and knowledge of how syntactically based target structures should be generated. Cross-lingual parsers of increasing complexity provide methods to choose different syntactic orderings in different situations.

Human-in-loop approaches respond to the difficulties in translation from one language to another that are inherent in Machine Translation. Languages are not symmetrically translatable word for word - greatly complicating software design and making perfect translation impossible. The greater the differences between languages' structure and culture, the greater the difficulty to accurately translate the intent of the speaker. As with any machine translation, conversions are normally not context-sensitive and may not fully convert text into its intended meaning. Language experts noted that machine translation software will never be able to replace a human translator's ability to interpret fine nuances, cultural references, and the use of slang terms or idioms.

Machine translation is not perfect, and may create some poor translations (which can be corrected). Computers, however limited for aiding nonlinguists, are powerful tools for linguists in intelligence and special operations to sort through tons of untranslated information or "triage" documents, sorting contents by priority. Machine "gisting" (reviewing intelligence documents to determine if they contain target key words or phrases) is used to better manage their workloads and target the information that trained linguists need to review in depth. An automated translation system can be used for translation of technical terms and consistent translation of stock phrases in diplomatic and legal documents to help human translators work more efficiently.

Both the private and the public sectors are exploring advances in machine translation of spoken and written communications. Off-the-shelf commercial software is designed for commercially viable languages, but not for the less-commonly taught, low-density languages. Numerous demonstration projects are under way, and early results show some promise for this type of technology.

BIBLIOGRAPHY

1. Hutchins, W. John; and Harold L. Somers (1992). An Introduction to Machine Translation. London: Academic Press.

2. Cohen, J.M., "Translation", Encyclopedia Americana, 1986, vol. 27, pp. 12–15.

3. Bar-Hillel, "Automatic Translation of Languages", 2009.

4. W. John Hut chins and Harold L. Somers. 1992. An Introduction to Machine Translation. ACADEMIC PRESS. (London)

5. Sato, S. and Nagao, Toward Memory - based Translation. In Proceedings of the Col- ing'90, pages 247-252 M. 1990.

К содержанию номера журнала: Вестник КАСУ №2 - 2012

{"REDIRECT_HTTPS":"on","REDIRECT_PORT":"443","REDIRECT_logsa":"%2Fhome%2Fu1440%2Flogs%2Fwww.vestnik-kafu.info-access.log","REDIRECT_logse":"%2Fhome%2Fu1440%2Flogs%2Fwww.vestnik-kafu.info-error.log","REDIRECT_PERL5LIB":".:\/nix\/store\/94h130g00alvmfffawya1k64jqr016x6-perl-union\/lib\/perl5\/site_perl:\/nix\/store\/c5zkkpqnb9w3d2bi90inci715gnxa8y9-perl-5.20.3\/lib\/perl5","REDIRECT_STATUS":"200","HTTPS":"on","PORT":"443","logsa":"%2Fhome%2Fu1440%2Flogs%2Fwww.vestnik-kafu.info-access.log","logse":"%2Fhome%2Fu1440%2Flogs%2Fwww.vestnik-kafu.info-error.log","PERL5LIB":".:\/nix\/store\/94h130g00alvmfffawya1k64jqr016x6-perl-union\/lib\/perl5\/site_perl:\/nix\/store\/c5zkkpqnb9w3d2bi90inci715gnxa8y9-perl-5.20.3\/lib\/perl5","HTTP_HOST":"www.vestnik-kafu.info","HTTP_X_FORWARDED_PROTO":"https","HTTP_X_REAL_IP":"216.73.216.247","HTTP_CONNECTION":"close","HTTP_ACCEPT":"*\/*","HTTP_USER_AGENT":"Mozilla\/5.0 AppleWebKit\/537.36 (KHTML, like Gecko; compatible; ClaudeBot\/1.0; +claudebot@anthropic.com)","HTTP_ACCEPT_ENCODING":"gzip, br, zstd, deflate","HTTP_REFERER":"http:\/\/www.vestnik-kafu.info\/journal\/34\/1387\/","PATH":"\/bin","SERVER_SIGNATURE":"","SERVER_SOFTWARE":"Apache\/2.4.46","SERVER_NAME":"www.vestnik-kafu.info","SERVER_ADDR":"127.0.0.1","SERVER_PORT":"80","REMOTE_ADDR":"216.73.216.247","DOCUMENT_ROOT":"\/home\/u1440\/vestnik-kafu.info","REQUEST_SCHEME":"http","CONTEXT_PREFIX":"","CONTEXT_DOCUMENT_ROOT":"\/home\/u1440\/vestnik-kafu.info","SERVER_ADMIN":"[no address given]","SCRIPT_FILENAME":"\/home\/u1440\/vestnik-kafu.info\/index.php","REMOTE_PORT":"55160","REDIRECT_URL":"\/journal\/34\/1387\/","GATEWAY_INTERFACE":"CGI\/1.1","SERVER_PROTOCOL":"HTTP\/1.1","REQUEST_METHOD":"GET","QUERY_STRING":"","REQUEST_URI":"\/journal\/34\/1387\/","SCRIPT_NAME":"\/index.php","PHP_SELF":"\/index.php","REQUEST_TIME":1778640617,"argv":[],"argc":0}