Machine translation techniques
К содержанию номера журнала: Вестник КАСУ №2 - 2012
Автор: Ларина М.В.
Technology
represents one major strategy that agencies use to address and manage their
foreign language shortfalls. Computer translation - often referred to as
machine translation - has been under development for decades. Machine
translation, which is the automated translation by a computer from one written
language to another without human oversight and intervention. It has distinct
problems. One of the difficulties that machine translation is faced with is the
ability of adjustment to context, but an advantage of machine translation is
the speed. Machine translation is cheaper and faster, although it is also less
accurate, than using human translators. To be effective, the translated text
must initially be in grammatically correct form and cannot include
colloquialisms. Because of this, only 60 to 70 percent accuracy is claimed by
vendors.
One of the great
inaccessibility problems of the Web is to the people who do not speak the major
language, English, the lingua franca. And the demand for translation of English
into other languages and vice versa leads to a steady improvement on the curve
of machine translation. There are a variety of third-party Web sites that
specialize in having the best software, making it possible to read someone's
site through that intermediate site.
Information
technology has made remarkable advances in recent years. The private sector
(without the same kinds of security concerns as the Intelligence Community) has
led the adoption of technologies that are also critical to intelligence. The
Community will never be able to hire enough linguists to meet its needs. It is
difficult for the Community to predict which languages will be most in demand
and to hire the necessary linguists in advance. And even an aggressive hiring
and training effort would not produce an analytic workforce that can absorb the
huge quantity of unclassified foreign language material available today.
Eventually, all
analysts will have basic foreign-language processing tools easily available to
them so that even those who are not language-qualified can pull pieces of
interest and get a quick, rough translation. NSA has done pioneering work on
machine translation and is pursuing a number of separate initiatives; the
military services, CIA (including In-Q-Tel), and other agencies sponsor largely
independent projects. There is an abundance of activity, but not a concerted,
coherent effort, which has led to steady but slow development.
The general idea
behind machine translation is that computers have the patience, stamina and
speed to quickly parse through gigabytes of text, matching text terms with
equivalent terms from an external vocabulary. Human translators often scoff at
the output of machine translators, noting the high rate of comical errors. An
often cited, perhaps apocryphal, example of poor machine translation is the
English to Russian transformation of "out of sight, out of mind" to
the Russian equivalent of "invisible idiot." Despite limitations,
machine translation is the only way to transform gigabytes and terabytes of
text. As long as people continue to type messages, reports, manuscripts and
notes into electronic documents, they will need computers to parse and organize
the resulting text.
Although many
machine-translation programs are currently available, few evaluation methods of
such translation exist for any given application area. It is difficult to evaluate
machine-translation systems objectively because the quality of a translation
depends on the combination of three factors: the translation program, the
dictionary, and the original document.
During the 1950s,
enthusiasts voiced extraordinary claims for new Machine Translation technology.
It has had lofty goals, promising quick and cheap translation. DARPA funded a
computer program to translate Soviet documents into English. The difficulties
of machine translation became clear when the Russian term hydraulic ram was
translated as "water goat." There was a backlash of skepticism
following the disastrous failure of the machine translation effort in the
1950s.
One hallmark of
the Air Force Foreign Technology Division (FTD) was (and continues to be for HQ
NASIC) its machine translation (MT) capabilities. In 1955, the Rome Air Development Center at Griffiss AFB, New York, was tasked to develop an MT system for
the center. The IBM Mark I Translating Device produced its first automated
translation in 1959, and, in October 1963, FTD installed the Mark II, which
provided word-for-word Russian language translations at the rate of about 5,000
words per hour.
The National Air
and Space Intelligence Center (NASIC) has been developing, operating, and
maintaining Systran [MT] systems since 1969. In July 1970, FTD upgraded to an IBM 360 Systran system. Translation speed increased 20-fold and the system analyzed
the Russian text sentence-by-sentence to provide improved grammar and syntax.
In October 1982, an optical character reader was added to the system to more
fully automate text translation.
In September 1971,
Air Force Rome Air Development Center developed an English-to-Vietnamese
automated translator. Designed to operate on the IBM 360/67 computer, the
translation system had an output rate of 80,000 to 100,000 words per hour. As
part of the overall "Vietnamization Program," RADC produced in May an
automated translation from English to Vietnamese of AF Manual 51-37, Instrument
Flying. The translation was accomplished using the LOGOS I System for
English-to-Vietnamese machine translation.
By the late 1970s
three types of projects include those relying on "brute force"
methods involving larger and faster computers; those based on a linguistic
tradition which asserts that knowledge required for machine translation can be
assimilated to the structure of a grammar-based system with a semantic component;
and those stemming from artificial intelligence research, with an emphasis on
knowledge structures. At that time the artificial intelligence approach seemed
to have the best chance of simulating the communicative abilities necessary for
realistic machine translation and gives an account of how knowledge structures
might cope with one of the classic problems of machine translation: that of metaphor,
or "semantic boundary breaking".
Machine
translation efforts at RADC concluded on 27 October 1980 upon completion of a
German/English translation system, dubbed METAL. Developed in conjunction with
the University of Texas at Austin, the third-generation machine translated with
an accuracy rate of 83 percent. From its beginnings 25 years before as an
in-house research and development project, translation machines were designed
by the Center for Russian, Chinese, and Vietnamese languages.
Today's MT
capabilities provides translation "on-the-fly." Within seconds after
receiving text, the computer begins providing the translation. Also, almost all
HQ NASIC personnel have access to the interactive machine translation system.
Russian is the most "robust" language, with built-in Russian translation
dictionaries containing more than 350,000 words and expressions.
The Systran MT
systems are the only known MT systems that cover the wide range of systems of
interest to NASIC and which employ the context-sensitive language analysis that
is compatible with NASIC's systems. In addition, Systran MT systems have been
identified as the only Department of Defense Intelligence Information System
(DODIIS) migration MT System by the DODIIS Migration Board. Existing Systran MT systems include Russian-English, French-English, German-English, Chinese- English,
Spanish-English, Korean-English, Slovak-English, Albanian-English,
Ukrainian-English, Serbo-Croatian-English, Japanese-English, Polish-English,
English-Chinese, English-Japanese, English-Korean, Czech-English, Arabic- English,
Urdu-English, and Farsi-English.
Over the past few
years there has been a significant research program funded by ARPA, NSA and
other government agencies to develop and test automatic machine translation
algorithms. While this research program has been constrained to a limited source
of documents and a limited set of languages, results so far have been very
promising. However a follow-on program is needed to transfer the results of
this research into operational use. NSA sponsored work to extend the applicability
of the best language translation algorithms to more languages and more general
domains; to improve the computational efficiency of those algorithms; to port
those algorithms to networked workstations; and to develop good human-machine
interfaces to allow easy control and operation of the system.
For textual
information, there are ongoing research programs for document retrieval by
topic, for data extraction and for machine translation. For several years,
ARPA, NSA and other agencies onducted and sponsored research programs to
develop algorithms for large vocabulary, continuous speech recognition. A
follow-on to this research program was needed to further improve the
recognition algorithms and to build a prototype speech recognition system and a
system capable of processing continuous speech dictation of arbitrary text.
NSA sponsored work
to extend the applicability of the best large vocabulary continuous speech
recognition systems to vocabularies with sizes up to 50,000 words and to
languages other than English; to improve the computational efficiency of those
algorithms; to port those algorithms to networked workstations; and to develop
effective human-machine interfaces to allow easy training, testing and general
use of the system. The goal of the program is to deliver a usable prototype
system for taking dictation on arbitrary topics using continuous speech input.
A major effort was
initiated for development of efficient and reliable text summarization
technology. Text summarization will combine existing text generation systems
with a new understanding of how to identify key points of information in a text
to reduce the volume of text an analyst needs to review. Prototype development
for text summarization and relevance feedback from users is a near-term goal of
the program.
By 2000 many
projects to develop and use technology, including machine translation tools,
for foreign language training and processing were under way in the Intelligence
Community with funding from the National Foreign Intelligence Program, Joint
Military Intelligence Program, and the Tactical Intelligence and Related
Activities budget. A number of pilot projects are underway that could
eventually help IC analysts and information processors deal with the increasing
volume of foreign language material.
But humans
remained a key part of this equation. The trend was toward development of tools
that are intended to assist rather than replace the human language specialist
and the instructor. Still, though this capability was not intended to replace
humans, it was increasingly useful in niche areas, such as technical
publications.
By 2003 the
performance of machine translation technology on Arabic news feeds had vastly
improved from essentially garbled output to nearly edit-worthy text, often understandable
down to the level of individual sentences. This work pointed the way to unprecedented
capabilities for exploiting huge volumes of speech and text in multiple
languages.
Historically three
different approaches to MT have been used: direct translation, interlingual translation
and transfer based translation. From the 1980's and early 1990's a few new
approaches were also introduced. These recent approaches to machine translation
are knowledge-based, corpus-based, hybrid methods and human in loop.
Direct translation
is the oldest approach to MT. If the MT system uses direct translation, it
usually meant that the source language text was not analyzed structurally
beyond morphology. The translation is based on large dictionaries and
word-by-word translation with some simple grammatical adjustments e.g. on word
order and morphology. A direct translation system is designed for a specific
source and target language pair. The translation unit of the approach is
usually a word.
The lexicon is
normally conceived of as the repository of word-specific information.
Traditional lexical resources, such as machine readable dictionaries, therefore
contain lists of words. These lists might delineate senses of a word, represent
the meaning of a word, or specify the syntactic frames in which a word can
appear, but the level of granularity with which they are concerned is the
individual word. There are many linguistic phenomena which pose a challenge to
this "word focus" in the lexicon. The incorporation of elements at a
higher level of abstraction -- at the phrasal level, where particular words are
grouped together into fixed phrases -- provides a basis for improved
computational processing of language.
One of the oldest
still used MT systems today, Systran, is basically a direct translation system.
The first version of it was published in 1969. Over the years the system has
been developed quite much, but still its translation capability is mainly based
on very large bilingual dictionaries. No general linguistic theory or parsing
principles are necessarily present for direct translation to work; these
systems depends instead on well developed dictionaries, morphological analysis,
and text processing software.
The interlingua
approach was historically the next steps in the development of MT. Esperanto was an interlingua for translating between languages. In an interlingua based
MT approach translation is done via an intermediary (semantic) representation
of the SL text. Interlingua is supposed to be a language independent
representation from which translations can be generated to different target
languages. The interlingua approach assumes that it is possible to convert
source texts into representations common to more than one language. From such
interlingual representations texts are generated into other languages.
Translation is thus in two stages: from the source language to the interlingua
(IL) and from the IL to the target language.
Transfer systems
divide translation into steps which clearly differentiate source language and
target language parts. The first stage converts source texts into abstract representations;
the second stage converts these into equivalent target language-oriented representations;
and the third generates the final target language texts. Whereas the
interlingua approach necessarily requires complete resolution of all
ambiguities in the SL text so that translation into any other language is
possible, in the transfer approach only those ambiguities inherent in the
language in question are tackled.
Knowledge-based
machine translation follows the linguistic and computational instructions
supplied to it by human researchers in linguistics and programming. The texts
to be translated have to be presented to the computer in machine-readable form.
The machine translation process may be unidirectional between a pair of
languages: the translation is possible only from Russian to English, for example,
and not vice versa, in one system. Or it may be bidirectional.
The dominant
approach since around 1970 has been to use handcrafted linguistic rules, but
this approach is very expensive to build, requiring the manual entry of large
numbers of "rules" by trained linguists. This approach does not scale
up well to a general system. Such systems also produce translations that are
awkward and hard to understand.
Corpus-based
approaches to machine translation (statistical or example-based) tried, and
partially succeeded to replace traditional rule-based approaches, beginning in
the mid-1990s, following the developments in language technology. The main
advantage of corpus-based machine translation systems is that they are
self-customising in the sense that they can learn the translations of
terminology and even stylistic phrasing from previously translated materials.
One of the many
problems in the field of machine translation is that expressions (multi-word
terms) convey ideas that transcend the meanings of the individual words in the
expression. A sentence may have unambiguous meaning, but each word in the
sentence can have many different meanings. Autocoding (or automatic concept
indexing) occurs when a software program extracts terms contained within text
and maps them to a standard list of concepts contained in a nomenclature. The
purpose of autocoding is to provide a way of organizing large documents by the
concepts represented in the text. Autocoders transform text into an index of
coded nomenclature terms (sometimes called a "concept index" or
"concept signature").
Word sense
disambiguation is a technique for assigning the most appropriate meaning to a
polysemous word within a given context. Word sense disambiguation is considered
essential for applications that use knowledge of word meanings in open text,
such as machine translation, knowledge acquisition, information retrieval, and
information extraction. Accordingly, word sense disambiguation may be used by
many commercial applications, such as automatic machine translation (e.g. see
the translation services offered by www.altavista.com, www.google.com), intelligent
information retrieval (helping the users of search engines find information
that is more relevant to their search), text classification, and others.
By the turn of the
century, this newer approach based on statistical models - a word or phrase is
translated to one of a number of possibilities based on the probability that it
would occur in the current context - has achieved marked success. The best
examples substantially outperform rule-based systems. Statistics-based machine
translation (SMT) also may prove easier and less expensive to expand, if the
system can be taught new knowledge domains or languages by giving it large
samples of existing human-translated texts.
Despite some
success, however, severe problems still exist: outputs are often ungrammatical
and the quality and accuracy of translation falls well below that of a human
linguist - and well below demands of all but highly specialized commercial
markets.
Hybrid methods are
still fundamentally statistics-based, but incorporate higher level abstract
syntax rules to arrive at the final translation. Such hybrids have been
explored in the research community, but without any real success because it was
difficult to merge the fundamentally different approaches. New algorithms
exploit knowledge of how words, phrases and patterns should be translated;
knowledge of how syntax-based and non-syntax based translation rules should be
applied; and knowledge of how syntactically based target structures should be
generated. Cross-lingual parsers of increasing complexity provide methods to
choose different syntactic orderings in different situations.
Human-in-loop
approaches respond to the difficulties in translation from one language to
another that are inherent in Machine Translation. Languages are not
symmetrically translatable word for word - greatly complicating software design
and making perfect translation impossible. The greater the differences between
languages' structure and culture, the greater the difficulty to accurately
translate the intent of the speaker. As with any machine translation,
conversions are normally not context-sensitive and may not fully convert text
into its intended meaning. Language experts noted that machine translation
software will never be able to replace a human translator's ability to
interpret fine nuances, cultural references, and the use of slang terms or
idioms.
Machine
translation is not perfect, and may create some poor translations (which can be
corrected). Computers, however limited for aiding nonlinguists, are powerful
tools for linguists in intelligence and special operations to sort through tons
of untranslated information or "triage" documents, sorting contents
by priority. Machine "gisting" (reviewing intelligence documents to
determine if they contain target key words or phrases) is used to better manage
their workloads and target the information that trained linguists need to
review in depth. An automated translation system can be used for translation of
technical terms and consistent translation of stock phrases in diplomatic and
legal documents to help human translators work more efficiently.
Both the private
and the public sectors are exploring advances in machine translation of spoken
and written communications. Off-the-shelf commercial software is designed for
commercially viable languages, but not for the less-commonly taught,
low-density languages. Numerous demonstration projects are under way, and early
results show some promise for this type of technology.
BIBLIOGRAPHY
1.
Hutchins, W. John; and Harold L. Somers (1992). An Introduction to Machine Translation.
London: Academic Press.
2.
Cohen, J.M., "Translation", Encyclopedia Americana, 1986, vol. 27,
pp. 12–15.
3.
Bar-Hillel, "Automatic Translation of Languages", 2009.
4.
W. John Hut chins and Harold L. Somers. 1992. An Introduction to Machine Translation.
ACADEMIC PRESS. (London)
5.
Sato, S. and Nagao, Toward Memory - based Translation. In Proceedings of the Col- ing'90, pages 247-252 M. 1990.
К содержанию номера журнала: Вестник КАСУ №2 - 2012
|