USING TOPOLOGICAL INFORMATION FOR DETECTING

Transcrição

USING TOPOLOGICAL INFORMATION FOR DETECTING
Dimitra Anastasiou, Oliver Čulo
Saarland University
USING TOPOLOGICAL INFORMATION FOR DETECTING
IDIOMATIC VERB PHRASES IN GERMAN
1 Introduction
The METIS-II1 project (Dirix et al. 2005; Dologlou et al. 2003) is a hybrid statistical
machine translation system (SMT). Its goal is to generate free text translation that is based on
statistical methods and on pattern matching in a huge monolingual target language (TL)
corpus, the British National Corpus (BNC)2. Its duration is from 2004 until 2007 and it has
Dutch, German, Greek and Spanish as source languages (SL) and British English as TL.
METIS-II has language-specific resources for both SL and TL, such as bilingual dictionaries,
tokenizer, part-of-speech (PoS) tagger, chunker, lemmatizer /morphological generator and
manually constructed mapping rules. The function of the mapping rules is to map the TL
structures onto SL structures. A language model for the target language acquired from the
BNC helps disambiguating between different translation possibilities and it is used to retrieve
the TL word order (Vandeghinste et al., 2005). A translation example is shown in figure 1
below. Clearly, the context plays a significant role; in the example, the choice of the adjective
in the TL depends on the noun it is combined with.
SL: Ich betrachte Churchill als einen großen Politiker.
Literally: I consider Churchill as a tall / great politician.
TL: I consider Churchill to be a great politician.
The SL sentence must be tokenized, tagged, lemmatized and chunked. When all lemmas have
found one or more translation in the TL, the statistical language model as well as mapping
rules help find the right TL lexemes and word order.
2 Identification of idiomatic expressions and discontinuities
Idioms are many times referred to as long words. Idioms are these expressions whose
syntactic or semantic properties cannot be derived from their component parts. They always
involve a lexical head and frozen and/or flexible complements.
Volk (1998) points out the prerequisites of a MT system in order to identify an idiom:
1. The contiguous parts of the idiom (vor dir Hunde – to the dogs)
1
2
METIS-II is sponsored by EU under the FET-STREP scheme of FP6 (METIS-II, IST-FP6-003768).
http://www.natcorp.ox.ac.uk/
2. The discontinuous parts of the idiom (gehen – go) in any of their declination forms
3. The syntactic requirements of the idiom. The idiom in question often takes an animate
subject and a physical object.
4. The clause boundaries. The idiom is usually in one clause. There are rare cases where
the idiom is spread in two clauses, but then it is often used with its non-idiomatic
meaning.
The term “discontinuous strings” has been used to describe the strings whose verb is
conjugated. It usually follows the subject-verb-object (SVO) sequence, e.g. die Welt geht
(ständig) vor die Hunde – the world (constantly) goes to the dogs or it is in the participle form,
e.g. Die Welt ist vor die Hunde gegangen – the world went to the dogs.
3 Types of idiomatic verb phrases
Idiomatic expression can be a noun phrase (das A und O – the be-all and end-all), a
prepositional phrase (auf Biegen and Brechen – by hook or by crook), an adverb (steif und fest
– firmly), a noun phrase plus a prepositional phrase (Hals über Kopf – head over heels) or a
verb phrase. In this paper, we examine the German idiomatic verb phrases in detail, because
they exhibit the most discontinuities. They basically occur in one of three following
categories. The entries written below are used for our experiment and are included in our
bilingual (German – English) lexicon. In our corpus, we used real world examples from
newspaper texts, with the below entries, both in continuous and discontinuous order.
1. Noun (NP) plus verb
ein Auge zudrücken
turn a blind eye
einen kapitalen Bock schießen
drop a real charger
2. Prepositional phrase (PP) plus verb
auf die falsche Karte setzen
back the wrong horse
ins Fettnäpfchen treten
put one’s foot in it
mit den Wölfen heulen
run with the pack
um den heißen Brei herumreden
beat about the bush
vor die Hunde gehen
go to the dogs
[OC: Diese Beispiele nach hinten, als future work, denn das hier behandeln wir gar
nicht in unserem matching!]
3. NP plus PP plus verb
das Kind mit dem Bade ausschütten
throw out the baby with the bathwater
den Bock zum Gärtner machen
set the fox to keep the geese
ein Wolf im Schafpelz sein
be a wolf in sheep’s clothing
jmdn. auf den Arm nehmen
pull s.o.'s leg
4 Permutations and topological distributions
German is a language with a relatively free word order. However, it does obey some ordering
principles, as described in the topological field model for German (Drach 1963, Duden 1998).
Making use of this model, we can describe the patterns in which subparts of idiomatic
expression can appear, potentially carrying an idiomatic reading. The topological field model
states that the German main clause3 can be divided into five fields, each of which can hold a
certain number and/or a certain. A basic description of the fields is as follows:
• The pre-field (VF) contains only one syntactic constituent, be it NP, PP or subordinate
clause;
• the left bracket (LK) holds the conjugated syntactic head verb;
• the middle field (MF) can consist of several permutations of various kinds of syntactic
constituents and subordinate clauses;
• the right bracket (RK) holds participles or infinitive forms in case the syntactic head
verb is an auxiliary or a modal;
• finally, the post-field (NF) contains subordinate clauses or coordinated main clauses.
As Drach (1963) points out, the sentence bracket is a typical feature of the German clause
construction. Syntactic units that have been seperated often appear as bracketing the middle
field, which is the main container of the main clause. This bracketing construction is the case,
for instance, when we have a modal verb: it will appear in the left bracket, the infinitive
belonging to it will be placed in the right bracket. The same goes for discontinuous idiomatic
verb phrases. The verb may appear in the left bracket, the NP or PP belonging to it will be set
to the end of the middle field4. Following this observation, we defined the pattern for
discontinuous appearances of iVPs:
iVLK ([NP|PP|subclause)*MF iNPMF [ (V*RK|subclause*NF)]
like in:
SL: Der Mann nimmtLK mich, obwohl ich mich darüber ärgere, ständig auf den ArmMF.
Literally: The man takes me, although I me about it annoy, constantly, on the arm.
TL: The man is constantly pulling my leg, although I am annoyed about it.
We define the ordering in which the verb appears to the left as non-canonical order, because
in the lexicon the verb is situated to the right of the iNP/iVP, which we simply define as
canonical order.
Besides the pattern given above, (at least) three other configurations are possible, all of them
being variations of continuous appearances of the iVP.
The iVP can appear as a continuous string at the end of a clause. The pattern for this is:
iNPMF iVRK
stating that the iV is in the right bracket and preceded by the iNP or iPP in the middle field.
This is usually the case in subordinate clauses, like in:
SL: Ich mag ihn nicht, weil er mich auf den ArmMF nimmtRK.
Literally: I like him not, because he me on the arm takes.
3
Subordinate clauses can be divided into fields, too, but a detailed inspection of a complete theory on German
topological fields is outside the scope of this paper.
4
We cannot say that it is in the right bracket, as by the definition we use only verbal forms can be contained
there.
TL: I don't like him, because he pulls my leg.
The same case appears in main clauses, when there is an auxiliary or modal verb in the left bracket as
syntactic head word, like in:
SL: Er hatLK mich auf den ArmMF genommenRK.
Literally: He has me on the arm taken.
TL: He pulled my leg.
When the iVP is topicalized, it appears as continuous string in the pre-field, described by the
pattern:
(iNP iV)VF
like in:
SL: Auf den Arm nehmenVF lasseLK ich mich nicht.
Literally: [On the arm take] let I me not.
TL: I won’t let anyone pull my leg.
While it may look superfluous to actually define all topological variations for the continuous
appearances instead of simply trying to match an ‘en bloc’ order of all iVP constituents, it
must be pointed out that some idiomatic expressions can also be varied slightly by inserting
adjectives, for instance. In these cases we want to still make sure that an idiomatic reading of
the whole expression is still possible, which is true if the slightly modified idiom parts are
distributed as in the above examples.
5 Pattern matching of discontinuous VPs
The detection of a fit between the given sentence and a particular tree structure is called
pattern matching. Matching is useful to identify the permutations of discontinuous phrases.
For this task, a procedure was developed that makes it possible to match discontinuous strings
(Carl and Rascu, 2006). The dictionary has only canonical forms and special dictionary preprocessing and lookup as well as discontinuous matching is necessary for detecting the
permutations.
The first section describes the structure of the dictionary and the second one describes the
rules that guide the pattern matching process with the help of topological information.
5.1 The dictionary
[OC: Die Generierung von Varianten mit der Beschreibung des diskontinuierlichen Musters in
Abschnitt 4 verknüpfen.]
The German – English lexicon used in METIS-II has been developed by the IAI5 at the
Saarland University. It contains more than 600.000 entries which have been collected over the
past 20 years. The bilingual dictionary describes the raw lemma-to-lemma translation. A
lemma and a PoS-tag are taken as input and a TL lemma and a partial TL tag are returned.
There is a German and an English side in each entry. During the compilation of the lexicon,
the German side and the English side of the entries are lemmatized and tagged, so that they
can be matched more easily on the SL string. The features are represented and used for TL
generation. Lexicon entries are represented in the form of attribute-value pairs. A single word
5
Institut der Gesellschaft zur Förderung der Angewandten Informationsforschung
of German can be translated either into an English single word or into a phrase; that means
that the language sides are independent. The entries also contain additional PoS information
for the German and for the English side. The dictionary entries do not necessarily have the
same number of arguments. The German side may have the meta-information jemanden
<jdn>, whereas the English may lack the meta information somebody <sb>. Take the
following lexicon entry, for instance:
<jdn> auf den Arm nehmen
<=> pull <sos> leg
The distance between SL and TL of idioms and other fixed expressions is often large, as
several of the lemmas of the idiom do not occur as possible translations of corresponding
entries in the dictionary. Therefore, listing this kind of fixed expressions in the dictionary
could solve the problem.
Let us take the above entry auf den Arm nehmen as an example. The compilation process
generates one database entry for the canonical form, and one for the permutation nehmen auf
den Arm, where the verb moves at the front of the phrase. The canonical form NP + V and the
permutation V + NP will appear in different kinds of sentences, as we have described above.
5.2 Topological rules guiding the matching process
[OC: Dies hier kürzer fassen und auf Abschnitt 4 beziehen.]
During the matching process, for each word of a multi-word entry the process will check
whether the words are present in the sentence. However, just this is not enough. While the
words may be there, they can still appear in a distribution which does not make them an
idiomatic expression. As we have already shown, the parts of an idiomatic VP can appear in
certain places in the sentence. The matching process can be guided by rules that check
whether matched words appear in the right place. Otherwise, for an expression like
auf_den_Arm_nehmen, any preposition auf appearing in the sentence could be matched. We
want to match just this preposition though that is part of the PP auf_ den_Arm. If there is no
such preposition, we can dismiss the sentence as containing the idiomatic expression and thus
avoid false positives, called noise.
The rules used for guiding the matching process are written in the KURD6 formalism. A
KURD rule consists of a name, a condition and an action part. A KURD test consists of a list
of nodes, each node being a set of attribute-values features. If the test matches the current
node(s) in the input object, it returns true and the given action is performed; else the action is
spared or possibly an alternative action is performed, if given.
A schematic representation of a matching rule containing topological information is as
follows:
VerbPattern_lk_mf =
VP [
V: field = left bracket
X*: field = middle field OR subordinate clause
NP: field = middle field
Y*: End_Of_Sentence OR field = right bracket OR field = post-field
]
6
KURD stands for the operations kill, unify, replace and delete. It is capable of other operations, too, though.
: Mark_As_iVP.
This rule describes a pattern for a discontinuous iVP, as we have described above. The first
element that is matched should be a verb in the left bracket of the sentence. After that, a
number of elements in the middle field or an inserted subordinate clause can follow. At the
end of the middle field, there is the NP part of the iVP. Either the sentence finishes at that
point, or the iNP is followed by an element in the right bracket or the post field. This
condition matches sentences like “Der Mann nimmt mich, obwohl ich mich darüber ärgere,
ständig auf den Arm”.
6 Evaluation of the discontinuous matching
We constructed two test corpora consisting of
•
32 sent. with canonical order, e.g.
de : Grünes Licht geben müssen die Eltern den Kindern.
en (literally) : Green light give must the parents the children.
en : The parents must give green light to the children.
•
19 sent. with non-canonical order, e.g.
de : Die Welt geht nicht vor die Hunde.
en (literally) : The world goes not before the dogs.
en : The world does not go to the dogs.
We ran the matching procedure on the given sentences and manually evaluated right and
wrong matches of idiomatic expressions. In addition, we compared it to an earlier version of
matching rules which would simply test what kind of clause the iVP appears in, assuming that
the canonical order should be more or less reserved for subordinate clauses. The following
table reflects our findings:
Precision
Recall
Noise
Miss
non-canonical
old match
100
89
0
2
new match
100
89
0
2
old match
77
53
5
15
new match
100
91
0
3
canonical
The false matched sentences could be classified in two categories, such as misses and noise.
The former is when the phrase has not been matched at all and noise when the sentence has
been matched, but in a wrong way. Some sentences may have both miss and noise at the same
time.
[OC: kurze Evaluation und Diskussion der Zahlen]
[OC: das im folgenden Absatz beschriebene Problem hatten wir kurz vor PALC gelöst]
A problem that we faced very often is that of having the same article two times in the
sentence, so that it matched the article that was not part of the idiom. The morphological
program could not identify whether the article belong to the discontinuous phrase or to
another one common noun of the sentence. With the help of a KURD rule, we solved the
problem by setting appropriate constraints.
7 Conclusion and further work
The idioms field is a matter that concerns many researchers and has always been a difficult
task for MT. Our matching procedure within METIS-II, a hybrid SMT system, has been
improved since last year. We amended the already existing rules of IAI and the evaluation
was proved to be successful. METIS-II is now able to match not only continuous, but
discontinuous phrases, too.
There are still more complex patterns to be researched, such as Der Bock macht sich selbst
zum Gärtner – The fox is set by itself to keep the geese.
[OC: folgende zwei Absätze als future work in Kürze zusammenfassen, ein oder zwei
Beispiele vom Anfang einfügen]
Abeillé and Schabes (1989) examine various discontinuities occurred with both “fixed” and
“flexible” idioms. The former are these idioms to which no syntactic or lexical rule can be
applied to the frozen arguments of the idiom. In other words, they can not be relativized,
passivized etc. Even this kind of idioms cannot be syntactically reanalyzed, but the idioms
rather need to be assigned a syntactic internal structure. All insertions are regularly
predictable from the syntactic category of the idiomatic element that is modified, for example
the adjective proverbial in the example: He kicked the proverbial bucket. Unbounded
discontinuities can arise from the insertion of auxiliaries or adverbials between the frozen
subject and the verb: All hell seemed to be likely to break loose.
The “flexible” idioms have the same discontinuities as free sentences and are often assigned
the same syntactic structures. Unbounded discontinuities may arise from the passivization of
an idiom with a frozen complement, e.g. The beans were spilled by him. To an idiom with a
frozen subject can adverbials and auxiliaries insert regularly: The beans continue to appear to
be certain to be spilled (Wasow et al., 1983).
It should, also, be checked whether the already applied patterns are applicable to all supportverb-constructions (SVCs).
RΕFERENCES
[OC: Volk (1998) fehlt hier noch!]
Abeillé, A. and Y. Schabes. (1989). “Parsing idioms in lexicalized TAGs”. Proceedings of the
fourth conference on European chapter of the Association for Computational Linguistics.
Manchester, England: 1-9.
Carl, M. and E. Rascu. (2006). “A Dictionary Lookup Strategy for Translating Discontinuous
Phrases”. Proceedings of EAMT. Oslo: 16-21.
Dirix, P., Schuurman I. and V. Vandeghinste. (2005). “Example-based machine translation
using monolingual corpora: System descritption”. Proceedings of MT Summit X,
Workshop on EBMT. Phuket, Thailand: 43-50.
Dologlou, Y., Markantonatou, S., Tambouratzis, G. Yannoutsou, O., Fourla, A. and N.
Ioannou. (2003). “Using Monolingual Corpora for Statistical Machine Translation: The
METIS System”. Proceedings of EAMT - CLAW 2003, Dublin: 61-68.
Drach, E. (1963)[1940]. Grundgedanken der deutschen Satzlehre. Wissenschaftliche
Buchgesellschaft, Darmstadt, Germany.
DUDEN Redaktion. (1998). Grammatik der deutschen Gegenwartssprache. Mannheim,
Germany.
Vandeghinste, V., Dirix, P. and I. Schuurman. (2005). “Example-based Translation without
Parallel Corpora: First experiments on a prototype”. Proceedings of MT Summit X,
Workshop on EBMT. Phuket, Thailand: 135-142.
Wasow, T., Sag, I., and G. Nunberg. (1983). „Idioms: An interim report“. In: Hattori, S. and
K. Inoue (eds.) Proceedings of the XIIIth International Congress of Linguistics. Tokyo:
102-115.

Documentos relacionados