Automatically Enriching a Thesaurus with Information from Dictionaries

Transcrição

Automatically Enriching a Thesaurus with Information
from Dictionaries
Hugo Gonçalo Oliveira1
Paulo Gomes
{hroliv,pgomes}@dei.uc.pt
Cognitive & Media Systems Group
CISUC, Universidade de Coimbra
October 11, 2011
1
supported by FCT scholarship grant SFRH/BD/44955/2008
Gonçalo Oliveira & Gomes (CISUC)
KDBI, EPIA 2011
October 11, 2011
1 / 18
Index
1
Introduction
2
Proposed approach
3
Enriching TeP with synonymy in PAPEL
4
Evaluation
5
Concluding remarks
KDBI, EPIA 2011
October 11, 2011
2 / 18
Introduction
Lexical knowledge bases
Thesaurus, lexical networks, lexical ontologies, ...
KDBI, EPIA 2011
October 11, 2011
3 / 18
Introduction
Structured on words and their meanings
KDBI, EPIA 2011
October 11, 2011
3 / 18
Introduction
Try to cover the whole language
KDBI, EPIA 2011
October 11, 2011
3 / 18
Introduction
No specific domain
KDBI, EPIA 2011
October 11, 2011
3 / 18
Introduction
No specific domain
Essential for developing NLP tools for a language
KDBI, EPIA 2011
October 11, 2011
3 / 18
Introduction
No specific domain
I
Useful for NLP tasks (eg. word-sense disambiguation,
question-answering, determining similarities, ...)
KDBI, EPIA 2011
October 11, 2011
3 / 18
Introduction
No specific domain
I
I
Useful for NLP tasks (eg. word-sense disambiguation,
question-answering, determining similarities, ...)
See Princeton WordNet [Fellbaum, 1998]
KDBI, EPIA 2011
October 11, 2011
3 / 18
Introduction
Free lexical knowledge bases for Portuguese
Public domain thesaurus:
I
I
2
3
TeP [Maziero et al., 2008]
OpenThesaurus.PT2
http://openthesaurus.caixamagica.pt/
http://pt.wiktionary.org/
KDBI, EPIA 2011
October 11, 2011
4 / 18
Introduction
I
I
OpenThesaurus.PT2
Collaborative dictionary
I
2
3
Portuguese Wiktionary3
KDBI, EPIA 2011
October 11, 2011
4 / 18
Introduction
I
I
OpenThesaurus.PT2
I
Public domain lexical network
I
2
3
PAPEL [Gonçalo Oliveira et al., 2010]
KDBI, EPIA 2011
October 11, 2011
4 / 18
Introduction
I
I
OpenThesaurus.PT2
I
I
Lexical ontology [coming soon]
I
2
3
Onto.PT [Gonçalo Oliveira and Gomes, 2010]
KDBI, EPIA 2011
October 11, 2011
4 / 18
Introduction
I
I
OpenThesaurus.PT2
I
I
I
More complementary than overlapping
2
3
KDBI, EPIA 2011
October 11, 2011
4 / 18
Introduction
I
I
OpenThesaurus.PT2
I
I
I
More complementary than overlapping
Fruitful to merge some of them in a unique broader resource
2
3
KDBI, EPIA 2011
October 11, 2011
4 / 18
Introduction
This work
Integrate synonymy information from dictionaries in a thesaurus
KDBI, EPIA 2011
October 11, 2011
5 / 18
Introduction
This work
1
Extraction of synpairs from dictionaries
KDBI, EPIA 2011
October 11, 2011
5 / 18
Introduction
This work
1
2
Assigning synpairs to synsets
KDBI, EPIA 2011
October 11, 2011
5 / 18
Introduction
This work
1
2
3
Clustering remaining pairs
KDBI, EPIA 2011
October 11, 2011
5 / 18
Introduction
This work
1
2
3
Apply the procedure in the enrichment of TeP with PAPEL
KDBI, EPIA 2011
October 11, 2011
5 / 18
Proposed approach
Extracting synpairs from dictionaries
mente, n: cérebro, cabeça, intelecto
[mind, n: brain, head, intellect]
máquina, n: o mesmo que computador
[machine, n: the same as computer ]
KDBI, EPIA 2011
October 11, 2011
6 / 18
Proposed approach
I
(cérebro, mente) (cabeça, mente) (intelecto, mente)
[(brain, mind) (head, mind) (intellect, mind)]
KDBI, EPIA 2011
October 11, 2011
6 / 18
Proposed approach
I
(cérebro, mente) (cabeça, mente) (intelecto, mente)
[(brain, mind) (head, mind) (intellect, mind)]
I
(computador, máquina)
[(computer, machine)]
KDBI, EPIA 2011
October 11, 2011
6 / 18
Proposed approach
p = (wx , wy ) + Sa = (w1 , w2 , ..., wn ) → Sa = (w1 , w2 , ..., wn , wx , wy )
KDBI, EPIA 2011
October 11, 2011
7 / 18
Proposed approach
p = (wx , wy ) + Sa = (w1 , w2 , ..., wn ) → Sa = (w1 , w2 , ..., wn , wx , wy )
Synonymy graph G
I
I
I
All the extracted synpairs
Nodes represent words (eg. wx , wy )
p = (wx , wy ) establishes an edge between wx and wy
KDBI, EPIA 2011
October 11, 2011
7 / 18
Proposed approach
For each synpair p = (wx , wy )
1
a
If Si ∈ T : wx ∈ Si ∧ wy ∈ Si , nothing is done.
Any measure for computing the similarity of two vectors can be used
KDBI, EPIA 2011
October 11, 2011
8 / 18
Proposed approach
1
2
Select all synsets Cj ∈ C : C ⊂ T , C = {C1 , C2 , ..., Cn }
∀(Cj ∈ C ) : wx ∈ Cj ∨ wy ∈ Cj .
a
KDBI, EPIA 2011
October 11, 2011
8 / 18
Proposed approach
1
2
∀(Cj ∈ C ) : wx ∈ Cj ∨ wy ∈ Cj .
3
If |C | = 1, p + C1 .
a
KDBI, EPIA 2011
October 11, 2011
8 / 18
Proposed approach
1
2
∀(Cj ∈ C ) : wx ∈ Cj ∨ wy ∈ Cj .
3
If |C | = 1, p + C1 .
4
Compute the adjacency vector [p] = [wx ] + [wy ]. The adjacency
vector of a word is a column of the matrix M, [wj ] = [Mj ];
a
KDBI, EPIA 2011
October 11, 2011
8 / 18
Proposed approach
1
2
∀(Cj ∈ C ) : wx ∈ Cj ∨ wy ∈ Cj .
3
If |C | = 1, p + C1 .
4
5
Compute the adjacency vector of each Cj ∈ C
P|Cj |
[Cj ] = k=1
[wk ] : wk ∈ Cj ;
a
KDBI, EPIA 2011
October 11, 2011
8 / 18
Proposed approach
1
2
∀(Cj ∈ C ) : wx ∈ Cj ∨ wy ∈ Cj .
3
If |C | = 1, p + C1 .
4
5
P|Cj |
[Cj ] = k=1
[wk ] : wk ∈ Cj ;
6
Select the most similar synset
Cbest : sim(p, Cbest )a = max(sim(p, Cj ));
a
KDBI, EPIA 2011
October 11, 2011
8 / 18
Proposed approach
1
2
∀(Cj ∈ C ) : wx ∈ Cj ∨ wy ∈ Cj .
3
If |C | = 1, p + C1 .
4
5
P|Cj |
[Cj ] = k=1
[wk ] : wk ∈ Cj ;
6
Select the most similar synset
Cbest : sim(p, Cbest )a = max(sim(p, Cj ));
7
p + Cbest .
a
KDBI, EPIA 2011
October 11, 2011
8 / 18
Proposed approach
G 0 is established by the remaining pairs
1
Sparse matrix M 0 (|N| × |N|)
KDBI, EPIA 2011
October 11, 2011
9 / 18
Proposed approach
1
2
Mij0 = sim([wi ], [wj ])
KDBI, EPIA 2011
October 11, 2011
9 / 18
Proposed approach
1
2
3
Normalise the columns of M, so that
P|Mj |
k=1
KDBI, EPIA 2011
Mjk = 1
October 11, 2011
9 / 18
Proposed approach
1
2
3
4
Extract cluster Si from each row Mi0 , with the words wj where Mij0 > θ
P|Mj |
k=1
KDBI, EPIA 2011
Mjk = 1
October 11, 2011
9 / 18
Proposed approach
1
2
3
4
Extract cluster Si from each row Mi0 , with the words wj where Mij0 > θ
5
For each Si : Si ∪ Sj = Sj and Si ∩ Sj = Si , Si is discarded.
P|Mj |
k=1
KDBI, EPIA 2011
Mjk = 1
October 11, 2011
9 / 18
Coverage of the synpairs by TeP
POS
Nouns
Verbs
Adjectives
4
Synpairs
37,452
21,465
19,073
In TeP
27.38%
43.01%
37.60%
|C |4 = 0
14.98%
1.34%
5.58%
|C | = 1
12.01%
4.04%
8.22%
|C | > 1
45.63%
51.66%
48.60%
|C |
3.86
6.64
4.26
Number of candidate synsets
KDBI, EPIA 2011
October 11, 2011
10 / 18
Coverage of the synpairs by TeP
POS
Nouns
Verbs
Adjectives
Synpairs
37,452
21,465
19,073
In TeP
27.38%
43.01%
37.60%
|C |4 = 0
14.98%
1.34%
5.58%
|C | = 1
12.01%
4.04%
8.22%
|C | > 1
45.63%
51.66%
48.60%
|C |
3.86
6.64
4.26
Experimentation was performed using the cosine similarity
4
Number of candidate synsets
KDBI, EPIA 2011
October 11, 2011
10 / 18
Results – words
Thesaurus
TeP 2.0
After assignments
Clusters
Final thesaurus
POS
Nouns
Verbs
Adjectives
Nouns
Verbs
Adjectives
Nouns
Verbs
Adjectives
Nouns
Verbs
Adjectives
Total
17,158
10,827
14,586
23,775
12,818
17,158
8,546
502
1,858
30,369
13,090
18,525
Ambiguous
5,805
4,905
3,735
10,418
7,094
6,294
701
8
39
12,045
7,221
6,550
KDBI, EPIA 2011
Words
Avg(senses)
1.71
2.08
1.46
2.09
2.64
1.83
1.15
1.02
1.03
1.96
2.62
1.80
Most ambig.
20
41
19
37
42
22
8
3
4
38
42
23
October 11, 2011
11 / 18
Results – synsets
Thesaurus
TeP 2.0
After assignments
Clusters
Final thesaurus
POS
Nouns
Verbs
Adjectives
Nouns
Verbs
Adjectives
Nouns
Verbs
Adjectives
Nouns
Verbs
Adjectives
Total
8,254
3,978
6,066
8,254
3,978
6,066
3,524
220
820
11,778
4,198
6,886
Avg(size)
3.56
5.67
3.50
6.01
8.50
5.17
2.78
2.34
2.33
5.05
8.18
4.84
KDBI, EPIA 2011
Synsets
size = 2
size > 25
3,079
0
939
48
3,033
19
1,930
179
702
217
2,369
120
2,247
0
174
0
656
0
4,177
179
876
217
3,025
120
max(size)
21
53
43
150
148
110
13
6
10
150
148
110
October 11, 2011
12 / 18
Evaluation
Assignments evaluation
Manual evaluation of sample assignments
Two judges for each assignment
KDBI, EPIA 2011
October 11, 2011
13 / 18
Evaluation
POS
Nouns
Verbs
Adjectives
Sample
100 assigns. × 2
100 assigns. × 2
100 assigns. × 2
153
142
151
Correct
(76.50%)
(71.00%)
(75.50%)
KDBI, EPIA 2011
47
58
49
Incorrect
(23.50%)
(29.00%)
(24.50%)
Agreement
77.00%
74.00%
75.00%
October 11, 2011
13 / 18
Evaluation
POS
Nouns
Verbs
Adjectives
Synpair
Synset
Judge 1
Judge 2
(escrutı́nio,votação)
(decisão,desempate)
(plano,gizamento)
(venerar,homenagear)
(atacar,combater)
(obter,rapar)
(grandioso,épico)
(delicado,requintado)
(falido,queimado)
votação;voto;sufrágio
resolução;objetivação;tenção;intenção
planı́cie;chã;chanura;plaino;plano;planura
venerar;cultuar;adorar;idolatrar
atacar;inciar
depilar;despelar;pelar;raspar;rapar;rascar
admirável;fabuloso;grandioso
difı́cil;complicado;delicado
queimado;incendiado
1
0
0
1
0
0
1
0
0
1
1
0
1
1
0
1
1
0
KDBI, EPIA 2011
October 11, 2011
13 / 18
Evaluation
Clustering
Manual evaluation of clusters
Two judges for each cluster
KDBI, EPIA 2011
October 11, 2011
14 / 18
Evaluation
Clustering
Cluster is correct if, in some context, all its words might have the
same meaning
KDBI, EPIA 2011
October 11, 2011
14 / 18
Evaluation
Clustering
same meaning
Table: Evaluation of clustering
POS
Nouns
Verbs
Adjectives
Sample
105 × 2
105 × 2
105 × 2
Correct
179 (85.24%)
193 (91.90%)
189 (90.00%)
KDBI, EPIA 2011
Incorrect
31 (14.76%)
17
(8.10%)
21 (10.00%)
Agreement
91.43%
87.62%
85.71%
October 11, 2011
14 / 18
Evaluation
Clustering
same meaning
Figure: Examples of connected subgraphs and resulting clusters.
KDBI, EPIA 2011
October 11, 2011
14 / 18
Concluding remarks
Update: computing similarity
Sum the adjacencies
I
One vector per synset: [Cj ] =
I
sim(p, Cj ) = sim([p], [Cj ])
P|Cj |
k=1 [wk ]
KDBI, EPIA 2011
: wk ∈ Cj ;
October 11, 2011
15 / 18
Concluding remarks
Sum the adjacencies
I
I
P|Cj |
k=1 [wk ]
: wk ∈ Cj ;
Average similarity of the pair with each synset element
I
One vector per synset element: [Cj ] = ([w1 ], ..., [wn ]), n = |Cj |
|Cj |
P
I
sim(p, Cj ) =
cos([p],[Mwk ])
k=1
|Cj |
, wk ∈ C j
KDBI, EPIA 2011
October 11, 2011
15 / 18
Concluding remarks
Sum the adjacencies
I
I
P|Cj |
k=1 [wk ]
: wk ∈ Cj ;
I
|Cj |
P
I
sim(p, Cj ) =
cos([p],[Mwk ])
k=1
|Cj |
, wk ∈ C j
Gold resource of 220 synpairs and possible assignments
KDBI, EPIA 2011
October 11, 2011
15 / 18
Concluding remarks
Sum the adjacencies
I
I
P|Cj |
k=1 [wk ]
: wk ∈ Cj ;
I
|Cj |
P
I
sim(p, Cj ) =
cos([p],[Mwk ])
k=1
|Cj |
, wk ∈ C j
I
Variable cut point θ on similarity
KDBI, EPIA 2011
October 11, 2011
15 / 18
Concluding remarks
Sum the adjacencies
I
I
P|Cj |
k=1 [wk ]
: wk ∈ Cj ;
I
|Cj |
P
I
sim(p, Cj ) =
cos([p],[Mwk ])
k=1
|Cj |
, wk ∈ C j
I
I
Possible to assign the same synpair to 0 ≤ n ≤ |C | synsets
KDBI, EPIA 2011
October 11, 2011
15 / 18
Concluding remarks
Sum the adjacencies
I
I
P|Cj |
k=1 [wk ]
: wk ∈ Cj ;
I
|Cj |
P
I
sim(p, Cj ) =
cos([p],[Mwk ])
k=1
|Cj |
, wk ∈ C j
I
I
Possible to assign the same synpair to 0 ≤ n ≤ |C | synsets
KDBI, EPIA 2011
October 11, 2011
15 / 18
Concluding remarks
Final remarks
Flexible method for enriching thesaurus with synonymy in dictionaries
KDBI, EPIA 2011
October 11, 2011
16 / 18
Concluding remarks
Final remarks
Applied to the enrichment of a Portuguese thesaurus
KDBI, EPIA 2011
October 11, 2011
16 / 18
Concluding remarks
Final remarks
This work was made in the scope of Onto.PT
I
Automatic creation of a lexical ontology for Portuguese
KDBI, EPIA 2011
October 11, 2011
16 / 18
Concluding remarks
Final remarks
I
I
Extraction + integration of lexical information from textual sources
KDBI, EPIA 2011
October 11, 2011
16 / 18
Concluding remarks
Final remarks
I
I
I
I
Extraction + integration of lexical information from textual sources
Soon freely available!
Check http://ontopt.dei.uc.pt
KDBI, EPIA 2011
October 11, 2011
16 / 18
Concluding remarks
Thank you!
KDBI, EPIA 2011
October 11, 2011
17 / 18
References
References I
[Fellbaum, 1998] Fellbaum, C., editor (1998).
WordNet: An Electronic Lexical Database (Language, Speech, and Communication).
The MIT Press.
[Gonçalo Oliveira and Gomes, 2010] Gonçalo Oliveira, H. and Gomes, P. (2010).
Onto.PT: Automatic Construction of a Lexical Ontology for Portuguese.
In Proc. 5th European Starting AI Researcher Symposium (STAIRS 2010). IOS Press.
[Gonçalo Oliveira et al., 2010] Gonçalo Oliveira, H., Santos, D., and Gomes, P. (2010).
Extracção de relações semânticas entre palavras a partir de um dicionário: o PAPEL e sua avaliação.
Linguamática, 2(1):77–93.
[Maziero et al., 2008] Maziero, E. G., Pardo, T. A. S., Felippo, A. D., and Dias-da-Silva, B. C. (2008).
A Base de Dados Lexical e a Interface Web do TeP 2.0 - Thesaurus Eletrônico para o Português do Brasil.
In VI Workshop em Tecnologia da Informação e da Linguagem Humana (TIL), pages 390–392.
KDBI, EPIA 2011
October 11, 2011
18 / 18

Automatically Enriching a Thesaurus with Information from Dictionaries

Transcrição

Documentos relacionados

Veja também em pdf.

The generalized inverse Weibull distribution for modeling

Onto.PT: Automatic construction of a Lexical Ontology for Portuguese

Abstract

The release call of Hypsiboas goianus (B. Lutz, 1968)

Ismar Sergio Gomes, cello

COMUNICAÇÃO TÉCNICA Nº 173771 Production of polymeric

SLB: Systemic Lisbon Battery

The biTTersweeT syllable THE COLOURS OF SLAVERY

Pterygium in patients from Goiânia, Goiás, Brazil

Towards merging common and technical lexicon wordnets