Onto.PT: Automatic construction of a Lexical Ontology for Portuguese

Transcrição

Onto.PT: Automatic construction of a Lexical Ontology for Portuguese
Onto.PT: Automatic construction of a Lexical Ontology
for Portuguese
Hugo Gonçalo Oliveira1 , Paulo Gomes
{hroliv,pgomes}@dei.uc.pt
Cognitive & Media Systems Group
CISUC, University of Coimbra
Lisbon, August 16, 2010
1
supported by FCT scholarship grant SFRH/BD/44955/2008
Gonçalo Oliveira & Gomes (CISUC)
STAIRS 2010
Lisbon, August 16, 2010
1 / 31
Outline
1
Introduction
Lexical ontologies
Goals
2
Approach
Information extraction from text
Synset discovery
Merging synset-based resources
Weighting triples
Assigning terms to synsets
Assigning terms to synsets
Knowledge organisation
3
Current results
Relation extraction
Synset discovery
Wordnet establishment
4
Concluding remarks
Gonçalo Oliveira & Gomes (CISUC)
STAIRS 2010
Lisbon, August 16, 2010
2 / 31
Introduction
Today’s applications
Need to understand information conveyed by natural language
Gonçalo Oliveira & Gomes (CISUC)
STAIRS 2010
Lisbon, August 16, 2010
3 / 31
Introduction
Today’s applications
Need to understand information conveyed by natural language
Therefore, demand better access to knowledge on words and their
meanings!
Gonçalo Oliveira & Gomes (CISUC)
STAIRS 2010
Lisbon, August 16, 2010
3 / 31
Introduction
Today’s applications
Need to understand information conveyed by natural language
Therefore, demand better access to knowledge on words and their
meanings!
Encoded in lexical ontologies
Gonçalo Oliveira & Gomes (CISUC)
STAIRS 2010
Lisbon, August 16, 2010
3 / 31
Introduction
Lexical ontologies
Lexical ontologies
Such as Princeton WordNet [Fellbaum, 1998]
Gonçalo Oliveira & Gomes (CISUC)
STAIRS 2010
Lisbon, August 16, 2010
4 / 31
Introduction
Lexical ontologies
Lexical ontologies
Such as Princeton WordNet [Fellbaum, 1998]
▶
Ontology + lexicon [Hirst, 2004]
Gonçalo Oliveira & Gomes (CISUC)
STAIRS 2010
Lisbon, August 16, 2010
4 / 31
Introduction
Lexical ontologies
Lexical ontologies
Such as Princeton WordNet [Fellbaum, 1998]
▶
▶
Ontology + lexicon [Hirst, 2004]
Knowledge structured on words and their meanings
Gonçalo Oliveira & Gomes (CISUC)
STAIRS 2010
Lisbon, August 16, 2010
4 / 31
Introduction
Lexical ontologies
Lexical ontologies
Such as Princeton WordNet [Fellbaum, 1998]
▶
▶
▶
▶
Ontology + lexicon [Hirst, 2004]
Knowledge structured on words and their meanings
Cover the whole language
Not based on a specific domain
Gonçalo Oliveira & Gomes (CISUC)
STAIRS 2010
Lisbon, August 16, 2010
4 / 31
Introduction
Lexical ontologies
Lexical ontologies
Such as Princeton WordNet [Fellbaum, 1998]
▶
▶
▶
▶
Ontology + lexicon [Hirst, 2004]
Knowledge structured on words and their meanings
Cover the whole language
Not based on a specific domain
Typically handcrafted...
▶
Construction and maintenance involve time-consuming human effort!
Gonçalo Oliveira & Gomes (CISUC)
STAIRS 2010
Lisbon, August 16, 2010
4 / 31
Introduction
Goals
Onto.PT
Automatic construction of a lexical ontology for Portuguese
Gonçalo Oliveira & Gomes (CISUC)
STAIRS 2010
Lisbon, August 16, 2010
5 / 31
Introduction
Goals
Onto.PT
Automatic construction of a lexical ontology for Portuguese
Extracted from different sources
Gonçalo Oliveira & Gomes (CISUC)
STAIRS 2010
Lisbon, August 16, 2010
5 / 31
Introduction
Goals
Onto.PT
Automatic construction of a lexical ontology for Portuguese
Extracted from different sources
▶
▶
▶
Manually created thesauri
Language dictionaries/encyclopedias
Corpora
Gonçalo Oliveira & Gomes (CISUC)
STAIRS 2010
Lisbon, August 16, 2010
5 / 31
Introduction
Goals
Onto.PT
Automatic construction of a lexical ontology for Portuguese
Extracted from different sources
▶
▶
▶
Manually created thesauri
Language dictionaries/encyclopedias
Corpora
Modelled after Princeton WordNet
Gonçalo Oliveira & Gomes (CISUC)
STAIRS 2010
Lisbon, August 16, 2010
5 / 31
Introduction
Goals
Onto.PT
Automatic construction of a lexical ontology for Portuguese
Extracted from different sources
▶
▶
▶
Manually created thesauri
Language dictionaries/encyclopedias
Corpora
Modelled after Princeton WordNet
▶
▶
Synsets: groups of synonymous words
Synset-based relational triples
Gonçalo Oliveira & Gomes (CISUC)
STAIRS 2010
Lisbon, August 16, 2010
5 / 31
Introduction
Goals
Onto.PT
Automatic construction of a lexical ontology for Portuguese
Extracted from different sources
▶
▶
▶
Manually created thesauri
Language dictionaries/encyclopedias
Corpora
Modelled after Princeton WordNet
▶
▶
Synsets: groups of synonymous words
Synset-based relational triples
WSD based on the knowledge already extracted, not on the context
Gonçalo Oliveira & Gomes (CISUC)
STAIRS 2010
Lisbon, August 16, 2010
5 / 31
Approach
Gonçalo Oliveira & Gomes (CISUC)
Information extraction from text
STAIRS 2010
Lisbon, August 16, 2010
6 / 31
Approach
Information extraction from text
Examples
From dictionaries:
▶
tenreiro, n -- terneiro, novilho ou bezerro.
→ terneiro SYNONYM OF tenreiro
→ novilho SYNONYM OF tenreiro
→ bezerro SYNONYM OF tenreiro
Gonçalo Oliveira & Gomes (CISUC)
STAIRS 2010
Lisbon, August 16, 2010
7 / 31
Approach
Information extraction from text
Examples
From dictionaries:
▶
▶
tenreiro, n -- terneiro, novilho ou bezerro.
→ terneiro SYNONYM OF tenreiro
→ novilho SYNONYM OF tenreiro
→ bezerro SYNONYM OF tenreiro
ébola, n -- virose que provoca febres e hemorragias
→ ébola CAUSATION OF febres
→ ébola CAUSATION OF hemorragias
Gonçalo Oliveira & Gomes (CISUC)
STAIRS 2010
Lisbon, August 16, 2010
7 / 31
Approach
Information extraction from text
Examples
From dictionaries:
▶
▶
tenreiro, n -- terneiro, novilho ou bezerro.
→ terneiro SYNONYM OF tenreiro
→ novilho SYNONYM OF tenreiro
→ bezerro SYNONYM OF tenreiro
ébola, n -- virose que provoca febres e hemorragias
→ ébola CAUSATION OF febres
→ ébola CAUSATION OF hemorragias
From textual corpora:
▶
O automobilismo (também conhecido como corridas de
automóveis ou desporto motorizado) é um desporto...
→ automobilismo SYNONYM OF corridas de automóveis
→ automobilismo SYNONYM OF desporto motorizado
→ desporto HYPERNYM OF automobilismo
Gonçalo Oliveira & Gomes (CISUC)
STAIRS 2010
Lisbon, August 16, 2010
7 / 31
Approach
Gonçalo Oliveira & Gomes (CISUC)
Synset discovery
STAIRS 2010
Lisbon, August 16, 2010
8 / 31
Approach
Synset discovery
Synonymy lexical network – example
Gonçalo Oliveira & Gomes (CISUC)
STAIRS 2010
Lisbon, August 16, 2010
9 / 31
Approach
Synset discovery
Synonymy networks tend to have a clustered structure
Goal: Identify synsets taking advantage of clusters
Gonçalo Oliveira & Gomes (CISUC)
STAIRS 2010
Lisbon, August 16, 2010
10 / 31
Approach
Synset discovery
Synonymy networks tend to have a clustered structure
Goal: Identify synsets taking advantage of clusters
Approach: Clustering algorithm over the synonymy lexical network
(see poster [Oliveira and Gomes, 2010])
Gonçalo Oliveira & Gomes (CISUC)
STAIRS 2010
Lisbon, August 16, 2010
10 / 31
Approach
Synset discovery
Synonymy networks tend to have a clustered structure
Goal: Identify synsets taking advantage of clusters
Approach: Clustering algorithm over the synonymy lexical network
(see poster [Oliveira and Gomes, 2010])
Keep ambiguity: clusters might be overlapping!
Gonçalo Oliveira & Gomes (CISUC)
STAIRS 2010
Lisbon, August 16, 2010
10 / 31
Approach
Synset discovery
Clustering – example
Gonçalo Oliveira & Gomes (CISUC)
STAIRS 2010
Lisbon, August 16, 2010
11 / 31
Approach
Gonçalo Oliveira & Gomes (CISUC)
Merging synset-based resources
STAIRS 2010
Lisbon, August 16, 2010
12 / 31
Approach
Merging synset-based resources
Merging synsets from different thesauri
For each synset Ti ∈ T , select Bj ∈ B with higher c = Ti ∩ Bj /Ti ∪ Bj 2
B1 = (diva, beldade, beleza, deidade, deusa, divindade)
B2 = (divindade, deidade, deus, nume)
2
Jaccard coefficient
Gonçalo Oliveira & Gomes (CISUC)
STAIRS 2010
Lisbon, August 16, 2010
13 / 31
Approach
Merging synset-based resources
Merging synsets from different thesauri
For each synset Ti ∈ T , select Bj ∈ B with higher c = Ti ∩ Bj /Ti ∪ Bj 2
B1 = (diva, beldade, beleza, deidade, deusa, divindade)
B2 = (divindade, deidade, deus, nume)
T1 = (divindade, diva, deusa)
2
Jaccard coefficient
Gonçalo Oliveira & Gomes (CISUC)
STAIRS 2010
Lisbon, August 16, 2010
13 / 31
Approach
Merging synset-based resources
Merging synsets from different thesauri
For each synset Ti ∈ T , select Bj ∈ B with higher c = Ti ∩ Bj /Ti ∪ Bj 2
B1 = (diva, beldade, beleza, deidade, deusa, divindade)
B2 = (divindade, deidade, deus, nume)
T1 = (divindade, diva, deusa)
▶
▶
2
c(T1 , B1 ) =
c(T1 , B2 ) =
1
3
1
6
Jaccard coefficient
Gonçalo Oliveira & Gomes (CISUC)
STAIRS 2010
Lisbon, August 16, 2010
13 / 31
Approach
Merging synset-based resources
Merging synsets from different thesauri
For each synset Ti ∈ T , select Bj ∈ B with higher c = Ti ∩ Bj /Ti ∪ Bj 2
B1 = (diva, beldade, beleza, deidade, deusa, divindade)
B2 = (divindade, deidade, deus, nume)
T1 = (divindade, diva, deusa)
▶
▶
c(T1 , B1 ) =
c(T1 , B2 ) =
1
3
1
6
N = B1 ∪ T1 = (diva, beldade, beleza, deidade, deusa, divindade)
2
Jaccard coefficient
Gonçalo Oliveira & Gomes (CISUC)
STAIRS 2010
Lisbon, August 16, 2010
13 / 31
Approach
Gonçalo Oliveira & Gomes (CISUC)
Weighting triples
STAIRS 2010
Lisbon, August 16, 2010
14 / 31
Approach
Weighting triples
Weighting triples
Frequency of extraction
Gonçalo Oliveira & Gomes (CISUC)
STAIRS 2010
Lisbon, August 16, 2010
15 / 31
Approach
Weighting triples
Weighting triples
Frequency of extraction
Corpus distributional similarity metrics (e.g. LSA
[Deerwester et al., 1990], PMI [Turney, 2001] )
Gonçalo Oliveira & Gomes (CISUC)
STAIRS 2010
Lisbon, August 16, 2010
15 / 31
Approach
Weighting triples
Weighting triples
Frequency of extraction
Corpus distributional similarity metrics (e.g. LSA
[Deerwester et al., 1990], PMI [Turney, 2001] )
Web distributional similarity metrics (e.g. WebJaccard, WebOverlap
[Bollegala et al., 2007])
Gonçalo Oliveira & Gomes (CISUC)
STAIRS 2010
Lisbon, August 16, 2010
15 / 31
Approach
Weighting triples
Weighting triples
Frequency of extraction
Corpus distributional similarity metrics (e.g. LSA
[Deerwester et al., 1990], PMI [Turney, 2001] )
Web distributional similarity metrics (e.g. WebJaccard, WebOverlap
[Bollegala et al., 2007])
Define filters based on weights
Gonçalo Oliveira & Gomes (CISUC)
STAIRS 2010
Lisbon, August 16, 2010
15 / 31
Approach
Gonçalo Oliveira & Gomes (CISUC)
Assigning terms to synsets
STAIRS 2010
Lisbon, August 16, 2010
16 / 31
Approach
Assigning terms to synsets
Mapping methods
Input:
▶
▶
Thesaurus T , containing synsets
Term-based semantic network, N, where each edge has a type R
Gonçalo Oliveira & Gomes (CISUC)
STAIRS 2010
Lisbon, August 16, 2010
17 / 31
Approach
Assigning terms to synsets
Mapping methods
Input:
▶
▶
Thesaurus T , containing synsets
Term-based semantic network, N, where each edge has a type R
Goal: map a R b ∈ N to A R B, (A, B) ∈ T
Gonçalo Oliveira & Gomes (CISUC)
STAIRS 2010
Lisbon, August 16, 2010
17 / 31
Approach
Assigning terms to synsets
Mapping methods
Input:
▶
▶
Thesaurus T , containing synsets
Term-based semantic network, N, where each edge has a type R
Goal: map a R b ∈ N to A R B, (A, B) ∈ T
Output: semantic network W , whose nodes are synsets, which relate
to other synsets by means of semantic relations (wordnet)
Gonçalo Oliveira & Gomes (CISUC)
STAIRS 2010
Lisbon, August 16, 2010
17 / 31
Approach
Assigning terms to synsets
Mapping procedures
Baseline
▶
A and B are random synsets containing a and b respectivelly
Gonçalo Oliveira & Gomes (CISUC)
STAIRS 2010
Lisbon, August 16, 2010
18 / 31
Approach
Assigning terms to synsets
Mapping procedures
Baseline
▶
A and B are random synsets containing a and b respectivelly
Related proportion [Gonçalo Oliveira and Gomes, 2010]
Gonçalo Oliveira & Gomes (CISUC)
STAIRS 2010
Lisbon, August 16, 2010
18 / 31
Approach
Assigning terms to synsets
Mapping procedures
Baseline
▶
A and B are random synsets containing a and b respectivelly
Related proportion [Gonçalo Oliveira and Gomes, 2010]
Cosine similarity
Gonçalo Oliveira & Gomes (CISUC)
STAIRS 2010
Lisbon, August 16, 2010
18 / 31
Approach
Assigning terms to synsets
Related proportion
Assignment of a (in a R b) to A:
1
Fix b
Gonçalo Oliveira & Gomes (CISUC)
STAIRS 2010
Lisbon, August 16, 2010
19 / 31
Approach
Assigning terms to synsets
Related proportion
Assignment of a (in a R b) to A:
1
Fix b
2
Sa ⊂ T : Sai ∈ Sa , a ∈ Sai
Gonçalo Oliveira & Gomes (CISUC)
STAIRS 2010
Lisbon, August 16, 2010
19 / 31
Approach
Assigning terms to synsets
Related proportion
Assignment of a (in a R b) to A:
1
Fix b
2
Sa ⊂ T : Sai ∈ Sa , a ∈ Sai
▶
a is not in T ? create synset A = (a), a → A
Gonçalo Oliveira & Gomes (CISUC)
STAIRS 2010
Lisbon, August 16, 2010
19 / 31
Approach
Assigning terms to synsets
Related proportion
Assignment of a (in a R b) to A:
1
Fix b
2
Sa ⊂ T : Sai ∈ Sa , a ∈ Sai
▶
3
a is not in T ? create synset A = (a), a → A
For each Sai ∈ Sa ,
Gonçalo Oliveira & Gomes (CISUC)
STAIRS 2010
Lisbon, August 16, 2010
19 / 31
Approach
Assigning terms to synsets
Related proportion
Assignment of a (in a R b) to A:
1
Fix b
2
Sa ⊂ T : Sai ∈ Sa , a ∈ Sai
▶
3
a is not in T ? create synset A = (a), a → A
For each Sai ∈ Sa ,
▶
pai =
nai
∣Sai ∣ ,
nai = number of terms tj ∈ Sai : (tj R b)
Gonçalo Oliveira & Gomes (CISUC)
STAIRS 2010
Lisbon, August 16, 2010
19 / 31
Approach
Assigning terms to synsets
Related proportion
Assignment of a (in a R b) to A:
1
Fix b
2
Sa ⊂ T : Sai ∈ Sa , a ∈ Sai
▶
3
a is not in T ? create synset A = (a), a → A
For each Sai ∈ Sa ,
▶
pai =
★
★
★
nai
∣Sai ∣ ,
nai = number of terms tj ∈ Sai : (tj R b)
Sa1 = (a, c, d, e), pa1 = 34
Sa2 = (a, f, g ), pa2 = 23
Sa3 = (a, h, i, j), pa3 = 41
Gonçalo Oliveira & Gomes (CISUC)
STAIRS 2010
Lisbon, August 16, 2010
19 / 31
Approach
Assigning terms to synsets
Related proportion
Assignment of a (in a R b) to A:
1
Fix b
2
Sa ⊂ T : Sai ∈ Sa , a ∈ Sai
▶
3
a is not in T ? create synset A = (a), a → A
For each Sai ∈ Sa ,
▶
pai =
★
★
★
▶
nai
∣Sai ∣ ,
nai = number of terms tj ∈ Sai : (tj R b)
Sa1 = (a, c, d, e), pa1 = 34
Sa2 = (a, f, g ), pa2 = 23
Sa3 = (a, h, i, j), pa3 = 41
a → Sa1
Gonçalo Oliveira & Gomes (CISUC)
STAIRS 2010
Lisbon, August 16, 2010
19 / 31
Approach
Assigning terms to synsets
Cosine similarity
1
M = term-term matrix based on the adjacencies of the lexical network
Gonçalo Oliveira & Gomes (CISUC)
STAIRS 2010
Lisbon, August 16, 2010
20 / 31
Approach
Assigning terms to synsets
Cosine similarity
1
M = term-term matrix based on the adjacencies of the lexical network
2
Collect all the synsets with a, Sa ⊂ T , and all synsets with b, Sb ⊂ T
Gonçalo Oliveira & Gomes (CISUC)
STAIRS 2010
Lisbon, August 16, 2010
20 / 31
Approach
Assigning terms to synsets
Cosine similarity
1
M = term-term matrix based on the adjacencies of the lexical network
2
Collect all the synsets with a, Sa ⊂ T , and all synsets with b, Sb ⊂ T
3
For each A ∈ Sa and B ∈ Sb , with terms Ai ∈ A and Bj ∈ B:
∣A∣ ∑
∣B∣
∑
sim(A, B) =
Gonçalo Oliveira & Gomes (CISUC)
cos(Ai , Bj )
i=1 j=1
STAIRS 2010
∣A∣∣B∣
Lisbon, August 16, 2010
20 / 31
Approach
Assigning terms to synsets
Cosine similarity
1
M = term-term matrix based on the adjacencies of the lexical network
2
Collect all the synsets with a, Sa ⊂ T , and all synsets with b, Sb ⊂ T
3
For each A ∈ Sa and B ∈ Sb , with terms Ai ∈ A and Bj ∈ B:
∣A∣ ∑
∣B∣
∑
sim(A, B) =
4
cos(Ai , Bj )
i=1 j=1
∣A∣∣B∣
Select the pair of synsets with the highest similarity
Gonçalo Oliveira & Gomes (CISUC)
STAIRS 2010
Lisbon, August 16, 2010
20 / 31
Approach
Gonçalo Oliveira & Gomes (CISUC)
Knowledge organisation
STAIRS 2010
Lisbon, August 16, 2010
21 / 31
Approach
Knowledge organisation
Knowledge organisation
Transitivity
▶
if R is transitive (e.g. SYNONYMY, HYPERNYMY, ...):
(A R B) ∧ (B R C ) → (A R C )
Gonçalo Oliveira & Gomes (CISUC)
STAIRS 2010
Lisbon, August 16, 2010
22 / 31
Approach
Knowledge organisation
Knowledge organisation
Transitivity
▶
if R is transitive (e.g. SYNONYMY, HYPERNYMY, ...):
(A R B) ∧ (B R C ) → (A R C )
Inheritance
▶
if R is not a HYPERNYMY or HYPONYMY relation:
(A HYPERNYM OF B) ∧ (A R C ) → (B R C )
Gonçalo Oliveira & Gomes (CISUC)
STAIRS 2010
Lisbon, August 16, 2010
22 / 31
Current results
Relation extraction
Triples extracted from dictionaries
Dicionário da Lı́ngua Portguesa (PAPEL 2.0)
Dicionário Aberto (DA)
Relation
Synonymy
Hypernymy
Part-of
Member-of
Causation
Purpose
Arguments
noun,noun
verb,verb
adj,adj
adv,adv
noun,noun
noun,noun
noun,adj
noun,noun
adj,noun
noun,noun
adj,noun
verb,noun
noun,noun
verb,noun
verb,adj
PAPEL 2.0
37,452
21,465
19,073
1,171
62,591
2,805
3.721
5.929
883
1.013
498
6,399
2,886
5,192
260
DA
20,910
8,715
7,353
605
59,887
1,795
4,902
1,564
59
264
166
5,714
1,760
3,383
186
Examples
auxı́lio, contributo
tributar, colectar
flexı́vel, moldável
após, seguidamente
planta, salva
cauda, cometa
tampa, coberto
ervilha, Leguminosas
celular, célula
fricção, assadura
reactivo, reacção
limpar, purgação
defesa, armadura
fazer rir, comédia
corrigir, correccional
Table: Examples of triples
Gonçalo Oliveira & Gomes (CISUC)
STAIRS 2010
Lisbon, August 16, 2010
23 / 31
Current results
Relation extraction
Relations extracted from Wikipedia abstracts
Relation
Synonymy
Hypernymy
Part-of
Causation
Purpose
Quantity
11,862
29,563
1,287
520
743
Example
estupro,violação
estilo de música,folk
jejuno,intestino
parasita,doença
construção,terracota
Sample
286
322
268
244
264
Correct
86,1%
59,1% 3
52,6%
49,6%
57,0%
Agreement
91,2%
93,1%
78,4%
79,5%
82,2%
Table: Examples and validation of relations
3
In 30%, grammars/tagger could not identify complete scientific names as in:
O Iriatherina werneri é uma espécie de peixe de aquário
→ peixe de aquário HYPERNYM OF werneri
Gonçalo Oliveira & Gomes (CISUC)
STAIRS 2010
Lisbon, August 16, 2010
24 / 31
Current results
Synset discovery
Thesaurus
TeP4 thesaurus
OpenThesaurus.PT (OT)5
Clustered PAPEL (CLIP)
TeP merged with OT, merged with CLIP (TOP)
Words
Synsets
Quantity
Ambiguous
Most ambiguous
Quantity
Avg. size
Biggest
TeP
17,158
5,867
20
8,254
3.51
21
OT
5,819
442
4
1,872
3.37
14
CLIP
23,741
12,196
47
7,468
12.57
103
TOP
30,554
13,294
21
9,960
6.6
277
Table: (Noun) thesauruses in numbers.
4
5
http://www.nilc.icmc.usp.br/tep2/index.htm
http://openthesaurus.caixamagica.pt/
Gonçalo Oliveira & Gomes (CISUC)
STAIRS 2010
Lisbon, August 16, 2010
25 / 31
Current results
Synset discovery
Manual validation
CLIP
CLIP’
TOP
TOP’
Sample
519 sets
310 sets
480 sets
448 sets
Correct
65.8%
81.1%
83.2%
86.8%
Incorrect
31.7%
16.9%
15.8%
12.3%
N/A
2.5%
2.0%
1.0%
0.9%
Agreement
76.1%
84.2%
82.3%
83.0%
Table: Results of manual synset validation.
CLIP’ and TOP’ only consider synsets with 10 or less words.
▶
The quality is higher for smaller synsets.
Gonçalo Oliveira & Gomes (CISUC)
STAIRS 2010
Lisbon, August 16, 2010
26 / 31
Current results
Wordnet establishment
Resulting WordNet – related proportion
Term-based triples
Mapped
1st
Same synset
Already present
Semi-mapped triples
Mapped
2nd Could be inferred
Already present
Synset-based triples
Hypernym of
62,591
27,750
233
3,970
7,952
88
50
13
23,572
Part of
2,805
1,460
5
40
262
1
0
0
1,416
Member of
5,929
3,962
12
167
357
0
0
0
3,783
Table: Results of triples mapping
Gonçalo Oliveira & Gomes (CISUC)
STAIRS 2010
Lisbon, August 16, 2010
27 / 31
Current results
Wordnet establishment
Automatic validation
For each triple, A R B
1 Compile a set of textual patterns denoting R, e.g.:
▶
▶
(hypo) é um∣uma (tipo∣forma∣variedade∣...)* de (hyper)
(whole/group) é um (grupo∣conjunto∣...) de (part/member)
Gonçalo Oliveira & Gomes (CISUC)
STAIRS 2010
Lisbon, August 16, 2010
28 / 31
Current results
Wordnet establishment
Automatic validation
For each triple, A R B
1 Compile a set of textual patterns denoting R, e.g.:
▶
▶
2
(hypo) é um∣uma (tipo∣forma∣variedade∣...)* de (hyper)
(whole/group) é um (grupo∣conjunto∣...) de (part/member)
Score the triple with the help of Google:
∣A∣ ∑
∣B∣
∑
score =
Gonçalo Oliveira & Gomes (CISUC)
found(Ai , Bj , R)
i=1 j=1
∣A∣ ∗ ∣B∣
STAIRS 2010
Lisbon, August 16, 2010
28 / 31
Current results
Wordnet establishment
Automatic validation
For each triple, A R B
1 Compile a set of textual patterns denoting R, e.g.:
▶
▶
2
(hypo) é um∣uma (tipo∣forma∣variedade∣...)* de (hyper)
(whole/group) é um (grupo∣conjunto∣...) de (part/member)
Score the triple with the help of Google:
∣A∣ ∑
∣B∣
∑
score =
Relation
Hypernymy of
Member of
Part of
found(Ai , Bj , R)
i=1 j=1
∣A∣ ∗ ∣B∣
Sample size
419 synsets
379 synsets
290 synsets
Validation
44,1%
24,3%
24,8%
Table: Automatic validation of triples
Gonçalo Oliveira & Gomes (CISUC)
STAIRS 2010
Lisbon, August 16, 2010
28 / 31
Concluding remarks
Concluding remarks
Answer to the growing demand on semantically aware applications
Gonçalo Oliveira & Gomes (CISUC)
STAIRS 2010
Lisbon, August 16, 2010
29 / 31
Concluding remarks
Concluding remarks
Answer to the growing demand on semantically aware applications
Lack of public domain lexico-semantic resources for Portuguese
Gonçalo Oliveira & Gomes (CISUC)
STAIRS 2010
Lisbon, August 16, 2010
29 / 31
Concluding remarks
Concluding remarks
Answer to the growing demand on semantically aware applications
Lack of public domain lexico-semantic resources for Portuguese
Export resources to different data formats
Gonçalo Oliveira & Gomes (CISUC)
STAIRS 2010
Lisbon, August 16, 2010
29 / 31
Concluding remarks
Concluding remarks
Answer to the growing demand on semantically aware applications
Lack of public domain lexico-semantic resources for Portuguese
Export resources to different data formats
WSD without a context:
▶
Clustering for establishing synsets
Gonçalo Oliveira & Gomes (CISUC)
STAIRS 2010
Lisbon, August 16, 2010
29 / 31
Concluding remarks
Concluding remarks
Answer to the growing demand on semantically aware applications
Lack of public domain lexico-semantic resources for Portuguese
Export resources to different data formats
WSD without a context:
▶
▶
Clustering for establishing synsets
Disambiguation of terms based on the extracted knowledge
Gonçalo Oliveira & Gomes (CISUC)
STAIRS 2010
Lisbon, August 16, 2010
29 / 31
Concluding remarks
Concluding remarks
Answer to the growing demand on semantically aware applications
Lack of public domain lexico-semantic resources for Portuguese
Export resources to different data formats
WSD without a context:
▶
▶
▶
Clustering for establishing synsets
Disambiguation of terms based on the extracted knowledge
Further organisation
Gonçalo Oliveira & Gomes (CISUC)
STAIRS 2010
Lisbon, August 16, 2010
29 / 31
Concluding remarks
Concluding remarks
Answer to the growing demand on semantically aware applications
Lack of public domain lexico-semantic resources for Portuguese
Export resources to different data formats
WSD without a context:
▶
▶
▶
Clustering for establishing synsets
Disambiguation of terms based on the extracted knowledge
Further organisation
Check http://ontopt.dei.uc.pt for updates and available
resources
Gonçalo Oliveira & Gomes (CISUC)
STAIRS 2010
Lisbon, August 16, 2010
29 / 31
The end
References
Bollegala, D., Matsuo, Y., and Ishizuka, M. (2007).
Measuring semantic similarity between words using web search engines.
In Proc. 16th International conference on World Wide Web (WWW’07), pages 757–766, New York, NY, USA. ACM.
Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., and Harshman, R. (1990).
Indexing by latent semantic analysis.
Journal of the American Society for Information Science, 41:391–407.
Fellbaum, C., editor (1998).
WordNet: An Electronic Lexical Database (Language, Speech, and Communication).
The MIT Press.
Gonçalo Oliveira, H. and Gomes, P. (2010).
Towards the automatic creation of a wordnet from a term-based lexical network.
In Proceedings of the ACL Workshop TextGraphs-5: Graph-based Methods for Natural Language Processing.
Hirst, G. (2004).
Ontology and the lexicon.
In Staab, S. and Studer, R., editors, Handbook on Ontologies, International Handbooks on Information Systems, pages
209–230. Springer.
Oliveira, H. G. and Gomes, P. (2010).
Automatic creation of a conceptual base for portuguese using clustering techniques.
In Proc. 19th European Conference on Artifical Intelligence (ECAI 2010).
Turney, P. D. (2001).
Mining the web for synonyms: PMI–IR versus LSA on TOEFL.
In Proc. 12th European Conference on Machine Learning (ECML-2001), volume 2167, pages 491–502. Springer.
Gonçalo Oliveira & Gomes (CISUC)
STAIRS 2010
Lisbon, August 16, 2010
30 / 31
The end
Thank you!
Gonçalo Oliveira & Gomes (CISUC)
STAIRS 2010
Lisbon, August 16, 2010
31 / 31

Documentos relacionados

Automatically Enriching a Thesaurus with Information from Dictionaries

Automatically Enriching a Thesaurus with Information from Dictionaries Automatically Enriching a Thesaurus with Information from Dictionaries Hugo Gonçalo Oliveira1

Leia mais