Onto.PT: Automatic construction of a Lexical Ontology for Portuguese
Transcrição
Onto.PT: Automatic construction of a Lexical Ontology for Portuguese
Onto.PT: Automatic construction of a Lexical Ontology for Portuguese Hugo Gonçalo Oliveira1 , Paulo Gomes {hroliv,pgomes}@dei.uc.pt Cognitive & Media Systems Group CISUC, University of Coimbra Lisbon, August 16, 2010 1 supported by FCT scholarship grant SFRH/BD/44955/2008 Gonçalo Oliveira & Gomes (CISUC) STAIRS 2010 Lisbon, August 16, 2010 1 / 31 Outline 1 Introduction Lexical ontologies Goals 2 Approach Information extraction from text Synset discovery Merging synset-based resources Weighting triples Assigning terms to synsets Assigning terms to synsets Knowledge organisation 3 Current results Relation extraction Synset discovery Wordnet establishment 4 Concluding remarks Gonçalo Oliveira & Gomes (CISUC) STAIRS 2010 Lisbon, August 16, 2010 2 / 31 Introduction Today’s applications Need to understand information conveyed by natural language Gonçalo Oliveira & Gomes (CISUC) STAIRS 2010 Lisbon, August 16, 2010 3 / 31 Introduction Today’s applications Need to understand information conveyed by natural language Therefore, demand better access to knowledge on words and their meanings! Gonçalo Oliveira & Gomes (CISUC) STAIRS 2010 Lisbon, August 16, 2010 3 / 31 Introduction Today’s applications Need to understand information conveyed by natural language Therefore, demand better access to knowledge on words and their meanings! Encoded in lexical ontologies Gonçalo Oliveira & Gomes (CISUC) STAIRS 2010 Lisbon, August 16, 2010 3 / 31 Introduction Lexical ontologies Lexical ontologies Such as Princeton WordNet [Fellbaum, 1998] Gonçalo Oliveira & Gomes (CISUC) STAIRS 2010 Lisbon, August 16, 2010 4 / 31 Introduction Lexical ontologies Lexical ontologies Such as Princeton WordNet [Fellbaum, 1998] ▶ Ontology + lexicon [Hirst, 2004] Gonçalo Oliveira & Gomes (CISUC) STAIRS 2010 Lisbon, August 16, 2010 4 / 31 Introduction Lexical ontologies Lexical ontologies Such as Princeton WordNet [Fellbaum, 1998] ▶ ▶ Ontology + lexicon [Hirst, 2004] Knowledge structured on words and their meanings Gonçalo Oliveira & Gomes (CISUC) STAIRS 2010 Lisbon, August 16, 2010 4 / 31 Introduction Lexical ontologies Lexical ontologies Such as Princeton WordNet [Fellbaum, 1998] ▶ ▶ ▶ ▶ Ontology + lexicon [Hirst, 2004] Knowledge structured on words and their meanings Cover the whole language Not based on a specific domain Gonçalo Oliveira & Gomes (CISUC) STAIRS 2010 Lisbon, August 16, 2010 4 / 31 Introduction Lexical ontologies Lexical ontologies Such as Princeton WordNet [Fellbaum, 1998] ▶ ▶ ▶ ▶ Ontology + lexicon [Hirst, 2004] Knowledge structured on words and their meanings Cover the whole language Not based on a specific domain Typically handcrafted... ▶ Construction and maintenance involve time-consuming human effort! Gonçalo Oliveira & Gomes (CISUC) STAIRS 2010 Lisbon, August 16, 2010 4 / 31 Introduction Goals Onto.PT Automatic construction of a lexical ontology for Portuguese Gonçalo Oliveira & Gomes (CISUC) STAIRS 2010 Lisbon, August 16, 2010 5 / 31 Introduction Goals Onto.PT Automatic construction of a lexical ontology for Portuguese Extracted from different sources Gonçalo Oliveira & Gomes (CISUC) STAIRS 2010 Lisbon, August 16, 2010 5 / 31 Introduction Goals Onto.PT Automatic construction of a lexical ontology for Portuguese Extracted from different sources ▶ ▶ ▶ Manually created thesauri Language dictionaries/encyclopedias Corpora Gonçalo Oliveira & Gomes (CISUC) STAIRS 2010 Lisbon, August 16, 2010 5 / 31 Introduction Goals Onto.PT Automatic construction of a lexical ontology for Portuguese Extracted from different sources ▶ ▶ ▶ Manually created thesauri Language dictionaries/encyclopedias Corpora Modelled after Princeton WordNet Gonçalo Oliveira & Gomes (CISUC) STAIRS 2010 Lisbon, August 16, 2010 5 / 31 Introduction Goals Onto.PT Automatic construction of a lexical ontology for Portuguese Extracted from different sources ▶ ▶ ▶ Manually created thesauri Language dictionaries/encyclopedias Corpora Modelled after Princeton WordNet ▶ ▶ Synsets: groups of synonymous words Synset-based relational triples Gonçalo Oliveira & Gomes (CISUC) STAIRS 2010 Lisbon, August 16, 2010 5 / 31 Introduction Goals Onto.PT Automatic construction of a lexical ontology for Portuguese Extracted from different sources ▶ ▶ ▶ Manually created thesauri Language dictionaries/encyclopedias Corpora Modelled after Princeton WordNet ▶ ▶ Synsets: groups of synonymous words Synset-based relational triples WSD based on the knowledge already extracted, not on the context Gonçalo Oliveira & Gomes (CISUC) STAIRS 2010 Lisbon, August 16, 2010 5 / 31 Approach Gonçalo Oliveira & Gomes (CISUC) Information extraction from text STAIRS 2010 Lisbon, August 16, 2010 6 / 31 Approach Information extraction from text Examples From dictionaries: ▶ tenreiro, n -- terneiro, novilho ou bezerro. → terneiro SYNONYM OF tenreiro → novilho SYNONYM OF tenreiro → bezerro SYNONYM OF tenreiro Gonçalo Oliveira & Gomes (CISUC) STAIRS 2010 Lisbon, August 16, 2010 7 / 31 Approach Information extraction from text Examples From dictionaries: ▶ ▶ tenreiro, n -- terneiro, novilho ou bezerro. → terneiro SYNONYM OF tenreiro → novilho SYNONYM OF tenreiro → bezerro SYNONYM OF tenreiro ébola, n -- virose que provoca febres e hemorragias → ébola CAUSATION OF febres → ébola CAUSATION OF hemorragias Gonçalo Oliveira & Gomes (CISUC) STAIRS 2010 Lisbon, August 16, 2010 7 / 31 Approach Information extraction from text Examples From dictionaries: ▶ ▶ tenreiro, n -- terneiro, novilho ou bezerro. → terneiro SYNONYM OF tenreiro → novilho SYNONYM OF tenreiro → bezerro SYNONYM OF tenreiro ébola, n -- virose que provoca febres e hemorragias → ébola CAUSATION OF febres → ébola CAUSATION OF hemorragias From textual corpora: ▶ O automobilismo (também conhecido como corridas de automóveis ou desporto motorizado) é um desporto... → automobilismo SYNONYM OF corridas de automóveis → automobilismo SYNONYM OF desporto motorizado → desporto HYPERNYM OF automobilismo Gonçalo Oliveira & Gomes (CISUC) STAIRS 2010 Lisbon, August 16, 2010 7 / 31 Approach Gonçalo Oliveira & Gomes (CISUC) Synset discovery STAIRS 2010 Lisbon, August 16, 2010 8 / 31 Approach Synset discovery Synonymy lexical network – example Gonçalo Oliveira & Gomes (CISUC) STAIRS 2010 Lisbon, August 16, 2010 9 / 31 Approach Synset discovery Synonymy networks tend to have a clustered structure Goal: Identify synsets taking advantage of clusters Gonçalo Oliveira & Gomes (CISUC) STAIRS 2010 Lisbon, August 16, 2010 10 / 31 Approach Synset discovery Synonymy networks tend to have a clustered structure Goal: Identify synsets taking advantage of clusters Approach: Clustering algorithm over the synonymy lexical network (see poster [Oliveira and Gomes, 2010]) Gonçalo Oliveira & Gomes (CISUC) STAIRS 2010 Lisbon, August 16, 2010 10 / 31 Approach Synset discovery Synonymy networks tend to have a clustered structure Goal: Identify synsets taking advantage of clusters Approach: Clustering algorithm over the synonymy lexical network (see poster [Oliveira and Gomes, 2010]) Keep ambiguity: clusters might be overlapping! Gonçalo Oliveira & Gomes (CISUC) STAIRS 2010 Lisbon, August 16, 2010 10 / 31 Approach Synset discovery Clustering – example Gonçalo Oliveira & Gomes (CISUC) STAIRS 2010 Lisbon, August 16, 2010 11 / 31 Approach Gonçalo Oliveira & Gomes (CISUC) Merging synset-based resources STAIRS 2010 Lisbon, August 16, 2010 12 / 31 Approach Merging synset-based resources Merging synsets from different thesauri For each synset Ti ∈ T , select Bj ∈ B with higher c = Ti ∩ Bj /Ti ∪ Bj 2 B1 = (diva, beldade, beleza, deidade, deusa, divindade) B2 = (divindade, deidade, deus, nume) 2 Jaccard coefficient Gonçalo Oliveira & Gomes (CISUC) STAIRS 2010 Lisbon, August 16, 2010 13 / 31 Approach Merging synset-based resources Merging synsets from different thesauri For each synset Ti ∈ T , select Bj ∈ B with higher c = Ti ∩ Bj /Ti ∪ Bj 2 B1 = (diva, beldade, beleza, deidade, deusa, divindade) B2 = (divindade, deidade, deus, nume) T1 = (divindade, diva, deusa) 2 Jaccard coefficient Gonçalo Oliveira & Gomes (CISUC) STAIRS 2010 Lisbon, August 16, 2010 13 / 31 Approach Merging synset-based resources Merging synsets from different thesauri For each synset Ti ∈ T , select Bj ∈ B with higher c = Ti ∩ Bj /Ti ∪ Bj 2 B1 = (diva, beldade, beleza, deidade, deusa, divindade) B2 = (divindade, deidade, deus, nume) T1 = (divindade, diva, deusa) ▶ ▶ 2 c(T1 , B1 ) = c(T1 , B2 ) = 1 3 1 6 Jaccard coefficient Gonçalo Oliveira & Gomes (CISUC) STAIRS 2010 Lisbon, August 16, 2010 13 / 31 Approach Merging synset-based resources Merging synsets from different thesauri For each synset Ti ∈ T , select Bj ∈ B with higher c = Ti ∩ Bj /Ti ∪ Bj 2 B1 = (diva, beldade, beleza, deidade, deusa, divindade) B2 = (divindade, deidade, deus, nume) T1 = (divindade, diva, deusa) ▶ ▶ c(T1 , B1 ) = c(T1 , B2 ) = 1 3 1 6 N = B1 ∪ T1 = (diva, beldade, beleza, deidade, deusa, divindade) 2 Jaccard coefficient Gonçalo Oliveira & Gomes (CISUC) STAIRS 2010 Lisbon, August 16, 2010 13 / 31 Approach Gonçalo Oliveira & Gomes (CISUC) Weighting triples STAIRS 2010 Lisbon, August 16, 2010 14 / 31 Approach Weighting triples Weighting triples Frequency of extraction Gonçalo Oliveira & Gomes (CISUC) STAIRS 2010 Lisbon, August 16, 2010 15 / 31 Approach Weighting triples Weighting triples Frequency of extraction Corpus distributional similarity metrics (e.g. LSA [Deerwester et al., 1990], PMI [Turney, 2001] ) Gonçalo Oliveira & Gomes (CISUC) STAIRS 2010 Lisbon, August 16, 2010 15 / 31 Approach Weighting triples Weighting triples Frequency of extraction Corpus distributional similarity metrics (e.g. LSA [Deerwester et al., 1990], PMI [Turney, 2001] ) Web distributional similarity metrics (e.g. WebJaccard, WebOverlap [Bollegala et al., 2007]) Gonçalo Oliveira & Gomes (CISUC) STAIRS 2010 Lisbon, August 16, 2010 15 / 31 Approach Weighting triples Weighting triples Frequency of extraction Corpus distributional similarity metrics (e.g. LSA [Deerwester et al., 1990], PMI [Turney, 2001] ) Web distributional similarity metrics (e.g. WebJaccard, WebOverlap [Bollegala et al., 2007]) Define filters based on weights Gonçalo Oliveira & Gomes (CISUC) STAIRS 2010 Lisbon, August 16, 2010 15 / 31 Approach Gonçalo Oliveira & Gomes (CISUC) Assigning terms to synsets STAIRS 2010 Lisbon, August 16, 2010 16 / 31 Approach Assigning terms to synsets Mapping methods Input: ▶ ▶ Thesaurus T , containing synsets Term-based semantic network, N, where each edge has a type R Gonçalo Oliveira & Gomes (CISUC) STAIRS 2010 Lisbon, August 16, 2010 17 / 31 Approach Assigning terms to synsets Mapping methods Input: ▶ ▶ Thesaurus T , containing synsets Term-based semantic network, N, where each edge has a type R Goal: map a R b ∈ N to A R B, (A, B) ∈ T Gonçalo Oliveira & Gomes (CISUC) STAIRS 2010 Lisbon, August 16, 2010 17 / 31 Approach Assigning terms to synsets Mapping methods Input: ▶ ▶ Thesaurus T , containing synsets Term-based semantic network, N, where each edge has a type R Goal: map a R b ∈ N to A R B, (A, B) ∈ T Output: semantic network W , whose nodes are synsets, which relate to other synsets by means of semantic relations (wordnet) Gonçalo Oliveira & Gomes (CISUC) STAIRS 2010 Lisbon, August 16, 2010 17 / 31 Approach Assigning terms to synsets Mapping procedures Baseline ▶ A and B are random synsets containing a and b respectivelly Gonçalo Oliveira & Gomes (CISUC) STAIRS 2010 Lisbon, August 16, 2010 18 / 31 Approach Assigning terms to synsets Mapping procedures Baseline ▶ A and B are random synsets containing a and b respectivelly Related proportion [Gonçalo Oliveira and Gomes, 2010] Gonçalo Oliveira & Gomes (CISUC) STAIRS 2010 Lisbon, August 16, 2010 18 / 31 Approach Assigning terms to synsets Mapping procedures Baseline ▶ A and B are random synsets containing a and b respectivelly Related proportion [Gonçalo Oliveira and Gomes, 2010] Cosine similarity Gonçalo Oliveira & Gomes (CISUC) STAIRS 2010 Lisbon, August 16, 2010 18 / 31 Approach Assigning terms to synsets Related proportion Assignment of a (in a R b) to A: 1 Fix b Gonçalo Oliveira & Gomes (CISUC) STAIRS 2010 Lisbon, August 16, 2010 19 / 31 Approach Assigning terms to synsets Related proportion Assignment of a (in a R b) to A: 1 Fix b 2 Sa ⊂ T : Sai ∈ Sa , a ∈ Sai Gonçalo Oliveira & Gomes (CISUC) STAIRS 2010 Lisbon, August 16, 2010 19 / 31 Approach Assigning terms to synsets Related proportion Assignment of a (in a R b) to A: 1 Fix b 2 Sa ⊂ T : Sai ∈ Sa , a ∈ Sai ▶ a is not in T ? create synset A = (a), a → A Gonçalo Oliveira & Gomes (CISUC) STAIRS 2010 Lisbon, August 16, 2010 19 / 31 Approach Assigning terms to synsets Related proportion Assignment of a (in a R b) to A: 1 Fix b 2 Sa ⊂ T : Sai ∈ Sa , a ∈ Sai ▶ 3 a is not in T ? create synset A = (a), a → A For each Sai ∈ Sa , Gonçalo Oliveira & Gomes (CISUC) STAIRS 2010 Lisbon, August 16, 2010 19 / 31 Approach Assigning terms to synsets Related proportion Assignment of a (in a R b) to A: 1 Fix b 2 Sa ⊂ T : Sai ∈ Sa , a ∈ Sai ▶ 3 a is not in T ? create synset A = (a), a → A For each Sai ∈ Sa , ▶ pai = nai ∣Sai ∣ , nai = number of terms tj ∈ Sai : (tj R b) Gonçalo Oliveira & Gomes (CISUC) STAIRS 2010 Lisbon, August 16, 2010 19 / 31 Approach Assigning terms to synsets Related proportion Assignment of a (in a R b) to A: 1 Fix b 2 Sa ⊂ T : Sai ∈ Sa , a ∈ Sai ▶ 3 a is not in T ? create synset A = (a), a → A For each Sai ∈ Sa , ▶ pai = ★ ★ ★ nai ∣Sai ∣ , nai = number of terms tj ∈ Sai : (tj R b) Sa1 = (a, c, d, e), pa1 = 34 Sa2 = (a, f, g ), pa2 = 23 Sa3 = (a, h, i, j), pa3 = 41 Gonçalo Oliveira & Gomes (CISUC) STAIRS 2010 Lisbon, August 16, 2010 19 / 31 Approach Assigning terms to synsets Related proportion Assignment of a (in a R b) to A: 1 Fix b 2 Sa ⊂ T : Sai ∈ Sa , a ∈ Sai ▶ 3 a is not in T ? create synset A = (a), a → A For each Sai ∈ Sa , ▶ pai = ★ ★ ★ ▶ nai ∣Sai ∣ , nai = number of terms tj ∈ Sai : (tj R b) Sa1 = (a, c, d, e), pa1 = 34 Sa2 = (a, f, g ), pa2 = 23 Sa3 = (a, h, i, j), pa3 = 41 a → Sa1 Gonçalo Oliveira & Gomes (CISUC) STAIRS 2010 Lisbon, August 16, 2010 19 / 31 Approach Assigning terms to synsets Cosine similarity 1 M = term-term matrix based on the adjacencies of the lexical network Gonçalo Oliveira & Gomes (CISUC) STAIRS 2010 Lisbon, August 16, 2010 20 / 31 Approach Assigning terms to synsets Cosine similarity 1 M = term-term matrix based on the adjacencies of the lexical network 2 Collect all the synsets with a, Sa ⊂ T , and all synsets with b, Sb ⊂ T Gonçalo Oliveira & Gomes (CISUC) STAIRS 2010 Lisbon, August 16, 2010 20 / 31 Approach Assigning terms to synsets Cosine similarity 1 M = term-term matrix based on the adjacencies of the lexical network 2 Collect all the synsets with a, Sa ⊂ T , and all synsets with b, Sb ⊂ T 3 For each A ∈ Sa and B ∈ Sb , with terms Ai ∈ A and Bj ∈ B: ∣A∣ ∑ ∣B∣ ∑ sim(A, B) = Gonçalo Oliveira & Gomes (CISUC) cos(Ai , Bj ) i=1 j=1 STAIRS 2010 ∣A∣∣B∣ Lisbon, August 16, 2010 20 / 31 Approach Assigning terms to synsets Cosine similarity 1 M = term-term matrix based on the adjacencies of the lexical network 2 Collect all the synsets with a, Sa ⊂ T , and all synsets with b, Sb ⊂ T 3 For each A ∈ Sa and B ∈ Sb , with terms Ai ∈ A and Bj ∈ B: ∣A∣ ∑ ∣B∣ ∑ sim(A, B) = 4 cos(Ai , Bj ) i=1 j=1 ∣A∣∣B∣ Select the pair of synsets with the highest similarity Gonçalo Oliveira & Gomes (CISUC) STAIRS 2010 Lisbon, August 16, 2010 20 / 31 Approach Gonçalo Oliveira & Gomes (CISUC) Knowledge organisation STAIRS 2010 Lisbon, August 16, 2010 21 / 31 Approach Knowledge organisation Knowledge organisation Transitivity ▶ if R is transitive (e.g. SYNONYMY, HYPERNYMY, ...): (A R B) ∧ (B R C ) → (A R C ) Gonçalo Oliveira & Gomes (CISUC) STAIRS 2010 Lisbon, August 16, 2010 22 / 31 Approach Knowledge organisation Knowledge organisation Transitivity ▶ if R is transitive (e.g. SYNONYMY, HYPERNYMY, ...): (A R B) ∧ (B R C ) → (A R C ) Inheritance ▶ if R is not a HYPERNYMY or HYPONYMY relation: (A HYPERNYM OF B) ∧ (A R C ) → (B R C ) Gonçalo Oliveira & Gomes (CISUC) STAIRS 2010 Lisbon, August 16, 2010 22 / 31 Current results Relation extraction Triples extracted from dictionaries Dicionário da Lı́ngua Portguesa (PAPEL 2.0) Dicionário Aberto (DA) Relation Synonymy Hypernymy Part-of Member-of Causation Purpose Arguments noun,noun verb,verb adj,adj adv,adv noun,noun noun,noun noun,adj noun,noun adj,noun noun,noun adj,noun verb,noun noun,noun verb,noun verb,adj PAPEL 2.0 37,452 21,465 19,073 1,171 62,591 2,805 3.721 5.929 883 1.013 498 6,399 2,886 5,192 260 DA 20,910 8,715 7,353 605 59,887 1,795 4,902 1,564 59 264 166 5,714 1,760 3,383 186 Examples auxı́lio, contributo tributar, colectar flexı́vel, moldável após, seguidamente planta, salva cauda, cometa tampa, coberto ervilha, Leguminosas celular, célula fricção, assadura reactivo, reacção limpar, purgação defesa, armadura fazer rir, comédia corrigir, correccional Table: Examples of triples Gonçalo Oliveira & Gomes (CISUC) STAIRS 2010 Lisbon, August 16, 2010 23 / 31 Current results Relation extraction Relations extracted from Wikipedia abstracts Relation Synonymy Hypernymy Part-of Causation Purpose Quantity 11,862 29,563 1,287 520 743 Example estupro,violação estilo de música,folk jejuno,intestino parasita,doença construção,terracota Sample 286 322 268 244 264 Correct 86,1% 59,1% 3 52,6% 49,6% 57,0% Agreement 91,2% 93,1% 78,4% 79,5% 82,2% Table: Examples and validation of relations 3 In 30%, grammars/tagger could not identify complete scientific names as in: O Iriatherina werneri é uma espécie de peixe de aquário → peixe de aquário HYPERNYM OF werneri Gonçalo Oliveira & Gomes (CISUC) STAIRS 2010 Lisbon, August 16, 2010 24 / 31 Current results Synset discovery Thesaurus TeP4 thesaurus OpenThesaurus.PT (OT)5 Clustered PAPEL (CLIP) TeP merged with OT, merged with CLIP (TOP) Words Synsets Quantity Ambiguous Most ambiguous Quantity Avg. size Biggest TeP 17,158 5,867 20 8,254 3.51 21 OT 5,819 442 4 1,872 3.37 14 CLIP 23,741 12,196 47 7,468 12.57 103 TOP 30,554 13,294 21 9,960 6.6 277 Table: (Noun) thesauruses in numbers. 4 5 http://www.nilc.icmc.usp.br/tep2/index.htm http://openthesaurus.caixamagica.pt/ Gonçalo Oliveira & Gomes (CISUC) STAIRS 2010 Lisbon, August 16, 2010 25 / 31 Current results Synset discovery Manual validation CLIP CLIP’ TOP TOP’ Sample 519 sets 310 sets 480 sets 448 sets Correct 65.8% 81.1% 83.2% 86.8% Incorrect 31.7% 16.9% 15.8% 12.3% N/A 2.5% 2.0% 1.0% 0.9% Agreement 76.1% 84.2% 82.3% 83.0% Table: Results of manual synset validation. CLIP’ and TOP’ only consider synsets with 10 or less words. ▶ The quality is higher for smaller synsets. Gonçalo Oliveira & Gomes (CISUC) STAIRS 2010 Lisbon, August 16, 2010 26 / 31 Current results Wordnet establishment Resulting WordNet – related proportion Term-based triples Mapped 1st Same synset Already present Semi-mapped triples Mapped 2nd Could be inferred Already present Synset-based triples Hypernym of 62,591 27,750 233 3,970 7,952 88 50 13 23,572 Part of 2,805 1,460 5 40 262 1 0 0 1,416 Member of 5,929 3,962 12 167 357 0 0 0 3,783 Table: Results of triples mapping Gonçalo Oliveira & Gomes (CISUC) STAIRS 2010 Lisbon, August 16, 2010 27 / 31 Current results Wordnet establishment Automatic validation For each triple, A R B 1 Compile a set of textual patterns denoting R, e.g.: ▶ ▶ (hypo) é um∣uma (tipo∣forma∣variedade∣...)* de (hyper) (whole/group) é um (grupo∣conjunto∣...) de (part/member) Gonçalo Oliveira & Gomes (CISUC) STAIRS 2010 Lisbon, August 16, 2010 28 / 31 Current results Wordnet establishment Automatic validation For each triple, A R B 1 Compile a set of textual patterns denoting R, e.g.: ▶ ▶ 2 (hypo) é um∣uma (tipo∣forma∣variedade∣...)* de (hyper) (whole/group) é um (grupo∣conjunto∣...) de (part/member) Score the triple with the help of Google: ∣A∣ ∑ ∣B∣ ∑ score = Gonçalo Oliveira & Gomes (CISUC) found(Ai , Bj , R) i=1 j=1 ∣A∣ ∗ ∣B∣ STAIRS 2010 Lisbon, August 16, 2010 28 / 31 Current results Wordnet establishment Automatic validation For each triple, A R B 1 Compile a set of textual patterns denoting R, e.g.: ▶ ▶ 2 (hypo) é um∣uma (tipo∣forma∣variedade∣...)* de (hyper) (whole/group) é um (grupo∣conjunto∣...) de (part/member) Score the triple with the help of Google: ∣A∣ ∑ ∣B∣ ∑ score = Relation Hypernymy of Member of Part of found(Ai , Bj , R) i=1 j=1 ∣A∣ ∗ ∣B∣ Sample size 419 synsets 379 synsets 290 synsets Validation 44,1% 24,3% 24,8% Table: Automatic validation of triples Gonçalo Oliveira & Gomes (CISUC) STAIRS 2010 Lisbon, August 16, 2010 28 / 31 Concluding remarks Concluding remarks Answer to the growing demand on semantically aware applications Gonçalo Oliveira & Gomes (CISUC) STAIRS 2010 Lisbon, August 16, 2010 29 / 31 Concluding remarks Concluding remarks Answer to the growing demand on semantically aware applications Lack of public domain lexico-semantic resources for Portuguese Gonçalo Oliveira & Gomes (CISUC) STAIRS 2010 Lisbon, August 16, 2010 29 / 31 Concluding remarks Concluding remarks Answer to the growing demand on semantically aware applications Lack of public domain lexico-semantic resources for Portuguese Export resources to different data formats Gonçalo Oliveira & Gomes (CISUC) STAIRS 2010 Lisbon, August 16, 2010 29 / 31 Concluding remarks Concluding remarks Answer to the growing demand on semantically aware applications Lack of public domain lexico-semantic resources for Portuguese Export resources to different data formats WSD without a context: ▶ Clustering for establishing synsets Gonçalo Oliveira & Gomes (CISUC) STAIRS 2010 Lisbon, August 16, 2010 29 / 31 Concluding remarks Concluding remarks Answer to the growing demand on semantically aware applications Lack of public domain lexico-semantic resources for Portuguese Export resources to different data formats WSD without a context: ▶ ▶ Clustering for establishing synsets Disambiguation of terms based on the extracted knowledge Gonçalo Oliveira & Gomes (CISUC) STAIRS 2010 Lisbon, August 16, 2010 29 / 31 Concluding remarks Concluding remarks Answer to the growing demand on semantically aware applications Lack of public domain lexico-semantic resources for Portuguese Export resources to different data formats WSD without a context: ▶ ▶ ▶ Clustering for establishing synsets Disambiguation of terms based on the extracted knowledge Further organisation Gonçalo Oliveira & Gomes (CISUC) STAIRS 2010 Lisbon, August 16, 2010 29 / 31 Concluding remarks Concluding remarks Answer to the growing demand on semantically aware applications Lack of public domain lexico-semantic resources for Portuguese Export resources to different data formats WSD without a context: ▶ ▶ ▶ Clustering for establishing synsets Disambiguation of terms based on the extracted knowledge Further organisation Check http://ontopt.dei.uc.pt for updates and available resources Gonçalo Oliveira & Gomes (CISUC) STAIRS 2010 Lisbon, August 16, 2010 29 / 31 The end References Bollegala, D., Matsuo, Y., and Ishizuka, M. (2007). Measuring semantic similarity between words using web search engines. In Proc. 16th International conference on World Wide Web (WWW’07), pages 757–766, New York, NY, USA. ACM. Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., and Harshman, R. (1990). Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41:391–407. Fellbaum, C., editor (1998). WordNet: An Electronic Lexical Database (Language, Speech, and Communication). The MIT Press. Gonçalo Oliveira, H. and Gomes, P. (2010). Towards the automatic creation of a wordnet from a term-based lexical network. In Proceedings of the ACL Workshop TextGraphs-5: Graph-based Methods for Natural Language Processing. Hirst, G. (2004). Ontology and the lexicon. In Staab, S. and Studer, R., editors, Handbook on Ontologies, International Handbooks on Information Systems, pages 209–230. Springer. Oliveira, H. G. and Gomes, P. (2010). Automatic creation of a conceptual base for portuguese using clustering techniques. In Proc. 19th European Conference on Artifical Intelligence (ECAI 2010). Turney, P. D. (2001). Mining the web for synonyms: PMI–IR versus LSA on TOEFL. In Proc. 12th European Conference on Machine Learning (ECML-2001), volume 2167, pages 491–502. Springer. Gonçalo Oliveira & Gomes (CISUC) STAIRS 2010 Lisbon, August 16, 2010 30 / 31 The end Thank you! Gonçalo Oliveira & Gomes (CISUC) STAIRS 2010 Lisbon, August 16, 2010 31 / 31
Documentos relacionados
Automatically Enriching a Thesaurus with Information from Dictionaries
Automatically Enriching a Thesaurus with Information from Dictionaries Hugo Gonçalo Oliveira1
Leia mais