Verbetes v3
Transcrição
Verbetes v3
Task 3 – Web Community Sensing & Task 6 – Query and Visualization REACTION Workshop – January 31th, 2013 Verbetes v3 Verbetes v3 31 January 2013 Luis Rei [email protected] @lmrei http://luisrei.com Jorge Teixeira [email protected] Context: Verbetes • Associate office or profession (ergo) to a person’s name • Temporal information • Applied to portuguese news Portuguese prime-minister Pedro Passos Coelho ... Pedro Passos Coelho Portuguese prime-minister Context: Voxx Context: O Mundo Visto Daqui (MVDI) Verbetes v3: Motivation • Extract new types/categories of entities (Organizations, Products, Locations) • Enrich entities with new descriptors: • Corporate title • Dates of Birth/Death • Photos • Company foundation date • ... Verbetes v3: Plan 1. NER (Identification, Classification) 2. Disambiguation 3. Descriptor Extraction 4. Fusion & Automatic Cross-Validation (Wikipedia, Freebase, ...) 5. Web Service 1. Named Entity Recognition Apple’s CEO, Tim Cook, said the iPhone has cannibalized some iPod business. ORG [Apple]’s CEO, [Tim Cook], said the [iPhone] has cannibalized some [iPod] business. Person Product 2. Disambiguation Same Category Vitor Pereira (Person) Different Category Francisco Sá Carneiro Person Coach President of The Referee Commission Local (Airport) Organization (Institute) 3. Descriptor Extraction ERGO [Apple]’s CEO, [Tim Cook], said the [iPhone] has cannibalized some [iPod] business. 4. Fusion Entity Descriptor Muammar Gaddafi Presidente da Liga Portuguesa de Futebol Muammar Moammar Mu'ammar Moamar ... Gaddafi Gathafi Kadafi Qaddafi ... Presidente da Liga de Futebol Presidente da Liga Portuguesa de Futebol Profissional 5. Web Services 12 NER: State of The Art Approaches Get an annotated corpus Train a model Use it to extract entities Research Problems • How to annotate the corpus • Corpus age effect on training data • Language dependency of all the tools • Feature extraction, Tokenization, ... Our Approch: Bootstrapping CRFs Barack Obama Hillary Clinton ... + Names Testset (un-annotated) List News Annotated [Barack Obama] nomeou... 3. Training 1. Annotate News 4 - Test, Extract, Add Model 2. Extract Features Trainset nomeou verbo singular nomear ... Corpus & Evaluation • News articles from Sapo (~3M) • Select 20,000 that have names in seed list • Precision - 100 articles/test • For each name extracted from each news article: correct or not. • Recall - 40 articles/test • For each article, calculate recall context extracted entity correct ... Cristiano Ronaldo ... Cristiano Ronaldo TRUE ... Pedro Passos Coelho ... Pedro Passos FALSE Preliminary Results Precision (avg of 3 tests, 1 bootstrap iteration) Recall (single test, 1 bootstrap iteration) 0.97 0.56 (std = 0.34) F = 0.71 Possible Applications: O Mundo Numa Rede Q&A Thank You, Luis Rei Verbetes v3