Einführung
Transcrição
Einführung
Text Analytics Praktikum Ulf Leser Idea • The practical parts of this course are greater than in a usual Halbkurs • Components – Each student has to give a 30min presentation, sometime in the semester • The topics are mostly rather practical (presentation of a tool etc.) • Topics are assigned during the semester (starting today) – We will build groups • Each group has to solve 5-6 exercises • All exercises must be solved by all groups – Solutions are presented by one of the group members in the form of 5-15 minutes talks • Each student has to present at least one solution Ulf Leser: Text Analytics, Praktikum, Sommersemester 2008 Before we start • You need to … – – – – have finished your Vorstudium (or have a special permission) be somewhat experienced in Java behave well to contribute to your groups’ solutions be willing to invest considerable time Ulf Leser: Text Analytics, Praktikum, Sommersemester 2008 Tentative Schedule • Today: Group formation, assignment 1, assignment of topics for talks • 26.10.: Presentation of solutions, assignment 2 – Implement fast term search in a 100 MB corpus • 2.11.: No program • 9.11.: No program • 16.11. Presentation of solutions, assignment 3 – Installation of a text mining framework; Genname tagging using a dictionary using the chose framework • 23.11. No program • … Ulf Leser: Text Analytics, Praktikum, Sommersemester 2008 Challenge - Voluntary • Most assignments can be solved more or the less well • Speed or precision • The best groups will get points for each assignment – Best: 5P, second: 2; third: 1 • The overall best group will get a little present at the end of the semester Ulf Leser: Text Analytics, Praktikum, Sommersemester 2008 Questions ? Ulf Leser: Text Analytics, Praktikum, Sommersemester 2008 Topics for Talks (on Order of Schedule) 1. Lucene 2. UIMA 3. GATE 4. LingPipe 5. OpenNLP 6. NLKT 7. Deep Web 8. TREC 9. PubMed Related Articles 10.WordNet 11.Geographical IR 12.Recognizing locations in text 13.Web-Scale Information Extraction 14.Declarative Information Extraction 15.Entity Search Ulf Leser: Text Analytics, Praktikum, Sommersemester 2008 Lucene • Open Source JAVA-based full-text indexing system – – – – – – – – – History, versions Availability, installation Functionality for IR Index structures used Integrated text preprocessing Extensibility Speed (benchmarks?) Linguistic features? … Ulf Leser: Text Analytics, Praktikum, Sommersemester 2008 UIMA http://incubator.apache.org/uima/ • Unstructured Information Management Architecture • Initiated by IBM, open source since 2006 • Framework-oriented; everything must be plugged – Annotator and format (CAS: Common analysis structure) • Comes with set of Eclipse plug-in’s for development • Project – – – – – – History Installation Usage / success Features Special focus or strengths Short demo? Ulf Leser: Text Analytics, Praktikum, Sommersemester 2008 GATE http://gate.ac.uk/ • • • • • General architecture for text processing Oldest of the NLP / Text Mining frameworks (~1996) Java, component-based, Open Source (version 4.0, 2007) Programming framework and (nice) development GUI Project – … Ulf Leser: Text Analytics, Praktikum, Sommersemester 2008 LingPipe: http://alias-i.com/lingpipe/ • • • • Commercial tool with license for research Java tool for linguistic analysis of human language Quite comprehensive (classification, NER, clustering, …) Project – … Ulf Leser: Text Analytics, Praktikum, Sommersemester 2008 OpenNLP, http://opennlp.sourceforge.net/ • • • • OpenSource library for NLP tools No GUI, JAVA-based Used to build processing pipelines Project – … Ulf Leser: Text Analytics, Praktikum, Sommersemester 2008 NLKT http://www.nltk.org/ • Open Source Natural Language Processing Toolkit in Python • Used a lot for university teaching • Has modules for everything – Accessing corpora, string processing, sentence tokenizers, stemmers, collocation, Part-of-speech tagging, classification, chunking, regular expression, named-entity parsing, semantic interpretation, evaluation metrics, frequency distributions, applications, … • Project – … Ulf Leser: Text Analytics, Praktikum, Sommersemester 2008 The Deep Web • Most information on the web is hidden behind a web interface – Which fraction? – How can we crawl it? • Papers – He et al. (2007). “Accessing the Deep Web: A Survey”, CACM – Jayant et al. (2008). "Googles Deep Web crawl.“, VLDB – Raghavan, Garcia-Molina (2001). “Crawling the HiddenWeb”, VLDB Ulf Leser: Text Analytics, Praktikum, Sommersemester 2008 TREC: http://trec.nist.gov/ • Text REtrieval Conference (TREC) … was started in 1992 … to support research within the information retrieval community by providing the infrastructure necessary for large-scale evaluation of text retrieval methodologies. .. – – – – • • • • encourage research in IR based on large test collections; increase communication among industry, academia, government … speed the transfer of technology … increase the availability of appropriate evaluation techniques … What are the specific tasks? How good are the results and how did they develop? Which resources are available? ... Ulf Leser: Text Analytics, Praktikum, Sommersemester 2008 PubMeds “Related Articles” Feature • Abstracts similar to a given one – Google has similar thing • Based on clustering / classification • Papers – Wilbur, W. J. and Yang, Y. (1996). "An analysis of statistical term strength …." Comput Biol Med – Lin et al. (2007). "PubMed related articles: a probabilistic topic-based model for content similarity." BMC Bioinformatics Ulf Leser: Text Analytics, Praktikum, Sommersemester 2008 WordNet http://wordnet.princeton.edu/ • „WordNet® is a large lexical database of English, …. Nouns, verbs, adjectives and adverbs are grouped into … synonyms. Synsets are interlinked by means of conceptual-semantic and lexical relations.“ • A resource used by many to compute the relationships between words (as distance in the WordNet graph) • Contains ~200.000 Word – sense pairs • What is the content? How can it be accessed? Exemplary applications? Format? Tools? Ulf Leser: Text Analytics, Praktikum, Sommersemester 2008 Geospatial IR • Searches that combine classical with grospatial predicates – Gute Hotels unter 100 Euro in der Nähe von Wuppertal – Ein chinesisches Restaurant in Kreuzberg • Both predicates may be matched to a varying degree – Terms: semantic, vector space – Geo: physical distance • Efficient and sensible ranking? • Paper – Vaid et al. (2005). "Spatio-textual Indexing for Geographical Search on the Web". Symposium on Advances in Spatial and Temporal Databases – Yen-Yu, C., Torsten, S. and Alexander, M. (2006). "Efficient query processing in geographic web search engines". SIGMOD Ulf Leser: Text Analytics, Praktikum, Sommersemester 2008 20 Geo NER: Recognizing locations in text • Find all references to locations in a given text – „Die Postfiliale in der Brunnenstrasse wurde gestern überfallen“ – „In Springfield lag im Dezember der Schnee zwei Meter hoch“ • Problems – Find locations (Streets, cities, countries, villages, places, buildings, etc.) – Find the right location (ambiguity) • Approach – Build large dictionaries – Web: Look at linked web pages • Paper – Amitay et al. (2004). "Web-a-where: geotagging web content". SIGIR – McCurley, K. S. (2001). "Geospatial mapping and navigation of the web". WWW. Ulf Leser: Text Analytics, Praktikum, Sommersemester 2008 Web-Scale Information Extraction • Domänenunabhängige Faktenextraktion • Angewandt auf „das Web“ – Skalierbarkeit ist ein Muss • Verwendung von Hearst-Pattern – NP1 “such as” NPList2, NP1 “including” NPList2, NP1 “is a” NP2 … • (Initiale) Bewertung der extrahierten Fakten • Mischung aus (einfachem) NLP, Machine Learning und Information Extraction • Paper – Banko, M., Cafarella, M. J., Soderland, S., Broadhead, M. and Etzioni, O. (2007). "Open information extraction from the web". IJCAI, Hyderabad, India. pp 2670-2676. Ulf Leser: Text Analytics, Praktikum, Sommersemester 2008 Declarative Information Extraction • Anfragesprache zur Formulierung von IE-Tasks • Anfrageoptimierung (Kostenmodell, Kostenschätzungen) • Paper – Reiss et al. (2008). "An Algebraic Approach to Rule-Based IE". ICDE – Shen et al. (2007). "Declarative Information Extraction Using Datalog with Embedded Extraction Predicates". VLDB Ulf Leser: Text Analytics, Praktikum, Sommersemester 2008 Entity Search with EntityRank • Oftmals sucht man im Web mal nach spezifischen Informationen – The email of Luis Gravano? – What profs are doing databases at UIUC? • Erkennung der Entitäten in Webseiten • Ranking aller gefundenen Instanzen – Zahl der Vorkommen, PageRank, Context … • Paper – Brauer et al. (2010). "Graph-Based Concept Identification and Disambiguation for Enterprise Search ". WWW – Cheng et al. (2007). "EntityRank: searching entities directly and holistically". VLDB Ulf Leser: Text Analytics, Praktikum, Sommersemester 2008 Who does what? Lucene 2.11. Kulagina UIMA 2.11. Kipar, Hartkopp GATE 2.11. Kaase, Wermke LingPipe 16.11 Heideklang, Bethge OpenNLP 16.11 Fajerski NLKT 16.11 Lelis, Frenzel Deep Web 2.12 TREC 2.12 PubMed Related Articles 2.12 WordNet 11.1 Geographical IR 11.1 Recognizing locations in text 11.1 Web-Scale Information Extraction 1.2 Declarative Information Extraction 1.2 Entity Search 1.2 Ulf Leser: Text Analytics, Praktikum, Sommersemester 2008 Groups • • • • • • • • • • • Group 1: Kipar, Hartkopp Group 2: Heideklang, Bethge Group 3: Fajerski, Severin Group 4: Lehmann, Weber, Kulagina Group 5: Frenzel, Mosolf Group 6: Kunkel, Isberner, Lilienthal Group 7: Wermke, Kaase Group 8: Minor, Lelis Group 9: Rocktäschel, Stoltmann Group 10: Brettschneider, Sarischeva, Kutin Please build groups in GOYA asap – Using group name „GroupX“ Ulf Leser: Text Analytics, Praktikum, Sommersemester 2008