Einführung

Transcrição

Einführung
Text Analytics
Praktikum
Ulf Leser
Idea
• The practical parts of this course are greater than in a
usual Halbkurs
• Components
– Each student has to give a 30min presentation, sometime in the
semester
• The topics are mostly rather practical (presentation of a tool etc.)
• Topics are assigned during the semester (starting today)
– We will build groups
• Each group has to solve 5-6 exercises
• All exercises must be solved by all groups
– Solutions are presented by one of the group members in the form
of 5-15 minutes talks
• Each student has to present at least one solution
Ulf Leser: Text Analytics, Praktikum, Sommersemester 2008
Before we start
• You need to …
–
–
–
–
have finished your Vorstudium (or have a special permission)
be somewhat experienced in Java
behave well to contribute to your groups’ solutions
be willing to invest considerable time
Ulf Leser: Text Analytics, Praktikum, Sommersemester 2008
Tentative Schedule
• Today: Group formation, assignment 1, assignment of
topics for talks
• 26.10.: Presentation of solutions, assignment 2
– Implement fast term search in a 100 MB corpus
• 2.11.: No program
• 9.11.: No program
• 16.11. Presentation of solutions, assignment 3
– Installation of a text mining framework; Genname tagging using a
dictionary using the chose framework
• 23.11. No program
• …
Ulf Leser: Text Analytics, Praktikum, Sommersemester 2008
Challenge - Voluntary
• Most assignments can be solved more or the less well
• Speed or precision
• The best groups will get points for each assignment
– Best: 5P, second: 2; third: 1
• The overall best group will get a little present at the end of
the semester
Ulf Leser: Text Analytics, Praktikum, Sommersemester 2008
Questions ?
Ulf Leser: Text Analytics, Praktikum, Sommersemester 2008
Topics for Talks (on Order of Schedule)
1. Lucene
2. UIMA
3. GATE
4. LingPipe
5. OpenNLP
6. NLKT
7. Deep Web
8. TREC
9. PubMed Related Articles
10.WordNet
11.Geographical IR
12.Recognizing locations in text
13.Web-Scale Information Extraction
14.Declarative Information Extraction
15.Entity Search
Ulf Leser: Text Analytics, Praktikum, Sommersemester 2008
Lucene
• Open Source JAVA-based full-text indexing system
–
–
–
–
–
–
–
–
–
History, versions
Availability, installation
Functionality for IR
Index structures used
Integrated text preprocessing
Extensibility
Speed (benchmarks?)
Linguistic features?
…
Ulf Leser: Text Analytics, Praktikum, Sommersemester 2008
UIMA http://incubator.apache.org/uima/
• Unstructured Information Management Architecture
• Initiated by IBM, open source since 2006
• Framework-oriented; everything must be plugged
– Annotator and format (CAS: Common analysis structure)
• Comes with set of Eclipse plug-in’s for development
• Project
–
–
–
–
–
–
History
Installation
Usage / success
Features
Special focus or strengths
Short demo?
Ulf Leser: Text Analytics, Praktikum, Sommersemester 2008
GATE http://gate.ac.uk/
•
•
•
•
•
General architecture for text processing
Oldest of the NLP / Text Mining frameworks (~1996)
Java, component-based, Open Source (version 4.0, 2007)
Programming framework and (nice) development GUI
Project
– …
Ulf Leser: Text Analytics, Praktikum, Sommersemester 2008
LingPipe: http://alias-i.com/lingpipe/
•
•
•
•
Commercial tool with license for research
Java tool for linguistic analysis of human language
Quite comprehensive (classification, NER, clustering, …)
Project
– …
Ulf Leser: Text Analytics, Praktikum, Sommersemester 2008
OpenNLP, http://opennlp.sourceforge.net/
•
•
•
•
OpenSource library for NLP tools
No GUI, JAVA-based
Used to build processing pipelines
Project
– …
Ulf Leser: Text Analytics, Praktikum, Sommersemester 2008
NLKT http://www.nltk.org/
• Open Source Natural Language Processing Toolkit in
Python
• Used a lot for university teaching
• Has modules for everything
– Accessing corpora, string processing, sentence tokenizers,
stemmers, collocation, Part-of-speech tagging, classification,
chunking, regular expression, named-entity parsing, semantic
interpretation, evaluation metrics, frequency distributions,
applications, …
• Project
– …
Ulf Leser: Text Analytics, Praktikum, Sommersemester 2008
The Deep Web
• Most information on the
web is hidden behind a web
interface
– Which fraction?
– How can we crawl it?
• Papers
– He et al. (2007). “Accessing
the Deep Web: A Survey”,
CACM
– Jayant et al. (2008). "Googles
Deep Web crawl.“, VLDB
– Raghavan, Garcia-Molina
(2001). “Crawling the
HiddenWeb”, VLDB
Ulf Leser: Text Analytics, Praktikum, Sommersemester 2008
TREC: http://trec.nist.gov/
• Text REtrieval Conference (TREC) … was started in 1992 …
to support research within the information retrieval
community by providing the infrastructure necessary for
large-scale evaluation of text retrieval methodologies. ..
–
–
–
–
•
•
•
•
encourage research in IR based on large test collections;
increase communication among industry, academia, government …
speed the transfer of technology …
increase the availability of appropriate evaluation techniques …
What are the specific tasks?
How good are the results and how did they develop?
Which resources are available?
...
Ulf Leser: Text Analytics, Praktikum, Sommersemester 2008
PubMeds “Related Articles” Feature
• Abstracts similar to a
given one
– Google has similar thing
• Based on clustering /
classification
• Papers
– Wilbur, W. J. and Yang, Y.
(1996). "An analysis of
statistical term strength …."
Comput Biol Med
– Lin et al. (2007). "PubMed
related articles: a probabilistic
topic-based model for content
similarity." BMC Bioinformatics
Ulf Leser: Text Analytics, Praktikum, Sommersemester 2008
WordNet http://wordnet.princeton.edu/
• „WordNet® is a large lexical
database of English, …. Nouns,
verbs, adjectives and adverbs
are grouped into … synonyms.
Synsets are interlinked by
means of conceptual-semantic
and lexical relations.“
• A resource used by many to compute the relationships
between words (as distance in the WordNet graph)
• Contains ~200.000 Word – sense pairs
• What is the content? How can it be accessed? Exemplary
applications? Format? Tools?
Ulf Leser: Text Analytics, Praktikum, Sommersemester 2008
Geospatial IR
• Searches that combine classical with grospatial predicates
– Gute Hotels unter 100 Euro in der Nähe von Wuppertal
– Ein chinesisches Restaurant in Kreuzberg
• Both predicates may be matched to a varying degree
– Terms: semantic, vector space
– Geo: physical distance
• Efficient and sensible ranking?
• Paper
– Vaid et al. (2005). "Spatio-textual Indexing for Geographical Search on the
Web". Symposium on Advances in Spatial and Temporal Databases
– Yen-Yu, C., Torsten, S. and Alexander, M. (2006). "Efficient query
processing in geographic web search engines". SIGMOD
Ulf Leser: Text Analytics, Praktikum, Sommersemester 2008
20
Geo NER: Recognizing locations in text
• Find all references to locations in a given text
– „Die Postfiliale in der Brunnenstrasse wurde gestern überfallen“
– „In Springfield lag im Dezember der Schnee zwei Meter hoch“
• Problems
– Find locations (Streets, cities, countries, villages, places, buildings, etc.)
– Find the right location (ambiguity)
• Approach
– Build large dictionaries
– Web: Look at linked web pages
• Paper
– Amitay et al. (2004). "Web-a-where: geotagging web content". SIGIR
– McCurley, K. S. (2001). "Geospatial mapping and navigation of the web".
WWW.
Ulf Leser: Text Analytics, Praktikum, Sommersemester 2008
Web-Scale Information Extraction
• Domänenunabhängige Faktenextraktion
• Angewandt auf „das Web“ – Skalierbarkeit ist ein Muss
• Verwendung von Hearst-Pattern
– NP1 “such as” NPList2, NP1 “including” NPList2, NP1 “is a” NP2 …
• (Initiale) Bewertung der extrahierten Fakten
• Mischung aus (einfachem) NLP, Machine Learning und
Information Extraction
• Paper
– Banko, M., Cafarella, M. J., Soderland, S., Broadhead, M. and
Etzioni, O. (2007). "Open information extraction from the web".
IJCAI, Hyderabad, India. pp 2670-2676.
Ulf Leser: Text Analytics, Praktikum, Sommersemester 2008
Declarative Information Extraction
• Anfragesprache zur Formulierung von IE-Tasks
• Anfrageoptimierung (Kostenmodell, Kostenschätzungen)
• Paper
– Reiss et al. (2008). "An Algebraic Approach to Rule-Based IE". ICDE
– Shen et al. (2007). "Declarative Information Extraction Using Datalog with
Embedded Extraction Predicates". VLDB
Ulf Leser: Text Analytics, Praktikum, Sommersemester 2008
Entity Search with EntityRank
• Oftmals sucht man im Web mal nach spezifischen
Informationen
– The email of Luis Gravano?
– What profs are doing databases at UIUC?
• Erkennung der Entitäten in Webseiten
• Ranking aller gefundenen
Instanzen
– Zahl der Vorkommen,
PageRank, Context …
• Paper
– Brauer et al. (2010). "Graph-Based Concept Identification and Disambiguation for
Enterprise Search ". WWW
– Cheng et al. (2007). "EntityRank: searching entities directly and holistically". VLDB
Ulf Leser: Text Analytics, Praktikum, Sommersemester 2008
Who does what?
Lucene
2.11.
Kulagina
UIMA
2.11.
Kipar, Hartkopp
GATE
2.11.
Kaase, Wermke
LingPipe
16.11
Heideklang, Bethge
OpenNLP
16.11
Fajerski
NLKT
16.11
Lelis, Frenzel
Deep Web
2.12
TREC
2.12
PubMed Related Articles
2.12
WordNet
11.1
Geographical IR
11.1
Recognizing locations in text
11.1
Web-Scale Information Extraction
1.2
Declarative Information Extraction
1.2
Entity Search
1.2
Ulf Leser: Text Analytics, Praktikum, Sommersemester 2008
Groups
•
•
•
•
•
•
•
•
•
•
•
Group 1: Kipar, Hartkopp
Group 2: Heideklang, Bethge
Group 3: Fajerski, Severin
Group 4: Lehmann, Weber, Kulagina
Group 5: Frenzel, Mosolf
Group 6: Kunkel, Isberner, Lilienthal
Group 7: Wermke, Kaase
Group 8: Minor, Lelis
Group 9: Rocktäschel, Stoltmann
Group 10: Brettschneider, Sarischeva, Kutin
Please build groups in GOYA asap
– Using group name „GroupX“
Ulf Leser: Text Analytics, Praktikum, Sommersemester 2008