Foundations of Information Retrieval

Transcrição

Foundations of Information Retrieval
Text Analytics
Introduction to Information Retrieval
Ulf Leser
Content of this Lecture
• Information Retrieval
– Introduction
– Documents
– Queries
• A first idea: Boolean queries and vector space model
Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008
2
Information Retrieval
• Information Retrieval is about helping a user in finding
information
• IR is about finding information, not data [BYRN99]
• IR builds systems for end users, not for programmers
– No SQL
– IR today is used by almost everybody, databases are not
• IR searches text
– >90% of all information is available in text, not in databases
Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008
3
History of IR
• ~300 ad. Library of Alexandria , ~700.000 „documents“
• 1450: Bookprint
• 19th century: Indices and concordances
– Extremely expensive and laborious
•
•
•
•
•
•
•
KWIC and concordances: H.P. Luhn, IBM (1958)
Probabilistic models: Maron & Kuhns (1960)
Boolean queries: Lockheed (~1960)
Vector Space Model: Salton, Cornell (1965)
Digital libraries, SGML, metadata standards
Mid 90s: The web, web search engines
End 90s: Personalized search engines, enterprise search etc.
• 90% of all information is believed to be in unstructured form
Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008
4
Information Retrieval: The Problem
• A bit more concrete: Help user in quickly finding
information from a given set of documents
– A corpus (linguistics) or document collection
– Quickly means: The right documents and possibly the right
passages in documents
Query
IR System
(Ranked)
Results
Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008
Document base
Corpus
5
Why is it hard?
• Homonyms and context
– Fenster (Glas, Computer), Berlin (BRD, USA), Boden (Dach,
Fussboden), …
• Synonyms
– Computer, PC, Rechner, Server, …
• Specific queries
– What was the score of Bayern München versus Stuttgart im DFB
Pokal in 1998? Who scored the first goal for Stuttgart?
– How many hours of sunshine on average has a day in Krete in
May?
– What are the last lines of “To be or not to be” in the third stage?
• Typical web queries have 1,6 terms
• “Information broker” is a profession
Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008
6
… to Quickly Find Information
• Speed
– Time to execute a query (should be fast)
• Indexing, parallelization, compression, cluster, …
– Time to answer the query (should be fast too)
• Reducing the time to “answer the query” is harder
–
–
–
–
–
Success of search engines: Better results (and fast!)
Guess users need and return (only) relevant information
Process-orientation: User feedback, query history, …
Reliable information extraction
Ranking
• Information overload
– “We are drowing in data, but starving for knowledge”
– In the corpus is large, ranking is top priority
– Different search modes: What’s new? What’s certain?
Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008
7
Document or Passage
• Searching only metadata or searching within documents
• Returning only documents or returning relevant passages in document
Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008
8
KWIC: KeyWord In Context
Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008
9
Documents
• Natural language text
• Might be grammatical correct (books, newspapers) or not
(BLOGs, newsgroups, spoken language)
• May have structure (abstract, summary, chapters, …) or
not (plain text)
• May have (explicit or in-text) metadata (title, author, year,
…) or not
• May refer to other documents (hyperlinks)
• Various formats (ASCII, PDF, DOC, XML, …)
Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008
10
IR and XML
Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008
11
XML and IR
• XML has two flavors
– Data oriented: Highly structured, except from database, short
documents, high volume, no inline tags, structure-oriented search
– Document oriented: Less structured, large documents, inline tags,
text-oriented search
• This was the original purpose of XML (and SGML!)
• XQuery/XPath is strictly targeting data
– Focus on structural parts of a query
– No full-text search, few search options within a value
– Semantics: Exact matching
Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008
12
Queries
• Users formulate queries
– Keywords
– Phrases
– Logical operations (AND, OR, NOT, ...)
• Web search: “-ulf +leser”
– True questions (e.g. MS-Word help)
– Structured queries (author=… and title contains …)
• Querying with document
– Find documents similar to this one
• Query refinement
– Refine my query based on first results
– Find documents within the result set of the previous search
Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008
13
IR: An Iterative, Multi-Stage, Complex Process
Q2
Q1
Q4
Q3
Q5
Q0
• “moving through many actions towards a general goal of
satisfactory completion of research related to an
information need.”
– “Berry-picking”, Bates 89
Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008
14
Searching with Metadata (PubMed/Medline)
Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008
15
Query Refinement
Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008
16
Metadata Standard: Dublin Core
•
•
•
Dublin Core Metadata Initiative (W3C), 1995
15 attributes
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
identifier:
format:
type:
language
title
subject:
coverage:
description:
creator:
publisher:
contributor:
rights:
source:
relation:
date:
E.g. ISBN/ISSN, URL/PURL, DOI, …
MIME-Typ, media type,
collection, image, text, …
Keywords
Scope of doc in space and/or time
free text
Last person manipulating the doc
Copyright, licenses, …
Other doc
to other docs
Date or period
Plus refinements
Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008
17
Example
• Usage in HTML
• <head profile="http://dublincore.org/documents/dcq-html/">
<title>Dublin Core</title>
<link rel="schema.DC" href="http://purl.org/dc/..." />
<link rel="schema.DCTERMS" href="http://purl.org/..." />
<meta name="DC.format" scheme=„…" content="text/htm" />
<meta name="DC.type" scheme=„…" content="Text" />
<meta name="DC.publisher" content="Jimmy Whales" />
<meta name="DC.subject" content="Dublin Core Metadata" />
<meta name="DC.creator" content="Björn G. Kulms" />
<meta name="DCTERMS.license" scheme="DCTERMS.URI„
content="http://www.gnu.org/copyleft/fdl.html" />
</head>
Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008
18
Historic Texts
• Sachsenspiegel, ~1250
– ”Swerlenrecht kůnnen wil•d~
volge dis buches lere.alrest sul
wi mer ken, daz ...”
• Multiple representations
– Facsimile
– Digitalization / diplomacy
• How well can the facsimile be
reproduced?
– Differences in individual writers
(proliferating errors)
– Different translations
– Different editions
Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008
19
Text Encoding Initiative
• www.tei-c.org:
– „The Text Encoding Initiative (TEI) is a consortium which
collectively develops and maintains a standard for the
representation of texts in digital form. Its chief deliverable is a set
of guidelines which specify encoding methods for machine-readable
texts, chiefly in the humanities, social sciences and linguistics.“
• Very complex schema for representing text and annotation
– Graphical features, linguistic annotation, editions, metadata, …
• Widely used in linguistics
• Based on XML stand-off architecture, extensible
Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008
20
Note
• This lecture is on analyzing / searching in pure text
– No structure, no metadata etc.
• See lectures on databases, XML, semi-structured data,
semantic web, …
Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008
21
Components of an IR System
Corpus
Parsing
Document IDs
Normalization
Term Weighting
IR
Database
Normalization
Filtering
Parsing
Ranking
Query-Interface
Feedback
User
Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008
Frakes et al. 1992
22
Other Buzzwords
• Document management systems (DMS)
– Large commercial market, links to OCR, workflow systems, print
management systems, mail management, etc.
– Legal issues
– Every DMS includes an IR system
• Knowledge management
– “More sophisticated” DMS with semantic searching
• Ontologies, thesauri, topic maps, …
– Social aspects: incentives, standard procedures, enterprise
vocabulary
• “if Siemens knew what Siemens knows”
• Digital libraries
– Somewhat broader and less technical
• Includes social aspects, archiving, multimedia, …
Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008
23
Enterprise Content Management
• „The technologies used to capture, manage, store, deliver,
and preserve information to support business processes. „
• Authorization and authentication
• Business process management
and document flow
• Compliance: legal requirements
– Record management
– Pharma, Finance, …
• Collaboration and sharing
– Inter and intra organizations
– Transactions, locks, …
• Publishing: What, when, where
Quelle: AIIM International
– Web, catalogues, mail push, …
• …
Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008
24
Technique versus Content
• IR is about techniques for searching document collections
• Creating doc collections is a business: Content provider
– Pre-selected collections: classified business news, new patents, …
– Example for content
• Medline: >5000 Journals, 16 Mill citations, >500K additions every year
• Journals not in Medline don’t exist
• Thompson: patent search and indexing
– Example for classifying content
• Institute for Scientific Information (ISI)
• Impact factors: which journals count?
– Added value: Categorization, classification, filtering of docs
• Example: MeSH terms, ACM keywords, …
• Sometimes, content provider buy tech companies and vice
versa
Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008
25
Content of this Lecture
• Information Retrieval
• A first idea: Boolean queries and vector space model
Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008
26
Logical View
• The logical view on a document determines what we query
– E.g.: Only metadata, only title, only abstract, full text, …
• Logical view on the raw text
– Stemming
– Stopword removal
– Transformation of special characters (Umlaute, greek letters,
XML/HTML encodings, …)
– Removal of formatting information (HTML), tags (XML), …
• Bag of words (BoW)
– View on the text which ignores order of words
– Extremely popular
– Used form now on when not indicated otherwise
Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008
27
Preprocessing
Quelle: [BYRN99]
•
•
•
•
•
Tokenization can be a real issue
Special characters can be a real issue
Format conversion (PDF, OCR, PostScript, …) can be a real issue
Effect of normalization is not clear
Must be applied to all docs and to query (partially)
Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008
28
A First Approach: Boolean Queries (rough idea)
• Search all docs which “match” a logical expression of words
–
–
–
–
“bono AND u2”
“bono AND (Arkansas OR Indiana)”
“bono AND NOT u2”
Note: Search engines don’t support “OR”
• Evaluation on BoW b
– An atom (term) w in the expression evaluates to true if w∈b
– Compute values of all atoms
– Compute value of expression
• Advantages: Fast, simple
• Disadvantages
– No ranking of documents: Either relevant or not
– Logic is hard to understand for ordinary users
– Typical queries are too simple – too many results (without ranking)
Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008
29
Vector Space Model - Concept
Quelle: Henrich, Information Retrieval 1
Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008
30
Vector Space Model – Maths (rough idea)
• Let d by a doc, D be a collection of docs with d∈D
• Let w() map a doc or a collection to its bag of words
• Let vd by a vector for d with
– vd[i]=0 if the i’th word in w(D) is not contained in w(d)
– vd[i]=1 if the i’th word in w(DL) is contained in w(d)
• Let vq by the corresponding vector for query q
• Compute a relevance score for each document by
| w ( D )|
rel (d , q ) =
v d • vq
| vd | * | v q |
=
∑ v [i] * v [i]
i =1
d
2
[
]
v
i
∑d *
Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008
q
2
[
]
v
i
∑q
31
Angel between Doc and Query
Star
• The relevance is the angle
Astronomy
between the doc and the
query in |w(D)| - dimensional
space
• Compute relevance score for all d∈D
• Return docs in order of decreasing relevance
• Advantages
–
–
–
–
Movie stars
Mammals
Diet
Still quite simple
“Meaningful” ranking – better results than Boolean model
Allows approximate answers
Dominant model today
• Disadvantages
– Words are assumed to be statistically independent
Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008
32
Outlook: Latent Semantic Indexing
• We usually also want to find documents which don‘t even
contain the words we are searching for, but which contain
the concepts we are trying to express by our query
– But what is a concept?
• LSI approach
– Map query and docs into a concept space
– Compare and rank in concept space
– Technical: Dimensionality reduction (like PCA etc.)
• Other approach
– Find relevant docs using VSM with score above threshold
– Find consensus (e.g. words in all docs)
– Use consensus as new query and iterate
Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008
33