clinton engines

Transcrição

clinton engines

Text Analytics
Models for Information Retrieval 1
Ulf Leser
Processing Pipeline
Document
Retrieval
Corpus
Parsing
Document IDs
Normalization
Term Weighting
IR
Database
Text
Preprocessing
Linguistic
Annotation
Normalization
Filtering
Parsing
Ranking
Named Entity
Recognition
Feedback
Named Entity
Normalization
Interface
User
Frakes et al. 1992
Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008
Relationship
Extraction
2
Format Conversion
New generation of e-commerce applications require data schemas In relational database
systems, data objects are conventionally that are constantly evolving and sparsely
populated. The conven- stored using a horizontal scheme. A data object is represented as
tional horizontal row representation fails to meet these require- …
• Transform text in ASCII / UniCode
• Formats
– PDF, XML, DOC, ODP, images, …
• Problems
– Formatting instruction, Special characters, formulas, figures and captions,
tables, section headers, footnotes, page numbers, margin text, …
• Diplomacy: To what extend can one reconstruct the original document
from the normalized representation?
3
Assumptions
• A document a sequence of sentences
• A sentence is a sequence of tokens
• A token is the smallest unit of text
– E.g. words, numbers
• A term is a token or a set of tokens with a semantic meaning
–
–
–
–
E.g. single token, proper names
“San” is a token, but not a term
“San Francisco” are two tokens but one term
Dictionaries usually contain terms, not tokens
• A word is either a token or a term – depends on the context
• A concept is the mental representation of a real-world thing
– Concepts are represented by terms/words
– Terms may represent multiple concepts: homonyms
– Concepts may be represented by multiple terms: synonyms
4
Case – A Difficult Case
• Should text be converted to lower case letters?
• Advantages
– Makes queries simpler
– Decreases index size a lot
• Disadvantages
– No abbreviations
– Looses important hints for sentence splitting
• Sentences start with upper case letters
– Looses important hints for tokenization, NER, …
• Often
– Convert to lower case after all other steps
5
Sentence Splitting
• Most linguistic analysis works on sentence level
• Sentences group together objects and statements
– But mind references: “What is this white thing? It is a house.”
• Naive approach: Reg-Exp search for “[.?!;] “
– (note the blank)
– Abbreviations
• “C. Elegans is a worm which …”; “This does not hold for the U.S.”
– Errors (due to previous normalization steps)
• “is not clear.Conclusions.We reported on …”
– Proper names
• “DOT.NET is a technique for …”
– Direct speech
• “By saying “It is the economy, stupid!”, B. Clinton meant that …”
– …
6
Tokenization
• Split a sentence into token
– But in a way that allows for recognition of multi-token terms later
• Simple approach: search for „ „ (blanks)
– “In San Francisco it rained a lot this year.”
– “A state-of-the-art Z-9 Firebird was purchased on 3/12/1995.”
– „SQL commands comprise SELECT … FROM … WHERE clauses; the latter
may contain functions such as leftstr(STRING, INT).“
– “This LCD-TV-Screen cost 3.100,99 USD.”
– “[Bis[1,2-cyclohexanedionedioximato(1-)-O][1,2-cyclohexanedione dioximato(2-)-O]methyl-borato(2-)N,N0,N00,N000,N0000,N00000)-chlorotechnetium) belongs to a family of
…“
• Typical approach
– Treat hyphens/parantheses as blanks
– Remove “.” (of abbreviations; after sentence splitting)
– Hope that errors are rare and don’t matter much
7
Stems versus Lemmas
• Common idea: Normalize words to a basal, “normal” form
– car,cars -> car; gives, gave, give -> give
• Stemming reduce words to a base form
– Base form often is not a word in the language
– Ignores detached prepositions (“Ich kaufe ein” -> “einkaufen” or
“kauf”)
– Rather simple
• Lemmatization reduce words to their lemma
– Finds the linguistic root of a word
– Must be a proper word of the language
– Rather difficult, requires morphological analysis
8
Stop Words
• Words that are so frequent that their removal (hopefully) does not
change the meaning of a doc for retrieval/mining purposes
– English: Top-2: 10% of all tokens; Top6: 20%; Top-50: 50%
– English (top10; LOB corpus): the, of, and, to, a, in, that, is, was, it
– German: aber, als, am, an, auch, auf, aus, bei, bin, …
• Removing stop words reduces a positional token index by ~40%
– Search engines do remove stop words (but which?)
• Hope: Increase in precision due to less spurious hits
– Clearly, this also depends on the usage of stop words in the query
• Variations
– Remove top 10, 100, 1000, … words
– Use language-specific or corpus-specific stop word list
9
Zipf‘s Law
• Let f be the frequency of a word and r its rank in the list of
all words sorted by frequency
• Zipf’s law: f ~ k/r for some constant k
• Example
– Word ranks in Moby
Dick
– Good fit to Zipf’s law
– Some domaindependency (whale)
• Zipf’s law is a fairly
good approximation in
most corpora
10
Proper Names – Named Entity Recognition
• Proper names are different from ordinary words
–
–
–
–
–
Very often comprise more than one token
Often not contained in lexica
Appear and disappear all the time
Often contain special characters
Are very important for information retrieval and text mining
• Search for “Bill Clinton” (not “Clinton ordered the bill”)
• Extract all relationships between “a disease” and the DMD gene
• Recognizing proper names is difficult – more later
– Named entity recognition
• “dark” is a proper name
• “broken wing” is no proper name, “broken-wing gene” possibly is one
– Disambiguation
• “dark” is a gene of the fruitfly (and a common English word)
11
Document Summarization
• Preprocessing may go far beyond what we discussed
• Statistical summarization
– Remove all token that are not representative for the text
– Only keep words which are more frequent than expected by chance
• “Chance”: Over all documents in the collection (corpus-view), in the
language (naïve-view)
– Makes amount of text much smaller, hopefully doesn’t influence IR
performance
• Semantic summarization
– “Understand” text and create summary of content
• Annotation / classification
– Have a human understand the text and categorize it / describe it
with words from a controlled dictionary
– That’s what libraries are crazy about and Yahoo is famous for
12
Content of this Lecture
•
•
•
•
•
•
•
IR Models
Boolean Model
Vector Space Model
Relevance Feedback in the VSM
Probabilistic Model
Latent Semantic Indexing
Other IR Models
13
Information Retrieval Core
• The core question in IR:
Which of a given set of (normalized) documents is relevant
for a given query?
• Ranking:
How relevant is each document of a given set for a given
query?
Query
Document base
Normalization
Normalization
Match /
Relevance
Scoring
Result
14
IR Models
• Modeling: How is relevance judged?
– What model of relevancy is assumed?
U
s
e
r
T
a
s
k
Retrieval:
Adhoc
Filtering
Fuzzy
Extended Boolean
Classic Models
Algebraic
Boolean
Vector-Space
Probabilistic
Generalized Vector
Lat. Semantic Index
Neural Networks
Structured Models
Non-Overlapping Lists
Proximal Nodes
Browsing
Set Theoretic
Probabilistic
Inference Network
Belief Network
Browsing
Flat
Structure Guided
Hypertext
[BYRN99]
15
Notation
• All of the models we discuss use the “Bag of Words” view
• Definition
– Let K be the set of all terms, k∈K is a term
– Let D be the set of all documents, d∈D is a document
– Let vd by a vector for d with
• vd[i]=0 if the i’th term ki in w(D) is not contained in d
• vd[i]=1 if the i’th term ki in w(D) is contained in d
• Let vq by a vector for q in the same sense
– Often, we shall use weights instead of 0/1 memberships
– Let wij≥0 be the weight of term ki in document dj
• To what extend does ki contribute to the content of dj?
• wij=0 if ki∉dj
• vd and vq are defined accordingly
16
•
•
•
•
•
•
•
IR Models
Boolean Model
Vector Space Model
Probabilistic Model
Other IR Models
17
Boolean Model
• Simple model based on set theory
• Queries are specified as Boolean expressions
– Terms are atoms
– Terms are connected by AND, OR, NOT (XOR, ...)
• There are no real weights; wij∈{0,1}
– All terms contribute equally to the content of a doc
• Relevance of a document is either 0 or 1
– Computation by Boolean operations on the doc and query vector
– Represent q as set S of vectors over all terms
• Example: q=k1 ∧(k2∨¬k3) => S={(1,1,1),(1,1,0),(1,0,0)}
• S describes all possible combinations of keywords that make a doc
relevant
– Document d is relevant for q iff vd∈S
18
Properties
• Simple, clear semantics, widely used in (early) systems
• Disadvantages
– No partial matching (expression must evaluate to true)
• Suppose query k1∧k2∧… ∧k9
• A doc d with k1∧k2…k8 is as much refused as one with none of the terms
– No ranking (extensions exist, but not satisfactory)
– Query terms cannot be weighted
– Average users don’t like Boolean expressions
• Search engines: “+bill +clinton –lewinsky tax”
• Results: Often unsatisfactory
– Too many documents (too few restrictions)
• Without ranking
– Too few documents (too many restrictions)
19
A Note on Implementation
• Our definition only gave one way of defining the semantics of a
Boolean query; this should not be used for implementations
• The set S of vectors can be extremely large
– Suppose q=k1 and |K|=1000 => S has 2999 components
• Improvement: Evaluate only those terms that are used in the query
• Usual way of implementation: Indexing
– Assume we have an index with fast operation find: K→PD
– Search each term ki of the query, resulting in a set Di of docs with Di⊆D
– Evaluate query in the usual order using set operations on Di’s
• ki ∧ kj : Di ∩ Dj
• ki ∨ kj : Di ∪ Dj
• NOT ki: D\Di
• Use cost-based optimization
– Evaluate those subexpressions first which hopefully result in small results
20
Negation in the Boolean Model
• Evaluating “NOT ki: D\Di“ is expensive
– Result often is very, very large (|D\Di|≈|D|)
• Because almost all terms appear in almost no documents
– Recall Zipf’s Law – the tail of the distribution
• If ki is not a stop word, which we removed anyway
• Solution 1: Disallow negation
– This is what web search engines do (exception?)
• Solution 2: Allow only in the form “ki ∧ NOT kj”
– Evaluation: Compute Di and remove docs which do not contain kj
21
•
•
•
•
•
•
•
IR Models
Boolean Model
Vector Space Model
Probabilistic Model
Other IR Models
22
Vector Space Model
• Salton, G., Wong, A. and Yang, C. S. (1975). "A Vector
Space Model for Automatic Indexing." Communications of
the ACM 18(11): 613-620.
– Historically, a breakthrough in IR
– Still most popular model today
• General idea
– Fix a vocabulary K
• Typically the set of all different terms in D
– View each doc and each query as a point in a |K|-dim space
– Rank docs according to distance from the query
• Main advantages
– Ability to rank docs (according to relevance)
– Natural capability for partial matching
23
Quelle: http://www.vilceanu.ne
Vector Space
• Each term in K is one
dimension
– Different suggestions for
determining co-ordinates
• = Weights
– Different suggestions for
determining the distance
between two points
Star
Astronomy
Movie stars
• The closest docs are the
most relevant ones
Mammals
Diet
– Rational: vectors correspond
to themes which are loosely
related to sets of terms
– Distance between vector ~
distance between themes
24
Preliminaries 1
• Definition
Let D be a document collection, K be the set of all terms in
D, d∈D and k∈K.
– The term frequency tfdk is the frequency of k in d
– The document frequency dfk is the number of docs in D containing k
• Or sometimes the number of occurrences of k in D
– The inverse document frequency idfk = |D| / dfk
• Actually, it should rather be called the inverse relative document
frequency
• In practice, one usually uses idfk = log(|D| / dfk)
• Remarks
– idf is a popular weighting term, therefore a proper definition
– Clearly, tfdk is a natural measure for the weight wdk
• But not the only or best one
25
Preliminaries 2
• Recall
– The scalar product between vectors v and w
v o w = v * w * cos(v, w)
– with
v=
∑ vi
2
– and
vow =
∑v *w
i =1..n
i
i
26
The Simplest Approach
• Co-ordinates are set as term frequency
• Distance is measured as the cosine of the angle between
doc d and query q
Can be dropped
– Wait for Euclidean distance
sim(d , q) = cos(vd , vq ) =
vd o v q
vd * v q
for ranking
=
∑ (v [i] * v [i])
∑ v [i] * ∑ v [i]
q
d
2
q
2
d
• Properties
– Including term weights increases retrieval performance
– Computes a (seemingly reasonable) ranking
– Also returns partial matches
27
Example Data
• Assume stop word removal (wir, in, …) and stemming
(verkaufen -> kauf, wollen -> woll, …)
Text
1 Wir verkaufen Häuser in
Italien
verkauf
haus
italien
1
1
1
2 Häuser mit Gärten zu
vermieten
1
3 Häuser: In Italien, um
Italien, um Italien herum
1
4 Die italienschen Gärtner
sind im Garten
gart
1
miet
blüh
woll
1
3
1
2
5 Der Garten in unserem
italienschen Haus blüht
1
1
1
Q Wir wollen ein Haus mit
Garten in Italien mieten
1
1
1
1
1
1
28
Ranking without Length Normalization
sim(d , q ) = d o q =
•
•
•
•
•
∑ v [i] * v [i]
i =1..n
d
q
1
1
1
2
1
3
1
4
1
1
1
3
1
2
5
1
1
1
Q
1
1
1
1
1
1
sim(d1,q)=1*0+1*1+1*1+0*1+0*1+0*0+0*1=2
sim(d2,q)=1+1+1=3
sim(d3,q)=1+3=4
sim(d4,q)=1+2=3
sim(d5,q)=1+1+1=3
No good?
Rg
Q: Wir wollen ein Haus mit Garten in Italien mieten
1
d3: Häuser: In Italien, um Italien, um Italien herum
2
d5: Der Garten in unserem italienschen Haus blüht
2
d4: Die italienschen Gärtner sind im Garten
2
d2: Häuser mit Gärten zu vermieten
5
d1: Wir verkaufen Häuser in Italien
29
Ranking with Length Normalization
(
v [i ] * v [i ])
∑
sim(d , q ) =
∑ v [i]
q
d
2
d
•
•
•
•
•
sim(d1,q)
sim(d2,q)
sim(d3,q)
sim(d4,q)
sim(d5,q)
=
=
=
=
=
1
1
1
2
1
3
1
4
1
1
3
1
2
5
1
1
1
Q
1
1
1
(1*0+1*1+1*1+0*1+0*1+0*0+0*1) / √3
(1+1+1) / √3
(1+3) / √10
(1+2) / √5
(1+1+1) / √4
Rg
No good?
1
1
1
~
~
~
~
~
1
1.15
1.73
1.26
1.34
1.5
Q: Wir wollen ein Haus mit Garten in Italien mieten
1
2
3
4
d3: Häuser: In Italien, um Italien, um Italien herum
5
d1: Wir verkaufen Häuser in Italien
30
Improved Scoring: TF*IDF
• One obvious problem: The longer the document, the
higher the chances of being ranked high
– Solution: Normalize on document length (also yields 0 ≤ wij ≤ 1)
wij =
tf ij
| di |
tf ij
=
∑ tf
k =1..|di|
ik
• Second obvious problem: Some terms are just everywhere
in D and should be scored less
– Solution: Use IDF scores
wij =
tf ij
| di |
* idf j
31
TF*IDF Short Form
• Give terms in a doc d high weights which are …
– frequent in d and
– infrequent in D
• IDF deals with the consequences of Zipf’s law
– The few very frequent (and unspecific) terms get low scores
– The many infrequent (and specific) terms get higher scores
32
Example TF*IDF
IDF
1 (tf)
tf ij
tf ij
|D|
wij =
* idf j =
*
| di |
| d i | df k
(
v [i ] * v [i ])
∑
sim(d , q ) =
∑ v [i]
q
d
2
5
5/4
5/4
1/3
1/3
1/3
2 (tf)
1/3
3 (tf)
1/4
4 (tf)
5 (tf)
Q
5/3
5
1/3
1/3
5
0
3/4
1/3
2/3
1/4
1/4
1/4
1
1
1
1/4
1
1
d
•
•
•
•
•
sim(d1,q)=(5/4*1/3 +5/4*1/3) / √0.3
sim(d2,q)=(5/4*1/3 +5/3*1/3+5*1/3) / √0.3
sim(d3,q)=(5/4*1/4+5/4*3/4) / √0.63
sim(d4,q)=(5/4*1/3 +5/3*2/3) / √0.56
sim(d5,q)=(5/4*1/4 +5/4*1/4+5/3*1/4) / √0.25
~
~
~
~
~
1.51
4,80
1,57
2,08
2,08
wollen ein Haus mit Garten in Italien mieten
wollen ein Haus mit Garten in Italien mieten
Häuser mit Gärten zu vermieten
Der Garten in unserem italienschen Haus blüht
Die italienschen Gärtner sind im Garten
d3: Häuser: In Italien, um Italien, um Italien
herum
Wir verkaufen Häuser in Italien
Häuser: In Italien, um Italien, um Italien herum
Ulf verkaufen
Leser: TextHäuser
Analytics,
Vorlesung, Sommersemester 2008
d1: Wir
in Italien
33
Further Improvements
• Scoring the query in the same way as the documents
– Note: Repeating words in a query then makes a difference
• Empirically estimated “best” scoring function [BYRN99]
– Different length normalization
wij =
tf ij
max tf ij
j
– Use log of IDF with slightly
different measure
(ni : # of docs containing kj);
rare terms are not totally
distorting the score any more
⎛| D |⎞
⎟
wij =
* log⎜
⎜ n ⎟
max tf ij
⎝ j ⎠
j
tf ij
34
Why not Simpler?
• Why not use simple Euclidean distance?
• Length of vectors would be much more important
• Since queries usually are very short, very short documents
would always win
• Cosine measures normalizes by the length of both vectors
35
Metaphor: Comparison to Clustering
• Clustering: Find groups of objects that are close to each
other and far apart from all others
• Rephrase the IR problem
–
–
–
–
–
A query partitions D in two clusters: relevant R / not relevant N
q should be in the heard of R: similarity
Docs in R should be close to each other: large wij*qi values
Docs in R should be far apart from other points: idf
Goal of scoring: Balance between intra- and inter-cluster similarity
36
Shortcomings
• We assumed that all terms are independent, i.e., that their
vectors are orthogonal
– Clearly wrong: terms are semantically close or not
• The appearance of “red” in a doc with “wine” doesn’t mean much
• But “wine” is a non-zero match for “drinks”
– Extension: Topic-based Vector Space Model
• No treatment homonyms (as for most other methods)
– We would need to split words into their senses
– See word disambiguation (later)
• Order independent
– But order carries semantic meaning (object? subject?)
– Order important for disambiguation
37
A Note on Implementation
• Again: Assume we have an index with op find: K→PD
• Assume we want to retrieve the top-r docs
• Core algorithm
– Look up all terms ki of the query
– Build the union of all documents which contain at least on ki
• Hold in a list sorted by score (initialize with 0)
– Walk through terms ki in order of decreasing IDF-weights
• For each document dj: Add wji*IDFi to current score sj
• Go through docs in order of current score
– Early stop
• Look at sjr – sjr+1: Can doc with rank r+1 still reach doc with rank r?
• Can be estimated from distribution of TF and IDF values
• Early stop might produce the wrong order of top-r docs, but not wrong
docs
– Why?
38
A Different View on VSD
Star
Diet
• Query evaluation actually searches for the top-r nearest
neighbors
• This can be approached using techniques from
multidimensional indexing
– kDD-Trees, Grid files, etc.
– Hope: No sequential scan of (all, many) docs, but directed search
according to distances
• But severe problem: High dimensionality
39
•
•
•
•
•
•
IR Models
Boolean Model
Vector Space Model
Probabilistic Model
40
Interactive IR
• Recall: IR is the process of finding the right information
– This includes more than a single query
• Relevance feedback
– User poses query
– System answers
– User judges the relevance
of top-k returned docs
– System considers feedback
and generates new
(improved) answer
Query
IR System
(Ranked)
Results
• User never needs to pose another query
• The new query is generated by the system
– Loop until satisfaction
41
Relevance Feedback
• User judges the current answers
• Basic assumptions
– Relevant docs are somewhat similar to each other – the common
core should be emphasized
– Irrelevant docs are different from relevant docs – the differences
should be deemphasized
• System adapts to feedback by
– Query expansion: Add new terms to the query
• From the relevant documents
– Term re-weighting: Assign new weights to terms
• From the relevant documents
42
Rationale
• Assume we know the set R of all relevant docs
• How would a good query for R look like?
1
1
vq =
d −
d
∑
∑
| R | d∈R
| D | − | R | d∉R
• Idea
– Weight terms high that are in R
– Weight terms low that are not in R, i.e., that are in N=D\R
• Remark
– It is not clear that this is the best of all queries
• Relevance feedback is a heuristic
43
Rocchio Algorithm
• Usually we do not know the real R
• Best bet: Let R (N) be the set docs marked as relevant
(irrelevant)
• Adapt query vector as follows
1
1
vqnew = α * v(q ) + β *
d − γ*
d
∑
∑
| R | d∈R
| N | d ∈N
• This implies query expansion and term re-weighting
– How to choose α, β, γ?
• Alternative
1
vqnew = α * v(q ) + β *
d − γ *{d | d = arg max( sim(q, d ))}
∑
| R | d∈R
d ∈N
– Intuition: Non-relevant docs are heterogeneous and tear in every
direction – better to only take the worst
44
Example (simplified)
α=0.5, β=0.5, γ=0
d1 = "information science“ = (0.2, 0.8)
d2 = "retrieval systems" = (0.9, 0.1)
q = "retrieval of information“ = (0.7, 0.3)
Information
1.0
d1
If d1 is marked relevant
q’ = ½∗q + ½∗d1 = (0.45, 0.55)
q’
If d2 is marked relevant
Q” = ½∗q + ½∗d2 = (0.80, 0.20)
0.5
q
q”
Quelle: A. Nürnberger: IR
0.5
d2
1.0
Retrieval
45
Results
• Advantages
–
–
–
–
Improved results (many studies) compared to single queries
Users need not generate new queries themselves
Iterative process converging to the best possible answer
Especially helpful for increasing recall (due to query expansion)
• Disadvantages
– Requires some work by the user
• Excite: Only 4% used relevance feedback (“more of this” button)
– Writing a new query based on returned results might be faster (and
easier and more successful) than classifying results
– Based on the assumption that relevant docs are similar
• What if user searches for all meanings of “jaguar”?
– Query very long already after one iteration – makes retrieval slow
46
Collaborative Filtering
• What other information could we use for improving
retrieval performance?
• Collaborative filtering: Give the user what other, similar
users liked
– “Customers who bought this book also bought …”
– In IR: Find users posing similar queries and look at which
documents they eventually looked it
• In e-Commerce, it is the buying in itself (very reliable)
• In IR, we need to approximate
–
–
–
–
–
Documents a searcher clicked on (if recordable)
Did the user look at the second page? (Low credit for first results)
Did the user pose a “refinement query” next?
…
All are not very reliable
• Problems: What is a similar user? What is a similar query?
47
Thesaurus-based Query Expansion
[M07, CS276]
• Expand query with synonyms and hyponyms of each term
– feline → feline cat
– One may weight added terms less than original query terms
• Often used in scientific IR systems (Medline)
• Requires a high quality thesaurus
• General observation
– Increases recall
– May significantly decrease precision
• “interest rate” → “interest rate fascinate evaluate”
– Do synonyms really exist?
48

clinton engines

Transcrição

Documentos relacionados

Foundations of Information Retrieval

Einführung

dnt q bow

Text Analytics - Institut für Informatik

Praktikum Information Integration

Text Analytics

Text Analytics

Analyze of bgol.us - Check Website Score

my resume - Alicia Carvalho

Analyze of capnhq.gov

Blue Coat Security Analytics Appliances

Descarregue a versão PDF

declaration regarding the handling of personal data (privacy)

Intel ESS MetaScale IT Manager First Person Case Study v002

Property Sheet PDF - Signature Properties Solutions

DSSD Custom High Performance Applications Solution

Property Sheet PDF - Signature Properties Solutions