D4.4 Recommender Data Models - ViSTA-TV

Transcrição

ViSTA-TV:
Video Stream Analytics for Viewers in the TV Industry
FP7 STREP ICT-296126 | 296126 co-funded by the European Commission
ICT-2011-SME-DCL | SME Initiative on Digital Content and Languages
D4.4
Recommender Data Models
Thomas Scharrenbach (UZH),
Yamuna Krishnamurthy (TUDo),
Christian Bockermann (TUDo)
Project start date:
Document identifier:
Date due:
Submission date:
June 1st, 2012
ViSTA-TV/2012/D4.4
30 Nov 2012
30 November 2012
Project duration:
Version:
Status:
Distribution:
www.vista-tv.eu
24 months
v1.0
Final
PU
ViSTA-TV Consortium
This document is part of a collaborative research project funded by the FP7 ICT Programme of the Commission
of the European Communities, grant number 296126. The following partners are involved in the project:
University of Zurich (UZH) - Coordinator
Dynamic and Distributed Information Systems Group (DDIS)
Binzmühlstrasse 14
8050 Zürich, Switzerland
Contact person: Abraham Bernstein
E-mail: [email protected]
Techniche Universität Dortmund (TUDo)
Computer Science VIII: Artificial Intelligence Unit
D-44221 Dortmund, Germany
Contact person: Katharina Morik
Rapid-I GmbH (RAPID-I)
Stockumer Strasse 475
44227 Dortmund, Germany
Contact person: Ingo Mierswa
Zattoo Europa AG (Zattoo)
Eggbühlstrasse 28
CH-8050 Zürich, Switzerland
Contact person: Bea Knecht
Vrije Universiteit Amsterdam (VUA)
Web & Media Group, Department of Computer Science, Faculty of Sciences (FEW)
De Boelelaan 1081a
NL-1081 HV Amsterdam, The Netherlands
Contact person: Guus Schreiber
The British Broadcasting Corporation (BBC)
56 Wood Lane / Centre House - Broadcasting House
UK-W12 7SB Northampton, United Kingdom
Contact person: Chris Newell
Copyright © 2012 The ViSTA-TV Consortium
D4.4 Recommender Data Models
3
Executive Overview
This documents reports on the achievements made for Deliverable 4.4, “Recommender Data Models” for the
EU FP7 project ViSTA-TV. We give an overview over how the data models for the recommendations engine are
to be defined, determined and applied. The data models will be presented along with the components of the
recommendations engine which to they belong. Recommendations are computed in the recommendations rules
engine which consumes output of the complex event processing engine and the show-user-model engine (both
part of the recommendations engine) and from the data enrichment engine (described by document ViSTATV/2012/D2.1). We here give definitions of the actual processes how the show-user-model is determined, how
the complex event processing engine works and how recommendations are computed. For an overview how
these components will be implemented we refer to the document ViSTA-TV/2012/D4.1.
4
ViSTA-TV
Contents
1 Introduction
1.1
Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2 The ViSTA-TV Project Setup
5
5
5
2.1
Scope of this document . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6
2.2
ViSTA-TV Complex Event Processing and Temporal Facts . . . . . . . . . . . . . . . . . . . .
7
2.3
ViSTA-TV Recommendation Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7
2.4
ViSTA-TV Show-User-Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8
3 Complex Events and Temporal Facts
9
3.1
Feature Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9
3.2
Show-User-Model Events and Temporal Facts . . . . . . . . . . . . . . . . . . . . . . . . . . .
9
4 The ViSTA-TV Show-User-Model
10
4.1
Frequent Item Set Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10
4.2
Subspace Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
15
4.3
Data format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
17
4.3.1
Input . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
17
4.3.2
Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
17
5 The ViSTA-TV Recommendation Rules
17
6 Conclusion
18
1
5
Introduction
This documents reports on the achievements made for Deliverable 4.4, “Recommender Data Models” for the EU
FP7 project ViSTA-TV. These recommender data models are defined within the context of the recommendations
engine of the ViSTA-TV Stream Processing Engine. For information about the first we refer to the document
ViSTA-TV/2012/D4.1. For an overview over the latter please refer to ViSTA-TV/2012/D4.1.
In this document we give an overview over how the data models for the recommendations engine are to
be defined, determined and applied. The data models will be presented along with the components of the
recommendations engine which to they belong. Recommendations are computed in the recommendations rules
engine which consumes output of the complex event processing engine and the show-user-model engine (both
part of the above-mentioned recommendations engine) and from the data enrichment engine (described by
document ViSTA-TV/2012/D2.1). We here give definitions of the actual processes how the show-user-model
is determined, how the complex event processing engine works and how recommendations are computed.
1.1
Outline
This document is structured as follows: In Section 2 we give an overview over the ViSTA-TV project as given
in the Description of Work (DOW). In the subsequent sections we give precise definitions for the data format
of the CEP engine (Section 3), the SUM engine (Section 4) and the RERULES engine (Section 5). We finally
conclude with a summary in Section 6.
2
The ViSTA-TV Project Setup
Please note that much of the content of this section was re-used from the Description of Work document
(DOW) to make this document self-contained.
In the past video content was consumed through broadcast TV delivered through the air or by cables. Broadcasters had information—both meta-data and the actual video stream—about their own programming as well as
coarse usage information supplied by survey companies such as Nielsen or GfK.1 Cable providers viewed themselves as pure bundling and transportation channels whilst viewers used secondary sources to inform themselves
about the desired programming sources. Fueled by Internet TV (IPTV) offers of telecommunication giants,
cable providers, or pure online providers such as YouTube or Hulu as well as digital rental companies such as
NetFlix and iTunes, people increasingly watch video content delivered over IP networks. The technological issues
arising from IPTV as well as the opportunities for interactive, personalized TV have been amply investigated in
EU-projects such as NoTube. However, the use of behavioral information accruing during the process of viewing
live IPTV in combination with the actual video streams and the electronic program guide (EPG) information as
a foundation for a market research data marketplace has been largely overlooked.
Consumers Ra7ng agencies Apps Behavior Data Adver7sers Behavior data Providers Content Providers IPTV-‐Providers Raw Data Raw Data Metadata Data Tag-‐
Enrichment Stream Engine Recommen-‐ da7ons Recommen-‐
da7on Engine Behavior Data Data Warehouse Tag-‐Stream LOD Cloud Linked OpenData ViSTA-‐TV Pla+orm Figure 1: The ViSTA-TV universe (logos are owned by respective companies).
6
ViSTA-TV
The ViSTA-TV project proposes to view IPTV as a two-way channel, where the viewer can take
advantage of the video streams whilst the ViSTA-TV platform employs behavioral information about the viewer
gathered by IPTV transmitters to improve the experience for all participants in the TV supply chain (see
Figure 1). Specifically, ViSTA-TV proposes to gather consumers’ viewing behavior and the actual video streams
from broadcasters and IPTV providers to combine them with enhanced EPG information as the input for a
holistic live-stream analysis. The analysis—comprising of personalized feature construction, supplementary tag
generation, and real-time recommendation generation—will provide the input for a triplication procedure that
generates Data in the Resource Description Framework (RDF)—a data format for storing typed graphs1 —and
enables the reuse of the resulting data for three different data pools:
1. Linked Open Data as a Basis for Analytic Processing,
2. Viewing Behavior Data, and
3. recommendation Data.
Each of these data pools with its production pipeline will provide the foundation for a data marketplace with
its own commercial as well as societal rationale such as bootstrapping innovative applications relying on TV
data, selling services relying on detailed viewing behavior analysis beyond today’s capabilities, or helping viewers
finding the shows that best match their interest at any given point in time.
The heart of the ViSTA-TV platform will be an online analysis system that operates on streams of
content data in real time. This system will be developed on the basis of the STORM stream processing
framework.2 We will extend existing methods for feature extraction from content streams, as well as methods
for querying and matching events on streams of background information to make them stream-ready. Last but
not least we will develop methods for real-time recommendation based on complex event processing and hybrid
user/show profiles. For learning and evaluating the corresponding data models we will rely upon existing offline
machine learning methods.
2.1
Scope of this document
The document describes the data models for the recommendations engine of the first version of the Recommendations Engine of the ViSTA-TV stream-processing engine, the former being described in document
ViSTA-TV/2012/D4.1 the latter in document ViSTA-TV/2012/D3.1. The ViSTA-TV stream-processing engine comprises the following components:
Data Enrichment Engine (DE)
Complex Event Processing Engine (CEP)
Show-User-Model Engine (SUM)
Recommendations Rules Engine (RERULES)
While the DE, CEP and SUM engines provide input for the RERULES engine, which in turn provides input for
the end user apps, and
the ViSTA-TV data warehouse.
In its current version the data models comprise a real-valued normalized feature vector with each feature
attached a positive real number referred to as weights. These features will be multiplied with the weights and a
recommendation given if the weighed sum exceeds a pre-defined threshold. Note that this document does not
provide any information about the actual implementation and data formats of the input and the output. These
are covered in document ViSTA-TV/2012/D4.1.
1 http://nielsen.com,
http://gfk.com
2 http://storm-project.net/
2.2
7
ViSTA-TV Complex Event Processing and Temporal Facts
The data enrichment process will generate event-streams of feature-value-changes that describe shows or simple
usage events such as “Jon Doe” watches Sports Today. Apart from their direct usage these event-streams provide
the basis for constructing more complex events as aggregated/constructed features (e.g., the complex event
after-work-TV-night could be characterized as watching the sequence: “news”, “weather”, “crime”, “show”). The
Complex Events Processing (CEP) engine performs the detection of such complex events. Furthermore, this
CEP engine will also be able to construct temporal facts, i.e. things that are known to be true for a certain
amount of time—in contrast to events that take place at a certain point in time and are over in the next instant.
The difference of events and temporal facts can be illustrated best by an example: The event that a thousand
football fans switch to the match on channel no 5 may trigger the temporal fact that a lot of people interested
in football are watching the match to become true. If such an amount of users switch again channels (or event
turn off their IPTV consuming device) and only a handful of football fans remain watching the match, this may
trigger the above temporal fact to end.
In the ViSTA-TV Stream Processing engine, temporal facts and complex events serve as an input for the recommendations rules engine. To construct aggregated tags ViSTA-TV will develop a mechanism that efficiently
detects complex event pattern on tag (or RDF-triple) streams. State-of-the-art event processing frameworks
such as SASE+ [9] or TelegraphCQ [7] are defined on a relational data model. The relational model, however,
makes it difficult to incorporate background knowledge, efficiently. Linked Data was designed to overcome this
limitation of the relational models.
EP-SPARQL [2] was presented, recently, for defining and querying complex events on streams of Linked Data
or, more specifically, RDF. Yet, EP-SPARQL does not make assumptions on real-time processing or complex
time constraints as do automata based approaches like SASE+. An extension of the relational stream querying
language CSQL [3] is C-SPARQL [5]. Both approaches perform processing triples/tuples within a fixed window,
which can be based on time-intervals, number of data triples, or partitions based on aggregates. C-SPARQL
considers the triples in the window as a static knowledge base. An extension for making this approach work
more dynamic was proposed, recently [15]. However, complex events whose components may go beyond the
window limits (such as long-term TV usage behavior) are not yet taken into account.
Real-time complex event matching requires efficient and effective methods for pruning the current matching
hypotheses. Current approaches prune by implementing memory eviction schemes such as first-in-first-out (essentially a time window) or least-recently-used (i.e., keeping only active patterns not the most useful one).
These schemes essentially introduce a time-window though the backdoor. ViSTA-TV, in contrast, will employ
the SASE+ automata based model and the EP-SPARQL querying capabilities but develop methods for estimating the probabilities of automata (representing complex event patterns) to match or not. Hence, this novel
approach will base its eviction decision based on inherent data statistics, which is expected to minimize the
number of false-positives and false-negatives. Consequently, ViSTA-TV will provide novel methods for matching
time-window agnostic complex events on streams of Linked Data in real-time, which will be used for feature/tag
construction.
In the present version, however, the focus is laid on integrating the components of the recommendations engine
into the overall framework of the ViSTA-TV Stream Processing Engine. Therefore, the actual implementation
does not yet take into account actual extensions of SPARQL of implement a complex event processor. To
match the goal of this deliverable the components of the CEP engine are implemented as dummies.
2.3
ViSTA-TV Recommendation Rules
Recommender systems have long been in the focus of the e-commerce, e.g. the famous NetFlix challenge
for movie recommendation is a good example here. The techniques for recommender systems are generally categorized as content-based, collaborative filtering, or hybrid methods. Content-based methods focus on
recommending objects that are similar to the ones a user has previously chosen ([19]). Methods based on collaborative filtering identify recommendable items based on user similarity [12, 10, 18]. Hybrid approaches combine
content-based and collaborative methods and have been reported to raise the accuracy on recommendations
[17]. Another distinction made in the recommender system literature is memory- versus model-based approaches.
Memory based approaches work directly on the user/content data and produce recommendation based on the
k-nearest neighbors. Hence, every new user or item needs to be compared to all existing users/items, i.e. a
computational cost unacceptable for our setting. More recently model-based approaches, which try to model the
8
ViSTA-TV
relation between users and content mathematically, have been shown to not only speedup the recommendation
process, but may even improve accuracy (e.g., matrix factorization techniques have outperformed neighboring
based methods in the NetFlix challenge [13]). Furthermore, tensor and matrix factorization technique that also
recently gained interest in social network analysis [1, 4, 16] are supporting this trend. Unfortunately, however,
these model-based techniques are still computationally intensive and, therefore, do not fit the scenario of online
recommendation of shows and movies as intended by ViSTA-TV.
ViSTA-TV will ensure real-time recommendations by addressing computational complexity of current approaches
whilst improving recommendation quality through personalized models: While some work on simple events, none
of the state-of-the art recommender systems operates on complex events. ViSTA-TV will make it possible to use
complex events for the actual recommendations. This has not been investigated so far, in particular not for realtime processing. In addition, current recommendation approaches work with handcrafted and manually selected
features to describe items. In contrast, ViSTA-TV will integrate dynamic feature extraction and selection within
the recommendation process.
The current implementation of the recommendations rules will be a linear combination of the features that are
fed into this engine. A recommendation will then be given according to the outcome of the application of the
linear combination to a classification system such as a perceptron or a decision tree. The dummy engine will
output recommendations via an XMPP server. Yet, the STORM framework provides the necessary flexibility to
add further output mechanisms as required.
It is clear that methods like nearest neighbor - were all computations take place in the model application phase
and none in the model training phase - are not suitable for real-time recommendations. Vista-TV therefore
distinguishes offline training of recommendation models and real-time application of the models. The application
will be done by compilation of the models into sets of recommendation rules, each rule is a linear combination
of the feature fed into this engine. A recommendation will then be given if the linear combination exceeds a
threshold, is the highest recommendation among all for a user and is especially significantly higher that the
recommendation value for the show the user currently watches.
2.4
ViSTA-TV Show-User-Model
To address the computational effort of current approaches we intend to take a two-pronged approach. First,
TUDo and UZH will not use the recommendation-algorithms on the actual video/audio-streams but rather
on the feature-event streams extracted in WP2 (see above). This will massively reduce the dimensionality of
the streams. Second, we will use a personalized heterogeneous clustering approach developed by TUDo that
exploits frequent term set clustering ([20, 11]). To speed it up, TUDo will combine this approach with online
item set mining [8]. The combination will result in a novel approach for reducing the computational complexity
of recommendation systems, which has the added benefit of providing improved recommendations due to its
underlying personalized clustering.
Collaborative filtering suffers from the so-called large-scale cold-start problem. New items are not being recommended as not enough users have rated them and they do not get ratings as they are not recommended. Both
hybrid and model-based approaches have been shown to reduce the cold-start problem [22], still recommendersystems have been mostly used where new content is the exception and not the rule. In the broadcasting TV
world of ViSTA-TV many shows on TV are first time transmissions. Hence, TUDo and UZH will explore a novel
hybrid 2-way clustering approach: based on available content descriptions (WP1) and external data (WP6)
(e.g. this movie is a TV-premiere, but it is already rated as cinema-movie) on one side as well as user viewing
information on the other side clusters of shows and users are discovered in a mutually reinforcing process. Once
these clusters are available the recommendation can be viewed as a prediction task [6] using techniques such
as Bayesian classifiers, decision trees or artificial neural networks. Since the recommendations are now based
not only on single shows but also on show-clusters new shows can, simply, be assigned to a cluster based on
its content characteristics and then recommended along with other similar shows overcoming the cold-start
problem. We refer to this two-way hierarchical clustering approach as finding out the show-user-model (SUM).
While the clustering clearly is an offline task the application of its learned model will be an integral part of
the ViSTA-TV Stream Processing Engine. Some significant amount of time before a show will be broadcasted
the SUM engine will query the data warehouse which shows will be actually broadcasted. This information,
along with its application to the show-user-model will serve as an input to the CEP engine as well as the
RERULES engine. Note that feeding these engines with the data can be performed some time before the actual
broadcasting takes place. The SUM engine can hence be seen as an initializer of the CEP and the RERULES
9
engine allowing them to prepare and optimize their processing for upcoming shows and the stream of feature
events before they take place.
Note that the SUM engine does currently not have any input from any other stream-processing component
in the ViSTA-TV stream-processing engine. Fortunately, the STORM framework underlying the ViSTA-TV
stream-processing engine provides means for easily adding input from other stream-processing components, e.g.
the data enrichment engine, should that become a necessity.
3
Complex Events and Temporal Facts
In ViSTA-TV we consider events are to be expressed in the Resource Description Framework (RDF). The data
model underlying RDF is a directed graph with typed edges. The nodes of the graphs are referred to as resources
and values whereas the edges are referred to as predicates (aka properties). Hence a statement in RDF is a
triple of the form hsubject, predicate, objecti where predicate may be an IRI, the subject may also be an
anonymous identifier (aka blank node), and the object may be also a data value (aka literal ).
In ViSTA-TV RDF triples additionally have time information. We distinguish between events and facts. Events
happen at a certain point in time. Events that are represented by a single RDF triple are called atomic events
in contrast to complex events that comprise a set of atomic events. As such both atomic as well as complex
events are asserted with two timestamps: ts is the earliest time at which an atomic event in the set happened
and te the latest time at which an atomic event in the set happened. Only at time te we know that an event E
definitely took place. Note that for atomic events we always have ts = te whereas for complex events we can
only guarantee ts ≤ te .
In contrast, temporal facts F are facts for which we know the interval [ts , te ] during which they are considered
valid. Note that the fact as such is always considered true but not necessarily valid for every time instant.
Events may trigger a fact to become valid and the validity state of a fact may trigger events to happen.
3.1
Feature Events
The CEP engine will—besides other input—consume events on features from the ViSTA-TV data enrichment
engine. The information provided therefore comprises a 4-tuple ht, id, key, valuei where
t is the time at which the change of the feature happened,
id is the id of the context (e.g. channel, show, user) related to the feature,
key is the identifier of the feature, and
value is the new value of the feature.
The CEP considers these 4-tuples as atomic events where ts = te = t, subject = id, predicate = key, and
object = value.
3.2
Show-User-Model Events and Temporal Facts
The SUM engine will push information about which users are potentially interested in the shows that are to be
broadcasted to the CEP engine. This information will be presented as feature change events as is the case of
the data enrichment engine (see above). Since we know the duration of a show beforehand we may pre-compute
temporal facts from this information in the CEP engine. Yet, the computation of these facts is done by the
CEP engine and not by the SUM engine.
Note that SUM features that are solely based on EPG data and past behavior are not time-critical and may
be pushed to the recommendations rules engine some time before the show will be broadcasted. This data is
not time-critical as EPG data are available weeks or even months before transmission. We plan, for example,
to provide the ViSTA-TV end-user Apps with recommendations that are solely based on these SUM features.
Assume, for example, that we know that a certain cluster of users is likely to watch a show that is to be
broadcasted in the evening. Applying the SUM engine we can recommend that very show to the according
10
ViSTA-TV
users even the day before (given, the EPG data is available then). While the goal of ViSTA-TV is providing
real-time recommendations, the pre-computed recommendations allow us first to have default recommendations.
The recommendations engine is not required to provide any real-time recommendation, for example, if there is
currently no show to recommend with respect to the real-time data. Second, it allows us comparing whether
users actually follow real-time recommendations or the pre-computed ones. It is hence an important component
for evaluating the quality and need of real-time recommendations.
4
The ViSTA-TV Show-User-Model
The TUDo SUM engine provides user→show groups and show→user group mappings. This mapping provides
an insight into the the interest relationship between users and shows. That is, we know what kind of shows a
user is interested in and what users are interested in a specific show. The knowledge of this two-way relationship
can be used by the recommendation system to predict the shows that a user will be interested in or the users
that need to be targeted for for a particular show.
The SUM engine provides these mappings as input to the CEP and the RERULES engines so they can prepare
and optimize their processing. The show groups and user groups are clusters that are determined offline. TUDo
is looking into the following two approaches for the offline clustering which are described in Section 4.1 and
Section 4.2 respectively. The two approaches have provided promising results from which useful features and
groups can be determined. But further analysis is necessary to provide a more comprehensive assessment of
both approaches.
Frequent Item Set Mining
Subspace Clustering
The SUM engine uses these user groups and show groups that have been determined offline to associate a user
or show to the groups online. The engine can be triggered to do this mapping by some event, for example
system time which indicates the beginning of a new show or detection of an event like a user logging in that it
receives from the DEE. It can then poll the data warehouse for existing user→show groups or show→user groups
mappings discovered by either frequent item set mining or subspace clustering and provide this information to
the CEP engine. The SUM engine will be implemented using the TUDo streams framework described in
ViSTA-TV/2012/D2.1 and will provide the output through a STORM Spout.
4.1
Frequent Item Set Mining
Some initial experiments were conducted to identify frequent item sets to discover interesting groups of items,
where the items could be channels viewed, genres and programs. The experiment was done with both BBC
and Zattoo data for channels and genres. The FP-Growth operator from RapidMiner was used to extract the
frequent item sets, after some initial pre-processing of the input data as shown in Figure 2.
Figure 2: Rapid miner operators for frequent item set mining
11
The frequent item set results for BBC channels and genres are shown in Table 1 and Table 2 respectively. The
support of an item set is the proportion of transactions in the data set which contain the item set. We see from
the tables that some frequent item sets have a very low support value and we can choose to ignore it. But it is
still important to study all the item sets discovered as the more unusual ones could have a low support. Table
1 shows that bbcone and cbeebies are popular channels. It also indicates that bbcone and bbcnews channels
are popularly watched together. That is, people watching bbcone also watch bbcnews.
Channels
Support
bbcone
0.67553
cbeebies
0.11074
bbcnews
0.10055
bbcthree
0.05847
cbbc
0.04917
bbctwo
0.03754
bbcone, bbcnews
0.01041
bbcone, bbctwo
0.0088
bbcfour
0.00842
bbcone, cbeebies
0.00437
bbcone, bbcthree
0.00377
cbeebies, cbbc
0.00293
bbcone, cbbc
0.00227
bbcnews, bbctwo
0.00216
bbcone, bbcfour
0.00177
cbeebies, bbctwo
0.0015
bbcthree, bbctwo
0.00133
bbctwo, bbcfour
0.00116
cbbc, bbctwo
0.00105
bbcnews, bbcfour
0.00072
bbcone, bbcthree, bbctwo
0.00066
bbcnews, bbcthree
0.00066
bbcone, bbcnews, bbctwo
0.00061
bbcthree, bbcfour
0.00061
cbeebies, bbcnews
0.00055
cbeebies, cbbc, bbctwo
0.00044
bbcone, bbctwo, bbcfour
0.00039
bbcone, bbcthree, bbcfour
0.00033
bbcone, cbbc, bbctwo
0.00028
bbcone, cbeebies, cbbc
0.00028
Table 1: Frequent item sets of BBC channels
Similarly we see from Table 2 that news and childrens entertainment and comedy are popular genres. We also
see that some genres are popular together among viewers like news and drama which is rather intuitive since
these two genres are hugely popular among the average age group between 35-40. We are hoping to discover
more unusual groups with further such frequent item set mining after filtering out the more obvious item sets.
We also generated frequent item sets for Zattoo shows and genres which are shown in Table 3 and Figure 3
respectively. For the shows we noticed that those that were aired consecutively were usually seen together.
The frequent item sets data provides insight into channels and genres that are most popular. It shows what
12
ViSTA-TV
Figure 3: Frequent item sets of Zattoo genres
channels and genres are viewed together. This information is useful to generate recommendations. The above
are preliminary analysis. We will continue with further analysis of these item sets and also work on generating
more interesting and unusual ones.
13
Genres
Support
news
0.683007102
childrens_entertainmentandcomedy
0.112766326
factual_familiesandrelationships
0.076043651
drama
0.068941625
childrens_drama
0.054131301
factual
0.045643513
childrens_entertainmentandcomedy,childrens_drama
0.036462844
childrens
0.032652001
childrens_entertainmentandcomedy, childrens
0.029014377
childrens_factual
0.018101507
weather
0.014204053
factual_artscultureandthemedia
0.013078122
drama, weather
0.012385242
comedy_sitcoms
0.01021999
childrens_drama, childrens_factual
0.009613719
childrens_entertainmentandcomedy, childrens_factual
0.008920838
drama, factual
0.006842196
news, drama
0.005976096
news, childrens_entertainmentandcomedy
0.005716265
childrens_entertainmentandcomedy, childrens_drama, childrens_factual
0.005543045
news, factual
0.003984064
childrens_drama, childrens
0.003551013
childrens_entertainmentandcomedy, childrens_drama, childrens
0.003551013
learning_preschool
0.002771523
factual_familiesandrelationships, drama
0.002771523
factual_familiesandrelationships, factual
0.002425082
drama, factual_artscultureandthemedia
0.002251862
childrens_entertainmentandcomedy, learning_preschool
0.002165252
news, weather
0.002165252
news, childrens_drama
0.001992032
childrens, learning_preschool
0.001818812
factual, factual_artscultureandthemedia
0.001818812
news, childrens_factual
0.001818812
news, drama, weather
0.001818812
factual_familiesandrelationships, comedy_sitcoms
0.001645592
childrens_entertainmentandcomedy, childrens, learning_preschool
0.001558981
news, factual_artscultureandthemedia
0.001558981
drama, comedy_sitcoms
0.001472371
news, factual_familiesandrelationships
0.001472371
childrens_factual, learning_preschool
0.001299151
childrens_entertainmentandcomedy,childrens_factual,learning_preschool
0.00129915
1 news, childrens_entertainmentandcomedy, childrens_drama
0.001299151
childrens, childrens_factual
0.001125931
Table 2: Frequent item sets of BBC genres
14
ViSTA-TV
Support
Shows
0,370
{Tagesschau, Wetter, Börse im Ersten}
0,328
{heute, Börse im Ersten}
0,319
{Tagesschau, Tatort}
0,299
{Tagesschau, ZDF-History}
0,284
{heute, ZDF-History}
0,283
{Tagesschau, heute-show}
0,278
{heute, Tatort}
0,275
{Tagesschau, hallo deutschland}
0,273
{Tagesschau, Markus Lanz}
0,256
{Tagesschau, Glücksspirale}
0,255
{Tagesschau, Günther Jauch}
0,251
{Tagesschau, Tatort: Hinkebein}
0,251
{Tagesschau, Brisant}
0,218
{ZDF-History, heute 100 sec}
0,211
{Wetter, heute, Günther Jauch}
Table 3: Frequent item sets of Zattoo shows
4.2
15
Subspace Clustering
As part of the subspace clustering analysis we did some data driven discretization. Most of the earlier experiments
were done using user driven discretization like age group and gender. This data driven discretization provides
some interesting insights into the viewership behavior. It provides information such as how long do most people
watch channels for, what programs are more popular, which programs trigger the most channel switching and
such.
Data driven discretization discretizes data based on the frequency of the items. The two discretization methods
used for the analysis were Neighborhood Discretization and MAFIA Discretization. Neighborhood Discretization
discretizes by examining difference in frequency in the neighborhood of the examples. MAFIA discretization
is based on MAFIA [21] subspace clustering algorithm. MAFIA uses an adaptive grid size for clustering and
merges the intervals so formed to explore clusters in higher dimensions. From our experiments we see that
the results from both these methods are comparable. We used the Neighborhood and MAFIA discretization
implementations that are part of the Subspace Clustering Extension for RapidMiner [?].
Channel Duration Results Channel duration attribute was discretized on the user log dataset from BBC.
Figure 4a and Figure 4b show that most viewers watch for a duration of 1-180 mins.
(a) Count of viewers for each range of channel duration discretized using neighborhood discretization
(b) Count of viewers for each range of channel duration discretized
using MAFIA discretization
Figure 4: Discretized channel duration
So then we filtered the dataset for all the entries with channel duration 1-180mins. We then discretized the
channel duration of these entries only. From Figure 5a and Figure 5b we see that there are more viewers actually
watching programs than switching channels.
(a) Count of viewers for each range of channel duration (between
1-180 mins) discretized using neighborhood discretization
(b) Count of viewers for each range of channel duration (between
1-180 mins) discretized using MAFIA discretization
Figure 5: Discretized channel duration (between 1-180 mins)
Program Duration Results We did further analysis to see what programs lead to channel switching. For this
we filtered the dataset for all entries with channel duration of 1-5mins. We then did a count of the resulting
dataset grouped by the program that was watched during that duration. Table 4 and Table 5 show the programs
that were watched for 5 mins or less. We can conclude that the programs with the higher count are the ones
that triggered a switch.
16
ViSTA-TV
CHANNEL
PROGRAM ID
PROGRAM TITLE
COUNT
bbcone
b01cr8wm
Breakfast
1954
bbcthree
b01cpqbr
Sun,Sex and Suspicious Parents
501
bbcone
b01cx90j
Racing for Time
415
bbcnews
b01cpfng
Breakfast (BBC News Channel)
332
bbcone
b01177tb
Homes Under the Hammer
332
bbcone
b01d0f6g
A Picture of Health
271
bbctwo
b008s9l9
Storyville
241
bbcnews
b01cpfnn
BBC News
226
bbcnews
b01cpfnx
BBC News
220
bbcnews
b01cpfns
BBC News
201
bbcone
b01cwwd8
Cowboy Trap
177
bbcone
b01d09q6
Doorstep Crime 999
154
cbeebies
b00rq2yv
Something Special
134
bbcnews
b01cpfnj
BBC News
130
cbeebies
b0121qyp
Same Smile
127
cbeebies
b017ldgw
Mister Maker Comes to Town
119
cbeebies
b01cyvmz
Mr Bloom’s Nursery
108
cbeebies
b00nk1bz
Waybuloo
100
Table 4: Programs that have a high count of being watched for less than 5 mins
CHANNEL
PROGRAM ID
PROGRAM TITLE
COUNT
bbctwo
b00tdypd
Chuggington: Badge Quest
5
bbctwo
b00xc5yy
Dipdap
5
cbbc
b00sllbn
Shaun the Sheep
5
bbctwo
b00jhhz0
Louie
4
bbctwo
b00qzkwn
Garth and Bev
4
bbctwo
b01cvm38
Newsround
4
bbctwo
b00qhvyk
Alphablocks
3
bbctwo
b00qtrrg
Zigby
3
bbctwo
b00t5z0f
Harry and Toto’s World of Opposites
3
bbctwo
b0078j82
Little Robots
2
bbctwo
b00j4hwy
Poetry Pie
1
Table 5: Programs that have a low count of being watched for less than 5 mins
We then did an analysis of one of the programs with the highest count, “Breakfast”, to see how the program
duration was distributed for all viewers. From Figure 6 a and Figure 6 b, we see that most viewers switched
channels during this program. The Breakfast show is a 3 hour program with segments of news and weather.
Hence people switch channels quite often during this program possibly during either weather or news segments.
We then did an analysis on a program with the lower count, ”Pingu” to see how the program duration was
distributed for all viewers. From Figure 7a and Figure 7b we see that most viewers watched the complete or a
significant portion of the program, thereby indicating that this program is more popular and does not trigger
as much channel switching.
The above analysis indicates that subspace clustering and data driven discretization can provide very insightful
(a) Count of viewers for each range of program duration discretized using neigborhood discretization for Breakfast show
(http://www.bbc.co.uk/programmes/b01cr8wm.xml
17
(b) Count of viewers for each range of program duration
discretized using MAFIA discretization for Breakfast show
(http://www.bbc.co.uk/programmes/b01cr8wm.xml)
Figure 6: Discretization of the program duration for Brakfast Show
(a) Count of viewers for each range of program duration
discretized using neighborhood discretization for Pingu show
(http://www.bbc.co.uk/programmes/b007809m.xml)
(b) Count of viewers for each range of program duration discretized using MAFIA discretization for Pingu show
(http://www.bbc.co.uk/programmes/b007809m.xml)
Figure 7: Discretization of the program duration for Pingu Show
information. We have seen that longer running programs see more switching activity. This also depends on
the type of programs. For example, shows such as news are more prone to switches than say movies. We need
to do further such in depth analysis of this viewership behavior. Some additional results can be found at the
TUDo vista website [14].
4.3
Data format
4.3.1
Input
The SUM engine gets its data from the DEE and the data warehouse. The input
will be
P from the DEE engine P
provided in the form of triples <time, key, value> where time > 0, and key ∈ DEE. The vocabulary DEE
is a finite set of strings and comprises the dictionary of the feature names as generated by the DEE engine. The
SUM can also receive inputs from other sources such as time and event triggers.
4.3.2
Output
The SUM engine will provide the output through a STORM Spout. The output provided is shown in Table 6.
These will be delivered by the Spout as serialized triples <timestamp, key, value>
5
The ViSTA-TV Recommendation Rules
The recommendation rules engine (RERULES) will determine whether and whom to give a recommendation. It consumes real-valued normalized features. As such a recommendation rule comprises a weight vector
18
ViSTA-TV
Output
Type
Description
User Id/Show Id
String
This is the Id that uniquely identifies the show or user
of current interest
User Groups/Show Groups
Integer Array
This is a list of the integers that uniquely identify the
user or show groups
Timestamp
long
Table 6: SUM Engine Output
Feature
Temporal fact
Show-User affinity
Real video content feature
Symbolic video content feature
Nature
Boolean
Real
Real
Boolean
Values
{0, 1}
[0, 1]
[0, 1]
[0, 1]
Source engine
CEP
SUM
SUM
DER
Table 7: The features for the RERULES engine and their nature.
hw0 , w1 , . . . , wN i ∈ RN +1 and a feature vector hx1 , . . . , xN i ∈ [0, 1]N where [0, 1] is a closed real-valued interval. The CEP engine, the SUM engine and the DER engine may contribute features to the RERULES engine.
Table 7 illustrates the nature of these features.
Symbolic video content features, for example, weekdays need to be binarized by the RERULES engine. Any
temporal fact F will be sent to the RERULES engine twice: first at the time it started being valid and second
at the time it became invalid. Considering the temporal fact F as a feature we convey the information that the
feature changed its value to and from valid and invalid. Computed statistics, in particular on the Show-User
assertions, will be handled by the CEP engine whereas statistics considered to be features are handled by the
RERULES engine directly. Hence not the RERULES engine but the CEP engine will compute that the football
match attracts more than a million viewers over the last ten minutes. The RERULES engine, in contrast, may
define a feature “viewers” whose value may then directly influence the decision whether to fire a recommendation.
The recommendations themselves will be computed according to a system that combines a linear combination
of the features f (w1 x1 + . . . + wN xN ) and evaluate the outcome against a threshold w0 . This allows us to
implement the rules as linear classifiers such as decision trees, perceptrons, support-vector-machines, etc.
Learning these rules will become possible with features from the data enrichment (WP2) and the show-usermodel becoming available.
6
Conclusion
For the Deliverable 4.4 UZH and TUDo have created the fundamental data models for all components of the
ViSTA-TV recommendations engine. Recommendations will be implemented as linear classifiers and make their
decision based on feature events, temporal facts and show-user-model information. The temporal facts are
computed by an extension of SPARQL which is under development. They will essentially take feature events
and show-user-model data is an input and match these against extended SPARQL queries. The show-user-model
will for a given show the cluster of users assigned to it and vice versa. Being the only of the above components
that is not time-critical, applying the actual model involves queries to the data warehouse about shows that are
to be broadcasted in the future. The show-user-model will be determined by a two-way hierarchical clustering
which is performed offline. The recommendation rules will be learned offline on previous user behavior and the
complex events will be hand-crafted and later learned as the feature data and the show-user-model become
available. First investigations on promising features and events have been carried out and will result in actual
models for the next milestone.
Acknowledgements
The research leading to these results has received funding from the European Union Seventh Framework Programme FP7/2007-2011 under grant agreement n° 296126.
19
We would like to thank Jörg-Uwe Kietz, Libby Miller and the reviewers for their helpful comments on this text.
20
ViSTA-TV
References
[1] Evrim Acar, Seyit Çamtepe, and Bülent Yener. Collective Sampling and Analysis of High Order Tensors
for Chatroom Communications. In Sharad Mehrotra, Daniel Zeng, Hsinchun Chen, Bhavani Thuraisingham, and Fei-Yue Wang, editors, Intelligence and Security Informatics, volume 3975 of Lecture Notes in
Computer Science, pages 213–224. Springer Berlin / Heidelberg, 2006.
[2] Darko Anicic, Paul Fodor, Sebastian Rudolph, and Nenad Stojanovic. EP-SPARQL: a unified language for
event processing and stream reasoning. In Proceedings of the 20th international conference on World wide
web - WWW ’11, pages 635–644, New York, NY, USA, 2011. ACM.
[3] Arvind Arasu, Shivnath Babu, and Jennifer Widom. The CQL continuous query language: semantic
foundations and query execution. The VLDB Journal, 15(2):121–142, July 2005.
[4] Arindam Banerjee, S Basu, and S Merugu. Multi-way clustering on relation graphs. SDM. SIAM, 2007.
[5] D.F. Barbieri, D. Braga, S. Ceri, E. Della Valle, and M. Grossniklaus. C-SPARQL: SPARQL for continuous
querying. In Proceedings of the 18th international conference on World wide web, pages 1061–1062. ACM,
2009.
[6] Daniel Billsus and Michael J Pazzani. Learning Collaborative Information Filters. In ICML ’98 Proceedings
of the Fifteenth International Conference on Machine Learning, pages 46 – 54, 1998.
[7] Sirish Chandrasekaran, Owen Cooper, Amol Deshpande, Michael J. Franklin, Joseph M. Hellerstein, Wei
Hong, Sailesh Krishnamurthy, Sam Madden, Vijayshankar Raman, Fred Reiss, and Mehul A. Shah. TelegraphCQ : Continuous Dataflow Processing for an Uncertain World +. In Michael Stonebraker, Jim Gray,
and David DeWitt, editors, CIDR 2003, First Biennial Conference on Innovative Data Systems Research,
Asilomar, CA, USA, January 5-8, 2003, Online Proceedings. www.crdrdb.org, 2003.
[8] Graham Cormode and Marios Hadjieleftheriou. Methods for finding frequent items in data streams. The
VLDB Journal, 19(1):3–20, December 2010.
[9] Yanlei Diao, Neil Immerman, and Daniel Gyllstrom. Sase+: An agile language for kleene closure over event
streams. Technical report, University of Massachusetts Amherst, Department of Computer Science, 2008.
[10] Oliver Flasch, Andreas Kaspari, Katharina Morik, and Michael Wurst. Aspect-based tagging for collaborative media organization. In From Web to Social Web: . . . . 2007.
[11] BCM Fung, Ke Wang, and Martin Ester. Hierarchical document clustering using frequent itemsets. In
Proc. SIAM International Conference on Data Mining 2003 (SDM 2003), 2003.
[12] Thomas Hofmann. Latent semantic models for collaborative filtering. ACM Transactions on Information
Systems, 22(1):89–115, January 2004.
[13] Yehuda Koren, Robert Bell, and Chris Volinsky. Matrix factorization techniques for recommender systems.
Computer, pages 42–49, 2009.
[14] Yamuna Krishnamurthy. Data driven discretization. Website, 2012. http://kirmes.cs.uni-dortmund.
de/vista/data_driven_discretization.html.
[15] Danh Le-phuoc, Minh Dao-tran, Josiane Xavier Parreira, and Manfred Hauswirth. A Native and Adaptive
Approach for Unified Processing of Linked Streams and Linked Data. In Lora Aroyo, Chris Welty, Harith
Alani, Jamie Taylor, Abraham Bernstein, Lalana Kagal, Natasha Noy, and Eva Blomqvist, editors, The
Semantic Web – ISWC 2011: 10th International Semantic Web Conference, Bonn, Germany, October
23-27, 2011, Proceedings, Part I, volume 7031 of Lecture Notes in Computer Science, pages 370–388.
Springer Berlin / Heidelberg, 2011.
[16] YR Lin, J Sun, and P Castro. Metafac: community discovery via relational hypergraph factorization. In
KDD ’09 Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and
data mining, pages 527–536, 2009.
[17] Prem Melville, Raymond J Mooney, and Ramadass Nagarajan. Content-Boosted Collaborative Filtering for Improved Recommendations. In Proceedings of the Eighteenth National Conference on Artificial
Intelligence(AAAI-2002), Edmonton, Canada, July 2002, number July, pages 187–192, 2002.
21
[18] Ingo Mierswa, Katharina Morik, and Michael Wurst. Collaborative Use of Features in a Distributed System
for the Organization of Music Collections. In Shephard Shen and Liu Cui, editors, Intelligent Music
Information Systems: Tools and Methodologies, pages 147–176. Idea Group Publishing, 2007.
[19] Raymond J Mooney and Loriene Roy. Content-Based Book Recommending Using Learning for Text Categorization. In Proceedings of the Fifth ACM Conference on Digital Libraries, San Antonio, TX, June 2000,
number June, pages 195–240, 2000.
[20] Katharina Morik, Andreas Kaspari, Michael Wurst, and Marcin Skirzynski. Multi-objective frequent termset
clustering. Knowledge and Information Systems, 30(3):715–738, July 2011.
[21] Harsha Nagesh, Sanjay Goil, and Alok Choudhary. Adaptive grids for clustering massive data sets. In
Proceedings of 1st SIAM International Conference on Data Mining, 2001.
[22] Andrew I Schein, Alexandrin Popescul, Lyle H Ungar, and David M Pennock. Methods and Metrics for
Cold-Start Recommendations. In Proceedings of the 25th Annual International ACM SIGIR Conference on
Research and Development in Information Retrieval (SIGIR 2002)., 2002.

D4.4 Recommender Data Models - ViSTA-TV

Transcrição

Documentos relacionados

dayz cheat engine scripts tutorial

AGÊNCIA NACIONAL DE AVIAÇÃO CIVIL

Shell Helix Ultra Professional AB-L 0W-30

steyr marine engines - Complete Yacht Service

manual/datasheet - Eiweiss

Alcoa - Analytics Use cases

fling - hn.org

Read in PDF

PC30MR-3

New brew in town