D4.4 Recommender Data Models - ViSTA-TV
Transcrição
D4.4 Recommender Data Models - ViSTA-TV
ViSTA-TV: Video Stream Analytics for Viewers in the TV Industry FP7 STREP ICT-296126 | 296126 co-funded by the European Commission ICT-2011-SME-DCL | SME Initiative on Digital Content and Languages D4.4 Recommender Data Models Thomas Scharrenbach (UZH), Yamuna Krishnamurthy (TUDo), Christian Bockermann (TUDo) Project start date: Document identifier: Date due: Submission date: June 1st, 2012 ViSTA-TV/2012/D4.4 30 Nov 2012 30 November 2012 Project duration: Version: Status: Distribution: www.vista-tv.eu 24 months v1.0 Final PU ViSTA-TV Consortium This document is part of a collaborative research project funded by the FP7 ICT Programme of the Commission of the European Communities, grant number 296126. The following partners are involved in the project: University of Zurich (UZH) - Coordinator Dynamic and Distributed Information Systems Group (DDIS) Binzmühlstrasse 14 8050 Zürich, Switzerland Contact person: Abraham Bernstein E-mail: [email protected] Techniche Universität Dortmund (TUDo) Computer Science VIII: Artificial Intelligence Unit D-44221 Dortmund, Germany Contact person: Katharina Morik E-mail: [email protected] Rapid-I GmbH (RAPID-I) Stockumer Strasse 475 44227 Dortmund, Germany Contact person: Ingo Mierswa E-mail: [email protected] Zattoo Europa AG (Zattoo) Eggbühlstrasse 28 CH-8050 Zürich, Switzerland Contact person: Bea Knecht E-mail: [email protected] Vrije Universiteit Amsterdam (VUA) Web & Media Group, Department of Computer Science, Faculty of Sciences (FEW) De Boelelaan 1081a NL-1081 HV Amsterdam, The Netherlands Contact person: Guus Schreiber E-mail: [email protected] The British Broadcasting Corporation (BBC) 56 Wood Lane / Centre House - Broadcasting House UK-W12 7SB Northampton, United Kingdom Contact person: Chris Newell E-mail: [email protected] Copyright © 2012 The ViSTA-TV Consortium D4.4 Recommender Data Models 3 Executive Overview This documents reports on the achievements made for Deliverable 4.4, “Recommender Data Models” for the EU FP7 project ViSTA-TV. We give an overview over how the data models for the recommendations engine are to be defined, determined and applied. The data models will be presented along with the components of the recommendations engine which to they belong. Recommendations are computed in the recommendations rules engine which consumes output of the complex event processing engine and the show-user-model engine (both part of the recommendations engine) and from the data enrichment engine (described by document ViSTATV/2012/D2.1). We here give definitions of the actual processes how the show-user-model is determined, how the complex event processing engine works and how recommendations are computed. For an overview how these components will be implemented we refer to the document ViSTA-TV/2012/D4.1. 4 ViSTA-TV Contents 1 Introduction 1.1 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 The ViSTA-TV Project Setup 5 5 5 2.1 Scope of this document . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.2 ViSTA-TV Complex Event Processing and Temporal Facts . . . . . . . . . . . . . . . . . . . . 7 2.3 ViSTA-TV Recommendation Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.4 ViSTA-TV Show-User-Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 3 Complex Events and Temporal Facts 9 3.1 Feature Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 3.2 Show-User-Model Events and Temporal Facts . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 4 The ViSTA-TV Show-User-Model 10 4.1 Frequent Item Set Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 4.2 Subspace Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 4.3 Data format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 4.3.1 Input . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 4.3.2 Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 5 The ViSTA-TV Recommendation Rules 17 6 Conclusion 18 D4.4 Recommender Data Models 1 5 Introduction This documents reports on the achievements made for Deliverable 4.4, “Recommender Data Models” for the EU FP7 project ViSTA-TV. These recommender data models are defined within the context of the recommendations engine of the ViSTA-TV Stream Processing Engine. For information about the first we refer to the document ViSTA-TV/2012/D4.1. For an overview over the latter please refer to ViSTA-TV/2012/D4.1. In this document we give an overview over how the data models for the recommendations engine are to be defined, determined and applied. The data models will be presented along with the components of the recommendations engine which to they belong. Recommendations are computed in the recommendations rules engine which consumes output of the complex event processing engine and the show-user-model engine (both part of the above-mentioned recommendations engine) and from the data enrichment engine (described by document ViSTA-TV/2012/D2.1). We here give definitions of the actual processes how the show-user-model is determined, how the complex event processing engine works and how recommendations are computed. 1.1 Outline This document is structured as follows: In Section 2 we give an overview over the ViSTA-TV project as given in the Description of Work (DOW). In the subsequent sections we give precise definitions for the data format of the CEP engine (Section 3), the SUM engine (Section 4) and the RERULES engine (Section 5). We finally conclude with a summary in Section 6. 2 The ViSTA-TV Project Setup Please note that much of the content of this section was re-used from the Description of Work document (DOW) to make this document self-contained. In the past video content was consumed through broadcast TV delivered through the air or by cables. Broadcasters had information—both meta-data and the actual video stream—about their own programming as well as coarse usage information supplied by survey companies such as Nielsen or GfK.1 Cable providers viewed themselves as pure bundling and transportation channels whilst viewers used secondary sources to inform themselves about the desired programming sources. Fueled by Internet TV (IPTV) offers of telecommunication giants, cable providers, or pure online providers such as YouTube or Hulu as well as digital rental companies such as NetFlix and iTunes, people increasingly watch video content delivered over IP networks. The technological issues arising from IPTV as well as the opportunities for interactive, personalized TV have been amply investigated in EU-projects such as NoTube. However, the use of behavioral information accruing during the process of viewing live IPTV in combination with the actual video streams and the electronic program guide (EPG) information as a foundation for a market research data marketplace has been largely overlooked. Consumers Ra7ng agencies Apps Behavior Data Adver7sers Behavior data Providers Content Providers IPTV-‐Providers Raw Data Raw Data Metadata Data Tag-‐ Enrichment Stream Engine Recommen-‐ da7ons Recommen-‐ da7on Engine Behavior Data Data Warehouse Tag-‐Stream LOD Cloud Linked OpenData ViSTA-‐TV Pla+orm Figure 1: The ViSTA-TV universe (logos are owned by respective companies). 6 ViSTA-TV The ViSTA-TV project proposes to view IPTV as a two-way channel, where the viewer can take advantage of the video streams whilst the ViSTA-TV platform employs behavioral information about the viewer gathered by IPTV transmitters to improve the experience for all participants in the TV supply chain (see Figure 1). Specifically, ViSTA-TV proposes to gather consumers’ viewing behavior and the actual video streams from broadcasters and IPTV providers to combine them with enhanced EPG information as the input for a holistic live-stream analysis. The analysis—comprising of personalized feature construction, supplementary tag generation, and real-time recommendation generation—will provide the input for a triplication procedure that generates Data in the Resource Description Framework (RDF)—a data format for storing typed graphs1 —and enables the reuse of the resulting data for three different data pools: 1. Linked Open Data as a Basis for Analytic Processing, 2. Viewing Behavior Data, and 3. recommendation Data. Each of these data pools with its production pipeline will provide the foundation for a data marketplace with its own commercial as well as societal rationale such as bootstrapping innovative applications relying on TV data, selling services relying on detailed viewing behavior analysis beyond today’s capabilities, or helping viewers finding the shows that best match their interest at any given point in time. The heart of the ViSTA-TV platform will be an online analysis system that operates on streams of content data in real time. This system will be developed on the basis of the STORM stream processing framework.2 We will extend existing methods for feature extraction from content streams, as well as methods for querying and matching events on streams of background information to make them stream-ready. Last but not least we will develop methods for real-time recommendation based on complex event processing and hybrid user/show profiles. For learning and evaluating the corresponding data models we will rely upon existing offline machine learning methods. 2.1 Scope of this document The document describes the data models for the recommendations engine of the first version of the Recommendations Engine of the ViSTA-TV stream-processing engine, the former being described in document ViSTA-TV/2012/D4.1 the latter in document ViSTA-TV/2012/D3.1. The ViSTA-TV stream-processing engine comprises the following components: Data Enrichment Engine (DE) Complex Event Processing Engine (CEP) Show-User-Model Engine (SUM) Recommendations Rules Engine (RERULES) While the DE, CEP and SUM engines provide input for the RERULES engine, which in turn provides input for the end user apps, and the ViSTA-TV data warehouse. In its current version the data models comprise a real-valued normalized feature vector with each feature attached a positive real number referred to as weights. These features will be multiplied with the weights and a recommendation given if the weighed sum exceeds a pre-defined threshold. Note that this document does not provide any information about the actual implementation and data formats of the input and the output. These are covered in document ViSTA-TV/2012/D4.1. 1 http://nielsen.com, http://gfk.com 2 http://storm-project.net/ D4.4 Recommender Data Models 2.2 7 ViSTA-TV Complex Event Processing and Temporal Facts The data enrichment process will generate event-streams of feature-value-changes that describe shows or simple usage events such as “Jon Doe” watches Sports Today. Apart from their direct usage these event-streams provide the basis for constructing more complex events as aggregated/constructed features (e.g., the complex event after-work-TV-night could be characterized as watching the sequence: “news”, “weather”, “crime”, “show”). The Complex Events Processing (CEP) engine performs the detection of such complex events. Furthermore, this CEP engine will also be able to construct temporal facts, i.e. things that are known to be true for a certain amount of time—in contrast to events that take place at a certain point in time and are over in the next instant. The difference of events and temporal facts can be illustrated best by an example: The event that a thousand football fans switch to the match on channel no 5 may trigger the temporal fact that a lot of people interested in football are watching the match to become true. If such an amount of users switch again channels (or event turn off their IPTV consuming device) and only a handful of football fans remain watching the match, this may trigger the above temporal fact to end. In the ViSTA-TV Stream Processing engine, temporal facts and complex events serve as an input for the recommendations rules engine. To construct aggregated tags ViSTA-TV will develop a mechanism that efficiently detects complex event pattern on tag (or RDF-triple) streams. State-of-the-art event processing frameworks such as SASE+ [9] or TelegraphCQ [7] are defined on a relational data model. The relational model, however, makes it difficult to incorporate background knowledge, efficiently. Linked Data was designed to overcome this limitation of the relational models. EP-SPARQL [2] was presented, recently, for defining and querying complex events on streams of Linked Data or, more specifically, RDF. Yet, EP-SPARQL does not make assumptions on real-time processing or complex time constraints as do automata based approaches like SASE+. An extension of the relational stream querying language CSQL [3] is C-SPARQL [5]. Both approaches perform processing triples/tuples within a fixed window, which can be based on time-intervals, number of data triples, or partitions based on aggregates. C-SPARQL considers the triples in the window as a static knowledge base. An extension for making this approach work more dynamic was proposed, recently [15]. However, complex events whose components may go beyond the window limits (such as long-term TV usage behavior) are not yet taken into account. Real-time complex event matching requires efficient and effective methods for pruning the current matching hypotheses. Current approaches prune by implementing memory eviction schemes such as first-in-first-out (essentially a time window) or least-recently-used (i.e., keeping only active patterns not the most useful one). These schemes essentially introduce a time-window though the backdoor. ViSTA-TV, in contrast, will employ the SASE+ automata based model and the EP-SPARQL querying capabilities but develop methods for estimating the probabilities of automata (representing complex event patterns) to match or not. Hence, this novel approach will base its eviction decision based on inherent data statistics, which is expected to minimize the number of false-positives and false-negatives. Consequently, ViSTA-TV will provide novel methods for matching time-window agnostic complex events on streams of Linked Data in real-time, which will be used for feature/tag construction. In the present version, however, the focus is laid on integrating the components of the recommendations engine into the overall framework of the ViSTA-TV Stream Processing Engine. Therefore, the actual implementation does not yet take into account actual extensions of SPARQL of implement a complex event processor. To match the goal of this deliverable the components of the CEP engine are implemented as dummies. 2.3 ViSTA-TV Recommendation Rules Recommender systems have long been in the focus of the e-commerce, e.g. the famous NetFlix challenge for movie recommendation is a good example here. The techniques for recommender systems are generally categorized as content-based, collaborative filtering, or hybrid methods. Content-based methods focus on recommending objects that are similar to the ones a user has previously chosen ([19]). Methods based on collaborative filtering identify recommendable items based on user similarity [12, 10, 18]. Hybrid approaches combine content-based and collaborative methods and have been reported to raise the accuracy on recommendations [17]. Another distinction made in the recommender system literature is memory- versus model-based approaches. Memory based approaches work directly on the user/content data and produce recommendation based on the k-nearest neighbors. Hence, every new user or item needs to be compared to all existing users/items, i.e. a computational cost unacceptable for our setting. More recently model-based approaches, which try to model the 8 ViSTA-TV relation between users and content mathematically, have been shown to not only speedup the recommendation process, but may even improve accuracy (e.g., matrix factorization techniques have outperformed neighboring based methods in the NetFlix challenge [13]). Furthermore, tensor and matrix factorization technique that also recently gained interest in social network analysis [1, 4, 16] are supporting this trend. Unfortunately, however, these model-based techniques are still computationally intensive and, therefore, do not fit the scenario of online recommendation of shows and movies as intended by ViSTA-TV. ViSTA-TV will ensure real-time recommendations by addressing computational complexity of current approaches whilst improving recommendation quality through personalized models: While some work on simple events, none of the state-of-the art recommender systems operates on complex events. ViSTA-TV will make it possible to use complex events for the actual recommendations. This has not been investigated so far, in particular not for realtime processing. In addition, current recommendation approaches work with handcrafted and manually selected features to describe items. In contrast, ViSTA-TV will integrate dynamic feature extraction and selection within the recommendation process. The current implementation of the recommendations rules will be a linear combination of the features that are fed into this engine. A recommendation will then be given according to the outcome of the application of the linear combination to a classification system such as a perceptron or a decision tree. The dummy engine will output recommendations via an XMPP server. Yet, the STORM framework provides the necessary flexibility to add further output mechanisms as required. It is clear that methods like nearest neighbor - were all computations take place in the model application phase and none in the model training phase - are not suitable for real-time recommendations. Vista-TV therefore distinguishes offline training of recommendation models and real-time application of the models. The application will be done by compilation of the models into sets of recommendation rules, each rule is a linear combination of the feature fed into this engine. A recommendation will then be given if the linear combination exceeds a threshold, is the highest recommendation among all for a user and is especially significantly higher that the recommendation value for the show the user currently watches. 2.4 ViSTA-TV Show-User-Model To address the computational effort of current approaches we intend to take a two-pronged approach. First, TUDo and UZH will not use the recommendation-algorithms on the actual video/audio-streams but rather on the feature-event streams extracted in WP2 (see above). This will massively reduce the dimensionality of the streams. Second, we will use a personalized heterogeneous clustering approach developed by TUDo that exploits frequent term set clustering ([20, 11]). To speed it up, TUDo will combine this approach with online item set mining [8]. The combination will result in a novel approach for reducing the computational complexity of recommendation systems, which has the added benefit of providing improved recommendations due to its underlying personalized clustering. Collaborative filtering suffers from the so-called large-scale cold-start problem. New items are not being recommended as not enough users have rated them and they do not get ratings as they are not recommended. Both hybrid and model-based approaches have been shown to reduce the cold-start problem [22], still recommendersystems have been mostly used where new content is the exception and not the rule. In the broadcasting TV world of ViSTA-TV many shows on TV are first time transmissions. Hence, TUDo and UZH will explore a novel hybrid 2-way clustering approach: based on available content descriptions (WP1) and external data (WP6) (e.g. this movie is a TV-premiere, but it is already rated as cinema-movie) on one side as well as user viewing information on the other side clusters of shows and users are discovered in a mutually reinforcing process. Once these clusters are available the recommendation can be viewed as a prediction task [6] using techniques such as Bayesian classifiers, decision trees or artificial neural networks. Since the recommendations are now based not only on single shows but also on show-clusters new shows can, simply, be assigned to a cluster based on its content characteristics and then recommended along with other similar shows overcoming the cold-start problem. We refer to this two-way hierarchical clustering approach as finding out the show-user-model (SUM). While the clustering clearly is an offline task the application of its learned model will be an integral part of the ViSTA-TV Stream Processing Engine. Some significant amount of time before a show will be broadcasted the SUM engine will query the data warehouse which shows will be actually broadcasted. This information, along with its application to the show-user-model will serve as an input to the CEP engine as well as the RERULES engine. Note that feeding these engines with the data can be performed some time before the actual broadcasting takes place. The SUM engine can hence be seen as an initializer of the CEP and the RERULES D4.4 Recommender Data Models 9 engine allowing them to prepare and optimize their processing for upcoming shows and the stream of feature events before they take place. Note that the SUM engine does currently not have any input from any other stream-processing component in the ViSTA-TV stream-processing engine. Fortunately, the STORM framework underlying the ViSTA-TV stream-processing engine provides means for easily adding input from other stream-processing components, e.g. the data enrichment engine, should that become a necessity. 3 Complex Events and Temporal Facts In ViSTA-TV we consider events are to be expressed in the Resource Description Framework (RDF). The data model underlying RDF is a directed graph with typed edges. The nodes of the graphs are referred to as resources and values whereas the edges are referred to as predicates (aka properties). Hence a statement in RDF is a triple of the form hsubject, predicate, objecti where predicate may be an IRI, the subject may also be an anonymous identifier (aka blank node), and the object may be also a data value (aka literal ). In ViSTA-TV RDF triples additionally have time information. We distinguish between events and facts. Events happen at a certain point in time. Events that are represented by a single RDF triple are called atomic events in contrast to complex events that comprise a set of atomic events. As such both atomic as well as complex events are asserted with two timestamps: ts is the earliest time at which an atomic event in the set happened and te the latest time at which an atomic event in the set happened. Only at time te we know that an event E definitely took place. Note that for atomic events we always have ts = te whereas for complex events we can only guarantee ts ≤ te . In contrast, temporal facts F are facts for which we know the interval [ts , te ] during which they are considered valid. Note that the fact as such is always considered true but not necessarily valid for every time instant. Events may trigger a fact to become valid and the validity state of a fact may trigger events to happen. 3.1 Feature Events The CEP engine will—besides other input—consume events on features from the ViSTA-TV data enrichment engine. The information provided therefore comprises a 4-tuple ht, id, key, valuei where t is the time at which the change of the feature happened, id is the id of the context (e.g. channel, show, user) related to the feature, key is the identifier of the feature, and value is the new value of the feature. The CEP considers these 4-tuples as atomic events where ts = te = t, subject = id, predicate = key, and object = value. 3.2 Show-User-Model Events and Temporal Facts The SUM engine will push information about which users are potentially interested in the shows that are to be broadcasted to the CEP engine. This information will be presented as feature change events as is the case of the data enrichment engine (see above). Since we know the duration of a show beforehand we may pre-compute temporal facts from this information in the CEP engine. Yet, the computation of these facts is done by the CEP engine and not by the SUM engine. Note that SUM features that are solely based on EPG data and past behavior are not time-critical and may be pushed to the recommendations rules engine some time before the show will be broadcasted. This data is not time-critical as EPG data are available weeks or even months before transmission. We plan, for example, to provide the ViSTA-TV end-user Apps with recommendations that are solely based on these SUM features. Assume, for example, that we know that a certain cluster of users is likely to watch a show that is to be broadcasted in the evening. Applying the SUM engine we can recommend that very show to the according 10 ViSTA-TV users even the day before (given, the EPG data is available then). While the goal of ViSTA-TV is providing real-time recommendations, the pre-computed recommendations allow us first to have default recommendations. The recommendations engine is not required to provide any real-time recommendation, for example, if there is currently no show to recommend with respect to the real-time data. Second, it allows us comparing whether users actually follow real-time recommendations or the pre-computed ones. It is hence an important component for evaluating the quality and need of real-time recommendations. 4 The ViSTA-TV Show-User-Model The TUDo SUM engine provides user→show groups and show→user group mappings. This mapping provides an insight into the the interest relationship between users and shows. That is, we know what kind of shows a user is interested in and what users are interested in a specific show. The knowledge of this two-way relationship can be used by the recommendation system to predict the shows that a user will be interested in or the users that need to be targeted for for a particular show. The SUM engine provides these mappings as input to the CEP and the RERULES engines so they can prepare and optimize their processing. The show groups and user groups are clusters that are determined offline. TUDo is looking into the following two approaches for the offline clustering which are described in Section 4.1 and Section 4.2 respectively. The two approaches have provided promising results from which useful features and groups can be determined. But further analysis is necessary to provide a more comprehensive assessment of both approaches. Frequent Item Set Mining Subspace Clustering The SUM engine uses these user groups and show groups that have been determined offline to associate a user or show to the groups online. The engine can be triggered to do this mapping by some event, for example system time which indicates the beginning of a new show or detection of an event like a user logging in that it receives from the DEE. It can then poll the data warehouse for existing user→show groups or show→user groups mappings discovered by either frequent item set mining or subspace clustering and provide this information to the CEP engine. The SUM engine will be implemented using the TUDo streams framework described in ViSTA-TV/2012/D2.1 and will provide the output through a STORM Spout. 4.1 Frequent Item Set Mining Some initial experiments were conducted to identify frequent item sets to discover interesting groups of items, where the items could be channels viewed, genres and programs. The experiment was done with both BBC and Zattoo data for channels and genres. The FP-Growth operator from RapidMiner was used to extract the frequent item sets, after some initial pre-processing of the input data as shown in Figure 2. Figure 2: Rapid miner operators for frequent item set mining D4.4 Recommender Data Models 11 The frequent item set results for BBC channels and genres are shown in Table 1 and Table 2 respectively. The support of an item set is the proportion of transactions in the data set which contain the item set. We see from the tables that some frequent item sets have a very low support value and we can choose to ignore it. But it is still important to study all the item sets discovered as the more unusual ones could have a low support. Table 1 shows that bbcone and cbeebies are popular channels. It also indicates that bbcone and bbcnews channels are popularly watched together. That is, people watching bbcone also watch bbcnews. Channels Support bbcone 0.67553 cbeebies 0.11074 bbcnews 0.10055 bbcthree 0.05847 cbbc 0.04917 bbctwo 0.03754 bbcone, bbcnews 0.01041 bbcone, bbctwo 0.0088 bbcfour 0.00842 bbcone, cbeebies 0.00437 bbcone, bbcthree 0.00377 cbeebies, cbbc 0.00293 bbcone, cbbc 0.00227 bbcnews, bbctwo 0.00216 bbcone, bbcfour 0.00177 cbeebies, bbctwo 0.0015 bbcthree, bbctwo 0.00133 bbctwo, bbcfour 0.00116 cbbc, bbctwo 0.00105 bbcnews, bbcfour 0.00072 bbcone, bbcthree, bbctwo 0.00066 bbcnews, bbcthree 0.00066 bbcone, bbcnews, bbctwo 0.00061 bbcthree, bbcfour 0.00061 cbeebies, bbcnews 0.00055 cbeebies, cbbc, bbctwo 0.00044 bbcone, bbctwo, bbcfour 0.00039 bbcone, bbcthree, bbcfour 0.00033 bbcone, cbbc, bbctwo 0.00028 bbcone, cbeebies, cbbc 0.00028 Table 1: Frequent item sets of BBC channels Similarly we see from Table 2 that news and childrens entertainment and comedy are popular genres. We also see that some genres are popular together among viewers like news and drama which is rather intuitive since these two genres are hugely popular among the average age group between 35-40. We are hoping to discover more unusual groups with further such frequent item set mining after filtering out the more obvious item sets. We also generated frequent item sets for Zattoo shows and genres which are shown in Table 3 and Figure 3 respectively. For the shows we noticed that those that were aired consecutively were usually seen together. The frequent item sets data provides insight into channels and genres that are most popular. It shows what 12 ViSTA-TV Figure 3: Frequent item sets of Zattoo genres channels and genres are viewed together. This information is useful to generate recommendations. The above are preliminary analysis. We will continue with further analysis of these item sets and also work on generating more interesting and unusual ones. D4.4 Recommender Data Models 13 Genres Support news 0.683007102 childrens_entertainmentandcomedy 0.112766326 factual_familiesandrelationships 0.076043651 drama 0.068941625 childrens_drama 0.054131301 factual 0.045643513 childrens_entertainmentandcomedy,childrens_drama 0.036462844 childrens 0.032652001 childrens_entertainmentandcomedy, childrens 0.029014377 childrens_factual 0.018101507 weather 0.014204053 factual_artscultureandthemedia 0.013078122 drama, weather 0.012385242 comedy_sitcoms 0.01021999 childrens_drama, childrens_factual 0.009613719 childrens_entertainmentandcomedy, childrens_factual 0.008920838 drama, factual 0.006842196 news, drama 0.005976096 news, childrens_entertainmentandcomedy 0.005716265 childrens_entertainmentandcomedy, childrens_drama, childrens_factual 0.005543045 news, factual 0.003984064 childrens_drama, childrens 0.003551013 childrens_entertainmentandcomedy, childrens_drama, childrens 0.003551013 learning_preschool 0.002771523 factual_familiesandrelationships, drama 0.002771523 factual_familiesandrelationships, factual 0.002425082 drama, factual_artscultureandthemedia 0.002251862 childrens_entertainmentandcomedy, learning_preschool 0.002165252 news, weather 0.002165252 news, childrens_drama 0.001992032 childrens, learning_preschool 0.001818812 factual, factual_artscultureandthemedia 0.001818812 news, childrens_factual 0.001818812 news, drama, weather 0.001818812 factual_familiesandrelationships, comedy_sitcoms 0.001645592 childrens_entertainmentandcomedy, childrens, learning_preschool 0.001558981 news, factual_artscultureandthemedia 0.001558981 drama, comedy_sitcoms 0.001472371 news, factual_familiesandrelationships 0.001472371 childrens_factual, learning_preschool 0.001299151 childrens_entertainmentandcomedy,childrens_factual,learning_preschool 0.00129915 1 news, childrens_entertainmentandcomedy, childrens_drama 0.001299151 childrens, childrens_factual 0.001125931 Table 2: Frequent item sets of BBC genres 14 ViSTA-TV Support Shows 0,370 {Tagesschau, Wetter, Börse im Ersten} 0,328 {heute, Börse im Ersten} 0,319 {Tagesschau, Tatort} 0,299 {Tagesschau, ZDF-History} 0,284 {heute, ZDF-History} 0,283 {Tagesschau, heute-show} 0,278 {heute, Tatort} 0,275 {Tagesschau, hallo deutschland} 0,273 {Tagesschau, Markus Lanz} 0,256 {Tagesschau, Glücksspirale} 0,255 {Tagesschau, Günther Jauch} 0,251 {Tagesschau, Tatort: Hinkebein} 0,251 {Tagesschau, Brisant} 0,218 {ZDF-History, heute 100 sec} 0,211 {Wetter, heute, Günther Jauch} Table 3: Frequent item sets of Zattoo shows D4.4 Recommender Data Models 4.2 15 Subspace Clustering As part of the subspace clustering analysis we did some data driven discretization. Most of the earlier experiments were done using user driven discretization like age group and gender. This data driven discretization provides some interesting insights into the viewership behavior. It provides information such as how long do most people watch channels for, what programs are more popular, which programs trigger the most channel switching and such. Data driven discretization discretizes data based on the frequency of the items. The two discretization methods used for the analysis were Neighborhood Discretization and MAFIA Discretization. Neighborhood Discretization discretizes by examining difference in frequency in the neighborhood of the examples. MAFIA discretization is based on MAFIA [21] subspace clustering algorithm. MAFIA uses an adaptive grid size for clustering and merges the intervals so formed to explore clusters in higher dimensions. From our experiments we see that the results from both these methods are comparable. We used the Neighborhood and MAFIA discretization implementations that are part of the Subspace Clustering Extension for RapidMiner [?]. Channel Duration Results Channel duration attribute was discretized on the user log dataset from BBC. Figure 4a and Figure 4b show that most viewers watch for a duration of 1-180 mins. (a) Count of viewers for each range of channel duration discretized using neighborhood discretization (b) Count of viewers for each range of channel duration discretized using MAFIA discretization Figure 4: Discretized channel duration So then we filtered the dataset for all the entries with channel duration 1-180mins. We then discretized the channel duration of these entries only. From Figure 5a and Figure 5b we see that there are more viewers actually watching programs than switching channels. (a) Count of viewers for each range of channel duration (between 1-180 mins) discretized using neighborhood discretization (b) Count of viewers for each range of channel duration (between 1-180 mins) discretized using MAFIA discretization Figure 5: Discretized channel duration (between 1-180 mins) Program Duration Results We did further analysis to see what programs lead to channel switching. For this we filtered the dataset for all entries with channel duration of 1-5mins. We then did a count of the resulting dataset grouped by the program that was watched during that duration. Table 4 and Table 5 show the programs that were watched for 5 mins or less. We can conclude that the programs with the higher count are the ones that triggered a switch. 16 ViSTA-TV CHANNEL PROGRAM ID PROGRAM TITLE COUNT bbcone b01cr8wm Breakfast 1954 bbcthree b01cpqbr Sun,Sex and Suspicious Parents 501 bbcone b01cx90j Racing for Time 415 bbcnews b01cpfng Breakfast (BBC News Channel) 332 bbcone b01177tb Homes Under the Hammer 332 bbcone b01d0f6g A Picture of Health 271 bbctwo b008s9l9 Storyville 241 bbcnews b01cpfnn BBC News 226 bbcnews b01cpfnx BBC News 220 bbcnews b01cpfns BBC News 201 bbcone b01cwwd8 Cowboy Trap 177 bbcone b01d09q6 Doorstep Crime 999 154 cbeebies b00rq2yv Something Special 134 bbcnews b01cpfnj BBC News 130 cbeebies b0121qyp Same Smile 127 cbeebies b017ldgw Mister Maker Comes to Town 119 cbeebies b01cyvmz Mr Bloom’s Nursery 108 cbeebies b00nk1bz Waybuloo 100 Table 4: Programs that have a high count of being watched for less than 5 mins CHANNEL PROGRAM ID PROGRAM TITLE COUNT bbctwo b00tdypd Chuggington: Badge Quest 5 bbctwo b00xc5yy Dipdap 5 cbbc b00sllbn Shaun the Sheep 5 bbctwo b00jhhz0 Louie 4 bbctwo b00qzkwn Garth and Bev 4 bbctwo b01cvm38 Newsround 4 bbctwo b00qhvyk Alphablocks 3 bbctwo b00qtrrg Zigby 3 bbctwo b00t5z0f Harry and Toto’s World of Opposites 3 bbctwo b0078j82 Little Robots 2 bbctwo b00j4hwy Poetry Pie 1 Table 5: Programs that have a low count of being watched for less than 5 mins We then did an analysis of one of the programs with the highest count, “Breakfast”, to see how the program duration was distributed for all viewers. From Figure 6 a and Figure 6 b, we see that most viewers switched channels during this program. The Breakfast show is a 3 hour program with segments of news and weather. Hence people switch channels quite often during this program possibly during either weather or news segments. We then did an analysis on a program with the lower count, ”Pingu” to see how the program duration was distributed for all viewers. From Figure 7a and Figure 7b we see that most viewers watched the complete or a significant portion of the program, thereby indicating that this program is more popular and does not trigger as much channel switching. The above analysis indicates that subspace clustering and data driven discretization can provide very insightful D4.4 Recommender Data Models (a) Count of viewers for each range of program duration discretized using neigborhood discretization for Breakfast show (http://www.bbc.co.uk/programmes/b01cr8wm.xml 17 (b) Count of viewers for each range of program duration discretized using MAFIA discretization for Breakfast show (http://www.bbc.co.uk/programmes/b01cr8wm.xml) Figure 6: Discretization of the program duration for Brakfast Show (a) Count of viewers for each range of program duration discretized using neighborhood discretization for Pingu show (http://www.bbc.co.uk/programmes/b007809m.xml) (b) Count of viewers for each range of program duration discretized using MAFIA discretization for Pingu show (http://www.bbc.co.uk/programmes/b007809m.xml) Figure 7: Discretization of the program duration for Pingu Show information. We have seen that longer running programs see more switching activity. This also depends on the type of programs. For example, shows such as news are more prone to switches than say movies. We need to do further such in depth analysis of this viewership behavior. Some additional results can be found at the TUDo vista website [14]. 4.3 Data format 4.3.1 Input The SUM engine gets its data from the DEE and the data warehouse. The input will be P from the DEE engine P provided in the form of triples <time, key, value> where time > 0, and key ∈ DEE. The vocabulary DEE is a finite set of strings and comprises the dictionary of the feature names as generated by the DEE engine. The SUM can also receive inputs from other sources such as time and event triggers. 4.3.2 Output The SUM engine will provide the output through a STORM Spout. The output provided is shown in Table 6. These will be delivered by the Spout as serialized triples <timestamp, key, value> 5 The ViSTA-TV Recommendation Rules The recommendation rules engine (RERULES) will determine whether and whom to give a recommendation. It consumes real-valued normalized features. As such a recommendation rule comprises a weight vector 18 ViSTA-TV Output Type Description User Id/Show Id String This is the Id that uniquely identifies the show or user of current interest User Groups/Show Groups Integer Array This is a list of the integers that uniquely identify the user or show groups Timestamp long Table 6: SUM Engine Output Feature Temporal fact Show-User affinity Real video content feature Symbolic video content feature Nature Boolean Real Real Boolean Values {0, 1} [0, 1] [0, 1] [0, 1] Source engine CEP SUM SUM DER Table 7: The features for the RERULES engine and their nature. hw0 , w1 , . . . , wN i ∈ RN +1 and a feature vector hx1 , . . . , xN i ∈ [0, 1]N where [0, 1] is a closed real-valued interval. The CEP engine, the SUM engine and the DER engine may contribute features to the RERULES engine. Table 7 illustrates the nature of these features. Symbolic video content features, for example, weekdays need to be binarized by the RERULES engine. Any temporal fact F will be sent to the RERULES engine twice: first at the time it started being valid and second at the time it became invalid. Considering the temporal fact F as a feature we convey the information that the feature changed its value to and from valid and invalid. Computed statistics, in particular on the Show-User assertions, will be handled by the CEP engine whereas statistics considered to be features are handled by the RERULES engine directly. Hence not the RERULES engine but the CEP engine will compute that the football match attracts more than a million viewers over the last ten minutes. The RERULES engine, in contrast, may define a feature “viewers” whose value may then directly influence the decision whether to fire a recommendation. The recommendations themselves will be computed according to a system that combines a linear combination of the features f (w1 x1 + . . . + wN xN ) and evaluate the outcome against a threshold w0 . This allows us to implement the rules as linear classifiers such as decision trees, perceptrons, support-vector-machines, etc. Learning these rules will become possible with features from the data enrichment (WP2) and the show-usermodel becoming available. 6 Conclusion For the Deliverable 4.4 UZH and TUDo have created the fundamental data models for all components of the ViSTA-TV recommendations engine. Recommendations will be implemented as linear classifiers and make their decision based on feature events, temporal facts and show-user-model information. The temporal facts are computed by an extension of SPARQL which is under development. They will essentially take feature events and show-user-model data is an input and match these against extended SPARQL queries. The show-user-model will for a given show the cluster of users assigned to it and vice versa. Being the only of the above components that is not time-critical, applying the actual model involves queries to the data warehouse about shows that are to be broadcasted in the future. The show-user-model will be determined by a two-way hierarchical clustering which is performed offline. The recommendation rules will be learned offline on previous user behavior and the complex events will be hand-crafted and later learned as the feature data and the show-user-model become available. First investigations on promising features and events have been carried out and will result in actual models for the next milestone. Acknowledgements The research leading to these results has received funding from the European Union Seventh Framework Programme FP7/2007-2011 under grant agreement n° 296126. D4.4 Recommender Data Models 19 We would like to thank Jörg-Uwe Kietz, Libby Miller and the reviewers for their helpful comments on this text. 20 ViSTA-TV References [1] Evrim Acar, Seyit Çamtepe, and Bülent Yener. Collective Sampling and Analysis of High Order Tensors for Chatroom Communications. In Sharad Mehrotra, Daniel Zeng, Hsinchun Chen, Bhavani Thuraisingham, and Fei-Yue Wang, editors, Intelligence and Security Informatics, volume 3975 of Lecture Notes in Computer Science, pages 213–224. Springer Berlin / Heidelberg, 2006. [2] Darko Anicic, Paul Fodor, Sebastian Rudolph, and Nenad Stojanovic. EP-SPARQL: a unified language for event processing and stream reasoning. In Proceedings of the 20th international conference on World wide web - WWW ’11, pages 635–644, New York, NY, USA, 2011. ACM. [3] Arvind Arasu, Shivnath Babu, and Jennifer Widom. The CQL continuous query language: semantic foundations and query execution. The VLDB Journal, 15(2):121–142, July 2005. [4] Arindam Banerjee, S Basu, and S Merugu. Multi-way clustering on relation graphs. SDM. SIAM, 2007. [5] D.F. Barbieri, D. Braga, S. Ceri, E. Della Valle, and M. Grossniklaus. C-SPARQL: SPARQL for continuous querying. In Proceedings of the 18th international conference on World wide web, pages 1061–1062. ACM, 2009. [6] Daniel Billsus and Michael J Pazzani. Learning Collaborative Information Filters. In ICML ’98 Proceedings of the Fifteenth International Conference on Machine Learning, pages 46 – 54, 1998. [7] Sirish Chandrasekaran, Owen Cooper, Amol Deshpande, Michael J. Franklin, Joseph M. Hellerstein, Wei Hong, Sailesh Krishnamurthy, Sam Madden, Vijayshankar Raman, Fred Reiss, and Mehul A. Shah. TelegraphCQ : Continuous Dataflow Processing for an Uncertain World +. In Michael Stonebraker, Jim Gray, and David DeWitt, editors, CIDR 2003, First Biennial Conference on Innovative Data Systems Research, Asilomar, CA, USA, January 5-8, 2003, Online Proceedings. www.crdrdb.org, 2003. [8] Graham Cormode and Marios Hadjieleftheriou. Methods for finding frequent items in data streams. The VLDB Journal, 19(1):3–20, December 2010. [9] Yanlei Diao, Neil Immerman, and Daniel Gyllstrom. Sase+: An agile language for kleene closure over event streams. Technical report, University of Massachusetts Amherst, Department of Computer Science, 2008. [10] Oliver Flasch, Andreas Kaspari, Katharina Morik, and Michael Wurst. Aspect-based tagging for collaborative media organization. In From Web to Social Web: . . . . 2007. [11] BCM Fung, Ke Wang, and Martin Ester. Hierarchical document clustering using frequent itemsets. In Proc. SIAM International Conference on Data Mining 2003 (SDM 2003), 2003. [12] Thomas Hofmann. Latent semantic models for collaborative filtering. ACM Transactions on Information Systems, 22(1):89–115, January 2004. [13] Yehuda Koren, Robert Bell, and Chris Volinsky. Matrix factorization techniques for recommender systems. Computer, pages 42–49, 2009. [14] Yamuna Krishnamurthy. Data driven discretization. Website, 2012. http://kirmes.cs.uni-dortmund. de/vista/data_driven_discretization.html. [15] Danh Le-phuoc, Minh Dao-tran, Josiane Xavier Parreira, and Manfred Hauswirth. A Native and Adaptive Approach for Unified Processing of Linked Streams and Linked Data. In Lora Aroyo, Chris Welty, Harith Alani, Jamie Taylor, Abraham Bernstein, Lalana Kagal, Natasha Noy, and Eva Blomqvist, editors, The Semantic Web – ISWC 2011: 10th International Semantic Web Conference, Bonn, Germany, October 23-27, 2011, Proceedings, Part I, volume 7031 of Lecture Notes in Computer Science, pages 370–388. Springer Berlin / Heidelberg, 2011. [16] YR Lin, J Sun, and P Castro. Metafac: community discovery via relational hypergraph factorization. In KDD ’09 Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 527–536, 2009. [17] Prem Melville, Raymond J Mooney, and Ramadass Nagarajan. Content-Boosted Collaborative Filtering for Improved Recommendations. In Proceedings of the Eighteenth National Conference on Artificial Intelligence(AAAI-2002), Edmonton, Canada, July 2002, number July, pages 187–192, 2002. D4.4 Recommender Data Models 21 [18] Ingo Mierswa, Katharina Morik, and Michael Wurst. Collaborative Use of Features in a Distributed System for the Organization of Music Collections. In Shephard Shen and Liu Cui, editors, Intelligent Music Information Systems: Tools and Methodologies, pages 147–176. Idea Group Publishing, 2007. [19] Raymond J Mooney and Loriene Roy. Content-Based Book Recommending Using Learning for Text Categorization. In Proceedings of the Fifth ACM Conference on Digital Libraries, San Antonio, TX, June 2000, number June, pages 195–240, 2000. [20] Katharina Morik, Andreas Kaspari, Michael Wurst, and Marcin Skirzynski. Multi-objective frequent termset clustering. Knowledge and Information Systems, 30(3):715–738, July 2011. [21] Harsha Nagesh, Sanjay Goil, and Alok Choudhary. Adaptive grids for clustering massive data sets. In Proceedings of 1st SIAM International Conference on Data Mining, 2001. [22] Andrew I Schein, Alexandrin Popescul, Lyle H Ungar, and David M Pennock. Methods and Metrics for Cold-Start Recommendations. In Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2002)., 2002.