Entscheidungen in Millisekunden!
Transcrição
Entscheidungen in Millisekunden!
Entscheidungen in Millisekunden! Echtzeit Analysen mit InfoSphere Streams 3.0 Stephan Reimann Big Data Specialist [email protected] InfoSphere Streams Represents a Paradigm Shift D ata in M o tio n Data at rest Reporting and human analysis on historical data Analysis of current data to improve business transactions Operational Databases 1968 Hierarchical OLTP 1970 Relational “System R” Data Warehousing 1983 DB2 v1 OLAP Real Time Analytic Processing (RTAP) to improve business response Stream Computing 2003 “System S” 2009 InfoSphere Streams RTAP What does Streams do? Relational databases and warehouses find information stored on disk Streams analyzes data before you store it Databases find the needle in the haystack – Possibly with indices, cubes, or hardware acceleration Streams finds the needle as it’s blowing by Streams Analyzes All Kinds of Data Mining in Microseconds (included with Streams) Acoustic (IBM Research) (Open Source) ***New Text (listen, verb), (radio, noun) Advanced Mathematical Models Simple & Advanced Text (included with Streams) (IBM Research) ***New Statistics Predictive ∑ R( s , a ) t (included with Streams) t population ***New Geospatial (included with Streams) Image & Video (Open Source) (included with Streams) 4 © 2013 IBM Corporation Use Cases: Video Processing (Contour Detection) Original Picture Contour Detection 5 © 2013 IBM Corporation IBM InfoSphere Streams v3.0 A platform for real-time analytics on BIG data Volume Just-in-time decisions – Terabytes per second – Petabytes per day Variety – All kinds of data – All kinds of analytics Velocity Powerful Analytics Millions of events per second Microsecond Latency – Insights in microseconds Agility – Dynamically responsive – Rapid application development 6 Sensor, video, audio, text, and relational data sources © 2013 IBM Corporation Big Data in Motion: Real world examples Financial Services Analyzes and correlates 5M+ market messages/sec to execute algorithmic option trades with average latency of 30 micro-secs. Telco Collects and summarizes 12M CDRs/sec for RTAP decision making. Telco 500K/sec, 6B+ IPDRs analyzed per day on more than 4 PBs/yr. sustaining 1GBps. Utilities Analyzing space weather at 6GB/sec and over 21.6TB/hour to predict how plasma clouds travel in space and lesson effect on power grids, +++. Categories of Problems Solved by Streams Applications that require on-the-fly processing, filtering and analysis of streaming data – Sensors: environmental, industrial, surveillance video, GPS, … – “Data exhaust”: network/system/web server/app server log files – High-rate transaction data: financial transactions, call detail records Criteria: two or more of the following – Messages are processed in isolation or in limited data windows – Sources include non-traditional data (spatial, imagery, text, …) – Sources vary in connection methods, data rates, and processing requirements, presenting integration challenges – Data rates/volumes require the resources of multiple processing nodes – Analysis and response are needed with sub-millisecond latency – Data rates and volumes are too great for store-and-mine approaches 8 © 2013 IBM Corporation How Streams Works ingestion continuous Continuous ingestion Continuous analysis 9 © 2013 IBM Corporation How Streams Works Continuous ingestion Continuous analysis Filter / Sample Infrastructure provides services for scheduling analytics across hardware hosts, establishing streaming connectivity Annotate Transform Correlate Classify Achieve scale: By partitioning applications into software components By distributing across stream-connected hardware hosts 10 Where appropriate: Elements can be fused together for lower communication latency © 2013 IBM Corporation Scalable Stream Processing Streams programming model: construct a graph – Mathematical concept • not a line -, bar -, or pie chart! • Also called a network • Familiar: for example, a tree structure is a graph OP OP OP OP OP OP stream OP – Consisting of operators and the streams that connect them • The vertices (or nodes) and edges of the mathematical graph • A directed graph: the edges have a direction (arrows) Streams runtime model: distributed processes – Single or multiple operators form a Processing Element (PE) – Compiler and runtime services make it easy to deploy PEs • On one machine • Across multiple hosts in a cluster when scaled-up processing is required – All links and data transport are handled by runtime services • Automatically • With manual placement directives where required 11 © 2013 IBM Corporation From Operators to Running Jobs Streams application graph: Src – A directed, possibly cyclic, graph – A collection of operators – Connected by streams OP Sink OP Src OP stream Sink Each complete application is a potentially deployable job Jobs are deployed to a Streams runtime environment, known as a Streams Instance (or simply, an instance) An instance can include a single processing node (hardware) Or multiple processing nodes node node h/w node node node node node node Streams instance 12 © 2013 IBM Corporation Streams Runtime Illustrated Meters Company Filter Usage Model Temp Action Usage Contract Text Extract Season Adjust Optimizing scheduler assigns jobs to hosts, and continually manages resource allocation Daily Adjust Commodity hardware – laptop, blades or high performance clusters x86 host 13 x86 host x86 host x86 host © 2013 IBM Corporation Streams Runtime Illustrated Optimizing scheduler assigns PEs to hosts, and continually manages resource allocation Dynamically add hosts and jobs Commodity hardware – laptop, blades or high performance clusters New jobs work with existing jobs Meters Company Filter Meters Usage Contract Text Extract Text Extract x86 host 14 Usage Model x86 host Temp Action Season Adjust Degree History Compare History x86 host Daily Adjust Store History x86 host x86 host © 2013 IBM Corporation InfoSphere Streams Objects: Development View Operator – The fundamental building block of the Streams Processing Language – Operators process data from streams and may produce new streams Streams Application operator stream Stream – An infinite sequence of structured tuples – Can be consumed by operators on a tuple-by-tuple basis or through the definition of a window Tuple – A structured list of attributes and their types. Each tuple on a stream has the form dictated by its stream type height: 640 width: 480 data: height: 1280 width: 1024 data: height: 640 width: 480 data: Stream type – Specification of the name and data type of each attribute in the tuple Window – A finite, sequential group of tuples – Based on count, time, attribute value, or punctuation marks 15 directory: directory: directory: directory: "/img" "/img" "/opt" "/img" filename: filename: filename: filename: "farm" "bird" "java" "cat" tuple © 2013 IBM Corporation Streams Core Analytical Capabilities - Examples The Split operator is used for dividing incoming tuples into separate streams for parallel processing The Functor operator is used for performing tuplelevel manipulations The Delay operator is used to “artificially” slowdown a stream The Aggregate operator is used for grouping and summarization of incoming tuples The Join operator is used for correlating two streams The Punctor operator is for inserting punctuation marks in streams And more! The Sort operator is used for imposing an order on incoming tuples in a stream The Barrier operator is used as a synchronization point IBM InfoSphere Streams 3.0 Comprehensive tooling Scale-out architecture Sophisticated analytics with toolkits & accelerators Front Office 3.0 • Eclipse IDE • Clustered runtime for nearlimitless capacity • Web-based console • RHEL v5.3 and above • Drag & Drop editor • CentOS v6.0 and above • Instance graph • X86 & Power multicore hardware • Streams visualization • Streams debugger 17 • InfiniBand support • Big Data, CEP, Database, Data Explorer (Big Data), DataStage, Finance, Geospatial, Internet, Messaging, Mining, SPSS, Standard, Text, TimeSeries toolkits • Telco & Social Media accelerators • Ethernet support © 2013 IBM Corporation Streams Console: Web-Based Administration 18 © 2013 IBM Corporation Streams Studio Task Launcher 19 © 2013 IBM Corporation Graphical Editor: Drag & Drop Editing Quickly configure the application graph – Property views to configure individual operators – Round trip between graphical view and SPL code 20 © 2013 IBM Corporation Improved Visual Application Monitoring Enhanced Instance Graph – Available in Streams Studio and web-based Streams Console – Visual monitoring of application health and metrics – Quickly identify issues using customizable views • Job, PE, Operator and Host containment views • Configurable, metric-based coloring schemes Instance graph in Studio, showing tuple flow rate Application graph in Console, for the same job 21 © 2013 IBM Corporation Stream Data Visualization New in the Streams Console Easily visualize stream data – Dynamically add new views to running applications Charts provided out of the box – Line graph, bar chart, table Views can filter and buffer data – Minimal performance impact – Views started on demand 22 © 2013 IBM Corporation Streams and Warehouse: Complementary Unit of analysis 1PB Warehouse Warehouse /Hadoop 100TB High -Scalable processing of huge data stores 10TB Sweet spot 1TB 100GB Capability Warehouse Med Streams 10GB - scalable low-latency processing of stream data GB Sweet spot Streams MB Low Streams KB Latency B µs ms Low … sec min hr Med day wk mo High yr Capability What Are People Doing With Streams? Stock market Telephony CDR processing Social analysis Churn prediction Geomapping Impact of weather on securities prices Analyze market data at ultra-low latencies Law Enforcement, Defense & Cyber-Security Real-time multimodal surveillance Situational awareness Cyber security detection Transportation Intelligent traffic management Fraud prevention Detecting multi-party fraud Real-time fraud prevention Smart Grid & Energy e-Science Transactive control Phasor Monitoring Unit Space weather prediction Detection of transient events Synchrotron atomic research Health & Life Sciences Neonatal ICU monitoring Epidemic early warning system Remote healthcare monitoring 24 Other Natural Systems Wildfire management Water management Manufacturing Text Analysis Who’s Talking to Whom? ERP for Commodities FPGA Acceleration © 2013 IBM Corporation IBM InfoSphere Streams - Summary Just-in-time decisions Real Time Analytic Processing: –Descriptive and predictive analytics –Data mining as records arrive for immediate results Streams allows you to: –Capture and analyze all your data. –All the time. –Just in time. Powerful Analytics Millions of events per second Microsecond Latency Sensor, video, audio, text and relational data sources 26 © 2013 IBM Corporation