Entscheidungen in Millisekunden!

Transcrição

Entscheidungen in Millisekunden!
Entscheidungen in Millisekunden!
Echtzeit Analysen mit InfoSphere Streams 3.0
Stephan Reimann
Big Data Specialist
[email protected]
InfoSphere Streams Represents a Paradigm Shift
D ata
in M
o tio
n
Data at
rest
Reporting and human
analysis on historical
data
Analysis of current data
to improve business
transactions
Operational
Databases
1968
Hierarchical
OLTP
1970
Relational
“System R”
Data
Warehousing
1983
DB2 v1
OLAP
Real Time Analytic
Processing (RTAP) to
improve business response
Stream
Computing
2003 “System S”
2009 InfoSphere Streams
RTAP
What does Streams do?
Relational databases and warehouses find information stored
on disk
Streams analyzes data before you store it
Databases find the needle in the haystack
– Possibly with indices, cubes, or hardware
acceleration
Streams finds the needle as it’s blowing by
Streams Analyzes All Kinds of Data
Mining in Microseconds
(included with Streams)
Acoustic
(IBM Research)
(Open Source)
***New
Text
(listen, verb),
(radio, noun)
Advanced
Mathematical
Models
Simple & Advanced Text
(included with Streams)
(IBM Research)
***New
Statistics
Predictive
∑ R( s , a )
t
(included with
Streams)
t
population
***New
Geospatial
(included with
Streams)
Image & Video
(Open Source)
(included with
Streams)
4
© 2013 IBM Corporation
Use Cases: Video Processing (Contour Detection)
Original Picture
Contour Detection
5
© 2013 IBM Corporation
IBM InfoSphere Streams v3.0
A platform for real-time analytics on BIG data
Volume
Just-in-time decisions
– Terabytes per second
– Petabytes per day
Variety
– All kinds of data
– All kinds of analytics
Velocity
Powerful
Analytics
Millions of
events per
second
Microsecond
Latency
– Insights in microseconds
Agility
– Dynamically responsive
– Rapid application development
6
Sensor, video, audio, text,
and relational data sources
© 2013 IBM Corporation
Big Data in Motion: Real world examples
Financial Services
Analyzes and
correlates 5M+
market
messages/sec to
execute algorithmic
option trades with
average latency of
30 micro-secs.
Telco
Collects and
summarizes
12M CDRs/sec for
RTAP decision
making.
Telco
500K/sec, 6B+ IPDRs
analyzed per day on
more than 4 PBs/yr.
sustaining 1GBps.
Utilities
Analyzing space
weather at 6GB/sec
and over 21.6TB/hour
to predict how plasma
clouds travel in space
and lesson effect on
power grids, +++.
Categories of Problems Solved by Streams
Applications that require on-the-fly processing, filtering and analysis
of streaming data
– Sensors: environmental, industrial, surveillance video, GPS, …
– “Data exhaust”: network/system/web server/app server log files
– High-rate transaction data: financial transactions, call detail records
Criteria: two or more of the following
– Messages are processed in isolation or in limited data windows
– Sources include non-traditional data (spatial, imagery, text, …)
– Sources vary in connection methods, data rates, and processing requirements,
presenting integration challenges
– Data rates/volumes require the resources of multiple processing nodes
– Analysis and response are needed with sub-millisecond latency
– Data rates and volumes are too great for store-and-mine approaches
8
© 2013 IBM Corporation
How Streams Works
ingestion
continuous
Continuous
ingestion
Continuous analysis
9
© 2013 IBM Corporation
How Streams Works
Continuous ingestion
Continuous analysis
Filter /
Sample
Infrastructure provides services for
scheduling analytics across hardware hosts,
establishing streaming connectivity
Annotate
Transform
Correlate
Classify
Achieve scale:
By partitioning applications into software components
By distributing across stream-connected hardware hosts
10
Where appropriate:
Elements can be fused together
for lower communication latency
© 2013 IBM Corporation
Scalable Stream Processing
Streams programming model: construct a graph
– Mathematical concept
• not a line -, bar -, or pie chart!
• Also called a network
• Familiar: for example,
a tree structure is a graph
OP
OP
OP
OP
OP
OP
stream
OP
– Consisting of operators and the streams that connect them
• The vertices (or nodes) and edges of the mathematical graph
• A directed graph: the edges have a direction (arrows)
Streams runtime model: distributed processes
– Single or multiple operators form a Processing Element (PE)
– Compiler and runtime services make it easy to deploy PEs
• On one machine
• Across multiple hosts in a cluster when scaled-up processing is required
– All links and data transport are handled by runtime services
• Automatically
• With manual placement directives where required
11
© 2013 IBM Corporation
From Operators to Running Jobs
Streams application graph:
Src
– A directed, possibly cyclic, graph
– A collection of operators
– Connected by streams
OP
Sink
OP
Src
OP
stream
Sink
Each complete application is a potentially deployable job
Jobs are deployed to a Streams runtime environment, known as a
Streams Instance (or simply, an instance)
An instance can include a single processing node (hardware)
Or multiple processing nodes
node
node
h/w node
node
node
node
node
node
Streams instance
12
© 2013 IBM Corporation
Streams Runtime Illustrated
Meters
Company
Filter
Usage
Model
Temp
Action
Usage
Contract
Text
Extract
Season
Adjust
Optimizing scheduler assigns jobs
to hosts, and continually manages
resource allocation
Daily
Adjust
Commodity hardware – laptop,
blades or high performance clusters
x86 host
13
x86 host
x86 host
x86 host
© 2013 IBM Corporation
Streams Runtime Illustrated
Optimizing scheduler assigns PEs
to hosts, and continually manages
resource allocation
Dynamically add
hosts and jobs
Commodity hardware – laptop,
blades or high performance clusters
New jobs work
with existing jobs
Meters
Company
Filter
Meters
Usage
Contract
Text
Extract
Text
Extract
x86 host
14
Usage
Model
x86 host
Temp
Action
Season
Adjust
Degree
History
Compare
History
x86 host
Daily
Adjust
Store
History
x86 host
x86 host
© 2013 IBM Corporation
InfoSphere Streams Objects: Development View
Operator
– The fundamental building block of the
Streams Processing Language
– Operators process data from streams
and may produce new streams
Streams Application
operator
stream
Stream
– An infinite sequence of structured tuples
– Can be consumed by operators on a
tuple-by-tuple basis or through the
definition of a window
Tuple
– A structured list of attributes and their
types. Each tuple on a stream has the
form dictated by its stream type
height:
640
width:
480
data:
height:
1280
width:
1024
data:
height:
640
width:
480
data:
Stream type
– Specification of the name and data type
of each attribute in the tuple
Window
– A finite, sequential group of tuples
– Based on count, time, attribute value,
or punctuation marks
15
directory: directory: directory: directory:
"/img" "/img"
"/opt" "/img"
filename: filename: filename: filename:
"farm" "bird" "java" "cat"
tuple
© 2013 IBM Corporation
Streams Core Analytical Capabilities - Examples
The Split operator is
used for dividing
incoming tuples into
separate streams for
parallel processing
The Functor operator is
used for performing tuplelevel manipulations
The Delay operator is
used to “artificially”
slowdown a stream
The Aggregate
operator is used for
grouping and
summarization of
incoming tuples
The Join operator is
used for correlating two
streams
The Punctor operator
is for inserting
punctuation marks in
streams
And more!
The Sort operator is used for
imposing an order on
incoming tuples in a stream
The Barrier operator
is used as a
synchronization point
IBM InfoSphere Streams 3.0
Comprehensive
tooling
Scale-out architecture
Sophisticated analytics
with toolkits & accelerators
Front Office 3.0
• Eclipse IDE
• Clustered runtime for nearlimitless capacity
• Web-based console
• RHEL v5.3 and above
• Drag & Drop editor
• CentOS v6.0 and above
• Instance graph
• X86 & Power multicore
hardware
• Streams visualization
• Streams debugger
17
• InfiniBand support
• Big Data, CEP, Database,
Data Explorer (Big Data),
DataStage, Finance, Geospatial,
Internet, Messaging, Mining, SPSS,
Standard, Text, TimeSeries toolkits
• Telco & Social Media accelerators
• Ethernet support
© 2013 IBM Corporation
Streams Console: Web-Based Administration
18
© 2013 IBM Corporation
Streams Studio Task Launcher
19
© 2013 IBM Corporation
Graphical Editor: Drag & Drop Editing
Quickly configure the application graph
– Property views to configure individual operators
– Round trip between graphical view and SPL code
20
© 2013 IBM Corporation
Improved Visual Application Monitoring
Enhanced Instance Graph
– Available in Streams Studio and web-based Streams Console
– Visual monitoring of application health and metrics
– Quickly identify issues using customizable views
• Job, PE, Operator and Host containment views
• Configurable, metric-based coloring schemes
Instance graph
in Studio,
showing tuple
flow rate
Application graph
in Console, for
the same job
21
© 2013 IBM Corporation
Stream Data Visualization
New in the Streams Console
Easily visualize stream data
– Dynamically add new views to running applications
Charts provided out of the box
– Line graph, bar chart, table
Views can filter and buffer data
– Minimal performance impact
– Views started on demand
22
© 2013 IBM Corporation
Streams and Warehouse: Complementary
Unit of analysis
1PB
Warehouse
Warehouse
/Hadoop
100TB
High
-Scalable processing of
huge data stores
10TB
Sweet
spot
1TB
100GB
Capability
Warehouse
Med
Streams
10GB
- scalable low-latency
processing of stream data
GB
Sweet
spot
Streams
MB
Low
Streams
KB
Latency
B
µs
ms
Low
…
sec
min
hr
Med
day
wk
mo
High
yr
Capability
What Are People Doing With Streams?
Stock market
Telephony
CDR processing
Social analysis
Churn prediction
Geomapping
Impact of weather on securities prices
Analyze market data at ultra-low latencies
Law Enforcement,
Defense & Cyber-Security
Real-time multimodal surveillance
Situational awareness
Cyber security detection
Transportation
Intelligent traffic
management
Fraud prevention
Detecting multi-party fraud
Real-time fraud prevention
Smart Grid & Energy
e-Science
Transactive control
Phasor Monitoring Unit
Space weather prediction
Detection of transient events
Synchrotron atomic research
Health & Life
Sciences
Neonatal ICU
monitoring
Epidemic early
warning system
Remote healthcare
monitoring
24
Other
Natural Systems
Wildfire management
Water management
Manufacturing
Text Analysis
Who’s Talking to Whom?
ERP for Commodities
FPGA Acceleration
© 2013 IBM Corporation
IBM InfoSphere Streams - Summary
Just-in-time decisions
Real Time Analytic Processing:
–Descriptive and predictive
analytics
–Data mining as records
arrive for immediate results
Streams allows you to:
–Capture and analyze all
your data.
–All the time.
–Just in time.
Powerful
Analytics
Millions of
events per
second
Microsecond
Latency
Sensor, video, audio, text and relational
data sources
26
© 2013 IBM Corporation