75ms - CloudConf

Transcrição

75ms - CloudConf
1
Infrastructure overview
Marlon Dutra
Production Engineer, Traffic
October, 2013
2
Physical infra
3
Data centers
4
Edge locations
5
Prineville, OR
6
Organization
Suites
Clusters
services
back end
front end
etc
etc
7
Triplet racks
8
Thousands of them...
9
Clusters
•Just a big group of servers
in a network topology
•No special software coordination
•We call “logical clusters” as “tiers”
(to avoid miscommunication)
10
Servers
•Very efficient servers
•Designed in house (opencompute.org)
•Vanity free, open cabinets, no paint
•No fancy boxes, manuals, CDs, etc
•10G network card
•Few hardware variances
cpu, memory, storage, iops...
11
opencompute.org
12
Logical infra
13
Cloud management
•We don’t use virtual machines
•We don’t care about servers or OSes
•We do care about services
•VMs are meant to share resources
•We want the opposite of that
•Every 1-2% matters, a lot
14
Cloud management [2]
•Remote hardware control
•Console, restart, power on/off, etc
•Same base OS everywhere
•Chef for host setup
•Automatic provisioning, via PXE
•We provision thousands of servers in
a few hours. All plug and play.
15
Cloud management [3]
•We buy fully assembled triplet racks
•Connect the rack switch to cluster switches
•Connect main and backup power
•Walk away
•In 1-2 hours, we can SSH into the hosts
16
Service management
•Services packaged with all dependencies
•They can run anywhere
•Everything built to scale
•Services must run in multiple machines,
data centers, etc
•Binaries deployed with bittorrent
•No bottlenecks in the distribution
17
Service management [2]
• Services run with LXC (Linux containers)
• chroot for filesystem isolation
• Process namespace isolation
• Routing isolation
• Similar to FreeBSD jails
18
Shared pool of servers
• Utilization example
• 250 instances of service A (not shared, multiple racks and clusters)
• 100 instances of service B (can be shared, needs 1 cpu, 4g memory)
• 700 instances of service C (can be shared, needs 2 cpu, 16g memory)
• The automatic scheduler takes care of the allocation
• Not everything can use a shared pool, of course (e.g. databases)
19
Service management [3]
• A broken server is not a big deal
• The scheduler moves the services
somewhere else
• Auto remediation system for common issues
• Canary ability for services and configs
20
Inter service communication
• Apache Thrift
• http://thrift.apache.org/
Tip: always avoid XML
21
Storage management
• Large objects (photos, videos...)
• BLOB store
• Computing nodes with lots of disks
• Small objects (text, numbers...)
• Databases (MySQL, HBASE, Hive, etc)
• Huge cache infra between apps and dbs
• All highly distributed and replicated
• Tip: never use disk arrays for big loads
22
Network management
• L3 everywhere
• Each rack has a /24 (IPv4) and a /64 (IPv6)
• Rack switches talk BGP-ECMP to CSWs
• CSWs talk BGP-ECMP to big routers...
• All the routing is BGP based
• 10g fiber links to each server
• Most services are behind load balancers
• tip: say goodbye to L2/VLANs
23
Traffic
24
Weekly cycle
Egress
Ingress
Monday
Tuesday
Wednesday
Thursday
Friday
Saturday
Sunday
7 days
25
Daily cycle (global)
11 AM
3 PM
24 hours
* Pacific time (UTC-8)
26
Daily cycle (global), mapped
27
Daily cycle (Brazil)
10 PM
1 PM
24 hours
* Brasilia time (UTC-3)
28
Some numbers
•Peak HTTP/SPDY rps: ~12.5M
•Peak TCP conns: ~260M
•MAU Global: 1.15 billion
•MAU Brazil: 73 million (march 2013)
29
Network/LB topology
Internet
Datacenter
DR
DR
DR
Cluster
Network
Traffic
BGP/ECMP
CSW
CSW
CSW
RSW
RSW
RSW
L4LB
L4LB
L4LB
L4LB
L4LB
L4LB
L4LB
L4LB
L4LB
IPv4: /32s
IPv6: /64s
DSR/WRR
L7LB
L7LB
L7LB
L7LB
L7LB
L7LB
L7LB
L7LB
L7LB
L7LB
L7LB
L7LB
L7LB
L7LB
L7LB
L7LB
L7LB
L7LB
L7LB
L7LB
Traffic
Web
WEB
30
Proportion
Singles to Tens
Tens to Hundreds
Thousands
31
cont.
x 10 or more
32
Porto Alegre <-> Forest City, NC
75ms
33
Porto Alegre <-> Forest City, NC
75ms
SYN
TCP conn
established:
SYN+ACK
150 ms
ACK
75ms
33
Porto Alegre <-> Forest City, NC
75ms
SYN
TCP conn
established:
SYN+ACK
150 ms
ACK
ClientHello
75ms
ServerHello
SSL session
established:
450 ms
ChangeCipherSpec
ChangeCipherSpec
33
Porto Alegre <-> Forest City, NC
75ms
SYN
TCP conn
established:
SYN+ACK
150 ms
ACK
ClientHello
75ms
ServerHello
SSL session
established:
ChangeCipherSpec
450 ms
ChangeCipherSpec
GET
Response
Received
600 ms
HTTP 1.1 200
33
Edge rack
x1
L4LB
x2
L7LB
x 20
PHP
34
POA - GRU - Forest City, NC
60ms
15ms
35
POA - GRU - Forest City, NC
15ms
60ms
Sessions
established:
90 ms
(vs 450 ms)
35
POA - GRU - Forest City, NC
15ms
60ms
Request
Received
Sessions
established:
90 ms
(vs 450 ms)
GET
GET
HTTP 1.1 200
Response
Received:
HTTP 1.1 200
240 ms
35
POA - GRU - Forest City, NC
60ms
TCP Connect: 150ms
SSL Session: 450ms
HTTP Response: 600ms
15ms
36
POA - GRU - Forest City, NC
60ms
TCP Connect: 150ms 30ms
SSL Session: 450ms 90ms
HTTP Response: 600ms 240ms
15ms
36
Intl RTT, before and after
37
Intl RTT, before and after
37
Conclusion
38
Tips
• Never have single point of failures
• Don't protect only against equipment failure
• Human failures are the worst ones
• Make data driven decision
• Invest on analytics and instrumentation
• More data, better decisions. Don't fly blind.
39
Tips [2]
• There's no right or wrong here
• This is just the way we solve our problem today
• This will probably be different next year or so, maybe
tomorrow
• Your problem might need a different solution
40
You can push the buttons too
http://www.facebook.com/careers
41
42
(c) 2009 Facebook, Inc. or its licensors. "Facebook" is a registered trademark of Facebook, Inc.. All rights reserved. 1.0
43

Documentos relacionados