75ms - CloudConf
Transcrição
75ms - CloudConf
1 Infrastructure overview Marlon Dutra Production Engineer, Traffic October, 2013 2 Physical infra 3 Data centers 4 Edge locations 5 Prineville, OR 6 Organization Suites Clusters services back end front end etc etc 7 Triplet racks 8 Thousands of them... 9 Clusters •Just a big group of servers in a network topology •No special software coordination •We call “logical clusters” as “tiers” (to avoid miscommunication) 10 Servers •Very efficient servers •Designed in house (opencompute.org) •Vanity free, open cabinets, no paint •No fancy boxes, manuals, CDs, etc •10G network card •Few hardware variances cpu, memory, storage, iops... 11 opencompute.org 12 Logical infra 13 Cloud management •We don’t use virtual machines •We don’t care about servers or OSes •We do care about services •VMs are meant to share resources •We want the opposite of that •Every 1-2% matters, a lot 14 Cloud management [2] •Remote hardware control •Console, restart, power on/off, etc •Same base OS everywhere •Chef for host setup •Automatic provisioning, via PXE •We provision thousands of servers in a few hours. All plug and play. 15 Cloud management [3] •We buy fully assembled triplet racks •Connect the rack switch to cluster switches •Connect main and backup power •Walk away •In 1-2 hours, we can SSH into the hosts 16 Service management •Services packaged with all dependencies •They can run anywhere •Everything built to scale •Services must run in multiple machines, data centers, etc •Binaries deployed with bittorrent •No bottlenecks in the distribution 17 Service management [2] • Services run with LXC (Linux containers) • chroot for filesystem isolation • Process namespace isolation • Routing isolation • Similar to FreeBSD jails 18 Shared pool of servers • Utilization example • 250 instances of service A (not shared, multiple racks and clusters) • 100 instances of service B (can be shared, needs 1 cpu, 4g memory) • 700 instances of service C (can be shared, needs 2 cpu, 16g memory) • The automatic scheduler takes care of the allocation • Not everything can use a shared pool, of course (e.g. databases) 19 Service management [3] • A broken server is not a big deal • The scheduler moves the services somewhere else • Auto remediation system for common issues • Canary ability for services and configs 20 Inter service communication • Apache Thrift • http://thrift.apache.org/ Tip: always avoid XML 21 Storage management • Large objects (photos, videos...) • BLOB store • Computing nodes with lots of disks • Small objects (text, numbers...) • Databases (MySQL, HBASE, Hive, etc) • Huge cache infra between apps and dbs • All highly distributed and replicated • Tip: never use disk arrays for big loads 22 Network management • L3 everywhere • Each rack has a /24 (IPv4) and a /64 (IPv6) • Rack switches talk BGP-ECMP to CSWs • CSWs talk BGP-ECMP to big routers... • All the routing is BGP based • 10g fiber links to each server • Most services are behind load balancers • tip: say goodbye to L2/VLANs 23 Traffic 24 Weekly cycle Egress Ingress Monday Tuesday Wednesday Thursday Friday Saturday Sunday 7 days 25 Daily cycle (global) 11 AM 3 PM 24 hours * Pacific time (UTC-8) 26 Daily cycle (global), mapped 27 Daily cycle (Brazil) 10 PM 1 PM 24 hours * Brasilia time (UTC-3) 28 Some numbers •Peak HTTP/SPDY rps: ~12.5M •Peak TCP conns: ~260M •MAU Global: 1.15 billion •MAU Brazil: 73 million (march 2013) 29 Network/LB topology Internet Datacenter DR DR DR Cluster Network Traffic BGP/ECMP CSW CSW CSW RSW RSW RSW L4LB L4LB L4LB L4LB L4LB L4LB L4LB L4LB L4LB IPv4: /32s IPv6: /64s DSR/WRR L7LB L7LB L7LB L7LB L7LB L7LB L7LB L7LB L7LB L7LB L7LB L7LB L7LB L7LB L7LB L7LB L7LB L7LB L7LB L7LB Traffic Web WEB 30 Proportion Singles to Tens Tens to Hundreds Thousands 31 cont. x 10 or more 32 Porto Alegre <-> Forest City, NC 75ms 33 Porto Alegre <-> Forest City, NC 75ms SYN TCP conn established: SYN+ACK 150 ms ACK 75ms 33 Porto Alegre <-> Forest City, NC 75ms SYN TCP conn established: SYN+ACK 150 ms ACK ClientHello 75ms ServerHello SSL session established: 450 ms ChangeCipherSpec ChangeCipherSpec 33 Porto Alegre <-> Forest City, NC 75ms SYN TCP conn established: SYN+ACK 150 ms ACK ClientHello 75ms ServerHello SSL session established: ChangeCipherSpec 450 ms ChangeCipherSpec GET Response Received 600 ms HTTP 1.1 200 33 Edge rack x1 L4LB x2 L7LB x 20 PHP 34 POA - GRU - Forest City, NC 60ms 15ms 35 POA - GRU - Forest City, NC 15ms 60ms Sessions established: 90 ms (vs 450 ms) 35 POA - GRU - Forest City, NC 15ms 60ms Request Received Sessions established: 90 ms (vs 450 ms) GET GET HTTP 1.1 200 Response Received: HTTP 1.1 200 240 ms 35 POA - GRU - Forest City, NC 60ms TCP Connect: 150ms SSL Session: 450ms HTTP Response: 600ms 15ms 36 POA - GRU - Forest City, NC 60ms TCP Connect: 150ms 30ms SSL Session: 450ms 90ms HTTP Response: 600ms 240ms 15ms 36 Intl RTT, before and after 37 Intl RTT, before and after 37 Conclusion 38 Tips • Never have single point of failures • Don't protect only against equipment failure • Human failures are the worst ones • Make data driven decision • Invest on analytics and instrumentation • More data, better decisions. Don't fly blind. 39 Tips [2] • There's no right or wrong here • This is just the way we solve our problem today • This will probably be different next year or so, maybe tomorrow • Your problem might need a different solution 40 You can push the buttons too http://www.facebook.com/careers 41 42 (c) 2009 Facebook, Inc. or its licensors. "Facebook" is a registered trademark of Facebook, Inc.. All rights reserved. 1.0 43