Scaling tokopedia-past-present-future

70
Scaling Tokopedia Past, Present, Future

Transcript of Scaling tokopedia-past-present-future

Scaling TokopediaPast, Present, Future

Once Upon a TimeIn Jakarta, Jan 2009

1 Product Guy and 1 Half Engineer

as co-founder

Never have experience to manage a high traffic website

Don’t have business background AT ALL

Perl as back end Build our own perl framework

Apache Mod Perl

Oracle Express Edition

Hm… looks like we need a better front end designer

AwStats and A little bit Google Analytic

CBN

apache server oracle server

Network Topology

Apps Topology

Internet

apache server

oracle server

http req http resp

sql

2 co-founder 1 real engineer

1 cust care

Hooray, WE LAUNCH!!

IDR 33 Mio of GMVin the first month

WE ARE SLOW!!!

* We didn’t have storage * pictures uploaded is stored on the same machine * Web page & static content is served by single apache * We didn’t use CDN * We didn’t even know what is CDN

WHY??

Network Topology

CBN

apache appserver

oracle serverapache staticserver

Apps Topology

Internet

apache app server

oracle server

http req http resp

sql

Internet

apache upload / statis server

oracle server

http upload http resp

sql

Internet

apache upload / statis server

http req http resp

access web page upload pictures read staticpicturescss + js

We are back in business

BUT WE ARE SLOW AGAIN!!!

* Oracle express edition reach it’s limit * No Partition * No Replication * Poor indexing * Read/Write and Query on the same Master DB.

WHY??

SO WE MIGRATE TO

Network Topology

CBN

apache appserver

PostgreSQL Masterapache staticserver

PostgreSQL Slave

Apps Topology

Internet

apache app server

PostgreSQL Master

http req http resp

sql insert sql update sql delete

PostgreSQL Slave

sql iquery

WAL streaming Replication

We did it again!!!!

DAMN SEARCH IS SLOW!!!

* We have a lot of new products every second * We have to show search results in real time * But every second the sorting keep changing * PostgreSQL load is just too much!!!

WHY??

And Many More……..

SEARCH IS EASY !!!!

Come on Man….SLOW AGAIN??

* We were using apache + mod perl * Apache consume a lot of resource * Our code has a lot of memory leak

WHY??

* We found out about NginX is very light and fast * We use nginx as load balancer * Replace apache modperl with nginx-perl * We have 1 nginx load balancer with several nginx-perl servers * For load balancing method, we mix round robin and clustering

SOLUTION

siege -c100 -t5s -i -b -q 'http://www.tokopedia.com/ebenhaezer' siege: invalid option -- 'q' siege: invalid option -- 'q' ** SIEGE 2.72 ** Preparing 100 concurrent users for battle. The server is now under siege... Lifting the server siege... done.

Transactions: 14788 hits Availability: 100.00 % Elapsed time: 4.59 secs Data transferred: 63.50 MB Response time: 0.03 secs Transaction rate: 3221.79 trans/sec Throughput: 13.83 MB/sec Concurrency: 87.52 Successful transactions: 7481 Failed transactions: 0 Longest transaction: 0.43 Shortest transaction: 0.00

Apps Topology

PostgreSQL Master

sql insert sql update sql delete

PostgreSQL Slave

sql iquery

WAL streaming Replication

Internet

http req http resp

NginX Load Balancer

nginx-perl #1 nginx-perl #2 nginx-perl #3 nginx-perl #n

proxy_pass

SOLR

Import

SOLR query

Now what….Storage??

* Hardware limitation * We used SATA HDD not SSD * Disk Utilities 100% * No back up, No Failover * Capacity is critical * Users keep uploading pictures

WHY??

User

We also use CDN

AFTER ALLWE ARE STILL SLOW!!!

SOLUTION

Internet

nginx-perl #1

PostgreSQL Master

http req http resp

nginx-perl #2 nginx-perl #3 nginx-perl #n

NginX Load Balancer

proxy_pass

PostgreSQL Slave

replication

MongoDBprimary

MongoDBsecondary

replication

SOLR

Redis

query & update

3rd Party API such asLogistics, Banks,

Payment GwETC

Internet

We Start To Know About NginX, NoSQL

In-Memory Storage GlusterFS Storage

Scale out (not scale up) and many more…..

Lesson Learn??

Thanks to ourAwesome Engineers

and many more…

We are back in business

BUT …………..

For the first time in our life we were doomed!!!

* One of our GlusterFS Server is broken. Image read/write is super slow.

* We were using version of postgresql which has some bugs on indexing.

WHY??

Another Awesome Engineers

Mixed with International Team

Current State

New VP of Engineering

FUTURE

* Mobile First Company

* Zero Downtime

* Full to Cloud

* Re-architech to SoA

* Open API to Public

* Deploy New Tech, such as replace perl with Go Lang

* Advance Alert & Monitoring

* Redundancy and Failover

* Multiple 3rd party

* Datawarehouse such as Cubes, Pentaho etc

* Machine Learning, Business Intelligence

* Build things that can be share with others

* Really pay attention on security

* and many more……

What if the problems come from ISP?

Unsolved Issues

* User cannot access Tokopedia * Pictures are not showing * css and js are not loaded * Sometime it just show a blank page * Some ISPs do Ads Injection * ALL WITHOUT REASONS

FACTS

WHY??WE DON’T KNOW

BUT SOMETHING HAPPENON ISP SIDE

Works well* Using NginX Geo Module * All HTTPS since Q4 2014 * Try CDN Load balancing

Don’t work at all* Talked to ISP * “Fight” in idEA

What we’ve done

Don’t think “someone else will join and take care of this” — Mike Krieger of Instagram

Whether you think you can, or you think you can’t, you’re right — Henry Ford

THANK YOU ANY QUESTIONS?