The in and outs of caching

by | 26.11.2020 | Engineering

Whether it’s watching your favourite movie or some yummy recipe, we don’t like things to buffer or to load slow. With ever-decreasing attention span alongside forever increasing content diversity (people from all around the world are uploading interesting content), this is becoming an extremely challenging problem to tackle. This article will dive into the pros and cons of caching your content and how you can avoid costly traps along the way.

Why Content Diversity is a Problem?

For the sake of this article suppose you have a video sharing platform like Youtube but smaller. Let’s call the platform smalltube. You’re using your savings to run the company and so have four servers at the following geographical locations:

  1. Germany
  2. United States
  3. India
  4. Brazil

One of your customers in Germany is interested in exploring a new Indian dish because his favourite television show recommended it. He found some content creator in smalltube who makes delicious Indian curry. As you know, transferring data takes time and resources (i.e. money), and your customers want to see the video as fast as possible. But you and your company have to choose between two things:

  1. Fast Delivery – If you want to deliver the content fast, you can replicate the content in your nearby servers before it’s even requested. Fast delivery will help the user to fetch content from the nearby server and not from some server in another continent. But, how will you decide which servers are nearby and you want to replicate the data to? You can’t decide, so you have to replicate the data to all of the servers.
  2. Economical Delivery – You can in this approach wait till the data is requested and then replicate the data to customers nearby server, i.e. the server in Germany. Now, as more people see the television show and decide to learn about the recipe, you can serve the replicated content. But, economical delivery is slow for the initial customer.

Why is Fast Delivery not a good option?

Fast delivery is heavy on cost. You shouldn’t replicate all the content to all the servers you own because it’s expensive on storage and bandwidth. That translates to money, i.e. your savings. If you are to scale smalltube to the size of youtube, i.e. 500 hours of video content are uploaded every minute and copied to all the servers, you’ll have to deal with a few challenges. Firstly, you need to build a company near your datacenters which can provide you with storage solutions. Then you might fail to cope with your increasing storage requirements over time. Secondly, the huge transfer will throttle your data connection even if you had speeds ten times better than those of a data good centre nowadays. The worst part is you don’t even know whether the data will be requested by customers living near other data center and will waste your storage and bandwidth.

You can understand the economic approach is the best one, and as a fact: every company on earth follows the same approach.

What is caching?

When I said we replicate data, it was not completely true. We cache data on a server. Caching is a way to store immutable data (data which doesn’t change quickly like uploaded video) on servers with a specific TTL (time to live) and a hash so that it can be accessed faster.

For simplicity, remember every file results to a different hash. If you change the content slightly, the hash completely changes. It acts as a security net and helps forgery while sharing data. TTL can be used as a way to help the server understand how long the server should wait before discarding the file.

Replication vs caching

Replication is costly as there is no TTL issued, and in the background the server has to keep fetching the file. This is costly regarding bandwidth as the whole file is fetched. And suppose the file is deleted in the main server mistakenly or the server malfunctions all files on another server will be deleted instantaneously.

In caching you don’t need to fetch the whole file every time as you can compare the hash which occasionally is a 256 characters string and conclude whether you need to fetch the file before TTL expires. After TTL expires you fetch a new copy of the file from a newer server or peer servers which has a fresher copy of the file. Suppose your main data centre has some issues; caching provides ways to recover the file fully.

How TTL expires?

TTL expires with time. Every cached file has a time limit till it’s valid and after that, the cached file is anonymized. Another way is forced expiry. Here the main server instructs the cache server to invalidate the cache. Then if the instruction says to update the file, it’s done from the main or peer server. If the instruction says nothing, then it meant the file must be discarded.

What Data is Cached?

Data which doesn’t change fast is cached because fast-changing data needs to be updated instantaneously and caching doesn’t help there. You can think of live stream or comment section on a twitch stream. Caching is done for data which are immutable like a video, set of images in a webpage or complete search result in Google. Yes, google caches whole of its search results to save computational power.

To conclude

I hope you have understood how caching works and how important is it to the smooth functioning of the internet. Next time you find a website or video load fast remember so many things happen in the background. I am adding a few references if you want to know more about caching.?

Reference Links

TTL- https://www.cloudflare.com/en-in/learning/cdn/glossary/time-to-live-ttl/

Caching – https://www.cloudflare.com/en-in/learning/cdn/what-is-caching/

Search cache – https://purevisibility.com/cached-pages-google-mean/#:~:text=Search results on Google often come with a,last visited the site and indexed its content.

Join the Community

The DevOps Awareness Program

Subscribe to the newsletter

Join 100+ cloud native ethusiasts

#wearep3r

Join the community Slack

Discuss all things Kubernetes, DevOps and Cloud Native

More stories from our blog

Linkerd: Looming on Service Meshes

Linkerd: Looming on Service Meshes

Microservices and service meshes have become a staple of the industry as companies realize the full potential of creating an independent architecture that allows for easier scale up, agile development, resilience and streamlined deployment. Many of these applications...

What’s new in Flux v0.17.0?

What’s new in Flux v0.17.0?

Flux2 came with its new update a while ago, and it is sheer exciting for the users because it brought a lot of new features. It also made a lot of new enhancements and updates. We will take a look at the entire catalogue in this article. So, without further a due,...

What’s new in Portainer v2.7.0 BE?

What’s new in Portainer v2.7.0 BE?

A few days ago, Portainer Business Edition came up with their new update. It is quite a massive update with many new features, bug fixes, enhancements and much more. In this article, we will see all of those in a nutshell. Let's start What is Portainer? Portainer is...

DVC (Git For Data): A Complete Intro

DVC (Git For Data): A Complete Intro

As a data scientist or ML engineer, have you ever faced the inconvenience of experimenting with the model? When we train the model, the model file is generated. Now, if you want to experiment with some different parameters or data, generally people rename the existing...

Recap of the Cloud Native Meetup Saar #3

Recap of the Cloud Native Meetup Saar #3

We are looking back on a very successful third edition of our Cloud Native Meetup Saar #3! Togetherer with our co-host anynines, we enjoyed a fun afternoon filled with great speakers, intriguing topics and thoughtful conversations! We welcomed a total of three...

Portainer Ambassador Series ft. Fabian Peter

Portainer Ambassador Series ft. Fabian Peter

Portainer arranged a fun and informative discussion through a one-hour special named “Ambassador Series” on 1st July 2021. It was pretty amazing to see Savannah Peterson as the host and two other guests. One is our very own CEO of p3r.one, Fabian Peter and the other...

What’s new in Longhorn v1.2.0?

What’s new in Longhorn v1.2.0?

Longhorn came with their new update. It is full of surprises. We will peel off one by one to see all the latest updates, features, bug fixes and much more. This one is a much-awaited update, and we will see all of it in a moment. So, without further a due, let's...

Kubernetes Stateful Friend: What’s more to etcd?

Kubernetes Stateful Friend: What’s more to etcd?

The Kubernetes control plane consists of various components, and one of such components is etcd. Anyone starting to learn k8s come across it and memorizes quickly that it’s a key-value pair for Kubernetes with persistence store. But, what’s more to it? Why do we need...

What’s New in Flux 1.24.0?

What’s New in Flux 1.24.0?

Flux 1.24 is out this month with couple of updates and Important notices. Let’s get around what are the updates in the new release. But, first, let’s do a quick intro on Flux. What is Flux? Flux is a tool that checks to see if the status of a cluster matches the git...

Event Driven Architecture Demystified (For Pros)

Event Driven Architecture Demystified (For Pros)

Event-Driven Architecture or EDA is talked about with pride inside any organization. But, through last few months, I have noticed a trend that the definition of EDA is not consistent across people and organizations. It’s vague. EDA is something where you have events...

Interested in what we do? Looking for help? Wanna talk about software strategy?