If you're seeing this message, it means we're having trouble loading external resources on our website.

If you're behind a web filter, please make sure that the domains *.kastatic.org and *.kasandbox.org are unblocked.

Main content

The challenges of big data

When a computing system needs to store massive amounts of data, there are two primary considerations: space and time. Or, more specifically:
  • How will the data be stored?
  • How can the data be processed efficiently?

Storage

In 2020, a standard laptop might have a 256 GB hard drive. That could fit:
  • 840,000 tweets (280 characters, username, timestamp)
  • 96,000 photos (compressed JPEGs)
  • 66,418 songs (compressed MP3s)
  • 224 movies (compressed MP4s)
For the average user, 256 gigabytes is quite a lot. But for a company operating at a global scale, it's barely anything.
Twitter users post 500 million tweets a day, and many of those tweets include photos. They would need more than 500 of those 256 GB hard drives to store the data for a single day of usage.
Dozens of hard drives can be connected together using a disk array or disk enclosure.
In this HP disk array, each shelf can store up to 12 hard drives:
A disk array with a controller component at the top and four shelves of hard drives below it. Each shelf has 12 slots, and many of the slots have a hard drive in them.
HP EVA4400 storage array. Image source: Redline
The German Climate Computing Center stores more than 40 petabytes of climate data using disk enclosures like the one pictured below:
A gloved individual removes a shelf from a disk enclosure containing many hard drives.
When an organization has thousands of hard drives to manage, they can house them in a data center, a building dedicated entirely to housing computers and data storage devices.
The inside of a data center, like this IBM Cloud Data Center, contains multiple aisles of computing equipment, plus the necessary infrastructure to provide electricity and prevent device overheating.
The inside of a data center with racks of disk arrays.
Image source: IBM
Data centers are often highly networked, so that data and computations can be shared across multiple machines.
All that networking requires a whole lot of networking cable:
The back of an aisle of racks in a data center, with a large number of networking cables coming out of each component.
Image source: IBM

Processing

A large data set can take a long time to process, regardless of whether the data set can fit on a single hard drive.
Let's imagine that engineers at Twitter want to determine how many tweets contain a particular hashtag (e.g. "#ClimateCrisis").
The code to determine whether a single tweet contains the hashtag requires only a tenth of a millisecond, or 0.0001 seconds.
The code to analyze 500 million tweets (the amount posted each day) would require this much time:
0.0001 * 500,000,000 = 50,000 seconds = 13.4 hours
It would take half a day just to process a day's worth of tweets!
The engineers have two options at this point:
  1. Come up with a faster per-tweet algorithm
  2. Use parallel computing to process the data in parallel
The engineers can probably figure out some ways to improve the efficiency of the hashtag check, but even if they managed to reduce the time by a factor of 10, it would still takes an hour and a half to analyze a day's worth of tweets. If they hope to analyze more than that (like a month of tweets, a year of tweets, or all tweets ever), they will need to use parallel computing.
🧠 Don't remember how parallel computing works? Review it here.
Each tweet can be analyzed independently of other tweets, so this type of data processing can be easily parallelized. The work can also be distributed, with multiple machines working on a subset of the data in parallel.
For example, five machines could each process 100 million tweets and send back a count of how many tweets contained "#ClimateCrisis" to a central machine. Once that machine received the count from each of the five machines, it could sum them up and report the total count.
Diagram of communication in a parallel and distributed computing system. The managing computer 100 million tweets to each of five workers, and each of the workers sends a count back.

Responsible use

"With great power, comes great responsibility." - Uncle Ben
Much of the data in these large data sets is related to people in some ways: health records, application data, geolocations. Whenever any organization is storing and processing massive data sets that either represent or affect humans, they must be extremely careful.
Here are just a few ethical considerations:
  • If the data includes PII, is it necessary? If so, it is secured via encryption?
  • For any personal information in the data, are the persons aware that their data is being collected and stored?
  • Are people allowed to request deletion of their personal data?
  • Is there a plan to automatically delete the data when it is no longer needed?
  • If the data will be analyzed to come up with conclusions, was there bias in the way the data was collected?
  • If the data will be used to justify a change in a user-facing product, will there be monitoring to ensure no users are harmed by the change?
We will dive deep into the ethics of using machine learning algorithms in the next lesson, since machine learning is a popular technique for analyzing big data but is too often used irresponsibly.

🙋🏽🙋🏻‍♀️🙋🏿‍♂️Do you have any questions about this topic? We'd love to answer—just ask in the questions area below!

Want to join the conversation?