Where does big data come from?
Sometimes, big data is data coming from one very large source. Most of the time, big data is a collection of data from lots of little sources. With 7.5 billion people in the world and even more computing devices, there's a lot of data out there to collect.
Let's explore a variety of sources.
The Large Hadron Collider, the world's largest particle accelerator, is used by physicists around the world to study the nature of matter. LHC experiments produce around 50-75 petabytes each year, the equivalent of 15-20 million high definition movies.
The earth is surrounded by thousands of satellites. NASA EOSDIS is one of the groups collecting imagery and sensor reports from those satellites, adding 23 terabytes of data to its archive every day.
Thanks to government funding of scientific research projects, a lot of the data collected by research projects is openly available in standard formats. That enables researchers and hobbyists everywhere to turn that data into valuable insights and opportunities.
Digital libraries archive vast numbers of historical documents, artifacts, and media.
The Internet Archive is a non-profit that attempts to archive every webpage at multiple points in its history. Our own website, Khan Academy, has been captured more than 8000 times, so we can reflect fondly on our early days in 2008. A single copy of their archive takes up more than 30 petabytes of space, and since they certainly don't want to lose that data, there are multiple copies of that 30 petabyte archive.
Google Books is a related project that has scanned over 25 million books and hopes to eventually scan every book in the world.
The scanning algorithms use optical character recognition (OCR) to turn the scanned book pages into text, so you may find results from books in Google search queries. The Google Ngram Viewer uses the scanned text database to visualize how often words were used by authors over the last few hundred years.
An increasing number of health care providers are storing patient data in an electronic health record (EHR). An electronic health record includes the patient's demographics, medical issues, medications ordered/taken, laboratory results, and imagery results.
Medical imagery is the bulkiest of the data in an EHR, since images take up so much more space than text. Hospitals often use imagery to diagnose internal injuries and tumors, and they may use different technologies like magnetic resonance imaging (MRI), positron emission tomography (PET), and X-ray computed tomography (CT).
A CT scan creates cross-sectional images of a body part or the entire body. The animation below shows 34 slices from a CT brain scan, from the top of the skull to the base:
A typical CT scan takes 512 x 512 images and stores each pixel using 16 bits. The brain scan above would take up 18 MB of storage space, and a more detailed scan or a scan of a longer body part would take up even more space. A single hospital can easily generate terabytes of imagery data each year.
In the US, health care providers need to store all that patient data in a way that's compliant with the Health Information Portability and Accountability Act (HIPAA). Their data storage mechanism must have privacy safeguards, to ensure only authorized health care providers can access the data. It also needs to have a backup copy and a disaster recovery strategy, to ensure the data isn't accidentally destroyed.
Any application with millions of users is also collecting big data about their user's interactions.
Back in 2014, Facebook reportedly generated 4 new petabytes of data every single day.
That amount of data presents huge challenges for processing, storage, and privacy.
We'll look at some of the challenges of dealing with large data sets in the next article.
Want to join the conversation?
- is it possible to keep track of the data you may be sending from your pc?(3 votes)
- Yes, but it's going to be a lot. Microsoft has Microsoft Network Monitor that you can download. You start a capture, a bunch of stuff will happen, then you end the capture and you can look at all the data that has been sent with your computer at that moment in time(4 votes)