From Terabytes to Petabytes

C1ay · September 4, 2009

As the amounts of data being stored by databases around the world enters the realm of the petabyte (the amount of data stored in a mile-high stack of CD-ROM disks), efficient data management is becoming more and more important. Now computer scientists at Yale University have developed a new database system by combining the best features of multiple approaches to create an open source hybrid system called HadoopDB.

Traditional approaches to managing data at this scale typically fall into one of two categories. The first includes parallel database management systems (DBMS), which are good at working with structured data that contain, for instance, tables with trillions of rows of data. The second includes the kind of approach taken by MapReduce, the software framework used by Google to search data contained on the Web, which gives the user more control over how the data is retrieved.

“In essence, HadoopDB is a hybrid of MapReduce and parallel DBMS technologies,” said Daniel Abadi, assistant professor of computer science at Yale and one of the system designers. “It’s designed to take the best features of both worlds. We get the performance of parallel database systems with the scalability and ease of use of MapReduce.”

HadoopDB was announced on Abadi’s blog last month. Yale graduate students and co-creators Azza Abouzeid and Kamil Bajda-Pawlikowski will present more in-depth details of the new system at the VLDB conference in Lyon, France on August 27. They will also present results of a detailed performance analysis they conducted with Abadi, Avi Silberschatz, chair of computer science at Yale, and Alexander Rasin of Brown University. The team will demonstrate the system performance on a range of representative queries at the conference, both on structured and unstructured data, and will outline HadoopDB’s characteristics along the run-time performance, loading time, fault tolerance and scalability dimensions.

With the huge amounts of data being collected and used in today’s databases – from consumer information used by retail chains to improve buying experiences and reduce customer churn to financial information being collected by banks to reduce risk and avoid another catastrophic financial collapse– being able to store and analyze such vast amounts of data will only continue to grow in importance, Abadi said.

HadoopDB reduces the time it takes to perform some typical tasks from days to hours, making more complicated analysis possible – the kind that could be used to find patterns in the stock market, earthquakes, consumer behavior and even outbreaks, Abadi said. “People have all this data, but they’re not using it in the most efficient or useful way.”

Source: Yale University

Pyrotex · September 9, 2009

A Petabyte (PB) is one million Gigabytes or 10^15 bytes, according to Wikipedia.

According that article, Google and other agencies are handling, archiving or streaming well over one PB per day. Per DAY.

It takes an ounce (or so) of chips or hard-drive to hold 1 GB of data. Let's say, 50 GB = 1 pound mass. That means 1 PB of data requires about 20,000 pounds, or 10 tons of mass storage.

Problem: Assume that storing data requires 10 tons of mass per PB from now on and never improves. Assume that the entire planet needs to store 1 PB of data per day on Sept. 9, 2009. Assume that storage requirements go up as an exponential function over time, and that on Sept. 10, 2009, one day later, storage requirements went up 1/100 of 1%.

On what date will the entire mass of the Earth be required to hold all stored data?

TheBigDog · September 9, 2009

A Petabyte (PB) is one million Gigabytes or 10^15 bytes, according to Wikipedia.

According that article, Google and other agencies are handling, archiving or streaming well over one PB per day. Per DAY.

It takes an ounce (or so) of chips or hard-drive to hold 1 GB of data. Let's say, 50 GB = 1 pound mass. That means 1 PB of data requires about 20,000 pounds, or 10 tons of mass storage.

Problem: Assume that storing data requires 10 tons of mass per PB from now on and never improves. Assume that the entire planet needs to store 1 PB of data per day on Sept. 9, 2009. Assume that storage requirements go up as an exponential function over time, and that on Sept. 10, 2009, one day later, storage requirements went up 1/100 of 1%.

On what date will the entire mass of the Earth be required to hold all stored data?

At that rate of growth in 1312.558 years it will take the mass of the entire earth to record just that day's data. Or... April 11, 3322.

Now that is when one day is the whole of the earth. Accumulating the mass of the earth would happen much sooner.

(I had to figure this one out)

February 2, 3070 the whole of the earth's mass would be dedicated to data storage.

Bill

Pyrotex · September 9, 2009

Outstanding job, big dog!!

;)

Boerseun · September 10, 2009

Jeez... imagine that.

Everybody is obsessed with recording stuff. Kids walk around with a couple of Terabytes in hard drive space in external drives, going from mate to mate and swopping all the cool vid clips, audio clips, songs, games, software, etc. - but does anybody actually use/watch/listen to any of it?

I have checked - on my song collection, it will take me almost three years to listen to every song, back-to-back, 24/7. I will clearly never do it, and I will never get around to listening to all the music I have. Why do I have it? It takes up the best part of a Terabyte as it is, yet I can't get myself to delete it. How many other people do the same, and how is this contributing to global warming and pollution in general? That hard drive didn't grow on a tree, you know, and it's doing absolutely nothing productive or even necessary.

Maybe the world will eventually turn into one single big hard drive for storing information about the fact that it is one single big hard drive.

Sign In

From Terabytes to Petabytes

Recommended Posts

C1ay

Pyrotex

TheBigDog

Pyrotex

Boerseun

Join the conversation

Browse

Activity