Spark: Lighting a Fire Under Hadoop

Apache Spark Training in Chennai , Classroom  / Online

Hadoop has come a long way since its introduction as an open source project from Yahoo. It is moving into production from pilot/test stages at many firms. And the ecosystem of companies supporting it in one way or another is growing daily.

It has some flaws, however, that are hampering the kinds of Big Data projects people can do with it. The Hadoop ecosystem uses a specialized distributed storage file system, called HDFS, to store large files across multiple servers and keep track of everything.

While this helps managed the terabytes of data, processing data at the speed of hard drives makes it prohibitively slow for handling anything exceptionally large or anything in real-time. Unless you were prepared to go to an all-SSD array – and who has that kind of money? – you were at the mercy of your 7,200 RPM hard drives.

The power of Hadoop is all centered around distributed computing, but Hadoop has primarily been used for batch processing. It uses the framework MapReduce to execute a batch process, oftentimes overnight, to get your answer. Because of this slow process, Big Data might have promised real-time analytics but it often couldn’t deliver.

Enter Spark. It moved the processing part of MapReduce to memory, giving Hadoop a massive speed boost. Developers claim it runs Hadoop up to 100 times faster in certain applications, and in the process opens up Hadoop to many more Big Data types of projects, due to the speed and potential for real-time processing.

Spark started as a project in the University of California, Berkeley AMPLab in 2009 and was donated as an open source project to the Apache Foundation in 2012. A company was spun out of AMPLab, called Databricks, to lead development of Spark.

Patrick Wendell, co-founder and engineering manager at Databricks, was a part of the team that made Spark at Berkeley. He says that Spark was focused on three things:

1) Speed: MapReduce was based on an old Google technology and is disk-based, while Spark runs in memory.

2) Ease of use: “MapReduce was really hard to program. Very few people wrote programs against it. Developers spent so much time trying to write their program in MapReduce and it was huge waste of time. Spark has a developer-friendly API,” he said. It supports eight different languages, including Phython, Java, and R.

3) Make something broadly compatible: Spark can run on Amazon EC2, Apache’s Mesos, and various cloud environments. It can read and write data to a variety of databases, like PostgreSQL, Oracle, MySQL and all Hadoop file formats.

 

Big Data Analhytics Training in Chennai

 

“Many people have moved to Spark because they are performance-sensitive and time is money for them,” said Wendell. “So this is a key selling point. A lot of original Hadoop code was focused on off line batch processing, often run overnight. There, latency and performance don’t matter much.”

Because Spark is not a storage system, you can use your existing storage network and Spark will plug right into Hadoop and get going. Governance and security is taken care of. “We just speed up the actual crunching of what you are trying to do,” said Wendell. Of course, that’s also predicated on giving your distributed servers all the memory they will need to run everything in memory.

 

Tags: , , , ,