Hadoop is an open-source project from the Apache Foundation, introduced in 2006 and developed in Java. It has to objective of offering a working environment that is appropriate for the demands of Big Data (the 4 V’s). As such, Hadoop is designed to work with large Volumes of data, both structured and unstructured (Variety), and to process them in a secure and efficient way (Veracity and Velocity).
In the beginning, as the volume of data created was increasing, the traditional solution consisted in investing in more powerful computers, with greater storage and processing capacities, which were of course more expensive. Soon, people realized that there had to be a better way, that things couldn’t continue as they were. The key was in distribution, both in terms of storing information as well as processing it. It is distributed between various computers working together in “clusters”. These clusters have one or more master nodes charged with managing the distributed files where the information is stored in different blocks, as well as coordinating and executing the different tasks among the cluster’s members.
When working in a distributed way, the key challenges are: being able to access the data, quickly processing it, and avoiding information loss if one of the nodes fails.
Hadoop answered these problems by providing:
It’s capabilities to distribute storage and processing between a large number of machines, and to offer redundancy based on software, is another key advantage of using Hadoop. One doesn’t have to buy special hardware, or costly RAID systems, one can instead use “commodity hardware”, which offers great flexibility and useful savings.
Another advantage of Hadoop is its scalability, especially when run on cloud platforms such as Microsoft Azure, Amazon Web Services and the Google Compute Engine. This allows user to add and remove resources as business needs change. In these cases one normally users these platforms’ storage systems in order to separate the computing and storage processes. In this way, the computing focuses on processing and analyzing data instead of performing system maintenance tasks.
As happens with other Open Source projects, some manufacturers offer stable distributions, with which they include their own tools and support. The most common are Cloudera Hadoop, HortonWorks, MapR, Microsoft HD Insight, IBM InfoShere BigInsights, AWSEMR (Elastic MapReduce) etc.