By  LaertesCTB

Hadoop is an enterprise-ready cloud computing technology, it is the de facto standard of Big Data.


Introduction of Hadoop

Hadoop is a Big Data platform to store and process large data sets, it offers two important services: to store ANY kind of data from ANY source inexpensively and at very large scale, secondly, to do very sophisticated analysis of that data easily and quickly.  It effectively resolves the difficulty when traditional systems are unable to capture, to store and to process Big Data.

Rather than rely on hardware to deliver high-availability, Hadoop itself uses industrial standard hardware, it detects and handles hardware failures automatically, so delivering a highly-available service on top of a cluster of computers. It also means that the cost per terabyte, for both storage and processing, is much lower than on older systems, extremely cost effective.  Adding or removing storage capacity is simple, you can dedicate new hardware to a cluster incrementally. 

More important, Hadoop allows for the distributed processing of large data sets, across clusters of computers using simple programming models. It is designed to scale up from single server to thousands of machines, each offering local computation and storage,  It provides full set of Big Data analysis tools, from Big Data database, data warehouse, parallel processing engine, to machine learning library.


Hadoop includes the following exceptional technologies:

  • Hadoop Distributed File System (HDFS), a distributed file system that provides high-throughput access to application data
  • MapReduce, a parallel processing system for large data sets, 
  • YARN, a framework for job scheduling and cluster resource management
  • Hadoop Common, the common utilities that support the other Hadoop modules, named 
  • Hive, the Hadoop data warehouse, facilitates easy data summarization, ad-hoc queries, and the analysis of large datasets
  • HBase, the Hadoop Dayabase, Highly fault tolerance, built-in scalability, built-in load-balancing, Automatic failover and versioning
  • Pig An engine for executing data flows in parallel on Hadoop,  automatic MapReduce program generator 
  • Ozzie, a scalable, reliable and extensible workflow scheduler system to manage Hadoop jobs
  • Mahout, the Hadoop powerful machine learning library
  • Sqoop,  an important application for transferring data between relational databases and Hadoop

In today’s Internet world, more and more big data is hitting businesses, Hadoop is being widely used in business world (Who uses Hadoop).  When your traditional software cannot handle large data within tolerable elapsed times, you may consider Hadoop.


The way to start your Hadoop Project

Data-driven decisions and applications creates immense value from Big Data. You may not have petabytes of data that you need to analyze today, nevertheless, you can deploy Hadoop with confidence, it is proven at scale because the user community of Hadoop and HBase is global, active and diverse.

There are 3 key steps to start a Hadoop project successfully:

  • Defining the problem domains and your business use cases: Start with an inventory of business challenges and goals, narrow them down to those expected to provide the highest return with Hadoop.
  • Defining the technical requirements: Determine the quality of your data in terms of volume, velocity, variety, identify how Hadoop will store and process the data
  • Planning the Big Data project: To construct concrete objectives and goals in measurable business terms, identify the time to business value, expected outcome. Plan project approach, cost by category, measures, project activity and timing.

Please feel free to contact us if you have any queries.

PostgreSQL, Open Source, database, Oracle, SQLServer, MYSQL