Big Data refers to the new technologies and applications introduced to handle increasing Volume, Velocity and Variety of data while enhancing data utilization capabilities such as Variability, Veracity, and Value.
The large Volume of data available forces horizontal scalability of storage and processing and has implications for all the other V-attributes. The increasing velocity of data ingestion and change implies the need for stream processing, filtering, and processing optimizations. The variety of data types (e.g. multimedia) being generated a requires the use of non-traditional data stores and processing.
Category | Project / Component | Description | Built through Bigtop |
---|---|---|---|
Core Components | Apache Bigtop | Bigtop is a project for Infrastructure Engineers and Data Scientists looking for comprehensive packaging, testing, and configuration of the leading open source big data components. List of projects/components included in it are here. It packages RPMs and DEBs and also produces provides vagrant recipes, raw images and docker images. | Yes |
Apache Hadoop | Hadoop is a framework that allows for distributed processing of large data sets across clusters of computers using simple programming models. It includes HDFS, MapReduce, YARN HCatalog | Yes | |
Elastic Stack | The Elastic Stack or ELK contains the following components - ElasticSearch, LogStash, Kibana and Beats. Elasticsearch is a search and analytics engine. Logstash is a server‑side data processing pipeline that ingests data from multiple sources simultaneously, transforms it, and then sends it to a "stash" like Elasticsearch. Kibana lets users visualize data with charts and graphs in Elasticsearch. | No | |
Operations Components | |||
Apache Ambari | The Apache Ambari project is aimed at making Hadoop management simpler by developing software for provisioning, managing, and monitoring Apache Hadoop clusters. Ambari provides an intuitive, easy-to-use Hadoop management web UI backed by its RESTful APIs. | Yes | |
Apache Zookeeper | ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services. | Yes | |
Streaming Components | |||
Apache Spark | Apache Spark is a unified analytics engine for large-scale data processing. | ||
Apache Flink | Apache Flink is a stream processing framework for distributed, high-performing, always-available, and accurate data streaming applications. | ||
Apache Kafka | Apache Kafka is a distributed streaming platform capable of handling trillions of events a day. Initially conceived as a messaging queue, Kafka is based on an abstraction of a distributed commit log | ||
Apache Beam | Apache Beam is an unified programming model to define and execute data processing pipelines, including ETL, batch and stream (continuous) processing | ||
Apache Flume | Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of streaming event data. | ||
Apache Storm | Apache Storm is a distributed stream processing computation framework written predominantly in the Clojure programming language. | ||
Database and Data Warehousing Components | |||
Apache Hive | Hive is a data warehouse software facilitates reading, writing, and managing large datasets residing in distributed storage and queried using SQL syntax. | Yes | |
Apache HBase | |||
SparkSQL | |||
Apache Pig | |||
Apache Pheonix | |||
Apache Impala | Apache Impala is a query engine and native analytic database for Apache Hadoop. Impala is shipped by Cloudera, MapR, Oracle, and Amazon | No | |
Apache Drill | Apache Drill is a framework that supports data-intensive distributed applications for interactive analysis of large-scale datasets | No | |
PrestoDB | Presto is an open source distributed SQL query engine for running interactive analytic queries against data sources of all sizes ranging from gigabytes to petabytes. | No | |
Governance and Security | |||
File Formats | |||
Apache Avro | Avro is a remote procedure call and data serialization framework developed within Apache's Hadoop project. | ||
Parquet | Apache Parquet is a columnar storage format available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model or programming language. | ||
ORC | The Optimized Row Columnar (ORC) file format provides a highly efficient way to store Hive data. It was designed to overcome limitations of the other Hive file formats. Using ORC files improves performance when Hive is reading, writing, and processing data | ||
RC | RCFile (Record Columnar File) is a data placement structure that determines how to store relational tables on computer clusters | ||
Data Science Notebooks | |||
Analytics Components | |||
Apache Arrow | Apache Arrow is a development platform for in-memory analytics. | ||
Machine Learning | |||
SparkML | |||
Mahout | Apache Mahout is a library of scalable machine-learning algorithms, implemented on top of Apache Hadoop and using the MapReduce paradigm. | ||
H20 | |||
TensorFlow | |||
Dependencies | |||
Smoke Tests | |||
Benchmarking | |||