.Big Data Projects / Components v1.0

Big Data refers to the new technologies and applications introduced to handle increasing Volume, Velocity and Variety of data while enhancing data utilization capabilities such as Variability, Veracity, and Value.

The large Volume of data available forces horizontal scalability of storage and processing and has implications for all the other V-attributes. The increasing velocity of data ingestion and change implies the need for stream processing, filtering, and processing optimizations. The variety of data types (e.g. multimedia) being generated a requires the use of non-traditional data stores and processing.

Category	Project / Component	Description	Built through Bigtop
Core Components	Apache Bigtop	Bigtop is a project for Infrastructure Engineers and Data Scientists looking for comprehensive packaging, testing, and configuration of the leading open source big data components. List of projects/components included in it are here. It packages RPMs and DEBs and also produces provides vagrant recipes, raw images and docker images.	Yes
	Apache Hadoop	Hadoop is a framework that allows for distributed processing of large data sets across clusters of computers using simple programming models. It includes HDFS, MapReduce, YARN HCatalog	Yes




Operations Components
	Apache Ambari	The Apache Ambari project is aimed at making Hadoop management simpler by developing software for provisioning, managing, and monitoring Apache Hadoop clusters. Ambari provides an intuitive, easy-to-use Hadoop management web UI backed by its RESTful APIs.	Yes
	Apache Zookeeper	ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services.	Yes
Streaming Components
	Apache Spark	Apache Spark is a unified analytics engine for large-scale data processing.
	Apache Flink	Apache Flink is a stream processing framework for distributed, high-performing, always-available, and accurate data streaming applications.
	Apache Kafka	Apache Kafka is a distributed streaming platform capable of handling trillions of events a day. Initially conceived as a messaging queue, Kafka is based on an abstraction of a distributed commit log
	Apache Beam	Apache Beam is an unified programming model to define and execute data processing pipelines, including ETL, batch and stream (continuous) processing
	Apache Flume	Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of streaming event data.
	Apache Storm	Apache Storm is a distributed stream processing computation framework written predominantly in the Clojure programming language.
Database and Data Warehousing Components
	Apache Hive	Hive is a data warehouse software facilitates reading, writing, and managing large datasets residing in distributed storage and queried using SQL syntax.	Yes
	Apache HBase	HBase is an open-source, non-relational, distributed database modeled after Google's Bigtable	Yes
	SparkSQL	Spark SQL is a Spark module for structured data processing
	Apache Pig	Apache Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. The salient property of Pig programs is that their structure is amenable to substantial parallelization, which in turns enables them to handle very large data sets.	Yes
	Apache Pheonix	Apache Phoenix is a massively parallel, relational database engine supporting OLTP for Hadoop using Apache HBase as its backing store.
	Apache Impala	Apache Impala is a query engine and native analytic database for Apache Hadoop. Impala is shipped by Cloudera, MapR, Oracle, and Amazon	No
	Apache Drill	Apache Drill is a framework that supports data-intensive distributed applications for interactive analysis of large-scale datasets	No
	PrestoDB	Presto is an open source distributed SQL query engine for running interactive analytic queries against data sources of all sizes ranging from gigabytes to petabytes.	No
	Apache Cassandra	Apache Cassandra is a massively scalable open source non-relational database that offers continuous availability, linear scale performance, operational simplicity and easy data distribution across multiple data centers and cloud availability zones.	No
	ScyllaDB	Scylla is a distributed NoSQL data store. It supports the same protocols as Cassandra (CQL and Thrift) and the same file formats (SSTable), but is a completely rewritten implementation, using the C++14 language replacing Cassandra's Java, and replacing threads, shared memory, mapped files, and other classic Linux programming techniques.	No
	Apache Ignite	Ignite is a memory-centric distributed database, caching, and processing platform for transactional, analytical, and streaming workloads delivering in-memory speeds at petabyte scale with support for JCache, SQL99, ACID transactions, and machine learning.
	Apache Sqoop	Sqoop is a tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases.
Governance and Security

File Formats
	Apache Avro	Avro is a remote procedure call and data serialization framework developed within Apache's Hadoop project.
	Parquet	Apache Parquet is a columnar storage format available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model or programming language.
	ORC	The Optimized Row Columnar (ORC) file format provides a highly efficient way to store Hive data. It was designed to overcome limitations of the other Hive file formats. Using ORC files improves performance when Hive is reading, writing, and processing data
	RC	RCFile (Record Columnar File) is a data placement structure that determines how to store relational tables on computer clusters
Data Science Notebooks

Analytics Components
	Apache Arrow	Apache Arrow is a development platform for in-memory analytics.
	Apache Solr	Solr is highly reliable, scalable and fault tolerant, providing distributed indexing, replication and load-balanced querying, automated failover and recovery, centralized configuration and more. Solr powers the search and navigation features of many of the world's largest internet sites.
	Elastic Stack	The Elastic Stack or ELK contains the following components - ElasticSearch, LogStash, Kibana and Beats. Elasticsearch is a search and analytics engine. Logstash is a server‑side data processing pipeline that ingests data from multiple sources simultaneously, transforms it, and then sends it to a "stash" like Elasticsearch. Kibana lets users visualize data with charts and graphs in Elasticsearch.	No
Machine Learning
	Spark MLib	MLlib is Spark’s machine learning (ML) library. Its goal is to make practical machine learning scalable and easy. It consists of common learning algorithms and utilities, including classification, regression, clustering, collaborative filtering, dimensionality reduction, as well as lower-level optimization primitives and higher-level pipeline APIs.	Yes
	Mahout	Apache Mahout is a library of scalable machine-learning algorithms, implemented on top of Apache Hadoop and using the MapReduce paradigm.
	H2O	Open Source Fast Scalable Machine Learning Platform For Smarter Applications (Deep Learning, Gradient Boosting, Random Forest, Generalized Linear Modeling (Logistic Regression, Elastic Net), K-Means, PCA, Stacked Ensembles, Automatic Machine Learning (AutoML), ...)	No
	TensorFlow	TensorFlow is a software library for numerical computation using data flow graphs.
Dependencies

Smoke Tests

Benchmarking