Big Data Projects / Components

Big Data refers to the new technologies and applications introduced to handle increasing Volume, Velocity and Variety of data while enhancing data utilization capabilities such as Variability, Veracity, and Value.

The large Volume of data available forces horizontal scalability of storage and processing and has implications for all the other V-attributes. The increasing velocity of data ingestion and change implies the need for stream processing, filtering, and processing optimizations. The variety of data types (e.g. multimedia) being generated a requires the use of non-traditional data stores and processing.

Category	Description	Link	Built through Bigtop	Upstream Build	CI	Smoke tests complete
Core Components
Apache Bigtop	Bigtop is a project for Infrastructure Engineers and Data Scientists looking for comprehensive packaging, testing, and configuration of the leading open source big data components. List of projects/components included in it are here. It packages RPMs and DEBs and also produces provides vagrant recipes, raw images and docker images.	collaborate JIRA			link	Apache Hadoop Apache Zookeeper Apache Spark Apache Hive Apache HBase Apache Ambari
Apache Hadoop	Hadoop is a framework that allows for distributed processing of large data sets across clusters of computers using simple programming models. It includes HDFS, MapReduce, YARN HCatalog	c ollaborate JIRA	RPM DEB	Tarball DEB
Operational Components
Apache Ambari	Apache Ambari makes Hadoop cluster provisioning, managing, and monitoring dead simple. Ambari provides an intuitive, easy-to-use Hadoop management web UI backed by its RESTful APIs.	c ollaborate JIRA	RPM DEB
Apache Zookeeper	ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services.	c ollaborate JIRA	RPM DEB
Apache Oozie	Oozie is a workflow scheduler system to manage Apache Hadoop jobs.	collaborate JIRA	RPM DEB
Streaming Components and Pipelines
Apache Spark	Apache Spark is a unified analytics engine for large-scale data processing.	collaborate JIRA	RPM DEB	Tarball
Apache Tez	TEZ project is aimed at building an application framework which allows for a complex directed-acyclic-graph of tasks for processing data	collaborate JIRA	RPM DEB
Apache Flink	Apache Flink is a stream processing framework for distributed, high-performing, always-available, and accurate data streaming applications.	collaborate JIRA	RPM DEB
Apache Kafka	Apache Kafka is a distributed streaming platform capable of handling trillions of events a day. Initially conceived as a messaging queue, Kafka is based on an abstraction of a distributed commit log	collaborate JIRA	RPM DEB
Apache Beam	Apache Beam is an unified programming model to define and execute data processing pipelines, including ETL, batch and stream (continuous) processing	collaborate JIRA
Apache Flume	Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of streaming event data.	collaborate JIRA	RPM DEB
Apache Storm	Apache Storm is a distributed stream processing computation framework written predominantly in the Clojure programming language.	collaborate JIRA
Apache NiFi	Apache NiFi (short for NiagaraFiles) supports powerful and scalable directed graphs of data routing, transformation, and system mediation logic	collaborate JIRA
Apache MiNiFi	Apache NiFi - MiNiFi is a complementary data collection approach that supplements the core tenets of NiFi in dataflow management, focusing on the collection of data at the source of its creation.	collaborate JIRA
Apache Apex	Apache Apex is a unified platform for big data stream and batch processing. Use cases include ingestion, ETL, real-time analytics, alerts and real-time actions.	collaborate JIRA	RPM DEB
Apache Crunch	The Apache Crunch Java library provides a framework for writing, testing, and running MapReduce pipelines. Its goal is to make pipelines that are composed of many user-defined functions simple to write, easy to test, and efficient to run.		RPM DEB
Apache Edgent	Apache Edgent is a programming model and micro-kernel style runtime that can be embedded in gateways and small footprint edge devices enabling local, real-time, analytics on the continuous streams of data coming from equipment, vehicles, systems, appliances, devices and sensors of all kinds (for example, Raspberry Pis or smart phones). Working in conjunction with centralized analytic systems, Apache Edgent provides efficient and timely analytics across the whole IoT ecosystem: from the center to the edge.
Database and Data Warehousing Components
Apache Hive	Hive is a data warehouse software facilitates reading, writing, and managing large datasets residing in distributed storage and queried using SQL syntax.	collaborate JIRA	RPM DEB
Apache HBase	HBase is an open-source, non-relational, distributed database modeled after Google's Bigtable	collaborate JIRA	RPM DEB
SparkSQL	Spark SQL is a Spark module for structured data processing		RPM DEB
Apache Pig	Apache Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. The salient property of Pig programs is that their structure is amenable to substantial parallelization, which in turns enables them to handle very large data sets.	collaborate JIRA	RPM DEB
Apache DataFu	Apache DataFu consists of two libraries: Apache DataFu Pig is a collection of useful user-defined functions for data analysis in Apache Pig. Apache DataFu Hourglass is a library for incrementally processing data using Apache Hadoop MapReduce. This library was inspired by the prevalence of sliding window computations over daily tracking data. Computations such as these typically happen at regular intervals (e.g. daily, weekly), and therefore the sliding nature of the computations means that much of the work is unnecessarily repeated. DataFu’s Hourglass was created to make these computations more efficient, yielding sometimes 50-95% reductions in computational resources.		RPM DEB
Apache Pheonix	Apache Phoenix is a massively parallel, relational database engine supporting OLTP for Hadoop using Apache HBase as its backing store.		RPM DEB
Apache Impala	Apache Impala is a query engine and native analytic database for Apache Hadoop. Impala is shipped by Cloudera, MapR, Oracle, and Amazon
Apache Drill	Apache Drill is a framework that supports data-intensive distributed applications for interactive analysis of large-scale datasets
PrestoDB	Presto is an open source distributed SQL query engine for running interactive analytic queries against data sources of all sizes ranging from gigabytes to petabytes.
Apache Cassandra	Apache Cassandra is a massively scalable open source non-relational database that offers continuous availability, linear scale performance, operational simplicity and easy data distribution across multiple data centers and cloud availability zones.	collaborate JIRA
ScyllaDB	Scylla is a distributed NoSQL data store. It supports the same protocols as Cassandra (CQL and Thrift) and the same file formats (SSTable), but is a completely rewritten implementation, using the C++14 language replacing Cassandra's Java, and replacing threads, shared memory, mapped files, and other classic Linux programming techniques.	collaborate JIRA
Apache CouchDB	Apache CouchDB is a database that completely embraces the web. Store your data with JSON documents. Access your documents with your web browser, via HTTP. Query, combine, and transform your documents with JavaScript. Apache CouchDB works well with modern web and mobile apps. You can even serve web apps directly out of Apache CouchDB. And you can distribute your data, or your apps, efficiently using Apache CouchDB’s incremental replication. Apache CouchDB supports master-master setups with automatic conflict detection
Apache Ignite	Ignite is a memory-centric distributed database, caching, and processing platform for transactional, analytical, and streaming workloads delivering in-memory speeds at petabyte scale with support for JCache, SQL99, ACID transactions, and machine learning.	collaborate JIRA	RPM DEB
Apache Alluxio	Alluxio, formerly Tachyon, enables any application to interact with any data from any storage system at memory speed.		RPM DEB
Postgres	Postgres is an object-relational database management system (ORDBMS) with an emphasis on extensibility and standards compliance.	collaborate JIRA
Memcached	Memcached is a high performance multi-threaded event-based in-memory key-value store for small chunks of arbitrary data (strings, objects) from results of database calls, API calls, or page rendering.	collaborate JIRA
Apache Sqoop	Sqoop is a tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases.	collaborate JIRA	RPM DEB
Apache Falcon	Falcon is a feed processing and feed management system aimed at making it easier for end consumers to onboard their feed processing and feed management on hadoop clusters
Redis	Redis is a in-memory data structure store, used as a database, cache and message broker.	collaborate JIRA
Apache Kudu	Kudu provides a combination of fast inserts/updates and efficient columnar scans to enable multiple real-time analytic workloads across a single storage layer. As a new complement to HDFS and Apache HBase, Kudu gives architects the flexibility to address a wider variety of use cases without exotic workarounds.	collaborate JIRA
Apache Greenplum	Greenplum, another of the PostgreSQL databases, competes in the same market as Teradata, Exadata, and Redshift. Greenplum is a datawarehousing solution with analytics built-in		RPM DEB
ArangoDB	A native multi-model database.	collaborate JIRA
Governance and Security
Apache Ranger	Apache Ranger is a framework to enable, monitor and manage comprehensive data security across the Hadoop platform.	collaborate JIRA
Apache Atlas	Atlas is a scalable and extensible set of core foundational governance services – enabling enterprises to effectively and efficiently meet their compliance requirements within Hadoop and allows integration with the whole enterprise data ecosystem.	collaborate JIRA
Apache Knox	The Apache Knox Gateway (“Knox”) provides perimeter security so that the enterprise can confidently extend Hadoop access to more of those new users while also maintaining compliance with enterprise security policies	collaborate JIRA
Apache Sentry	Apache Sentry is a system for enforcing fine grained role based authorization to data and metadata stored on a Hadoop cluster.	collaborate JIRA
File Formats
Apache Avro	Avro is a remote procedure call and data serialization framework developed within Apache's Hadoop project.	collaborate JIRA
Parquet	Apache Parquet is a columnar storage format available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model or programming language.
ORC	The Optimized Row Columnar (ORC) file format provides a highly efficient way to store Hive data. It was designed to overcome limitations of the other Hive file formats. Using ORC files improves performance when Hive is reading, writing, and processing data
RC	RCFile (Record Columnar File) is a data placement structure that determines how to store relational tables on computer clusters
Data Science Notebooks
Jupyter	The Jupyter Notebook is a web application that allows you to create and share documents that contain live code, equations, visualizations and narrative text. Uses include: data cleaning and transformation, numerical simulation, statistical modeling, data visualization, machine learning, and much more.	collaborate JIRA
Zeppelin	Zeppelin is a Web-based notebook that enables data-driven, interactive data analytics and collaborative documents with SQL, Scala and more	collaborate JIRA	RPM DEB
Analytics Components
Apache Arrow	Apache Arrow is a development platform for in-memory analytics.	collaborate JIRA
Apache Solr	Solr is highly reliable, scalable and fault tolerant, providing distributed indexing, replication and load-balanced querying, automated failover and recovery, centralized configuration and more. Solr powers the search and navigation features of many of the world's largest internet sites.	collaborate JIRA	RPM DEB
Elastic Stack	The Elastic Stack or ELK contains the following components - ElasticSearch, LogStash, Kibana and Beats. Elasticsearch is a search and analytics engine. Logstash is a server‑side data processing pipeline that ingests data from multiple sources simultaneously, transforms it, and then sends it to a "stash" like Elasticsearch. Kibana lets users visualize data with charts and graphs in Elasticsearch.	collaborate JIRA
Ganglia	Ganglia is a scalable distributed monitoring system for high-performance computing systems such as clusters and Grids.
Machine Learning
Spark MLib	MLlib is Spark’s machine learning (ML) library. Its goal is to make practical machine learning scalable and easy. It consists of common learning algorithms and utilities, including classification, regression, clustering, collaborative filtering, dimensionality reduction, as well as lower-level optimization primitives and higher-level pipeline APIs.		RPM DEB
Mahout	Apache Mahout is a library of scalable machine-learning algorithms, implemented on top of Apache Hadoop and using the MapReduce paradigm.		RPM DEB
H2O	Open Source Fast Scalable Machine Learning Platform For Smarter Applications (Deep Learning, Gradient Boosting, Random Forest, Generalized Linear Modeling (Logistic Regression, Elastic Net), K-Means, PCA, Stacked Ensembles, Automatic Machine Learning (AutoML), ...)	collaborate JIRA
TensorFlow	TensorFlow is a software library for numerical computation using data flow graphs.	collaborate JIRA
Dependencies

Smoke Tests

Benchmarking
TPC Benchmarking Tools	TPC Benchmarks consists of TPC-C, TPC-DI, TPC-DS, TPC-E, TPC-H, TPCx-BB, TPCx-HS, TPCx-V benchmarking tools	collaborate JIRA
Big Bench	The BigBench specification comprises two key components: a data model specification, and a workload/query specification.	collaborate JIRA
HiBench	HiBench is a big data benchmark suite that helps evaluate different big data frameworks in terms of speed, throughput and system resource utilizations.	collaborate JIRA
AMPLab	Compares performance of five popular SQL query engines in the Big Data ecosystem on common types of queries, which can be reproduced through publicly available scripts and datasets.	collaborate JIRA