Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Big Data refers to the new technologies and applications introduced to handle increasing Volume, Velocity and Variety of data while enhancing data utilization capabilities such as Variability, Veracity, and Value.

...

Big Data refers to the new technologies and applications introduced to handle increasing Volume, Velocity and Variety of data while enhancing data utilization capabilities such as Variability, Veracity, and Value.

The large Volume of data available forces horizontal scalability of storage and processing and has implications for all the other V-attributes. The increasing velocity of data ingestion and change implies the need for stream processing, filtering, and processing optimizations. The variety of data types  (e.g. multimedia) being generated a requires the use of non-traditional data stores and processing. 

...

CategoryProject / ComponentDescriptionBuilt through BigtopBuilt and Ported
Core ComponentsApache BigtopBigtop is a project for Infrastructure Engineers and Data Scientists looking for comprehensive packaging, testing, and configuration of the leading open source big data components. List of projects/components included in it are here. It packages RPMs and DEBs and also produces provides vagrant recipes, raw images and docker images.YesYes

Apache HadoopHadoop is a framework that allows for distributed processing of large data sets across clusters of computers using simple programming models. It includes HDFS, MapReduce, YARN HCatalogYesYes
Operations Components




Apache AmbariThe Apache Ambari project is aimed at making Hadoop management simpler by developing software for provisioning, managing, and monitoring Apache Hadoop clusters. Ambari provides an intuitive, easy-to-use Hadoop management web UI backed by its RESTful APIs.YesYes

Apache ZookeeperZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services. YesYes

Apache OozieOozie is a workflow scheduler system to manage Apache Hadoop jobs.

Streaming Components




Apache SparkApache Spark is a unified analytics engine for large-scale data processing.for large-scale data processing.
Yes

Apache TezTEZ project is aimed at building an application framework which allows for a complex directed-acyclic-graph of tasks for processing dataYesYes

Apache FlinkApache Flink is a stream processing framework for distributed, high-performing, always-available, and accurate data streaming applications.
Yes

Apache KafkaApache Kafka is a distributed streaming platform capable of handling trillions of events a day. Initially conceived as a messaging queue, Kafka is based on an abstraction of a distributed commit log
Yes

Apache BeamApache Beam is an unified programming model to define and execute data processing pipelines, including ETLbatch and stream (continuous) processing
Yes

Apache FlumeFlume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of streaming event data.Apache StormApache Storm is a distributed stream processing computation framework written predominantly in the Clojure programming language. , aggregating, and moving large amounts of streaming event data.


Apache StormApache Storm is a distributed stream processing computation framework written predominantly in the Clojure programming language. 


Apache NiFiApache NiFi (short for NiagaraFiles supports powerful and scalable directed graphs of data routing, transformation, and system mediation logicNoYes

Apache MiNiFiApache NiFi - MiNiFi is a complementary data collection approach that supplements the core tenets of NiFi in dataflow management, focusing on the collection of data at the source of its creation.NoYes
Database and Data Warehousing Components




Apache HiveHive is a data warehouse software facilitates reading, writing, and managing large datasets residing in distributed storage and queried using SQL syntax. YesYes

Apache HBaseHBase is an open-sourcenon-relationaldistributed database modeled after Google's Bigtable YesYes

SparkSQLSpark SQL is a Spark module for structured data processingYesYes

Apache PigApache Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. The salient property of Pig programs is that their structure is amenable to substantial parallelization, which in turns enables them to handle very large data sets.YesYes

Apache PheonixApache Phoenix is a massively parallel, relational database engine supporting OLTP for Hadoop using Apache HBase as its backing store.


Apache ImpalaApache Impala is a query engine and native analytic database for Apache Hadoop. Impala is shipped by Cloudera, MapR, Oracle, and AmazonNo

Apache DrillApache Drill is a framework that supports data-intensive distributed applications for interactive analysis of large-scale datasetsNo

PrestoDBPresto is an open source distributed SQL query engine for running interactive analytic queries against data sources of all sizes ranging from gigabytes to petabytes.No

Apache CassandraApache Cassandra is a massively scalable open source non-relational database that offers continuous availability, linear scale performance, operational simplicity and easy data distribution across multiple data centers and cloud availability zones. NoYes

ScyllaDBScylla is a distributed NoSQL data store. It supports the same protocols as Cassandra (CQL and Thrift) and the same file formats (SSTable), but is a completely rewritten implementation, using the C++14 language replacing Cassandra's Java, and replacing threads, shared memory, mapped files, and other classic Linux programming techniques.NoApache IgniteIgnite is a memory-centric distributed database, caching, and processing platform for transactional, analytical, and streaming workloads delivering in-memory speeds at petabyte scale with support for JCache, SQL99, ACID transactions, and machine learning. Cassandra (CQL and Thrift) and the same file formats (SSTable), but is a completely rewritten implementation, using the C++14 language replacing Cassandra's Java, and replacing threads, shared memory, mapped files, and other classic Linux programming techniques.No

Apache IgniteIgnite is a memory-centric distributed database, caching, and processing platform for transactional, analytical, and streaming workloads delivering in-memory speeds at petabyte scale with support for JCache, SQL99, ACID transactions, and machine learning. 


Apache AlluxioAlluxio, formerly Tachyon, enables any application to interact with any data from any storage system at memory speed.YesYes

PostgresPostgres  is an object-relational database management system (ORDBMS) with an emphasis on extensibility and standards compliance.No

MemcachedMemcached is a high performance multi-threaded event-based in-memory key-value store for small chunks of arbitrary data (strings, objects) from results of database calls, API calls, or page rendering.NoYes

Apache SqoopSqoop is a tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases.Apache FalconFalcon is a feed processing and feed management system aimed at making it easier for end consumers to onboard their feed processing and feed management on hadoop clustersas relational databases.


Apache FalconFalcon is a feed processing and feed management system aimed at making it easier for end consumers to onboard their feed processing and feed management on hadoop clusters


RedisRedis is a in-memory data structure store, used as a database, cache and message broker.No

Apache KuduKudu provides a combination of fast inserts/updates and efficient columnar scans to enable multiple real-time analytic workloads across a single storage layer. As a new complement to HDFS and Apache HBase, Kudu gives architects the flexibility to address a wider variety of use cases without exotic workarounds.

Governance and Security




Apache RangerApache Ranger is a framework to enable, monitor and manage comprehensive data security across the Hadoop platform.


Apache AtlasAtlas is a scalable and extensible set of core foundational governance services – enabling enterprises to effectively and efficiently meet their compliance requirements within Hadoop and allows integration with the whole enterprise data ecosystem.


Apache KnoxThe Apache Knox Gateway (“Knox”) provides perimeter security so that the enterprise can confidently extend Hadoop access to more of those new users while also maintaining compliance with enterprise security policies


Apache SentryApache Sentry is a system for enforcing fine grained role based authorization to data and metadata stored on a Hadoop cluster.

File Formats




Apache AvroAvro is a remote procedure call and data serialization framework developed within Apache's Hadoop project. 


ParquetApache Parquet is a columnar storage format available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model or programming language.ORCThe Optimized Row Columnar (ORC) file format provides a highly efficient way to store Hive data. It was designed to overcome limitations of the other Hive file formats. Using ORC files improves performance when Hive is reading, writing, and processing dataRCRCFile (Record Columnar File) is a data placement structure that determines how to store relational tables on computer clustersData Science Notebooksany project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model or programming language.


ORCThe Optimized Row Columnar (ORC) file format provides a highly efficient way to store Hive data. It was designed to overcome limitations of the other Hive file formats. Using ORC files improves performance when Hive is reading, writing, and processing data


RCRCFile (Record Columnar File) is a data placement structure that determines how to store relational tables on computer clusters

Data Science Notebooks




JupyterThe Jupyter Notebook is a web application that allows you to create and share documents that contain live code, equations, visualizations and narrative text. Uses include: data cleaning and transformation, numerical simulation, statistical modeling, data visualization, machine learning, and much more.


ZeppelinZeppelin is a Web-based notebook that enables data-driven, interactive data analytics and collaborative documents with SQL, Scala and more

Analytics Components




Apache ArrowApache Arrow is a development platform for in-memory analytics. 


Apache SolrSolr is highly reliable, scalable and fault tolerant, providing distributed indexing, replication and load-balanced querying, automated failover and recovery, centralized configuration and more. Solr powers the search and navigation features of many of the world's largest internet sites.
Yes

Elastic StackThe Elastic Stack or ELK contains the following components - ElasticSearch, LogStash, Kibana and Beats. Elasticsearch is a search and analytics engine. Logstash is a server‑side data processing pipeline that ingests data from multiple sources simultaneously, transforms it, and then sends it to a "stash" like Elasticsearch. Kibana lets users visualize data with charts and graphs in Elasticsearch.NoYes

GangliaGanglia is a scalable distributed monitoring system for high-performance computing systems such as clusters and Grids.

Machine Learning




Spark MLibMLlib is Spark’s machine learning (ML) library. Its goal is to make practical machine learning scalable and easy. It consists of common learning algorithms and utilities, including classification, regression, clustering, collaborative filtering, dimensionality reduction, as well as lower-level optimization primitives and higher-level pipeline APIs.YesYes

MahoutApache Mahout is a library of scalable machine-learning algorithms, implemented on top of Apache Hadoop and using the MapReduce paradigm. 


H2OOpen Source Fast Scalable Machine Learning Platform For Smarter Applications (Deep Learning, Gradient Boosting, Random Forest, Generalized Linear Modeling (Logistic Regression, Elastic Net), K-Means, PCA, Stacked Ensembles, Automatic Machine Learning (AutoML), ...)NoYes

TensorFlowTensorFlow is a software library for numerical computation using data flow graphs.NoYes
Dependencies








Smoke Tests








Benchmarking








...