Big Data Projects / Components

Big Data Projects / Components

Big Data refers to the new technologies and applications introduced to handle increasing Volume, Velocity and Variety of data while enhancing data utilization capabilities such as Variability, Veracity, and Value.

The large Volume of data available forces horizontal scalability of storage and processing and has implications for all the other V-attributes. The increasing velocity of data ingestion and change implies the need for stream processing, filtering, and processing optimizations. The variety of data types  (e.g. multimedia) being generated a requires the use of non-traditional data stores and processing. 

 

Category

Project / Component

Description

Link

Built through Bigtop

Built and Ported

Upstream Build

CI

Smoke tests complete

Testing in Clustered environment

Category

Project / Component

Description

Link

Built through Bigtop

Built and Ported

Upstream Build

CI

Smoke tests complete

Testing in Clustered environment

Core Components

Apache Bigtop

Bigtop is a project for Infrastructure Engineers and Data Scientists looking for comprehensive packaging, testing, and configuration of the leading open source big data components. List of projects/components included in it are here. It packages RPMs and DEBs and also produces provides vagrant recipes, raw images and docker images.

collaborate

JIRA

link

Apache Hadoop

Apache Zookeeper

Apache Spark

Apache Hive

Apache HBase

Apache Ambari

 

Apache Hadoop

Hadoop is a framework that allows for distributed processing of large data sets across clusters of computers using simple programming models. It includes HDFS, MapReduce, YARN HCatalog

collaborate

JIRA

RPM DEB

Tarball DEB

 

 

 

Operational Components

Apache Ambari

Apache Ambari makes Hadoop cluster provisioning, managing, and monitoring dead simple. Ambari provides an intuitive, easy-to-use Hadoop management web UI backed by its RESTful APIs.

collaborate

JIRA

RPM DEB

 

 

 


Apache Zookeeper

ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services. 

collaborate

JIRA

RPM DEB

 

 

 

 


Apache Oozie

Oozie is a workflow scheduler system to manage Apache Hadoop jobs.

collaborate

JIRA

RPM DEB

 

 

 

 

Streaming Components and Pipelines

Apache Spark

Apache Spark is a unified analytics engine for large-scale data processing.

collaborate

JIRA

RPM DEB

Tarball

 

 

 

Apache Tez

TEZ project is aimed at building an application framework which allows for a complex directed-acyclic-graph of tasks for processing data

collaborate

JIRA

RPM DEB

 

 

 

 

Apache Flink

Apache Flink is a stream processing framework for distributed, high-performing, always-available, and accurate data streaming applications.

collaborate

JIRA

RPM DEB

 

 

 

Apache Kafka

Apache Kafka is a distributed streaming platform capable of handling trillions of events a day. Initially conceived as a messaging queue, Kafka is based on an abstraction of a distributed commit log

collaborate

JIRA

RPM DEB

 

 

 

Apache Beam

Apache Beam is an unified programming model to define and execute data processing pipelines, including ETLbatch and stream (continuous) processing

collaborate

JIRA

 

 

 

Apache Flume

Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of streaming event data.

collaborate

JIRA

RPM DEB

 

 

 

 

Apache Storm

Apache Storm is a distributed stream processing computation framework written predominantly in the Clojure programming language. 

collaborate

JIRA

 

 

 

 

 

Apache NiFi

Apache NiFi (short for NiagaraFiles supports powerful and scalable directed graphs of data routing, transformation, and system mediation logic

collaborate

JIRA

 

 

 

Apache MiNiFi

Apache NiFi - MiNiFi is a complementary data collection approach that supplements the core tenets of NiFi in dataflow management, focusing on the collection of data at the source of its creation.

collaborate

JIRA

 

 

 

Apache Apex

Apache Apex is a unified platform for big data stream and batch processing. Use cases include ingestion, ETL, real-time analytics, alerts and real-time actions. 

collaborate

JIRA

RPM DEB

 

 

 

 

Apache Crunch

The Apache Crunch Java library provides a framework for writing, testing, and running MapReduce pipelines. Its goal is to make pipelines that are composed of many user-defined functions simple to write, easy to test, and efficient to run.

 

RPM DEB

 

 

 

 

 

Apache Edgent

Apache Edgent is a programming model and micro-kernel style runtime that can be embedded in gateways and small footprint edge devices enabling local, real-time, analytics on the continuous streams of data coming from equipment, vehicles, systems, appliances, devices and sensors of all kinds (for example, Raspberry Pis or smart phones). Working in conjunction with centralized analytic systems, Apache Edgent provides efficient and timely analytics across the whole IoT ecosystem: from the center to the edge.

 

 

 

 

 

 

Database and Data Warehousing Components

Apache Hive

Hive is a data warehouse software facilitates reading, writing, and managing large datasets residing in distributed storage and queried using SQL syntax. 

collaborate

JIRA

RPM DEB

 

 

 

 

Apache HBase

HBase is an open-sourcenon-relationaldistributed database modeled after Google's Bigtable 

collaborate

JIRA

RPM DEB

 

 

 

 

SparkSQL

Spark SQL is a Spark module for structured data processing

 

RPM DEB

 

 

 

 

Apache Pig

Apache Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. The salient property of Pig programs is that their structure is amenable to substantial parallelization, which in turns enables them to handle very large data sets.

collaborate

JIRA

RPM DEB

 

 

 

 

Apache DataFu

Apache DataFu consists of two libraries: Apache DataFu Pig is a collection of useful user-defined functions for data analysis in Apache Pig. Apache DataFu Hourglass is a library for incrementally processing data using Apache Hadoop MapReduce. This library was inspired by the prevalence of sliding window computations over daily tracking data. Computations such as these typically happen at regular intervals (e.g. daily, weekly), and therefore the sliding nature of the computations means that much of the work is unnecessarily repeated. DataFu’s Hourglass was created to make these computations more efficient, yielding sometimes 50-95% reductions in computational resources.

 

RPM DEB

 

 

 

 

 

Apache Pheonix

Apache Phoenix is a massively parallel, relational database engine supporting OLTP for Hadoop using Apache HBase as its backing store.

 

RPM DEB

 

 

 

 

 

Apache Impala

Apache Impala is a query engine and native analytic database for Apache Hadoop. Impala is shipped by Cloudera, MapR, Oracle, and Amazon

 

 

 

 

 

 

Apache Drill

Apache Drill is a framework that supports data-intensive distributed applications for interactive analysis of large-scale datasets

 

 

 

 

 

 

PrestoDB

Presto is an open source distributed SQL query engine for running interactive analytic queries against data sources of all sizes ranging from gigabytes to petabytes.

 

 

 

 

 

 

Apache Cassandra

Apache Cassandra is a massively scalable open source non-relational database that offers continuous availability, linear scale performance, operational simplicity and easy data distribution across multiple data centers and cloud availability zones. 

collaborate

JIRA

 

 

 

 

ScyllaDB

Scylla is a distributed NoSQL data store. It supports the same protocols as Cassandra (CQL and Thrift) and the same file formats (SSTable), but is a completely rewritten implementation, using the C++14 language replacing Cassandra's Java, and replacing threads, shared memory, mapped files, and other classic Linux programming techniques.

collaborate

JIRA

 

 

 

 

 

Apache CouchDB

Apache CouchDB is a database that completely embraces the web. Store your data with JSON documents. Access your documents with your web browser, via HTTP. Query, combine, and transform your documents with JavaScript. Apache CouchDB works well with modern web and mobile apps. You can even serve web apps directly out of Apache CouchDB. And you can distribute your data, or your apps, efficiently using Apache CouchDB’s incremental replication. Apache CouchDB supports master-master setups with automatic conflict detection

 

 

 

 

 

 

Apache Ignite

Ignite is a memory-centric distributed database, caching, and processing platform for transactional, analytical, and streaming workloads delivering in-memory speeds at petabyte scale with support for JCache, SQL99, ACID transactions, and machine learning. 

collaborate

JIRA

RPM DEB

 

 

 

 

Apache Alluxio

Alluxio, formerly Tachyon, enables any application to interact with any data from any storage system at memory speed.

 

RPM DEB

 

 

 

 

Postgres

Postgres  is an object-relational database management system (ORDBMS) with an emphasis on extensibility and standards compliance.

collaborate

JIRA

 

 

 

 

 

Memcached

Memcached is a high performance multi-threaded event-based in-memory key-value store for small chunks of arbitrary data (strings, objects) from results of database calls, API calls, or page rendering.

collaborate

JIRA