Spark can be downloaded from: https://spark.apache.org/. Spark provides a number of services via an api:
You can start a standalone master server by executing.
./sbin/start-master.sh
By default it exposes Admin page on http://localhost:8080 (which is often busy)
Once started, the master will print out a
spark://HOST:PORT URL
for itself, which you can use to connect workers to it, or pass as the “master” argument to SparkContext.
Similarly, you can start one or more workers and connect them to the master via:
./sbin/start-slave.sh <master-spark-URL>
Once you have started a worker, look at the master’s web UI (http://localhost:8080 by default). You should see the new node listed there, along with its number of CPUs and memory (minus one gigabyte left for the OS).
Lightweight ETL Framework for Java from northConcepts
The Cascading Ecosystem is a collection of applications, languages, and APIs for developing data-intensive applications.
Cascading 4 has a new sub-project that provides new Cascading local mode Tap implementations for Amazon S3 and Apache Kafka. Cascading local mode (without Hadoop ) can be used in constrained and lightweight environments like AWS Lambda.
Project on github: https://github.com/cwensel/cascading-local
Apachie Oozie is a Java web application used for scheduling Apache Hadoop jobs.
It is a reliable and scalable tool which forms a single logical unit by sequentially combining multiple jobs. It comes with built in support for various Hadoop jobs such as Java MapReduce, Streaming MapReduce, Pig, Hive, Sqoop and Distcp.
Apache Oozie also supports job scheduling for specific systems such as shell scripts and java programs.