Spark

Spark can be downloaded from: https://spark.apache.org/. Spark provides a number of services via an api:

  • Spark SQL for SQL and structured data processing
  • MLlib for machine learning
  • GraphX for graph processing
  • and Spark Streaming.

Starting

You can start a standalone master server by executing.

./sbin/start-master.sh

By default it exposes Admin page on http://localhost:8080 (which is often busy)

Once started, the master will print out a

spark://HOST:PORT URL

for itself, which you can use to connect workers to it, or pass as the “master” argument to SparkContext.

Similarly, you can start one or more workers and connect them to the master via:

./sbin/start-slave.sh <master-spark-URL>

Once you have started a worker, look at the master’s web UI (http://localhost:8080 by default). You should see the new node listed there, along with its number of CPUs and memory (minus one gigabyte left for the OS).

Streamsets

DataPipeline

Lightweight ETL Framework for Java from northConcepts

Cascading

The Cascading Ecosystem is a collection of applications, languages, and APIs for developing data-intensive applications.

Cascading 4 has a new sub-project that provides new Cascading local mode Tap implementations for Amazon S3 and Apache Kafka. Cascading local mode (without Hadoop ) can be used in constrained and lightweight environments like AWS Lambda.

Project on github: https://github.com/cwensel/cascading-local

Apache Oozie

Apachie Oozie is a Java web application used for scheduling Apache Hadoop jobs.

It is a reliable and scalable tool which forms a single logical unit by sequentially combining multiple jobs. It comes with built in support for various Hadoop jobs such as Java MapReduce, Streaming MapReduce, Pig, Hive, Sqoop and Distcp.

Apache Oozie also supports job scheduling for specific systems such as shell scripts and java programs.

 
comparison_of_hadoop_based_tools.txt · Last modified: 2018/01/29 01:29 by root
 
RSS - 200 © CrosswireDigitialMedia Ltd