Differences

This shows you the differences between two versions of the page.

Link to this comparison view

amazon_emr [2018/01/26 08:34] (current)
root created
Line 1: Line 1:
  
 +== For ETL == 
 +
 +Use AWS and an EMR cluster on an S3 bucket. To create and run this ETL method administrators can use the management console, AWS command line, an SDK or a web service API. The execution path depends on your data needs and programming resources.
 +
 +This method sets up a data pipeline between your source and destination data stores. If there are log files available in your Amazon S3 bucket, then you can begin. EMR is a tool to process and analyze big data, an expandable computing service alternative to running on-premise cluster computing for processing power. You create an Amazon EMR cluster. Update the metadata with the contents of the S3 bucket to sync the source and destination stores. EMR processes data across a Hadoop cluster in AWS. Using the integrated Apache programming language Pig, you can submit a transformation script to clean the data. Then generate reports using Apache Hive’s SQL-like query language. When run with a scheduler, the older EMRFS metadata will need to be cleaned out of the cluster on a regular basis.
 +
 +== Links and Reference == 
 +
 +https://​www.rittmanmead.com/​blog/​2016/​12/​etl-offload-with-spark-and-amazon-emr-part-3-running-pyspark-on-emr/​
 
amazon_emr.txt · Last modified: 2018/01/26 08:34 by root
 
RSS - 200 © CrosswireDigitialMedia Ltd