Big data in practice using Spark
Nowadays everybody seems to be working with "big data" and data science. No doubt also you would like to interrogate your voluminous data sources (click streams, social media, relational data, cloud data, sensor data, ...) and are experiencing the shortcomings of traditional data analytics tools. Maybe you want the processing power of a cluster --and its parallel processing capabilities-- to interrogate your distributed data stores.
If fast prototyping and processing speed are a priority, Spark will most likely be the platform of your choice. Apache Spark is an open source processing engine focusing on low latency, ease of use, flexibility and analytics. It's an alternative to the MapReduce approach delivered of Hadoop with Hive (cf our course Big data in practice using Hadoop). Spark has complemented, actually superseded, traditional Hadoop, due to the higher abstraction of Spark's APIs and its faster, in-memory processing.
More specifically, Spark allows to easily interrogate data sources on HDFS, in a NoSQL database (e.g. Cassandra or HBase), in a relational database, in the cloud (e.g. AWS) or in local files. Independent of this, a Spark job can easily run on either your local machine (i.e., in development mode), or on a Hadoop cluster (with Yarn), or a Mesos environment, or Kubernetes, or in the cloud. And all this through a simple Spark script or through a more complex (Java or Python) program or though a web based notebook (e.g. Zeppelin).
This course is situated in the framework as set forth by the Big data architecture and infrastructure overview course. You will get hands-on practice on Linux with Spark and its libraries. You learn how to implement robust data processing (in Scala, Python, Java or R) with an SQL-style interface.
After successful completion of the course, you will have sufficient basic expertise to set up a Spark development environment, and use it to interrogate your data. You will be able to write simple SparkSQL scripts and programs (with the Scala based SparkShell or with PySpark) that use the MLlib, GraphX, and Streaming libraries.
Whoever wants to start practising Spark: developers, data architects, and anyone who needs to work with data science technology.
- Motivation for Spark & base concepts
- The Apache Spark project and its components
- Getting to learn the Spark architecture and programming model
- The principles of Data Analytics
- Data sources
- Learn how to access data residing in Hadoop HDFS, Cassandra, AWS, or a relational database
- Working with the several programming interfaces and the web interface (specifically: Spark-shell and PySpark)
- Writing and debugging programs for simple data analytic problems
- Data Frames and RDDs
- A short introduction to the use of the Spark libraries
- Machine learning (MLlib)
- Streaming (i.e., processing "volatile" data)
- Parallel computations in trees and graphs (GraphX)
Classroom instruction, supported by practical examples and extensive practical exercises.
It was good, very useful to uderstand how the spark objects are used.
||(Alejandro del Valle Ponce, )|
Obtain an overview of Spark and its capability. Some trying-out exercises to better know how spark works.
I thought the first day went a bit too slow. I guess the content is quite broad, as was the audience, so many things were explained in ample detail and in a lengthy way. The second day was much nicer as it was more to the point of Spark.
||(Pinar Kahraman, ING - Haarlerbergpark, )|
I learn a lot from this training. Quite useful knowledge and can lead my following self-study.
||(N.N., ING - Financial Plaza, )|
Er kon sneller overgeschakeld worden naar oefeningen in de praktijk. We waren met een technisch publiek dus mijn gevoel was wel dat de cursus op 1 dag doorlopen kon worden.
||(N.N., Continuum Consulting, )|
Voldoende, voorbeelden waren helder en relevant. Ik had graag echter meer tijd besteed aan concreet oefenen met de stof.
|SESSION INFO AND ENROLMENT|