Spark 2016

From HPC users
Jump to navigationJump to search

Introduction

Apache Spark is a fast and general-purpose cluster computing system. It provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs. 1 Spark comes with Apache Hadoop, a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. 2

Installed version

The currently installed version is

on hpc-env/6.4
Spark/2.4.0-intel-2018a-Hadoop-2.7


Using Spark

If you want to find out more about Spark on the HPC Cluster, you can use the command

module spider Spark

This will show you basic informations e.g. a short description and the currently installed version.

To load the desired version of the module, use the command, e.g.

module load hpc-env/6.4
module load  Spark

Always remember: this command is case sensitive!


Documentation

An informative quick start guide can be found here, and the documentation page can be found here. If you need more information about Hadoop, consider visiting Apaches website