What is Apache Spark?

Apache Spark is a free, open source parallel, distributed data processing framework that enables you to process all kinds of data at massive scale.

Download the reference architecture guide

Features and benefits
of Apache Spark

Spark delivers an advanced palette of capabilities for today's data engineers, data scientists and analysts:

Write custom parallel, distributed data processing applications with popular languages including Python, Scala and Java.
Use Spark MLLib to perform all manner of common machine learning tasks at scale, and use Spark GraphX to perform graph processing over vast graphs with an incredible number of vertices.
Write SQL queries and process big data using common data warehousing techniques, with SparkSQL.
Write continuous data processing applications that reliably process streaming data using Spark's Spark Streaming API.

Spark is widely used to build sophisticated data pipelines, which can be continuous event stream processing applications or batch based jobs that run on a schedule. Spark is also widely adopted by data scientists for data analysis as well as for machine learning tasks including data preparation.

Why choose Spark?

Flexible and versatile
Processes big data
Support for multiple programming languages

Why do companies use Spark?

Proven solution

Apache Spark has been battle tested in real world deployments around the globe for more than a decade.

Massive scalability

Spark is designed to run at web scale – with horizontal scaling capabilities built in.

Large user community

Spark is a mature, actively developed project and has a vast user community.

How do companies use Spark?

Streaming data

You can use Spark to develop applications that process continuous data streams, for example for analysing web clickstream data and providing real time insights and alerts at web scale.

Data science

Spark is a popular choice for empowering data scientists. Spark helps data scientists to prepare data, train models and explore data sets with a compute cluster. And with support for Python and R, data scientists can work with the tooling that they already know.

Analytics

With Spark, you can use industry standard Structured Query Language (SQL) to query your big data. Or, you can use Spark data frame API to analyse massive datasets using Python in a familiar way.

How does Spark work?

Spark runs as an application on a compute cluster, under a resource scheduler. You can use the popular Kubernetes scheduler, YARN or set up a standalone Spark cluster.

Users write Spark applications using the language of their choice – Python, SQL, Scala, R or Java – and submit them to the cluster, where the Spark application can be run on many compute nodes at once, dividing the workload into tasks in order to parallelise efforts and complete processing faster.

Spark makes use of in-memory caching in order to accelerate data processing even further, but falls back to persistent storage media if memory is insufficient.

Spark applications are composed of a Driver and Executors. The driver component can run on the cluster; or it can run on the user's local machine when using an interactive session like spark-shell. The driver acts as task coordinator. Executors perform the actual processing tasks. Data is partitioned and distributed amongst the executors for processing.

The diagram shows a Spark client connecting to a Spark driver on a Kubernetes cluster. The Spark driver is coordinating a distributed, parallel processing job which is reading and writing data to and from a remote object storage system.

Feature breakdown

Horizontally scalable
Spark applications can be scaled to increase processing capacity by adding additional executors, which enables Spark to work at petabyte dataset scale.
Distributed processing
Spark workloads are distributed across many executors. Executors can be distributed across many servers in a compute cluster, which can massively accelerate data processing times by dividing data processing tasks and distributing them across the cluster.
Data warehousing
When used in conjunction with solutions like Apache Kyuubi, Spark makes an effective lakehouse engine for data warehousing at data lake scale.

Installing Spark

Spark is a distributed system developed in Java, designed for computers running Ubuntu and other Linux distributions.

You can use Canonical's Charmed Spark solution to deploy a fully supported Spark solution on Kubernetes.

Try Charmed Spark

Canonical's Charmed Spark

Charmed Spark delivers up to 10 years of support and security maintenance for Apache Spark as an integrated, turnkey solution with advanced management features.

Charmed Spark

Included in Ubuntu Pro + Support

When you purchase an Ubuntu Pro + Support plan, you also get support for the full Charmed Spark solution.

Up to 10 years of Spark support per release track
24/7 or weekday phone and ticket support
Up to 10 years of security maintenance for Spark covering critical and high severity CVEs

Charmed Spark allows you to automate deployment and operation of Spark at web scale in the environment of your choice – on the cloud or in your data centre. Supports deployment to most popular clouds or to CNCF conformant Kubernetes.

Learn more about Charmed Spark ›

Spark OCI-compliant container image

Included in Ubuntu Pro

Also included in Ubuntu Pro, you get support for Canonical's OCI-compliant container image for Spark in GitHub Container Registry (GHCR) – the slimmest, fastest, niftiest Spark container image on the market, based on Ubuntu LTS.

Up to 10 years of support per release track
Same 24/7 or weekday phone and ticket support commitment
Same 10 years of security maintenance covering critical and high severity CVEs in the image

Try the OCI image ›

Spark consultancy and support

Advanced professional services for Spark, when you need them

Get help designing, planning and building and even operating a hyper automated production Spark solution that perfectly fits your needs, with Canonical's expert services.

Help with design and build of both production and non-production Spark environments with Charmed Spark
Managed services for Spark lake houses in your cloud tenancy or data centre, backed by an SLA
Firefighting support with a Spark operations expert, who works alongside your team when crisis hits

Reach out for Spark solution support ›

Learn more about Spark

Get an introduction to Apache Spark and how to prioritise your security requirements.

Watch our webinar ›

Spark resources

Charmed Spark reference architecture guide
Read the reference architecture guide for Charmed Spark.
Make better decisions with open source Big Data and AI solutions
Learn how to build a smarter enterprise with a secure, integrated open source stack.
Building an online data hub with Spark
Building an effective, online data hub to facilitate access to enterprise data means ensuring solution scalability and reliability. Read the guide to gain insights into the value, use cases and challenges associated with building an enterprise data hub – whether on the public cloud or on-premise.

Apache®, Apache Spark, Spark®, and the Spark logo are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries.

Quick links

Quick links

Quick links

Quick links

Quick links

Quick links

Quick links

Quick links

Quick links

Categories

Industries

Case studies ›

Partner programs

Quick links

Roles by department

Working here

Explore Canonical

Latest updates

Company highlights ›

What is Apache Spark?

Features and benefits
of Apache Spark

Why choose Spark?