Skip to main content

Your submission was sent successfully! Close

Thank you for signing up for our newsletter!
In these regular emails you will find the latest updates from Canonical and upcoming events where you can meet our team.Close

Thank you for contacting us. A member of our team will be in touch shortly. Close

What is Apache Spark?

Apache Spark is a free, open source parallel, distributed data processing framework that enables you to process all kinds of data at massive scale.


Features and benefits
of Apache Spark

Spark delivers an advanced palette of capabilities for today's data engineers, data scientists and analysts:


  • Write custom parallel, distributed data processing applications with popular languages including Python, Scala and Java.
  • Use Spark MLLib to perform all manner of common machine learning tasks at scale, and use Spark GraphX to perform graph processing over vast graphs with an incredible number of vertices.
  • Write SQL queries and process big data using common data warehousing techniques, with SparkSQL.
  • Write continuous data processing applications that reliably process streaming data using Spark's Spark Streaming API.

Spark is widely used to build sophisticated data pipelines, which can be continuous event stream processing applications or batch based jobs that run on a schedule. Spark is also widely adopted by data scientists for data analysis as well as for machine learning tasks including data preparation.


Why choose Spark?

  • Flexible and versatile

  • Processes big data

  • Support for multiple programming languages


Why do companies use Spark?


Proven solution

Apache Spark has been battle tested in real world deployments around the globe for more than a decade.


Massive scalability

Spark is designed to run at web scale – with horizontal scaling capabilities built in.


Large user community

Spark is a mature, actively developed project and has a vast user community.


How do companies use Spark?


Streaming data

You can use Spark to develop applications that process continuous data streams, for example for analysing web clickstream data and providing real time insights and alerts at web scale.


Data science

Spark is a popular choice for empowering data scientists. Spark helps data scientists to prepare data, train models and explore data sets with a compute cluster. And with support for Python and R, data scientists can work with the tooling that they already know.


Analytics

With Spark, you can use industry standard Structured Query Language (SQL) to query your big data. Or, you can use Spark data frame API to analyse massive datasets using Python in a familiar way.


How does Spark work?

Spark runs as an application on a compute cluster, under a resource scheduler. You can use the popular Kubernetes scheduler, YARN or set up a standalone Spark cluster.

Users write Spark applications using the language of their choice – Python, SQL, Scala, R or Java – and submit them to the cluster, where the Spark application can be run on many compute nodes at once, dividing the workload into tasks in order to parallelise efforts and complete processing faster.

Spark makes use of in-memory caching in order to accelerate data processing even further, but falls back to persistent storage media if memory is insufficient.

Spark applications are composed of a Driver and Executors. The driver component can run on the cluster; or it can run on the user's local machine when using an interactive session like spark-shell. The driver acts as task coordinator. Executors perform the actual processing tasks. Data is partitioned and distributed amongst the executors for processing.


Feature breakdown

  • Horizontally scalable

    Spark applications can be scaled to increase processing capacity by adding additional executors, which enables Spark to work at petabyte dataset scale.
  • Distributed processing

    Spark workloads are distributed across many executors. Executors can be distributed across many servers in a compute cluster, which can massively accelerate data processing times by dividing data processing tasks and distributing them across the cluster.
  • Data warehousing

    When used in conjunction with solutions like Apache Kyuubi, Spark makes an effective lakehouse engine for data warehousing at data lake scale.

Installing Spark

Spark is a distributed system developed in Java, designed for computers running Ubuntu and other Linux distributions.

You can use Canonical's Charmed Spark solution to deploy a fully supported Spark solution on Kubernetes.


Canonical's Charmed Spark

Charmed Spark delivers up to 10 years of support and security maintenance for Apache Spark as an integrated, turnkey solution with advanced management features.


Charmed Spark

Included in Ubuntu Pro + Support

When you purchase an Ubuntu Pro + Support plan, you also get support for the full Charmed Spark solution.


  • Up to 10 years of Spark support per release track
  • 24/7 or weekday phone and ticket support
  • Up to 10 years of security maintenance for Spark covering critical and high severity CVEs

Charmed Spark allows you to automate deployment and operation of Spark at web scale in the environment of your choice – on the cloud or in your data centre. Supports deployment to most popular clouds or to CNCF conformant Kubernetes.


Spark Rock container image

Included in Ubuntu Pro

Also included in Ubuntu Pro, you get support for Canonical's container image for Spark in GitHub Container Registry (GHCR) – the slimmest, fastest, niftiest Spark container image on the market, based on Ubuntu LTS. So solid and secure, we call it a Rock.


  • Up to 10 years of support per release track
  • Same 24/7 or weekday phone and ticket support commitment
  • Same 10 years of security maintenance covering critical and high severity CVEs in the image

Spark consultancy and support

Advanced professional services for Spark, when you need them

Get help designing, planning and building and even operating a hyper automated production Spark solution that perfectly fits your needs, with Canonical's expert services.


  • Help with design and build of both production and non-production Spark environments with Charmed Spark
  • Managed services for Spark lake houses in your cloud tenancy or data centre, backed by an SLA
  • Firefighting support with a Spark operations expert, who works alongside your team when crisis hits

Learn more about Spark

Get an introduction to Apache Spark and how to prioritise your security requirements.


Spark resources


Apache®, Apache Spark, Spark®, and the Spark logo are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries.