What is Apache Spark?
Apache Spark is a free, open source parallel, distributed data processing framework that enables you to process all kinds of data at massive scale.
Features and benefits
of Apache Spark
Spark delivers an advanced palette of capabilities for today's data engineers, data scientists and analysts:
- Write custom parallel, distributed data processing applications with popular languages including Python, Scala and Java.
- Use Spark MLLib to perform all manner of common machine learning tasks at scale, and use Spark GraphX to perform graph processing over vast graphs with an incredible number of vertices.
- Write SQL queries and process big data using common data warehousing techniques, with SparkSQL.
- Write continuous data processing applications that reliably process streaming data using Spark's Spark Streaming API.
Spark is widely used to build sophisticated data pipelines, which can be continuous event stream processing applications or batch based jobs that run on a schedule. Spark is also widely adopted by data scientists for data analysis as well as for machine learning tasks including data preparation.
Why choose Spark?
-
Flexible and versatile
-
Processes big data
-
Support for multiple programming languages
Why do companies use Spark?
Proven solution
Apache Spark has been battle tested in real world deployments around the globe for more than a decade.
Massive scalability
Spark is designed to run at web scale – with horizontal scaling capabilities built in.
Large user community
Spark is a mature, actively developed project and has a vast user community.
How do companies use Spark?
Streaming data
You can use Spark to develop applications that process continuous data streams, for example for analysing web clickstream data and providing real time insights and alerts at web scale.
Data science
Spark is a popular choice for empowering data scientists. Spark helps data scientists to prepare data, train models and explore data sets with a compute cluster. And with support for Python and R, data scientists can work with the tooling that they already know.
Analytics
With Spark, you can use industry standard Structured Query Language (SQL) to query your big data. Or, you can use Spark data frame API to analyse massive datasets using Python in a familiar way.
How does Spark work?
Spark runs as an application on a compute cluster, under a resource scheduler. You can use the popular Kubernetes scheduler, YARN or set up a standalone Spark cluster.
Users write Spark applications using the language of their choice – Python, SQL, Scala, R or Java – and submit them to the cluster, where the Spark application can be run on many compute nodes at once, dividing the workload into tasks in order to parallelise efforts and complete processing faster.
Spark makes use of in-memory caching in order to accelerate data processing even further, but falls back to persistent storage media if memory is insufficient.
Spark applications are composed of a Driver and Executors. The driver component can run on the cluster; or it can run on the user's local machine when using an interactive session like spark-shell. The driver acts as task coordinator. Executors perform the actual processing tasks. Data is partitioned and distributed amongst the executors for processing.
Feature breakdown
-
Horizontally scalable
Spark applications can be scaled to increase processing capacity by adding additional executors, which enables Spark to work at petabyte dataset scale. -
Distributed processing
Spark workloads are distributed across many executors. Executors can be distributed across many servers in a compute cluster, which can massively accelerate data processing times by dividing data processing tasks and distributing them across the cluster. -
Data warehousing
When used in conjunction with solutions like Apache Kyuubi, Spark makes an effective lakehouse engine for data warehousing at data lake scale.
Installing Spark
Spark is a distributed system developed in Java, designed for computers running Ubuntu and other Linux distributions.
You can use Canonical's Charmed Spark solution to deploy a fully supported Spark solution on Kubernetes.
Canonical's Charmed Spark
Charmed Spark delivers up to 10 years of support and security maintenance for Apache Spark as an integrated, turnkey solution with advanced management features.
Charmed Spark
Included in Ubuntu Pro + Support
When you purchase an Ubuntu Pro + Support plan, you also get support for the full Charmed Spark solution.
- Up to 10 years of Spark support per release track
- 24/7 or weekday phone and ticket support
- Up to 10 years of security maintenance for Spark covering critical and high severity CVEs
Charmed Spark allows you to automate deployment and operation of Spark at web scale in the environment of your choice – on the cloud or in your data centre. Supports deployment to most popular clouds or to CNCF conformant Kubernetes.
Spark Rock container image
Included in Ubuntu Pro
Also included in Ubuntu Pro, you get support for Canonical's container image for Spark in GitHub Container Registry (GHCR) – the slimmest, fastest, niftiest Spark container image on the market, based on Ubuntu LTS. So solid and secure, we call it a Rock.
- Up to 10 years of support per release track
- Same 24/7 or weekday phone and ticket support commitment
- Same 10 years of security maintenance covering critical and high severity CVEs in the image
Spark consultancy and support
Advanced professional services for Spark, when you need them
Get help designing, planning and building and even operating a hyper automated production Spark solution that perfectly fits your needs, with Canonical's expert services.
- Help with design and build of both production and non-production Spark environments with Charmed Spark
- Managed services for Spark lake houses in your cloud tenancy or data centre, backed by an SLA
- Firefighting support with a Spark operations expert, who works alongside your team when crisis hits
Learn more about Spark
Get an introduction to Apache Spark and how to prioritise your security requirements.
Spark resources
-
Charmed Spark reference architecture guide
Read the reference architecture guide for Charmed Spark.
-
Make better decisions with open source Big Data and AI solutions
Learn how to build a smarter enterprise with a secure, integrated open source stack.
-
Building an online data hub with Spark
Building an effective, online data hub to facilitate access to enterprise data means ensuring solution scalability and reliability. Read the guide to gain insights into the value, use cases and challenges associated with building an enterprise data hub – whether on the public cloud or on-premise.
Apache®, Apache Spark, Spark®, and the Spark logo are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries.