Actualizando

The Databricks Data Engineer

Jakub Lasak

Tecnología

Estreno: 2026-06-15

Gratis New

12 episodios

Audio

Gratis New

12 episodios

Audio

Tecnología

Estreno: 2026-06-15

-39

n.° 124 en Top podcasts > Tecnología

El episodio más reciente

The Spark Shuffle is baggage claim: why your job waits instead of computes (and more workers won't fix it)

Your Spark job has been running for forty minutes. The dashboard shows your cluster isn't even busy. So you do the obvious thing: add more workers. And it changes nothing. Here's why. During a shuffle, Spark is barely computing at all. It's tagging eve

Tiempo: 11:09

Reproducir

Your Spark job has been running for forty minutes. The dashboard shows your cluster isn't even busy. So you do the obvious thing: add more workers. And it changes nothing.

Here's why. During a shuffle, Spark is barely computing at all. It's tagging every row by destination, piling rows together, spilling the overflow to disk, and hauling data across the network between executors. It's an airport rerouting every passenger's bag to a new carousel, and more baggage handlers can't speed up a single overloaded belt.

In this episode:

- Why your slowest wide transformation spends most of its time on logistics, not computing

- The four-step model that lets you explain the shuffle to a teammate in sixty seconds

- Why adding workers can make a skewed job slower, not faster

- The two numbers in the Spark UI that tell you whether it's skew, partition count, or spill

- The one diagnostic to run before you ever resize the cluster again

This episode is for Databricks data engineers whose joins and aggregations crawl for reasons the cluster size never seems to fix. Whether you're mid-level and tired of guessing, or senior and tired of paying for compute that doesn't help, you'll walk away able to read a slow shuffle instead of throwing hardware at it.

---

Helping 18,000+ Databricks data engineers become seniors: interview like seniors, execute like seniors, think like seniors.

Follow The Databricks Data Engineer for new episodes every Monday, Wednesday, and Friday.

LinkedIn: linkedin.com/in/jrlasak

Newsletter: dataengineer.wiki

#DataEngineering #Databricks #DataEngineer #CareerGrowth #ApacheSpark #DeltaLake

ID de episodio: 1000772782700

GUID: 5477e284-0a3f-4b63-9283-855bac881095

Fecha de lanzamiento: 15/6/2026 11:00:00

Descripción

Helping 18k+ Databricks data engineers become seniors: interview like seniors, execute like seniors, think like seniors.

URL del canal

https://anchor.fm/s/110cfe0fc/podcast/rss

Apple Podcasts: Reseñas de clientes

Ninguna entrada

A la venta en iTunes. Los precios de los productos y la disponibilidad son exactas a partir del 1/7/2026 9:40:58 y están sujetos a cambios. Cualquier precio y disponibilidad de la información que aparece en iTunes en el momento de la compra se aplicarán a la compra de este producto. Apple, el logotipo de Apple, Apple Music, iPad, iPhone, iPod, iPod touch, iTunes, iTunes Store, iTunes U, Mac y OS X son marcas comerciales de Apple Inc. registradas en EE. UU. y otros países. Apple Books, App Store y Mac App Store son marcas de servicio de Apple Inc. IOS es una marca comercial o marca comercial registrada de Cisco en EE. UU. y otros países y se utiliza bajo licencia. QR Code es una marca comercial registrada de Denso Wave Incorporated. Todas las demás marcas comerciales, logotipos y derechos de autor son propiedad de sus respectivos dueños.