Michael Burkhardt’s Weblog

What is Amazon SageMaker Processing?

When AWS announced Amazon SageMaker Processing back in 2019, I was fairly busy working out how to get SageMaker Tuning, Training, and Batch Inference jobs and AWS Glue pre- and postprocessing jobs orchestrated to run at scale using an S3 data lake. I remember seeing the new item in the SageMaker service menu appear one day and thinking, “What the heck is Processing? Ugh, I’ll look at it later.”

Now I wish I hadn’t kicked the can down the road. I’ve recently begun to learn about Processing and I think it’ll solve some important problems as I begin migrating some analytics workloads to the cloud. It seems like a good time to share what I’m learning as I go.

If you haven't read about it already, the introduction in the Developer Guide is worth a look. It says, in part:

With Processing, you can use a simplified, managed experience on SageMaker to run your data processing workloads, such as feature engineering, data validation, model evaluation, and model interpretation. You can also use the Amazon SageMaker Processing APIs during the experimentation phase and after the code is deployed in production to evaluate performance.

So what is Amazon SageMaker Processing really?

That sounds like a lot of marketing malarkey to me. I’d describe Processing like this: it’s sort of like AWS Glue, but different/better. Some key features are that you can

The SageMaker SDK provides several processors for launching Processing jobs:

Some Introductory Examples

If you haven't use Processing before, I recommend the following examples provided as part of the huge Amazon SageMaker Examples GitHub repo:

More to come…

I’m especially excited about the ScriptProcessor because I’m interested in scheduling R scripts in the cloud. I’ll continue to share what I learn in future articles.

Recent Posts