Innovative and Vetted AWS Technology to Consider for 2022, Part 2

AWS SageMaker Data Wrangler – Aggregate and Prepare Data for Machine Learning

by Philip Horwitz

Machine learning or ML—a subset of artificial intelligence—is prodigiously gaining popularity across all business sectors and verticals. ML models can also be used to expedite the software development lifecycle as well as offer a completely new paradigm to be incorporated into a host of emerging applications.

ML is becoming more mainstream but is still evolving at a rapid clip. One tech professional compared the stages of a baby learning to walk, crawl, and run to the progressive development of ML as a technology. He made it quite clear that ML is still only in the “crawling” stage.

Because ML is an intensive data-driven process, developers and data scientists must prepare large sets of training data that are fed into learning algorithms to distinguish important patterns hidden within the data. These algorithms are so powerful that they can discover patterns that developers have not even contemplated.

To meet the incredible demand for developing ML models, Amazon Web Services (AWS) launched a powerful cloud machine learning platform in 2017 called SageMaker—a fully managed service that removes challenges from each stage of the ML process to make it easier and faster for everyday developers and data scientists to build, train, deploy, and manage machine learning models.

Moreover, SageMaker Studio is the first fully integrated development environment (IDE) for ML that enables developers to create, train, and deploy ML models in the cloud. It also lets developers implement ML models on embedded systems and edge-devices.

With the increased popularity of this platform, AWS introduced a complimentary new service in 2020 to make it substantially easier to prepare data for machine learning training. This cloud-based service—Amazon SageMaker Data Wrangler—allows developers to simplify the process of data preparation and feature engineering for ML. It provides a single visual interface to complete each step in the data preparation workflow.

Wrangling Data into a Suitable State for Machine Learning

One of the most requested features from SageMaker customers was the ability to perform data preparation in the SageMaker Studio IDE. During a keynote speech, Andy Jassy, now president and CEO of Amazon, said “Data preparation is hard,” he also added “The topic that seems to come up first and foremost every time is how can we make data preparation for machine learning easier.” SageMaker Data Wrangler is the direct result of that query by Jassy who at the time headed up AWS.

SageMaker Data Wrangler offers a faster and more efficient way to clean and prepare data specifically for use in ML with SageMaker Studio—as opposed to other use cases that might be more appropriate for a more generalized data prep tool such as AWS Glue DataBrew.

As might be expected, SageMaker Data Wrangler actually makes abundant use of its own ML algorithms to recognize the type of data that a user is feeding into a training model. From that, the service is capable of recommending one (or more) from hundreds of transformations to apply to the data.

This automation is a total gamechanger for the reduction in time it normally takes to aggregate and do data prep for ML. It also alleviates developers and data scientists from the significant burden associated with this tedious and lengthy activity. SageMaker Data Wrangler manages all of the processing infrastructure under the hood.

In fact, the whole idea behind SageMaker Studio IDE—other than allowing users to collect data, build, train, and deploy ML models through AWS—was to allow them to get their models to production faster, while simultaneously saving on time, effort, and labor costs. Basically, SageMaker Data Wrangler was designed as an effective auxiliary tool to “grease the wheels” of the process—making ML development and deployment even faster and more optimized.

A Look Inside SageMaker Data Wrangler

This powerful tool—designed specifically for data scientists and developers that must aggregate and prepare data for ML—has a straightforward workflow with rich features for accomplishing all data preparation tasks. The general flow is as follows:

Select and query data
Transform data
Understand and interpret data with visualizations
Fix any problems with the data
Automate data preparation workflows

Select and Query Data

The data selection tool built into SageMaker Data Wrangler allows users to rapidly select data from a variety of data sources. Data sources can include, among others, Amazon S3, Amazon Athena, Amazon Redshift, AWS Lake Formation, and various data stores. It is also easy to write queries for these data sources.

Data can also be imported directly into the tool from various file formats including CSV, Parquet files, and other database tables. Users can choose the data they want and import it with a single click.

Transform Data

A major difficulty of preparing data for ML arises from the fact that data attributes—frequently referred to as features—often come from different sources and exist in a variety of formats. Because of this, developers and data scientists typically spend considerable time extracting and cleansing data so it is ready and consistent for use in ML and pattern recognition.

To radically facilitate this process, SageMaker Data Wrangler includes 300+ built-in and pre-configured data transformations that help data scientists and other users quickly normalize, transform, and combine features without writing any code. These data transformations or “transformers” include functions such as convert column type, one hot encoding, impute missing data with mean or median, rescale columns, split by delimiter, and custom transform—just to name a few.

Moreover, data scientists and developers might also want to combine several features into “composite features” for a specific ML or pattern recognition model. SageMaker Data Wrangler fundamentally simplifies this labor-intensive process as well as the work generally associated with transforming raw data into features—a process known as “feature engineering.” For example, a spam detection algorithm used in ML may make use of the following features found within the datasets for its training: the presence or absence of an email header, the email’s overall structure, language, frequency of specific terms, and the grammatical correctness of the text.

Understand and Interpret Data with Visualizations

Before, during, and after the transformation or conversion of data, SageMaker Data Wrangler allows users to understand their data, identify potential errors, inconsistencies, or extreme values with a powerful set of pre-configured data visualizations. These visualizations include histograms, scatter plots, box and whisker plots, line plots, and bar charts.

Users can quickly preview and inspect data to ensure that the transformations align with what was actually intended. If not, corrections can be quickly made. They can also use the visual interface tool to build preprocessing and visualization pipelines or flows.

Fix Any Problems with the Data

Through the combined use of the transformation tools and the data visualization templates, data scientists and developers working with ML can quickly identify and cure problems or inconsistencies with the input data that might affect the accuracy of the model.

SageMaker Data Wrangler allows users to diagnose issues before models are deployed into production. A user can quickly identify or estimate if prepared data will result in an accurate ML model or whether additional feature engineering is required to enhance performance.

Automate Data Preparation Workflows

Once preparation is complete, a user can automate ML data preparation workflows. With a single click the data workflow can be exported to a notebook or code script to bring it into production.

To further facilitate the process of automating data preparation workflows from SageMaker Data Wrangler, there is a seamless integration with another critical AWS tool—Amazon SageMaker Pipelines. This tool is used to automate model deployment and management. SageMaker Pipelines is the first purpose-built, easy-to-use continuous integration and continuous delivery (CI/CD) service for ML.

Through the use of this tool, developers can define each step of an end-to-end machine learning workflow. These include the data-load steps and transformations from Amazon SageMaker Data Wrangler. It also allows for features to be stored in the Amazon SageMaker Feature Store. With this tool, developers can easily re-run workflows that incorporate the full data preparation and feature engineering of SageMaker Data Wrangler. A workflow can be re-run using the same settings to get the exact same model every time. Workflows can also be re-run on a regular schedule with new data to update the model.

***

When it comes to preprocessing data for ML models, one of the fastest and easiest ways to do so is with Amazon SageMaker Data Wrangler which works seamlessly with SageMaker Studio. Data scientists and developers can use this robust data preparation and visualization tool to prepare data for ML tasks in an easy, fast, and repeatable manner.

According to Amazon, SageMaker Data Wrangler readily simplifies the process of data preparation and feature engineering. SageMaker Data Wrangler completes each step of the data preparation workflow, including data selection, cleansing, exploration, and visualization from a single visual interface. Once data is prepared, users can build fully automated ML workflows with SageMaker Pipelines and save them for reuse in the SageMaker Feature Store.

The JBS Quick Launch Lab

Free Qualified Assessment

Quantify what it will take to implement your next big idea!

Our assessment session will deliver tangible timelines, costs, high-level requirements, and recommend architectures that will work best. Let JBS prove to you and your team why over 24 years of experience matters.

Get Your Assessment