Capabilities of AWS Glue Data Quality
AWS Glue Data Quality accelerates your data quality journey with the following key capabilities:
- Serverless – AWS Glue Data Quality is a feature of AWS Glue, which eliminates the need for infrastructure management, patching, and maintenance.
- Reduced manual efforts with recommending data quality rules and out-of-the-box rules – AWS Glue Data Quality computes data statistics such as minimums, maximums, histograms, and correlations for datasets. It then uses these statistics to automatically recommend data quality rules that check for data freshness, accuracy, and integrity. This reduces manual data analysis and rule identification efforts from days to hours. You can then augment recommendations with out-of-the-box data quality rules. The following table lists the rules that are supported by AWS Glue Data Quality as of writing. For an up-to-date list, refer to Data Quality Definition Language (DQDL).
| Rule Type | Description |
AggregateMatch |
Checks if two datasets match by comparing summary metrics like total sales amount. Useful for customers to compare if all data is ingested from source systems. |
ColumnCorrelation |
Checks how well two columns are corelated. |
ColumnCount |
Checks if any columns are dropped. |
ColumnDataType |
Checks if a column is compliant with a data type. |
ColumnExists |
Checks if columns exist in a dataset. This allows customers building self-service data platforms to ensure certain columns are made available. |
ColumnLength |
Checks if length of data is consistent. |
ColumnNamesMatchPattern |
Checks if column names match defined patterns. Useful for governance teams to enforce column name consistency. |
ColumnValues |
Checks if data is consistent per defined values. This rule supports regular expressions. |
Completeness |
Checks for any blank or NULLs in data. |
CustomSql |
Customers can implement almost any type of data quality checks in SQL. |
DataFreshness |
Checks if data is fresh. |
DatasetMatch |
Compares two datasets and identifies if they are in sync. |
DistinctValuesCount |
Checks for duplicate values. |
Entropy |
Checks for entropy of the data. |
IsComplete |
Checks if 100% of the data is complete. |
IsPrimaryKey |
Checks if a column is a primary key (not NULL and unique). |
IsUnique |
Checks if 100% of the data is unique. |
Mean |
Checks if the mean matches the set threshold. |
ReferentialIntegrity |
Checks if two datasets have referential integrity. |
RowCount |
Checks if record counts match a threshold. |
RowCountMatch |
Checks if record counts between two datasets match. |
StandardDeviation |
Checks if standard deviation matches the threshold. |
SchemaMatch |
Checks if schema between two datasets match. |
Sum |
Checks if sum matches a set threshold. |
Uniqueness |
Checks if uniqueness of dataset matches a threshold. |
UniqueValueRatio |
Checks if the unique value ration matches a threshold. |
- Embedded in customer workflow – AWS Glue Data Quality has to blend into customer workflows for it to be useful. Disjointed experiences create friction in getting started. You can access AWS Glue Data Quality from the AWS Glue Data Catalog, allowing data stewards to set up rules while they are using the Data Catalog. You can also access AWS Glue Data Quality from Glue Studio (AWS Glue’s visual authoring tool), Glue Studio notebooks (a notebook-based interface for coders to create data integration pipelines), and interactive sessions, an API where data engineers can submit jobs from their choice of code editor.
- Pay-as-you-go and cost-effective – AWS Glue Data Quality is charged based on the compute used. This simple pricing model doesn’t lock you into annual licenses. AWS Glue ETL-based data quality checks can use Flex execution, which is 34% cheaper for non-SLA sensitive data quality checks. Additionally, AWS Glue Data Quality rules on data pipelines can help you save costs because you don’t have to waste compute resources on bad quality data when detected early. Also, when data quality checks are configured as part of data pipelines, you only incur an incremental cost because the data is already read and mostly in memory.
- Built on open-source – AWS Glue Data Quality is built on open-source DeeQu, a library that is used internally by Amazon to manage the quality of data lakes over 60 PB. DeeQu is optimized to run data quality rules in minimal passes that makes it efficient. The rules that are authored in AWS Glue Data Quality can be run in any environment that can run DeeQu, allowing you to stay in an open-source solution.
- Simplified rule authoring language – As part of AWS Glue Data Quality, we announced Data Quality Definition Language (DQDL). DQDL attempts to standardize data quality rules so that you can use the same data quality rules across different databases and engines. DQDL is simple to author and read, and brings the goodness of code that developers like, such as version control and deployment. To demonstrate the simplicity of this language, the following example shows three rules that check if record counts are greater than 10, and ensures that
VendorIDdoesn’t have any empty values andVendorIDhas a certain range of values: