Classifying and Labeling Synapse Spark Data with Microsoft Purview

Azure Synapse Analytics administrators and data stewards often deal with sensitive data across Spark pools and Azure Data Lake Storage (ADLS) Gen2. Microsoft Purview provides powerful tools to automatically classify this data and apply sensitivity labels for protection and governance. In this blog, we’ll explain how Purview classifies and labels data (like Parquet files in ADLS Gen2) accessed by Synapse Spark, how to manually refine classifications and labels in the Purview catalog, and a step-by-step guide to implementing this. We’ll also cover best practices to ensure labeling consistency, data protection, and how these labels flow into downstream tools (e.g. Power BI) and security policies.

Automatic Classification and Sensitivity Labeling in Purview

Microsoft Purview’s data scanning capability can automatically detect sensitive information in your Synapse data (such as Parquet files in ADLS Gen2) and tag it with classifications. Purview includes a library of over 200 built-in sensitive information types (SITs) – for example, credit card numbers, Social Security numbers, bank account numbers, etc. – that it uses to scan data for matchesjamesserra.com techcommunity.microsoft.com. When Purview scans a data source, it inspects file contents and schema to find patterns; if a Parquet file’s columns contain email addresses or SSNs, Purview will automatically assign the corresponding classifications (like “Email Address” or “US Social Security Number”) to those data assets. This scanning is content-based and works at the file and column level – Purview can tag an entire file asset with a classification or even individual columns within structured files/tablesjamesserra.com.

In addition to classification tags, Purview can also apply sensitivity labels to your data assets as part of the same scanning process. Classifications and sensitivity labels serve distinct purposes: classifications identify and categorize the data (what type of sensitive data it is), whereas sensitivity labels indicate how sensitive the data is and how it should be handled or protectedjamesserra.com. Purview is integrated with Microsoft Purview Information Protection (formerly Microsoft Information Protection, MIP) so that the same sensitivity labels used for Office documents or Power BI can be extended to your Azure data assetslearn.microsoft.com. If you have configured sensitivity labels in your tenant and enabled them for “Files & other data assets,” Purview’s scan can automatically label data when certain sensitive patterns are found. For example, if a Parquet file contains credit card numbers (detected via classification), an auto-labeling policy could automatically assign a label like “Confidential” to that file assetdocs.azure.cn. This automatic labeling during scans ensures that sensitive Synapse data in ADLS Gen2 is not only identified, but also tagged with the appropriate business sensitivity category.

It’s important to note that automatic labeling in Purview’s Data Map relies on policies you define. By default, a scan will identify classifications, and you can then configure rules that say “if data classified as X is found, apply sensitivity label Y.” When such an auto-label rule exists and the label is enabled in Purview, scanning an asset in Purview will cause the label to be applied to that asset’s metadata in the Purview catalogdocs.azure.cn. In short, Purview scanning does the heavy lifting by discovering PII or other sensitive data and tagging it, and with integration to sensitivity labels it can also uniformly label that data according to your organization’s taxonomy.

Manual Classification and Labeling in the Purview Catalog

Automatic classification is powerful, but data stewards may need to manually adjust or enhance the classifications and labels on assets in the Purview catalog. For instance, if Purview’s scanner misses a particular custom sensitive pattern, or if you want to tag a whole folder as containing a certain data category, you can add classifications manually. The Purview Data Catalog provides the ability to edit an asset’s classifications – you can add additional classification tags or remove ones that don’t applydocs.azure.cn. As an example, after a scan you might find a Parquet file asset with some columns automatically classified as “Phone Number”. If you know that another column contains sensitive customer IDs that weren’t auto-detected, you can manually assign a custom classification (e.g. “Customer ID”) to that column in the catalog. Manual classification ensures that all important sensitive data is tagged, beyond what the automated rules catch.

Sensitivity labels, on the other hand, are generally managed through the Purview Information Protection portal and auto-labeling policies rather than by one-off manual edits in the Purview catalog. In fact, as of now Microsoft Purview does not support manually applying a sensitivity label directly in the Data Map UI for an individual assetdocs.azure.cn. Instead, the typical way to label data assets is to rely on the auto-labeling mechanism or to pre-label the data via other tools. “Manual” labeling in practice might mean: (a) using the M365 Compliance Center to create or modify a label and then running a Purview scan to apply it via a policy, or (b) applying a default label at the source. For example, you could use a policy to label all files in a specific ADLS folder as “Highly Confidential” if you know everything in that folder is sensitive. In Purview, you would implement that by creating an auto-labeling rule scoped to that folder’s assets. Once the policy is in place, the next scan will tag those files with the chosen sensitivity label. While you cannot simply click “apply label” on an asset in Purview’s catalog today, these policy-driven approaches achieve the outcome of manual labeling. In summary, manual classification (tagging the type of data) can be done within Purview’s catalog, but manual sensitivity labeling is accomplished by configuring labels and policies in the Purview compliance portal and then scanning or otherwise applying them to the assets.

Step-by-Step Walkthrough

Next, let’s walk through how to set up Microsoft Purview to classify and label Synapse Spark data stored in ADLS Gen2. We’ll cover registering the data source, scanning with classifications, reviewing the results, and setting up sensitivity labels (both via auto-labeling and in policy).

Register the ADLS Gen2 data source in Purview: Before Purview can scan your Synapse data lake, you need to register the storage account (or the specific data lake) as a data source in your Purview Data Map. In the Purview governance portal, navigate to Data Map > Sources and choose to register a new source. Select Azure Data Lake Storage Gen2 as the source type and provide the required details: a descriptive name, the Azure subscription and storage account, and the appropriate collection to organize it underlearn.microsoft.com. You will also need to authenticate Purview to access the data – the simplest method is often to use a managed identity with at least Reader permissions on the ADLS Gen2 account. Once you click Apply, the ADLS Gen2 source will appear in Purview, and you’re ready to configure scanning.
Configure and run a scan with classification enabled: With the data source registered, the next step is to set up a scan rule for that ADLS Gen2 source. In Purview, find your registered ADLS Gen2 source (under Data Map > Sources) and create a new scan. You can choose to scan the entire storage account or specific containers/folders depending on where your Synapse Spark data resides. When configuring the scan, ensure that classification is turned on (this is usually on by default). Purview allows you to specify which file types and classification rules to use in the scan – for example, you might include Parquet, CSV, JSON, etc., and all relevant built-in classifications, or limit to certain rule sets as neededlearn.microsoft.com. Typically, using the default scan rule set is fine, as it will attempt to scan all supported file formats and apply all built-in classifiers. You can schedule the scan to run periodically (e.g. daily or weekly) or run it once on demand. It’s often a best practice to schedule recurring scans for continuously updated data, so new files from Synapse Spark jobs are regularly classifiedtechcommunity.microsoft.com. After configuring, click Save and run to start the scan. Purview will use an integration runtime to connect to ADLS Gen2 and begin scanning files – parsing Parquet schemas, reading a sample of data, and looking for matches to its sensitive info patterns. Once the scan completes, the discovered data assets (files, folders, etc.) along with any classifications will be added to the Purview catalog.
View and manage classification results (file and column level): After the scan, you can explore the Purview Data Catalog to review what was found. Navigate in Purview to the storage account collection, and drill down into the container or folder that was scanned – you should see assets representing your files (e.g. Parquet files, or resource sets if files are part of a partitioned dataset). Click on a particular file asset to see its overview page, which will show metadata and any classifications that were automatically applied. For example, a Parquet file CustomerData.snappy.parquet might show classifications like “Credit Card Number” or “Email Address” if those patterns were detected in its content. If the file is recognized as a structured data resource, Purview may show a schema with columns, and you might see classifications attached to specific columns (e.g. a column named Email tagged with the Email Address classification). You can also use the catalog’s filters to search across assets by classification – for instance, filter by the “Credit Card Number” classification to see all files or tables where credit card data was found. If you open an asset’s detail page, you’ll see a list of Classifications applied to it, and if sensitivity labeling is in place, you may see a Sensitivity label field as well.

Purview catalog view of a data asset (an Azure blob file) showing the applied sensitivity label “Secret” (top right) and multiple detected classifications (e.g. Credit Card Number, Aadhaar number) listed in the asset’s details. Administrators can review these auto-classifications and, if needed, manually add or remove classifications.

It’s important for administrators to validate the classification results. In some cases, Purview might flag something as a credit card number that is actually a false positive, or it might not catch a custom sensitive data type. You can use the Purview catalog to adjust classifications: click Edit on the asset’s classification section to manually add a missing classification or remove an incorrect onedocs.azure.cn. Remember that classifications are purely descriptive tags in the catalog – removing a classification won’t delete data, it just adjusts the metadata. By curating the classifications, you improve the accuracy of your data inventory. If many assets were scanned, Purview also provides insight reports summarizing the scan results (how many assets have each classification, etc.), which can help you spot anomalies or areas to focus.

Creating and assigning sensitivity labels (manually and via auto-labeling policies): Once your data is classified, you’ll likely want to ensure the proper sensitivity labels are applied to those assets, to align with your organization’s data handling policies. There are two ways to assign sensitivity labels in Purview: automatically via policies (preferred) or manually (with some caveats). We’ll outline both:
- Automatic labeling via policies: Microsoft Purview allows you to define auto-labeling policies that map certain classification findings to a specific sensitivity label. These policies are created in the Microsoft Purview compliance portal (under Information Protection). First, make sure you have defined the sensitivity labels you need – for example, labels like Public, General, Confidential, Highly Confidential – and that they are enabled for Files & other data assets (this scope makes them available to Purview’s Data Map)learn.microsoft.com. Next, create an auto-labeling policy for non-Microsoft 365 data (since ADLS and Azure SQL are considered non-M365 workloads). In the policy, you can scope it to either all Azure data locations or specific ones (like specific storage accounts or even specific containers). Then define the conditions: for instance, if an asset contains the “Credit Card Number” sensitive info type (classification), then apply the Confidential label. You can combine conditions or target multiple classifications. Once you publish this policy, Purview will evaluate it on the next scan. Essentially, as Purview scans and finds those classifications, it will automatically assign the chosen label to the asset in the cataloglearn.microsoft.com. It’s recommended to wait ~15 minutes after creating a new policy for it to take effect, then rerun (or schedule) a scan of the data sourcelearn.microsoft.com. After the scan, you should see that assets meeting the criteria now show a sensitivity label in Purview. (For example, that CustomerData.parquet file might now be labeled “Confidential” in addition to having classifications.) Auto-labeling policies are powerful because they ensure consistent application of labels based on your rules – every time new data is scanned, the labels will be applied uniformly across similar data.
- Manual labeling: If you need to assign a sensitivity label without waiting for an auto-scan or for a one-off case, you have a couple of options. One approach is to manually apply the label at the source – for instance, using Azure Storage or AIP tools to label a file or container – and then when Purview scans it, it will pick up that label. (If a file in ADLS is already labeled by MIP, Purview’s scan will recognize and reflect that label in the cataloglearn.microsoft.com learn.microsoft.com.) Another approach is to use a targeted auto-label policy as a “manual” method: for example, create a policy that specifically targets a single file or a small scope and triggers a label, then run the scan. What you currently cannot do is simply open the Purview catalog, click an asset and set a sensitivity label from a dropdown – that feature is not available in the data catalog interfacedocs.azure.cn. So, “manual” labeling in Purview really means using the compliance portal to directly label items (in limited scenarios) or leveraging policies in a controlled way. In practice, most organizations will rely on the auto-label rules for consistency and only use manual methods in exceptional cases. After any manual label changes, you’ll want to scan or refresh Purview so that the catalog is up-to-date.
After labels are applied, you can view them in the Purview catalog just like classifications. You can even filter assets by sensitivity label. For example, if you want to list all data assets labeled “Highly Confidential”, you can use the Purview search filters to show only those assetslearn.microsoft.com. Then you might review what classifications led to that label (Purview will show both the label and the underlying classifications on the asset’s page). This helps in validating that the labeling is correct for the data’s content.

Best Practices for Consistent Labeling and Data Protection

Managing data classification and labeling is an ongoing process. Here are some best practices to ensure consistency and robust data protection:

Use a unified taxonomy: Align your Purview classifications and sensitivity labels with your organization’s overall data taxonomy. The sensitivity labels in Purview are the same labels used across Microsoft 365, SharePoint, Power BI, etc., which means a label like “Confidential” has the same meaning everywheredocs.azure.cn. Leverage that by extending your existing labels to Purview so that all data – whether in a Spark table, a Parquet file, or an Excel report – uses consistent labels that everyone understands.
Enable broad classification coverage, then refine: Start with Purview’s rich set of built-in classifiers to automatically scan for sensitive data across your Synapse-linked lake. By casting a wide net (the 200+ default classifiers), you get a comprehensive view of what sensitive data you havetechcommunity.microsoft.com. Then, refine the results: disable any irrelevant classifiers that generate noise (false positives) and add custom classifications for any organization-specific data types not covered by the built-ins. This tuning ensures that auto-classification remains accurate and relevant to your business.
Schedule regular scans: Data in your Synapse environment is likely constantly evolving – new data is ingested, and existing data is transformed. Schedule Purview scans on a regular cadence (e.g. nightly or weekly) for critical storage accounts. Regular scanning means new or updated files (such as those produced by Synapse Spark jobs) get classified and labeled soon after they appear. This practice keeps the catalog up-to-date and ensures sensitive data doesn’t slip through untagged.
Verify and curate classification results: Don’t treat the automated classifications as set-and-forget. Use Purview’s reports and search filters to spot-check assets with certain classifications. In the catalog, review high-value assets (like a folder holding all customer data) to ensure the classifications make sense. Correct any mistakes by manually editing classifications in the catalogdocs.azure.cn. This curation step is important for building trust in the catalog’s accuracy, especially if the classifications will drive labeling and protection decisions.
Implement auto-labeling policies thoughtfully: When creating auto-label rules, involve your security/compliance teams to map the right label to each classification or combination of classifications. Not all sensitive data is equal – for example, you might decide that any dataset containing personal identifiers (Name, Email, Phone) gets labeled “Confidential”, but only label something “Highly Confidential” if it contains financial data or national IDs. Define these rules clearly in Purview’s auto-labeling policies, and document them so data users know what the labels signify. Start with a broad policy if unsure (e.g. anything with any sensitive info gets a generic confidential label) and iterate towards more specific rules as you analyze the results.
Leverage hierarchical labeling where appropriate: If you know an entire container or folder in ADLS Gen2 will only contain one type of data (say, a Spark output folder that always has encrypted credit card info), consider applying a default label to that container (via a policy or script) and enforce that all files under it inherit that label. This can complement Purview’s file-level scanning by providing a safety net (every file in this secure folder is labeled Highly Confidential by default). Purview can reflect that default label and still add specific classifications it finds. However, use this approach carefully – overly broad labeling can reduce the value of labels if everything ends up labeled highly sensitive regardless of content.
Integrate with Synapse and development processes: Take advantage of the integration between Azure Synapse and Purview. For instance, you can connect a Purview account to Synapse Studio so that data engineers can search the Purview catalog from within Synapse. This means they can see classifications and sensitivity labels on datasets before using them, encouraging informed handling of data. Additionally, consider automating a notification or review process: e.g., if Purview flags a newly created dataset as containing sensitive data, alert a data steward or update a data inventory. Such processes ensure that labeling and classification drive real actions in your data lifecycle.
Monitor label usage and enforce policies: Applying sensitivity labels in Purview is not just for show – tie them into security controls. For example, if a label “Highly Confidential” is applied to certain data, you can have corresponding MIP policies that enforce encryption or restrict access to that data. In Microsoft 365 Compliance Center, you can define that a “Highly Confidential” label should encrypt files so that only specific users or groups can open them. While Purview’s catalog label is metadata, that same label on the actual file (if applied through MIP) can trigger enforcement. Use tools like Microsoft Defender for Cloud Apps or Azure Information Protection scanner for on-prem files in conjunction with Purview to create a full-circle protection scheme. Also, utilize Purview’s auditing and insight reports to track how many assets are labeled at each level, and verify that matches your expectations and compliance requirements.
Educate data consumers: Finally, make sure that the people using the data (analysts, data scientists, BI developers) understand what the classifications and labels mean. For example, if a Spark notebook is pulling data from a folder labeled “Confidential”, the developer should know to treat that data carefully (perhaps apply extra access controls or not create wide-sharing datasets from it). Power BI users should be aware that if they create a report using a highly sensitive dataset, that report should also be labeled appropriately. Sensitivity is a shared responsibility – Purview provides the tags and labels, but everyone in the data chain needs to respect them.

Downstream Integration: Power BI and Security Policies

One big advantage of using sensitivity labels via Microsoft Purview is that those labels flow into downstream tools and processes. The sensitivity labels applied to ADLS Gen2 files or Synapse assets are part of the same Microsoft Purview Information Protection framework used by Power BI, Office, and other services. This means a few things for downstream integration:

Consistent labels in Power BI: When you connect Power BI to a labeled dataset (for example, loading a Parquet file from ADLS that Purview has labeled “Confidential”), that label context can inform Power BI’s own handling. In Power BI, you can also apply sensitivity labels to datasets, reports, and dashboards to match the source. Because the labels come from a unified taxonomy, your Power BI developers can choose the same “Confidential” or “Highly Confidential” label for their reports as was applied in Purview to the data source. This creates end-to-end consistency – the data at rest is labeled, and the BI artifacts derived from it carry forward that label. Power BI’s integration with Purview is evolving, but even today you can search the Purview catalog for Power BI assets and see their labels in Purview. In the above Purview search example, you can see filters for sensitivity labels (e.g. “Secret”, “Confidential”) and results including Power BI datasets and files, each showing their assigned label. This illustrates the unified view Purview provides across different systems. By aligning Synapse and Power BI on the same labeling, organizations ensure that everyone sees the same sensitivity designations, reducing confusion.
Protection persists on export: Perhaps the most critical aspect of sensitivity labels is that they can enforce protection when data moves outside the controlled environment. For example, suppose you have a Parquet file labeled Highly Confidential in ADLS, and a Power BI report connects to it. If a user later exports data from that Power BI report to Excel or PDF, Power BI will automatically apply the Highly Confidential label to the exported file and protect it according to the label’s encryption settingslearn.microsoft.com. This means if the label is configured to encrypt files so that only your organization’s users can open them, the exported Excel will be encrypted and inaccessible to outsiders. The label (set originally because Purview classified the data) thus triggers downstream protection – your sensitive data remains protected even as it flows to end-user tools or leaves the analytics system. This is a huge benefit of integrating Purview labeling with tools like Power BI and Office.
Support for access control and auditing: Sensitivity labels can be tied into access control decisions. For instance, you might use Azure Purview Data Policy (in preview) or Azure AD to restrict access to certain data by label. While Azure Synapse itself doesn’t natively consume Purview labels for row-level security, you could imagine processes where highly sensitive data is isolated in certain storage with stricter ACLs. Additionally, because labels are recorded in the Purview catalog and the M365 Compliance Center, you gain auditing capabilities. Security admins can monitor label usage – e.g. get alerts if someone tries to remove a label or if labeled data is being accessed frequently. Purview’s “sensitivity label insights” report provides a quick overview of how data is labeled across your estate, and the compliance center can show you things like how many files are labeled with each category and if any risky behavior (like oversharing of a highly confidential file) is happening. By classifying and labeling Synapse data, you essentially integrate it into the broader security ecosystem of Microsoft Purview.
Microsoft Information Protection (MIP) synergy: Since Microsoft Purview’s labeling is part of MIP, you can create holistic policies that span Azure data and Office data. For example, a Data Loss Prevention (DLP) policy in M365 could be configured to prevent emails with attachments that contain data labeled as Highly Confidential. Or if a user attempts to use a Azure Synapse Spark notebook to output a highly sensitive dataset to an unsecured location, you could have monitoring in place (via Azure Monitor or Purview scanning alerts) to flag that. The key is that the label is the signal – once data is labeled, many security tools can respond to that signal. Thus, labeling Synapse’s data with Purview is not just for cataloging; it’s an enabling step for advanced security controls like encryption, DLP, conditional access, and auditing throughout the data lifecycle.

Conclusion

For Azure Synapse administrators and data stewards, implementing data classification and labeling with Microsoft Purview should be a top priority. It provides visibility into what sensitive data you have in your Spark pools and data lake, and it applies a consistent protective labeling that travels with the data. Purview’s automatic classifiers identify the sensitive content (credit cards, personal IDs, etc.) and its integration with sensitivity labels ensures that data gets the appropriate label (Confidential, Highly Confidential, etc.) in the catalogjamesserra.com. With the step-by-step process outlined above, you can set up Purview to scan your ADLS Gen2 linked to Synapse, review and refine the results, and put in place labeling policies that automatically tag new data as it arrives. The payoff is significant: not only do you have an organized catalog for data governance, but those labels flow into tools like Power BI – helping to protect data downstream and enforce compliance even as data is shared or exportedlearn.microsoft.com. By following best practices (regular scanning, policy-driven labeling, and continuous oversight), you’ll maintain a high level of labeling consistency and data protection. In short, Microsoft Purview empowers you to know your Synapse data and safeguard it, bridging the gap between data discovery and enterprise-wide information protection.

Sources:

Microsoft Purview documentation on data classification and sensitivity labelsjamesserra.com docs.azure.cn
Azure Purview integration with Azure Synapse – automated data discovery and classification capabilitiestechcommunity.microsoft.com
Microsoft Purview Data Map labeling preview – applying sensitivity labels to files and databaseslearn.microsoft.com docs.azure.cn
Power BI sensitivity label integration – how labels applied in Purview/MIP enforce protection on exportlearn.microsoft.com
James Serra, Classifications and sensitivity labels in Microsoft Purview – Q&A on how classifications vs labels workjamesserra.com jamesserra.com
Microsoft Purview FAQ – manual classification edits and auto-labeling behaviordocs.azure.cn learn.microsoft.com

Automatic Classification and Sensitivity Labeling in Purview

Manual Classification and Labeling in the Purview Catalog

Step-by-Step Walkthrough

Best Practices for Consistent Labeling and Data Protection

Downstream Integration: Power BI and Security Policies

Conclusion

Related Posts

Transitioning from Classic to the New Unified Microsoft Purview Portal for Synapse Governance

How to “Trust” Specific Devices with Conditional Access – No Intune Required

Office 365 Mailbox is Full?

Leave a ReplyCancel Reply