Section 1 – Implement information protection – Create and SITs and trainable classifiers – Set-AzWebApp -name "Anything Microsoft and other stuff on the side"

The first section of this Study guide is going to cover Information types and trainable classifiers but maybe first introducing the different trials that you can use to try them out, if you don’t have real licenses available.

You can use the trial for 90 days.

Free trial of Microsoft Purview compliance solutions

Try all Microsoft Purview risk and compliance solutions at the E5 level for free for 90 days. Get details on trial eligibility and how to sign up.

And let’s first see what this section has to offer, after it we should have a better understanding of the different functions.

What are Sensitive information types?

The first step in Information Protection is identifying and classifying sensitive items under your organization’s control. Microsoft Purview offers three methods for this:

Manual identification by users.
Automated pattern recognition (e.g., sensitive information types).
Machine learning.

Sensitive Information Types (SITs) are pattern-based classifiers for detecting sensitive data like social security and credit card numbers. Microsoft provides pre-configured SITs or allows you to create custom ones.

SITs can be used in the following:

What about Trainable classifiers?

Manual categorization relies on human judgment and actions. Users and administrators classify content when they come across it. You have the option to utilize pre-established labels and sensitive information types or create custom ones. Following categorization, you can proceed to safeguard the content and handle its handling.

These are the Automatic methods for categorizing content:

Locating content based on keywords or metadata values (utilizing keyword query language).
Identifying patterns of sensitive information (e.g., social security, credit card, or bank account numbers) through sensitive information type definitions.
Recognizing content as a variation of a template (document fingerprinting).
Detecting content based on the presence of exact strings (exact data match).

Following these categorization processes, sensitivity and retention labels can be automatically applied, enabling the content to be accessible within Purview for various purposes.

And Classifiers are available to use as a condition in:

Office auto-labeling with sensitivity labels
Auto-apply retention label policy based on a condition
Communication compliance
Sensitivity labels can use classifiers as conditions, see Apply a sensitivity label to content automatically.
Data loss prevention

When we are discussing on defining the content, we are talking about section defined as “Know Your Data”

Create and manage sensitive info types

Identify sensitive information requirements for an organization’s data

One quick win could be GDPR-based content which can the defined by the following:

Personal Identifiable Information (PII): This includes information that can be used to identify an individual, such as names, addresses, social security numbers, phone numbers, and email addresses.

Other examples are:

Financial Data: Financial information like credit card numbers, bank account details, and financial transactions fall under this category.
Health Information (PHI): PHI includes medical records, diagnoses, treatment histories, and other health-related data protected by regulations like the Health Insurance Portability and Accountability Act (HIPAA).
Confidential Business Data: This category encompasses sensitive business information such as trade secrets, intellectual property, financial reports, and strategic plans.

Let’s say in example that you Must adhere to regulation that force you to remove PII-content to a certain date or when you don’t need it anymore, you can do define these with the following methods.

Translate sensitive information requirements into built-in or custom sensitive info types

Create and manage custom sensitive info types

First you have to understand the following concepts:

regular expressions – Microsoft 365 sensitive information types uses the Boost.RegEx 5.1.3 engine
keyword lists – you can create your own as you define your sensitive information type or choose from existing keyword lists
keyword dictionary
Sensitive information type functions
confidence levels

You can access SITs from https://compliance.microsoft.com/dataclassificationclassifiers?viewid=sensitiveinfotypes

In here you have in total 314 different built-in types

And from here you find the full list on them.

Sensitive information type entity definitions

There are many sensitive information types that are ready for you to use in your DLP policies. This article is a list of all these sensitive information type entity definitions.

Author note! Have you noticed that you can also identity Azure or SQL sensitive info types with them? If you didn’t, here is the list:

Entity	Publisher
Azure AD client access token	Microsoft
Azure AD client secret	Microsoft
Azure AD User Credentials	Microsoft
Azure App Service deployment password	Microsoft
Azure Batch shared access key	Microsoft
Azure Bot Framework secret key	Microsoft
Azure Bot service app secret	Microsoft
Azure Cognitive Search API key	Microsoft
Azure Cognitive Service key	Microsoft
Azure Container Registry access key	Microsoft
Azure Cosmos DB account access key	Microsoft
Azure Databricks personal access token	Microsoft
Azure DevOps app secret	Microsoft
Azure DevOps personal access token	Microsoft
Azure DocumentDB auth key	Microsoft
Azure EventGrid access key	Microsoft
Azure Function Master / API key	Microsoft
Azure IAAS database connection string and Azure SQL connection string	Microsoft
Azure IoT connection string	Microsoft
Azure IoT shared access key	Microsoft
Azure Logic app shared access signature	Microsoft
Azure Machine Learning web service API key	Microsoft
Azure Maps subscription key	Microsoft
Azure publish setting password	Microsoft
Azure Redis cache connection string	Microsoft
Azure Redis cache connection string password	Microsoft
Azure SAS	Microsoft
Azure service bus connection string	Microsoft
Azure service bus shared access signature	Microsoft
Azure Shared Access key / Web Hook token	Microsoft
Azure SignalR access key	Microsoft
Azure SQL connection string	Microsoft
Azure storage account access key	Microsoft
Azure storage account key	Microsoft
Azure Storage account key (generic)	Microsoft
Azure Storage account shared access signature	Microsoft
Azure Storage account shared access signature for high risk resources	Microsoft
Azure subscription management certificate	Microsoft

But there is also limitations for them, see more from Learn.

Sensitive information type limits

Learn about instance count and other sensitive information type limits

You can copy them and define that way your own, well almost all of them. I haven’t personally found a comprehensive list on what you and what you can’t. Here’s short list from Learn.

Canada driver’s license number
EU driver’s license number
EU national identification number
EU passport number
EU social security number or equivalent identification
EU tax identification number
International classification of diseases (ICD-10-CM)
International classification of diseases (ICD-9-CM)
U.S. driver’s license number

You can create your own from scratch too.

Match confidence increases with more supporting elements detected alongside the primary element. Opt for higher confidence levels when specifying more supporting elements to ensure that matched items contain the sought-after sensitive information. High-confidence matches feature numerous nearby supporting elements, while low-confidence matches have few or none nearby.

This represents the core information you aim to identify within content. You can establish the primary element through various means, including regular expressions (RegEx), keyword lists, keyword dictionaries, or functions.

Once the primary element is identified, any supporting elements will only match when located in close proximity to the primary element. The closer the primary and supporting elements are, the higher the likelihood that the detected content aligns with your intended criteria.

Augmenting with supporting elements enhances the accuracy of detecting valid information. For instance, to identify nine-digit employee ID numbers, you can incorporate keywords like “employee,” “badge,” and “ID” in proximity to the numbers, ensuring supporting elements match only when found nearby after the primary element is matched.

For finer precision in item evaluation and detection, you can implement extra checks to either incorporate or exclude particular text or patterns. For instance, you can exclude specific 16-digit numbers that could be mistaken as credit card numbers.

Create and manage exact data match (EDM) classifiers

With Exact Data Match (EDM) based classification, you can create a custom sensitive information type that is designed to:

be dynamic and easily refreshed
result in fewer false-positives
work with structured sensitive data
handle sensitive information more securely, not sharing it with anyone, including Microsoft
be used with several Microsoft cloud services

The steps to create EDM SITs

Phase	What’s needed
Phase 1: Export source data for exact data match based sensitive information type	– Read access to the sensitive data
Phase 2:Create the sample file	– Know the column headers and the format of the data you will be looking for in each column.
Phase 3: Create the EDM SIT	– Access to Microsoft Purview Compliance portal > Data classification > Exact data match
Phase 4: Hash and upload the sensitive information source table for exact data match sensitive information types	– Custom security group and user account – Hash and upload from one computer: local admin access to a computer with direct internet access and to host the EDM Upload Agent – Hash and upload from separate computers: local admin access to a computer with direct internet access and host the EDM Upload Agent for the upload and local admin access to a secure computer to host the EDM Upload Agent to hash the sensitive information source table – Read access to the sensitive information source table file
Phase 5: Test an exact data match sensitive information type	– Access to the Microsoft Purview compliance portal

Where can EDM SITs be used?

Microsoft Purview Data Loss Prevention
Auto-labeling (service and client side)
Microsoft Purview Insider Risk Management policies
Microsoft Purview eDiscovery
Microsoft Purview Insider Risk Management
Microsoft Defender for Cloud Apps

Where services support EDM?

Service	Locations
Microsoft Purview Data Loss Prevention	– SharePoint online – OneDrive for Business – Teams Chat – Exchange Online – Devices
Microsoft Defender for Cloud Apps	– SharePoint Online – OneDrive for Business
Auto-labeling (service side)	– SharePoint online – OneDrive for Business – Exchange Online
Auto-labeling (client side)	– Word – Excel – PowerPoint – Exchange desktop clients
Customer Managed Key	– SharePoint online – OneDrive for Business – Teams Chat – Exchange Online – Word – Excel – PowerPoint – Exchange desktop clients – Devices
eDiscovery	– SharePoint online – OneDrive for Business – Teams Chat – Exchange Online – Word – Excel – PowerPoint – Exchange desktop clients
Insider Risk Management	– SharePoint online – OneDrive for Business – Teams Chat – Exchange Online – Word – Excel – PowerPoint – Exchange desktop clients

Implement document fingerprinting

For precise document fingerprint matching configuration, choose “Exact” as the setting for the high confidence level. When you opt for “Exact” as the high confidence level, only files that precisely match the fingerprint’s text will trigger detection. Any minor deviation from the fingerprint text will result in non-detection.

Example use cases

Official government documentation.
Forms collecting employee information for Human Resources purposes.
Tailored forms designed exclusively for your organization’s needs.

How it works?

Documents are not associated with actual fingerprints; the term “document fingerprint” serves as a metaphor.
Similar to unique patterns in human fingerprints, documents possess unique word patterns.
When you upload a file, DLP identifies these distinctive word patterns, creates a document fingerprint based on them, and utilizes it to identify outbound documents with matching patterns.
Uploading forms or templates is highly effective in generating such document fingerprints because users start with the same initial set of words and then add their own content.
If an outbound document lacks password protection and contains all the text from the original form, DLP can determine if it matches the document fingerprint.

Microsoft Word .dotx file type isn’t supported for Fingerprinting.

Limitations

Password protected files
Files that contain images only
Documents that don’t contain all the text from the original form used to create the document fingerprint
Files larger than 4 MB

Upload the file

Open the same page where you can create SITs and choose Fingerprint based

And uploaded the template file that adheres to the prerequisites mentioned earlier.

Create and manage trainable classifiers

Identify when to use trainable classifiers

You can use them in the following:

Office auto-labeling with sensitivity labels
Auto-apply retention label policy based on a condition
Communication compliance
Sensitivity labels can use classifiers as conditions, see Apply a sensitivity label to content automatically.
Data loss prevention

And with the following permissions in the following scenarios

Scenario	Required Role Permissions
Retention label policy	Record Management Retention Management
Sensitivity label policy	Security Administrator Compliance Administrator Compliance Data Administrator
Communication compliance policy	Insider Risk Management Administrator Supervisory Review Administrator

But classifiers don’t support encrypted file as they cannot be accessed without decrypting them and only the user who creates a custom classifier can train and review predictions made by that classifier.

Design and create a trainable classifier

To develop a personalized trainable classifier for you, we’ll begin by scanning your content locations to gather insights that will enable us to understand the content within your organization better. This scanning process is expected to conclude within 7 to 14 days.

If you prefer not to initiate this process immediately, you can still utilize our pre-existing classifiers right away.

Feeding of samples to the trainable classifier is known as seeding.

A minimum of 50 positive samples is required, and you can include up to a maximum of 500. The trainable classifier will consider the 500 most recently created samples, sorted by file creation date and time. Providing a greater number of samples enhances the classifier’s prediction accuracy.

Process of creating

Learn about trainable classifiers

Trainable classifiers can recognize various types of content for labeling or policy application by giving it positive and negative samples to look at.

Supported file types

Default crawled file name extensions and parsed file types in SharePoint Server – SharePoint Server

Learn which file name extensions SharePoint Server and SharePoint in Microsoft 365 crawl by default and which file types it parses by default.

How to create

You can create Classifiers from https://compliance.microsoft.com/dataclassificationclassifiers?viewid=classifiers

And you have to do a scan first.

If you wish to use the built-in ones, you can do it without scanning your content but if you start the scanning it will show you this notification.

And once done, you have option to create those Classifiers

Select a site for provide the seed content.

You can also select a main directory for this website. If you can’t find the folder you’re searching for, please note that newly created folders may take up to a day to be fully indexed.

And once you next and finish, the processing of your seed content starts. It will identify similarities and build a prediction that you can test out. Processing can take anywhere from 1 to 24 hours.

You will see your classifier listed under In progress on the next page.

Test a trainable classifier

To test the classifier, just open and upload a file

Retrain a trainable classifier

Compliance admin role or Compliance Data Administrator is required to train a classifier.

Microsoft Entra built-in roles – Microsoft Entra

Describes the Microsoft Entra built-in roles and permissions.

While utilizing your classifiers, you might seek to enhance the accuracy of their categorizations. This is achieved by assessing the accuracy of the classifications assigned to items identified as matches or non-matches. After conducting 30 evaluations for a classifier, it incorporates this feedback and undergoes automatic retraining.

But Pre-trained classifiers cannot be retrained.

Process of retraining

Learn about trainable classifiers

Trainable classifiers can recognize various types of content for labeling or policy application by giving it positive and negative samples to look at.

You have the option to check the count of matches detected by a trainable classifier in both Content Explorer and the Trainable Classifiers interface. Additionally, you can offer feedback on whether an item truly constitutes a match or not through the Match/Not a Match feedback mechanism and employ this feedback to fine-tune your classifiers.

Closure

Let’s do a recap on what we learned in this section.

Sensitivity Information Types (SIT)

Microsoft Purview offers three methods for this:

Manual identification by users.
Automated pattern recognition (e.g., sensitive information types).
Machine learning.

Discovering Personally Identifiable Information (PII) for GDPR reasons or maybe Azure information that should not be shared.

Trainable classifiers

These are the Automatic methods for categorizing content:

Locating content based on keywords or metadata values (utilizing keyword query language).
Identifying patterns of sensitive information (e.g., social security, credit card, or bank account numbers) through sensitive information type definitions.
Recognizing content as a variation of a template (document fingerprinting).
Detecting content based on the presence of exact strings (exact data match).

EDM

With Exact Data Match (EDM) based classification, you can create a custom sensitive information type that is designed to:

be dynamic and easily refreshed
result in fewer false-positives
work with structured sensitive data
handle sensitive information more securely, not sharing it with anyone, including Microsoft

Can be used with several Microsoft cloud services like these.

Microsoft Purview Data Loss Prevention
Auto-labeling (service and client side)
Microsoft Purview Insider Risk Management policies
Microsoft Purview eDiscovery
Microsoft Purview Insider Risk Management
Microsoft Defender for Cloud Apps

Fingerprinting

Microsoft Word .dotx file type isn’t supported for Fingerprinting.

Cannot be used for the following:

Password protected files
Files that contain images only
Documents that don’t contain all the text from the original form used to create the document fingerprint
Files larger than 4 MB

Classifiers

Feeding of samples to the trainable classifier is known as seeding.

It can find all the all the file formats that SharePoint Server and SharePoint in Microsoft 365 have built-in format handlers for, https://learn.microsoft.com/en-us/sharepoint/technical-reference/default-crawled-file-name-extensions-and-parsed-file-types#default-crawled-file-name-extensions-and-parsed-file-formats

Pre-trained classifiers cannot be retrained.

Link to main post

Exam cram for SC-400 – Administering Information Protection and Compliance in M365

Previously I did Study guides for SC-300, AZ-500, SC-100 and SC-200. So now it’s the turn for the Compliance part under the Security umbrella. See here for the previous Study guides. Exam cra…