The first section of this Study guide is going to cover Information types and trainable classifiers but maybe first introducing the different trials that you can use to try them out, if you don’t have real licenses available.
You can use the trial for 90 days.
And let’s first see what this section has to offer, after it we should have a better understanding of the different functions.
Table of Contents
What are Sensitive information types?
The first step in Information Protection is identifying and classifying sensitive items under your organization’s control. Microsoft Purview offers three methods for this:
- Manual identification by users.
- Automated pattern recognition (e.g., sensitive information types).
- Machine learning.
Sensitive Information Types (SITs) are pattern-based classifiers for detecting sensitive data like social security and credit card numbers. Microsoft provides pre-configured SITs or allows you to create custom ones.
SITs can be used in the following:
- Microsoft Purview Data Loss Prevention policies
- Sensitivity labels
- Retention labels
- Insider risk management
- Communication compliance
- Auto-labelling policies
- Microsoft Priva
What about Trainable classifiers?
Manual categorization relies on human judgment and actions. Users and administrators classify content when they come across it. You have the option to utilize pre-established labels and sensitive information types or create custom ones. Following categorization, you can proceed to safeguard the content and handle its handling.
These are the Automatic methods for categorizing content:
- Locating content based on keywords or metadata values (utilizing keyword query language).
- Identifying patterns of sensitive information (e.g., social security, credit card, or bank account numbers) through sensitive information type definitions.
- Recognizing content as a variation of a template (document fingerprinting).
- Detecting content based on the presence of exact strings (exact data match).
Following these categorization processes, sensitivity and retention labels can be automatically applied, enabling the content to be accessible within Purview for various purposes.
And Classifiers are available to use as a condition in:
- Office auto-labeling with sensitivity labels
- Auto-apply retention label policy based on a condition
- Communication compliance
- Sensitivity labels can use classifiers as conditions, see Apply a sensitivity label to content automatically.
- Data loss prevention
When we are discussing on defining the content, we are talking about section defined as “Know Your Data”
Create and manage sensitive info types
Identify sensitive information requirements for an organization’s data
One quick win could be GDPR-based content which can the defined by the following:
- Personal Identifiable Information (PII): This includes information that can be used to identify an individual, such as names, addresses, social security numbers, phone numbers, and email addresses.
Other examples are:
- Financial Data: Financial information like credit card numbers, bank account details, and financial transactions fall under this category.
- Health Information (PHI): PHI includes medical records, diagnoses, treatment histories, and other health-related data protected by regulations like the Health Insurance Portability and Accountability Act (HIPAA).
- Confidential Business Data: This category encompasses sensitive business information such as trade secrets, intellectual property, financial reports, and strategic plans.
Let’s say in example that you Must adhere to regulation that force you to remove PII-content to a certain date or when you don’t need it anymore, you can do define these with the following methods.
Translate sensitive information requirements into built-in or custom sensitive info types
Create and manage custom sensitive info types
First you have to understand the following concepts:
- regular expressions – Microsoft 365 sensitive information types uses the Boost.RegEx 5.1.3 engine
- keyword lists – you can create your own as you define your sensitive information type or choose from existing keyword lists
- keyword dictionary
- Sensitive information type functions
- confidence levels
You can access SITs from https://compliance.microsoft.com/dataclassificationclassifiers?viewid=sensitiveinfotypes
In here you have in total 314 different built-in types
And from here you find the full list on them.
Author note! Have you noticed that you can also identity Azure or SQL sensitive info types with them? If you didn’t, here is the list:
But there is also limitations for them, see more from Learn.
You can copy them and define that way your own, well almost all of them. I haven’t personally found a comprehensive list on what you and what you can’t. Here’s short list from Learn.
- Canada driver’s license number
- EU driver’s license number
- EU national identification number
- EU passport number
- EU social security number or equivalent identification
- EU tax identification number
- International classification of diseases (ICD-10-CM)
- International classification of diseases (ICD-9-CM)
- U.S. driver’s license number
You can create your own from scratch too.
Match confidence increases with more supporting elements detected alongside the primary element. Opt for higher confidence levels when specifying more supporting elements to ensure that matched items contain the sought-after sensitive information. High-confidence matches feature numerous nearby supporting elements, while low-confidence matches have few or none nearby.
This represents the core information you aim to identify within content. You can establish the primary element through various means, including regular expressions (RegEx), keyword lists, keyword dictionaries, or functions.
Once the primary element is identified, any supporting elements will only match when located in close proximity to the primary element. The closer the primary and supporting elements are, the higher the likelihood that the detected content aligns with your intended criteria.
Augmenting with supporting elements enhances the accuracy of detecting valid information. For instance, to identify nine-digit employee ID numbers, you can incorporate keywords like “employee,” “badge,” and “ID” in proximity to the numbers, ensuring supporting elements match only when found nearby after the primary element is matched.
For finer precision in item evaluation and detection, you can implement extra checks to either incorporate or exclude particular text or patterns. For instance, you can exclude specific 16-digit numbers that could be mistaken as credit card numbers.
Create and manage exact data match (EDM) classifiers
With Exact Data Match (EDM) based classification, you can create a custom sensitive information type that is designed to:
- be dynamic and easily refreshed
- result in fewer false-positives
- work with structured sensitive data
- handle sensitive information more securely, not sharing it with anyone, including Microsoft
- be used with several Microsoft cloud services
The steps to create EDM SITs
Phase | What’s needed |
---|---|
Phase 1: Export source data for exact data match based sensitive information type | – Read access to the sensitive data |
Phase 2:Create the sample file | – Know the column headers and the format of the data you will be looking for in each column. |
Phase 3: Create the EDM SIT | – Access to Microsoft Purview Compliance portal > Data classification > Exact data match |
Phase 4: Hash and upload the sensitive information source table for exact data match sensitive information types | – Custom security group and user account – Hash and upload from one computer: local admin access to a computer with direct internet access and to host the EDM Upload Agent – Hash and upload from separate computers: local admin access to a computer with direct internet access and host the EDM Upload Agent for the upload and local admin access to a secure computer to host the EDM Upload Agent to hash the sensitive information source table – Read access to the sensitive information source table file |
Phase 5: Test an exact data match sensitive information type | – Access to the Microsoft Purview compliance portal |
Where can EDM SITs be used?
- Microsoft Purview Data Loss Prevention
- Auto-labeling (service and client side)
- Microsoft Purview Insider Risk Management policies
- Microsoft Purview eDiscovery
- Microsoft Purview Insider Risk Management
- Microsoft Defender for Cloud Apps
Where services support EDM?
Service | Locations |
---|---|
Microsoft Purview Data Loss Prevention | – SharePoint online – OneDrive for Business – Teams Chat – Exchange Online – Devices |
Microsoft Defender for Cloud Apps | – SharePoint Online – OneDrive for Business |
Auto-labeling (service side) | – SharePoint online – OneDrive for Business – Exchange Online |
Auto-labeling (client side) | – Word – Excel – PowerPoint – Exchange desktop clients |
Customer Managed Key | – SharePoint online – OneDrive for Business – Teams Chat – Exchange Online – Word – Excel – PowerPoint – Exchange desktop clients – Devices |
eDiscovery | – SharePoint online – OneDrive for Business – Teams Chat – Exchange Online – Word – Excel – PowerPoint – Exchange desktop clients |
Insider Risk Management | – SharePoint online – OneDrive for Business – Teams Chat – Exchange Online – Word – Excel – PowerPoint – Exchange desktop clients |
Implement document fingerprinting
For precise document fingerprint matching configuration, choose “Exact” as the setting for the high confidence level. When you opt for “Exact” as the high confidence level, only files that precisely match the fingerprint’s text will trigger detection. Any minor deviation from the fingerprint text will result in non-detection.
Example use cases
- Official government documentation.
- Forms collecting employee information for Human Resources purposes.
- Tailored forms designed exclusively for your organization’s needs.
How it works?
- Documents are not associated with actual fingerprints; the term “document fingerprint” serves as a metaphor.
- Similar to unique patterns in human fingerprints, documents possess unique word patterns.
- When you upload a file, DLP identifies these distinctive word patterns, creates a document fingerprint based on them, and utilizes it to identify outbound documents with matching patterns.
- Uploading forms or templates is highly effective in generating such document fingerprints because users start with the same initial set of words and then add their own content.
- If an outbound document lacks password protection and contains all the text from the original form, DLP can determine if it matches the document fingerprint.
Microsoft Word .dotx file type isn’t supported for Fingerprinting.
Limitations
- Password protected files
- Files that contain images only
- Documents that don’t contain all the text from the original form used to create the document fingerprint
- Files larger than 4 MB
Upload the file
Open the same page where you can create SITs and choose Fingerprint based
And uploaded the template file that adheres to the prerequisites mentioned earlier.
Create and manage trainable classifiers
Identify when to use trainable classifiers
You can use them in the following:
- Office auto-labeling with sensitivity labels
- Auto-apply retention label policy based on a condition
- Communication compliance
- Sensitivity labels can use classifiers as conditions, see Apply a sensitivity label to content automatically.
- Data loss prevention
And with the following permissions in the following scenarios
Scenario | Required Role Permissions |
---|---|
Retention label policy | Record Management Retention Management |
Sensitivity label policy | Security Administrator Compliance Administrator Compliance Data Administrator |
Communication compliance policy | Insider Risk Management Administrator Supervisory Review Administrator |
But classifiers don’t support encrypted file as they cannot be accessed without decrypting them and only the user who creates a custom classifier can train and review predictions made by that classifier.
Design and create a trainable classifier
To develop a personalized trainable classifier for you, we’ll begin by scanning your content locations to gather insights that will enable us to understand the content within your organization better. This scanning process is expected to conclude within 7 to 14 days.
If you prefer not to initiate this process immediately, you can still utilize our pre-existing classifiers right away.
Feeding of samples to the trainable classifier is known as seeding.
A minimum of 50 positive samples is required, and you can include up to a maximum of 500. The trainable classifier will consider the 500 most recently created samples, sorted by file creation date and time. Providing a greater number of samples enhances the classifier’s prediction accuracy.
Process of creating
Supported file types
How to create
You can create Classifiers from https://compliance.microsoft.com/dataclassificationclassifiers?viewid=classifiers
And you have to do a scan first.
If you wish to use the built-in ones, you can do it without scanning your content but if you start the scanning it will show you this notification.
And once done, you have option to create those Classifiers
Select a site for provide the seed content.
You can also select a main directory for this website. If you can’t find the folder you’re searching for, please note that newly created folders may take up to a day to be fully indexed.
And once you next and finish, the processing of your seed content starts. It will identify similarities and build a prediction that you can test out. Processing can take anywhere from 1 to 24 hours.
You will see your classifier listed under In progress on the next page.
Test a trainable classifier
To test the classifier, just open and upload a file
Retrain a trainable classifier
Compliance admin role or Compliance Data Administrator is required to train a classifier.
While utilizing your classifiers, you might seek to enhance the accuracy of their categorizations. This is achieved by assessing the accuracy of the classifications assigned to items identified as matches or non-matches. After conducting 30 evaluations for a classifier, it incorporates this feedback and undergoes automatic retraining.
But Pre-trained classifiers cannot be retrained.
Process of retraining
You have the option to check the count of matches detected by a trainable classifier in both Content Explorer and the Trainable Classifiers interface. Additionally, you can offer feedback on whether an item truly constitutes a match or not through the Match/Not a Match feedback mechanism and employ this feedback to fine-tune your classifiers.
Closure
Let’s do a recap on what we learned in this section.
Sensitivity Information Types (SIT)
Microsoft Purview offers three methods for this:
- Manual identification by users.
- Automated pattern recognition (e.g., sensitive information types).
- Machine learning.
Sensitive Information Types (SITs) are pattern-based classifiers for detecting sensitive data like social security and credit card numbers. Microsoft provides pre-configured SITs or allows you to create custom ones.
Discovering Personally Identifiable Information (PII) for GDPR reasons or maybe Azure information that should not be shared.
Trainable classifiers
Manual categorization relies on human judgment and actions. Users and administrators classify content when they come across it. You have the option to utilize pre-established labels and sensitive information types or create custom ones. Following categorization, you can proceed to safeguard the content and handle its handling.
These are the Automatic methods for categorizing content:
- Locating content based on keywords or metadata values (utilizing keyword query language).
- Identifying patterns of sensitive information (e.g., social security, credit card, or bank account numbers) through sensitive information type definitions.
- Recognizing content as a variation of a template (document fingerprinting).
- Detecting content based on the presence of exact strings (exact data match).
EDM
With Exact Data Match (EDM) based classification, you can create a custom sensitive information type that is designed to:
- be dynamic and easily refreshed
- result in fewer false-positives
- work with structured sensitive data
- handle sensitive information more securely, not sharing it with anyone, including Microsoft
Can be used with several Microsoft cloud services like these.
- Microsoft Purview Data Loss Prevention
- Auto-labeling (service and client side)
- Microsoft Purview Insider Risk Management policies
- Microsoft Purview eDiscovery
- Microsoft Purview Insider Risk Management
- Microsoft Defender for Cloud Apps
Fingerprinting
Microsoft Word .dotx file type isn’t supported for Fingerprinting.
Cannot be used for the following:
- Password protected files
- Files that contain images only
- Documents that don’t contain all the text from the original form used to create the document fingerprint
- Files larger than 4 MB
Classifiers
Feeding of samples to the trainable classifier is known as seeding.
A minimum of 50 positive samples is required, and you can include up to a maximum of 500. The trainable classifier will consider the 500 most recently created samples, sorted by file creation date and time. Providing a greater number of samples enhances the classifier’s prediction accuracy.
It can find all the all the file formats that SharePoint Server and SharePoint in Microsoft 365 have built-in format handlers for, https://learn.microsoft.com/en-us/sharepoint/technical-reference/default-crawled-file-name-extensions-and-parsed-file-types#default-crawled-file-name-extensions-and-parsed-file-formats
Pre-trained classifiers cannot be retrained.
You have the option to check the count of matches detected by a trainable classifier in both Content Explorer and the Trainable Classifiers interface. Additionally, you can offer feedback on whether an item truly constitutes a match or not through the Match/Not a Match feedback mechanism and employ this feedback to fine-tune your classifiers.