Section 1 – Implement information protection – Create and SITs and trainable classifiers

The first section of this Study guide is going to cover Information types and trainable classifiers but maybe first introducing the different trials that you can use to try them out, if you don’t have real licenses available.

You can use the trial for 90 days.

And let’s first see what this section has to offer, after it we should have a better understanding of the different functions.

What are Sensitive information types?

The first step in Information Protection is identifying and classifying sensitive items under your organization’s control. Microsoft Purview offers three methods for this:

  • Manual identification by users.
  • Automated pattern recognition (e.g., sensitive information types).
  • Machine learning.

Sensitive Information Types (SITs) are pattern-based classifiers for detecting sensitive data like social security and credit card numbers. Microsoft provides pre-configured SITs or allows you to create custom ones.

SITs can be used in the following:

What about Trainable classifiers?

Manual categorization relies on human judgment and actions. Users and administrators classify content when they come across it. You have the option to utilize pre-established labels and sensitive information types or create custom ones. Following categorization, you can proceed to safeguard the content and handle its handling.

These are the Automatic methods for categorizing content:

  1. Locating content based on keywords or metadata values (utilizing keyword query language).
  2. Identifying patterns of sensitive information (e.g., social security, credit card, or bank account numbers) through sensitive information type definitions.
  3. Recognizing content as a variation of a template (document fingerprinting).
  4. Detecting content based on the presence of exact strings (exact data match).

Following these categorization processes, sensitivity and retention labels can be automatically applied, enabling the content to be accessible within Purview for various purposes.

And Classifiers are available to use as a condition in:

When we are discussing on defining the content, we are talking about section defined as “Know Your Data”

Create and manage sensitive info types

Identify sensitive information requirements for an organization’s data

One quick win could be GDPR-based content which can the defined by the following:

  • Personal Identifiable Information (PII): This includes information that can be used to identify an individual, such as names, addresses, social security numbers, phone numbers, and email addresses.

Other examples are:

  • Financial Data: Financial information like credit card numbers, bank account details, and financial transactions fall under this category.
  • Health Information (PHI): PHI includes medical records, diagnoses, treatment histories, and other health-related data protected by regulations like the Health Insurance Portability and Accountability Act (HIPAA).
  • Confidential Business Data: This category encompasses sensitive business information such as trade secrets, intellectual property, financial reports, and strategic plans.

Let’s say in example that you Must adhere to regulation that force you to remove PII-content to a certain date or when you don’t need it anymore, you can do define these with the following methods.

Translate sensitive information requirements into built-in or custom sensitive info types

Create and manage custom sensitive info types

First you have to understand the following concepts:

You can access SITs from https://compliance.microsoft.com/dataclassificationclassifiers?viewid=sensitiveinfotypes

In here you have in total 314 different built-in types

And from here you find the full list on them.

Author note! Have you noticed that you can also identity Azure or SQL sensitive info types with them? If you didn’t, here is the list:

EntityPublisher
Azure AD client access tokenMicrosoft
Azure AD client secretMicrosoft
Azure AD User CredentialsMicrosoft
Azure App Service deployment passwordMicrosoft
Azure Batch shared access keyMicrosoft
Azure Bot Framework secret keyMicrosoft
Azure Bot service app secretMicrosoft
Azure Cognitive Search API keyMicrosoft
Azure Cognitive Service keyMicrosoft
Azure Container Registry access keyMicrosoft
Azure Cosmos DB account access keyMicrosoft
Azure Databricks personal access tokenMicrosoft
Azure DevOps app secretMicrosoft
Azure DevOps personal access tokenMicrosoft
Azure DocumentDB auth keyMicrosoft
Azure EventGrid access keyMicrosoft
Azure Function Master / API keyMicrosoft
Azure IAAS database connection string and Azure SQL connection stringMicrosoft
Azure IoT connection stringMicrosoft
Azure IoT shared access keyMicrosoft
Azure Logic app shared access signatureMicrosoft
Azure Machine Learning web service API keyMicrosoft
Azure Maps subscription keyMicrosoft
Azure publish setting passwordMicrosoft
Azure Redis cache connection stringMicrosoft
Azure Redis cache connection string passwordMicrosoft
Azure SASMicrosoft
Azure service bus connection stringMicrosoft
Azure service bus shared access signatureMicrosoft
Azure Shared Access key / Web Hook tokenMicrosoft
Azure SignalR access keyMicrosoft
Azure SQL connection stringMicrosoft
Azure storage account access keyMicrosoft
Azure storage account keyMicrosoft
Azure Storage account key (generic)Microsoft
Azure Storage account shared access signatureMicrosoft
Azure Storage account shared access signature for high risk resourcesMicrosoft
Azure subscription management certificateMicrosoft

But there is also limitations for them, see more from Learn.

You can copy them and define that way your own, well almost all of them. I haven’t personally found a comprehensive list on what you and what you can’t. Here’s short list from Learn.

  • Canada driver’s license number
  • EU driver’s license number
  • EU national identification number
  • EU passport number
  • EU social security number or equivalent identification
  • EU tax identification number
  • International classification of diseases (ICD-10-CM)
  • International classification of diseases (ICD-9-CM)
  • U.S. driver’s license number

You can create your own from scratch too.

Match confidence increases with more supporting elements detected alongside the primary element. Opt for higher confidence levels when specifying more supporting elements to ensure that matched items contain the sought-after sensitive information. High-confidence matches feature numerous nearby supporting elements, while low-confidence matches have few or none nearby.

This represents the core information you aim to identify within content. You can establish the primary element through various means, including regular expressions (RegEx), keyword lists, keyword dictionaries, or functions.

Once the primary element is identified, any supporting elements will only match when located in close proximity to the primary element. The closer the primary and supporting elements are, the higher the likelihood that the detected content aligns with your intended criteria.

Augmenting with supporting elements enhances the accuracy of detecting valid information. For instance, to identify nine-digit employee ID numbers, you can incorporate keywords like “employee,” “badge,” and “ID” in proximity to the numbers, ensuring supporting elements match only when found nearby after the primary element is matched.

For finer precision in item evaluation and detection, you can implement extra checks to either incorporate or exclude particular text or patterns. For instance, you can exclude specific 16-digit numbers that could be mistaken as credit card numbers.

Create and manage exact data match (EDM) classifiers

With Exact Data Match (EDM) based classification, you can create a custom sensitive information type that is designed to:

  • be dynamic and easily refreshed
  • result in fewer false-positives
  • work with structured sensitive data
  • handle sensitive information more securely, not sharing it with anyone, including Microsoft
  • be used with several Microsoft cloud services

The steps to create EDM SITs

PhaseWhat’s needed
Phase 1: Export source data for exact data match based sensitive information type– Read access to the sensitive data
Phase 2:Create the sample file– Know the column headers and the format of the data you will be looking for in each column.
Phase 3: Create the EDM SIT– Access to Microsoft Purview Compliance portal > Data classification > Exact data match
Phase 4: Hash and upload the sensitive information source table for exact data match sensitive information types– Custom security group and user account
– Hash and upload from one computer: local admin access to a computer with direct internet access and to host the EDM Upload Agent
– Hash and upload from separate computers: local admin access to a computer with direct internet access and host the EDM Upload Agent for the upload and local admin access to a secure computer to host the EDM Upload Agent to hash the sensitive information source table
– Read access to the sensitive information source table file
Phase 5: Test an exact data match sensitive information type– Access to the Microsoft Purview compliance portal

Where can EDM SITs be used?

  • Microsoft Purview Data Loss Prevention
  • Auto-labeling (service and client side)
  • Microsoft Purview Insider Risk Management policies
  • Microsoft Purview eDiscovery
  • Microsoft Purview Insider Risk Management
  • Microsoft Defender for Cloud Apps

Where services support EDM?

ServiceLocations
Microsoft Purview Data Loss Prevention– SharePoint online
– OneDrive for Business
– Teams Chat
– Exchange Online
– Devices
Microsoft Defender for Cloud Apps– SharePoint Online
– OneDrive for Business
Auto-labeling (service side)– SharePoint online
– OneDrive for Business
– Exchange Online
Auto-labeling (client side)– Word
– Excel
– PowerPoint
– Exchange desktop clients
Customer Managed Key– SharePoint online
– OneDrive for Business
– Teams Chat
– Exchange Online
– Word
– Excel
– PowerPoint
– Exchange desktop clients
– Devices
eDiscovery– SharePoint online
– OneDrive for Business
– Teams Chat
– Exchange Online
– Word
– Excel
– PowerPoint
– Exchange desktop clients
Insider Risk Management– SharePoint online
– OneDrive for Business
– Teams Chat
– Exchange Online
– Word
– Excel
– PowerPoint
– Exchange desktop clients

Implement document fingerprinting

For precise document fingerprint matching configuration, choose “Exact” as the setting for the high confidence level. When you opt for “Exact” as the high confidence level, only files that precisely match the fingerprint’s text will trigger detection. Any minor deviation from the fingerprint text will result in non-detection.

Example use cases

  • Official government documentation.
  • Forms collecting employee information for Human Resources purposes.
  • Tailored forms designed exclusively for your organization’s needs.

How it works?

  • Documents are not associated with actual fingerprints; the term “document fingerprint” serves as a metaphor.
  • Similar to unique patterns in human fingerprints, documents possess unique word patterns.
  • When you upload a file, DLP identifies these distinctive word patterns, creates a document fingerprint based on them, and utilizes it to identify outbound documents with matching patterns.
  • Uploading forms or templates is highly effective in generating such document fingerprints because users start with the same initial set of words and then add their own content.
  • If an outbound document lacks password protection and contains all the text from the original form, DLP can determine if it matches the document fingerprint.

 

Microsoft Word .dotx file type isn’t supported for Fingerprinting.

Limitations

  • Password protected files
  • Files that contain images only
  • Documents that don’t contain all the text from the original form used to create the document fingerprint
  • Files larger than 4 MB

Upload the file

Open the same page where you can create SITs and choose Fingerprint based

And uploaded the template file that adheres to the prerequisites mentioned earlier.

Create and manage trainable classifiers

Identify when to use trainable classifiers

You can use them in the following:

And with the following permissions in the following scenarios

ScenarioRequired Role Permissions
Retention label policyRecord Management
Retention Management
Sensitivity label policySecurity Administrator
Compliance Administrator
Compliance Data Administrator
Communication compliance policyInsider Risk Management Administrator
Supervisory Review Administrator

But classifiers don’t support encrypted file as they cannot be accessed without decrypting them and only the user who creates a custom classifier can train and review predictions made by that classifier.

Design and create a trainable classifier

To develop a personalized trainable classifier for you, we’ll begin by scanning your content locations to gather insights that will enable us to understand the content within your organization better. This scanning process is expected to conclude within 7 to 14 days.

If you prefer not to initiate this process immediately, you can still utilize our pre-existing classifiers right away.

Feeding of samples to the trainable classifier is known as seeding.

A minimum of 50 positive samples is required, and you can include up to a maximum of 500. The trainable classifier will consider the 500 most recently created samples, sorted by file creation date and time. Providing a greater number of samples enhances the classifier’s prediction accuracy.

Process of creating

Supported file types

How to create

You can create Classifiers from https://compliance.microsoft.com/dataclassificationclassifiers?viewid=classifiers

And you have to do a scan first.

If you wish to use the built-in ones, you can do it without scanning your content but if you start the scanning it will show you this notification.

And once done, you have option to create those Classifiers

Select a site for provide the seed content.

You can also select a main directory for this website. If you can’t find the folder you’re searching for, please note that newly created folders may take up to a day to be fully indexed.

And once you next and finish, the processing of your seed content starts. It will identify similarities and build a prediction that you can test out. Processing can take anywhere from 1 to 24 hours.

You will see your classifier listed under In progress on the next page.

Test a trainable classifier

To test the classifier, just open and upload a file

Retrain a trainable classifier

Compliance admin role or Compliance Data Administrator is required to train a classifier.

While utilizing your classifiers, you might seek to enhance the accuracy of their categorizations. This is achieved by assessing the accuracy of the classifications assigned to items identified as matches or non-matches. After conducting 30 evaluations for a classifier, it incorporates this feedback and undergoes automatic retraining.

But Pre-trained classifiers cannot be retrained.

Process of retraining


You have the option to check the count of matches detected by a trainable classifier in both Content Explorer and the Trainable Classifiers interface. Additionally, you can offer feedback on whether an item truly constitutes a match or not through the Match/Not a Match feedback mechanism and employ this feedback to fine-tune your classifiers.

Closure

Let’s do a recap on what we learned in this section.

Sensitivity Information Types (SIT)

Microsoft Purview offers three methods for this:

  • Manual identification by users.
  • Automated pattern recognition (e.g., sensitive information types).
  • Machine learning.

Sensitive Information Types (SITs) are pattern-based classifiers for detecting sensitive data like social security and credit card numbers. Microsoft provides pre-configured SITs or allows you to create custom ones.

Discovering Personally Identifiable Information (PII) for GDPR reasons or maybe Azure information that should not be shared.

Trainable classifiers

Manual categorization relies on human judgment and actions. Users and administrators classify content when they come across it. You have the option to utilize pre-established labels and sensitive information types or create custom ones. Following categorization, you can proceed to safeguard the content and handle its handling.

These are the Automatic methods for categorizing content:

  1. Locating content based on keywords or metadata values (utilizing keyword query language).
  2. Identifying patterns of sensitive information (e.g., social security, credit card, or bank account numbers) through sensitive information type definitions.
  3. Recognizing content as a variation of a template (document fingerprinting).
  4. Detecting content based on the presence of exact strings (exact data match).

EDM

With Exact Data Match (EDM) based classification, you can create a custom sensitive information type that is designed to:

  • be dynamic and easily refreshed
  • result in fewer false-positives
  • work with structured sensitive data
  • handle sensitive information more securely, not sharing it with anyone, including Microsoft

Can be used with several Microsoft cloud services like these.

  • Microsoft Purview Data Loss Prevention
  • Auto-labeling (service and client side)
  • Microsoft Purview Insider Risk Management policies
  • Microsoft Purview eDiscovery
  • Microsoft Purview Insider Risk Management
  • Microsoft Defender for Cloud Apps

Fingerprinting

Microsoft Word .dotx file type isn’t supported for Fingerprinting.

Cannot be used for the following:

  • Password protected files
  • Files that contain images only
  • Documents that don’t contain all the text from the original form used to create the document fingerprint
  • Files larger than 4 MB

Classifiers

Feeding of samples to the trainable classifier is known as seeding.

A minimum of 50 positive samples is required, and you can include up to a maximum of 500. The trainable classifier will consider the 500 most recently created samples, sorted by file creation date and time. Providing a greater number of samples enhances the classifier’s prediction accuracy.

It can find all the all the file formats that SharePoint Server and SharePoint in Microsoft 365 have built-in format handlers for, https://learn.microsoft.com/en-us/sharepoint/technical-reference/default-crawled-file-name-extensions-and-parsed-file-types#default-crawled-file-name-extensions-and-parsed-file-formats

Pre-trained classifiers cannot be retrained.

You have the option to check the count of matches detected by a trainable classifier in both Content Explorer and the Trainable Classifiers interface. Additionally, you can offer feedback on whether an item truly constitutes a match or not through the Match/Not a Match feedback mechanism and employ this feedback to fine-tune your classifiers.

Link to main post

Author: Harri Jaakkonen

Leave a Reply

Your email address will not be published. Required fields are marked *