Set up the Openflow Connector for SharePoint

Note

The connector is subject to the Connector Terms.

This topic describes the steps to set up the Openflow Connector for SharePoint.

Prerequisites

  1. Ensure that you have reviewed Openflow Connector for SharePoint.

  2. Ensure that you have set up Openflow.

Get the credentials

As an SharePoint administrator, perform the following actions:

  1. Ensure that you have a Microsoft Graph (https://learn.microsoft.com/en-us/graph/overview) application with the following Microsoft Graph permissions:

    1. Sites.Selected (https://learn.microsoft.com/en-us/graph/permissions-reference#sitesselected): limits access only to specified sites.

    2. Files.SelectedOperations.Selected (https://learn.microsoft.com/en-us/graph/permissions-reference#filesselectedoperationsselected): limits access only to files in specified sites.

    3. GroupMember.Read.All (https://learn.microsoft.com/en-us/graph/permissions-reference#groupmemberreadall): used for resolving SharePoint group permissions.

    4. User.ReadBasic.All (https://learn.microsoft.com/en-us/graph/permissions-reference#userreadbasicall): used for resolving Microsoft365 user emails.

  2. Configure SharePoint to enable OAuth authentication as described in Get access without a user. The connector uses the following Microsoft Graph APIs to fetch data from SharePoint:

    1. Download driveItem content

    2. Get driveItem metadata

    3. driveItem: delta

    4. List driveItem permissions

    5. group: delta

    6. List group members

    7. Get user (https://learn.microsoft.com/en-us/graph/api/user-get?view=graph-rest-1.0&tabs=http)

  3. Get the site URL of your Microsoft 365 SharePoint site with files or folders that you want to ingest into Snowflake and the credentials from your Azure or Office 365 account administrator.

Set up Snowflake account

As a Snowflake account administrator, perform the following tasks:

  1. Create a new role or use an existing role and grant the Database privileges.

  2. Create a new Snowflake service user with the type as SERVICE.

  3. Grant the Snowflake service user the role you created in the previous steps.

  4. Configure with key-pair auth for the Snowflake SERVICE user from step 2.

  5. Snowflake strongly recommends this step. Configure a secrets manager supported by Openflow, for example, AWS, Azure, and Hashicorp, and store the public and private keys in the secret store.

    Note

    If for any reason, you do not wish to use a secrets manager, then you are responsible for safeguarding the public key and private key files used for key-pair authentication according to the security policies of your organization.

    1. Once the secrets manager is configured, determine how you will authenticate to it. On AWS, it’s recommended that you the EC2 instance role associated with Openflow as this way no other secrets have to be persisted.

    2. In Openflow, configure a Parameter Provider associated with this Secrets Manager, from the hamburger menu in the upper right. Navigate to Controller Settings » Parameter Provider and then fetch your parameter values.

    3. At this point all credentials can be referenced with the associated parameter paths and no sensitive values need to be persisted within Openflow.

  6. If any other Snowflake users require access to the raw ingested documents and tables ingested by the connector (for example, for custom processing in Snowflake), then grant those users the role created in step 1.

  7. Designate a warehouse for the connector to use. Start with the smallest warehouse size, then experiment with size depending on the number of tables being replicated, and the amount of data transferred. Large table numbers typically scale better with multi-cluster warehouses, rather than larger warehouse sizes.

Use case 1: Use the connector definition to ingest files only

Use the connector definition to:

  • Perform custom processing on ingested files

  • Ingest Sharepoint files and permissions and keep them up to date

Configure the connector

As a data engineer, perform the following tasks to configure the connector:

  1. Create a database and schema in Snowflake for the connector to store ingested data.

  2. Download the connector definition file for setting up and running the connector. This flow fetches data from SharePoint, uploads them to a stage, and processes additional metadata about permissions and file information, which is uploaded to Snowflake tables.

  3. Import the connector definition into Openflow.

    1. Enter the Openflow canvas.

    2. Add a process group to the canvas.

    3. On the Create Process Group pop-up, select the connector definition file to import.

  4. Populate the process group parameters

    1. Right-click on the imported process group and select Parameters.

    2. Enter the required parameter values as described in Flow parameters: Ingest files only.

Flow parameters: Ingest files only

Parameter

Description

CDC Refresh Frequency

Specifies the frequency at which the connector retrieves data from Sharepoint.

SharePoint Site URL

URL or SharePoint site from which the connector will ingest content

SharePoint Client ID

Microsoft Entra client ID. To learn about client ID and how to find it in Microsoft Entra, see Application ID (client ID) (https://learn.microsoft.com/en-us/azure/healthcare-apis/register-application#application-id-client-id).

SharePoint Client Secret

Microsoft Entra Client Secret. To learn about a client secret and how to find it in Microsoft Entra, see Certificates & secrets (https://learn.microsoft.com/en-us/azure/healthcare-apis/register-application#certificates--secrets).

SharePoint Tenant ID

Microsoft Entra Tenant ID. To learn about tenant ID and how to find it in Microsoft Entra, see Find your Microsoft 365 tenant ID (https://learn.microsoft.com/en-us/sharepoint/find-your-office-365-tenant-id).

SharePoint Source Folder

Supported files from this folder and all its subfolders is ingested into Snowflake. The folder path is relative to a Shared Documents library.

Openflow Instance Database

A database is created in the user’s Snowflake account, if necessary. Files, metadata, and ACLs are ingested into tables in the specified schema.

Openflow Instance Schema

A schema is created in the target database in the user’s Snowflake account, if necessary. The stage and tables are created to ingest files, metadata, and ACLs.

Snowflake Account

Name of the account that the connector will be running for.

Snowflake Username

Name of the user that the connector will act on behalf of.

Snowflake Private Key

The RSA private key used for authentication. The RSA key must be formatted according to PKCS8 standards and have standard PEM headers and footers. Note that either Snowflake Private Key File or Snowflake Private Key must be defined.

Snowflake Private Key File

The file that contains the RSA Private Key used for authentication to Snowflake, formatted according to PKCS8 standards and having standard PEM headers and footers. The header line starts with -----BEGIN PRIVATE.

Snowflake Private Key Password

The password associated with the Snowflake Private Key File.

Snowflake Warehouse

The warehouse used for the connection and SQL that requires a warehouse to be specified.

  1. Run the flow.

    1. Start the process group. The flow will create all required objects inside of Snowflake.

    2. Right click on the imported process group and select Start.

Use case 2: Use the connector definition to ingest files and perform processing with Cortex

Use the predefined flow definition to:

  • Create AI assistants for public documents within your organization’s SharePoint site.

  • Enable your AI assistants to adhere to access controls specified in your organization’s SharePoint site.

Configure the connector

As a data engineer, perform the following tasks to configure the connector:

  1. Create a database and schema in Snowflake for the connector to store ingested data.

  2. Download the connector definition files:

    1. Download this connector definition file for setting up and running the connector. This flow fetches data from Sharepoint, runs parsing and chunking processes, updates the Cortex Search service.

    2. Download this connector definition file for managing the connector. This can be used to toggle Cortex Search service indexing and to clean up the connector state.

  3. Import the connector definition into Openflow.

    1. Enter an Openflow canvas.

    2. Add a process group to the canvas.

    3. On the Create Process Group pop-up, select the connector definition file to import.

  4. Populate the process group parameters

    1. Right click on the imported process group and select Parameters.

    2. Enter the required parameter values as described in Flow parameters: Ingest files and perform processing with Cortex.

Flow parameters: Ingest files and perform processing with Cortex

Parameter

Description

CDC Refresh Frequency

Specifies the frequency at which the connector retrieves data from Sharepoint.

SharePoint Site URL

URL or SharePoint site from which the connector will ingest content

SharePoint Client ID

Microsoft Entra client ID. To learn about client ID and how to find it in Microsoft Entra, see Application ID (client ID) (https://learn.microsoft.com/en-us/azure/healthcare-apis/register-application#application-id-client-id).

SharePoint Client Secret

Microsoft Entra Client Secret. To learn about a client secret and how to find it in Microsoft Entra, see Certificates & secrets (https://learn.microsoft.com/en-us/azure/healthcare-apis/register-application#certificates--secrets).

SharePoint Tenant ID

Microsoft Entra Tenant ID. To learn about tenant ID and how to find it in Microsoft Entra, see Find your Microsoft 365 tenant ID (https://learn.microsoft.com/en-us/sharepoint/find-your-office-365-tenant-id).

SharePoint Source Folder

Supported files from this folder and all its subfolders is ingested into Snowflake. The folder path is relative to a Shared Documents library.

File Extensions To Ingest

A comma-separated list that specifies file extensions to ingest. The connector tries to convert the files to PDF format first, if possible. Nonetheless, the extension check is performed on the original file extension. To learn about the formats that can be converted, see Format options (https://learn.microsoft.com/en-us/graph/api/driveitem-get-content-format?view=graph-rest-1.0&tabs=http#format-options) If some of the specified file extensions are not supported by Cortex Parse Document, then the connector ignores those files, logs a warning message in an event log, and continues processing other files.

Openflow Instance Database

A database is created in the user’s Snowflake account, if necessary. Files, metadata, and ACLs are ingested into tables in the specified schema.

Openflow Instance Schema

A schema is created in the target database in the user’s Snowflake account, if necessary. The stage and tables are created to ingest files, metadata, and ACLs.

OCR Mode

The OCR mode to use when parsing files with Cortex PARSE_DOCUMENT function. The value can be OCR or LAYOUT. In OCR mode, only raw text content is extracted, ignoring formatting and table structures. In LAYOUT mode, the output preserves table structures as Markdown.

Snowflake Account

Name of the account that the connector will be running for.

Snowflake Username

Name of the user that the connector will act on behalf of.

Snowflake Private Key

The RSA private key used for authentication. The RSA key must be formatted according to PKCS8 standards and have standard PEM headers and footers. Note that either Snowflake Private Key File or Snowflake Private Key must be defined.

Snowflake Private Key File

The file that contains the RSA Private Key used for authentication to Snowflake, formatted according to PKCS8 standards and having standard PEM headers and footers. The header line starts with -----BEGIN PRIVATE.

Snowflake Private Key Password

The password associated with the Snowflake Private Key File.

Snowflake Warehouse

The warehouse used for the connection and SQL that requires a warehouse to be specified.

Snowflake Cortex Search Service user role

An identifier of a role that is assigned usage permissions on the Cortex Search service.

  1. Right-click on the plane and select Enable all Controller Services.

  2. Right-click on the imported process group and select Start. The connector starts the data ingestion.

  3. Query the Cortex Search service.

Use case 3: Customise the connector definition

Customize the connector definition to:

  • Process the ingested files with Document AI.

  • Perform custom processing on ingested files.

Procedure

  1. Download the connector definition files:

    1. Download this connector definition file for setting up and running the connector. This flow fetches data from Sharepoint, runs parsing and chunking processes, updates the Cortex Search service.

    2. Download this connector definition file for managing the connector. This can be used to toggle Cortex Search service indexing and to clean up the connector state.

  2. Import the connector definition into Openflow.

    1. Enter an Openflow canvas.

    2. Add a process group to the canvas.

    3. On the Create Process Group pop-up, select the connector definition file to import.

  3. Customize the connector definition.

    1. Remove the following process groups:

      1. Check If Duplicate Content

      2. Snowflake Stage and Parse PDF

      3. Update Snowflake Cortex

      4. (Optional) ProcessMicrosoft365Groups

    2. Attach any custom processing to the output of the Process SharePoint Metadata process group. Each flow file represents a single SharePoint file change.

  4. Populate the process group parameters. Follow the same process as for the use case 1. Note that after modifying the connector definition, not all parameters might be required.

  5. Run the flow.

    1. Start the process group. The flow will create all required objects inside of Snowflake.

    2. Right click on the imported process group and select Start.

  6. Query the Cortex Search service.

Enabling Sharepoint site groups

Microsoft Graph application for site groups

In addition to the steps specified in Get the credentials, do the following:

  1. Add Sites.Selected (https://learn.microsoft.com/en-us/graph/permissions-reference#sitesselected) SharePoint permission.

    Note

    You should see Sites.Selected in both Microsoft Graph and SharePoint permissions.

  2. Generate a key pair (https://learn.microsoft.com/en-us/entra/identity-platform/howto-create-self-signed-certificate). Alternatively, you can create a self-signed certificate with openssl by running the following command:

    openssl req -x509 -nodes -newkey rsa:2048 -keyout key.pem -out cert.pem -days 365
    
    Copy

    Note

    The command above doesn’t encrypt the generated private key. Remove the -nodes argument if you want to generate an encrypted key.

  3. Attach the certificate (https://learn.microsoft.com/en-us/graph/applications-how-to-add-certificate?tabs=http) to the Microsoft Graph application.

Additional flow parameters

After setting up the flow according to any of the use cases, fill out the following additional parameters.

Parameter Name

Description

Example value

Sharepoint Site Groups Enabled

Specifies whether the Site Groups functionality is enabled.

true

Sharepoint Site Domain

A domain name of the synchronized Sharepoint site.

exampletenant.sharepoint.com

Sharepoint Application Certificate

A generated application certificate in PEM format.

N/A

Sharepoint Application Private Key

A generated application private key in PEM format. The key must be unencrypted.

N/A

Query the Cortex Search service

You can use the Cortex Search service to build chat and search applications to chat with or query your documents in SharePoint.

After you install and configure the connector and it begins ingesting content from Sharepoint, you can query the Cortex Search service. For more information about using Cortex Search, see Query a Cortex Search service.

Filter responses

To restrict responses from the Cortex Search service to documents that a specific user has access to in SharePoint, you can specify a filter containing the user ID or email address of the user when you query Cortex Search. For example, filter.@contains.user_ids or filter.@contains.user_emails. The name of the Cortex Search service created by the connector is search_service in the schema Cortex.

Run the following SQL code in a SQL worksheet to query the Cortex Search service with files ingested from your SharePoint site.

Replace the following:

  • application_instance_name: Name of your database and connector application instance.

  • user_emailID: Email ID of the user who you want to filter the responses for.

  • your_question: The question that you want to get responses for.

  • number_of_results: Maximum number of results to return in the response. The maximum value is 1000 and the default value is 10.

SELECT PARSE_JSON(
  SNOWFLAKE.CORTEX.SEARCH_PREVIEW(
    '<application_instance_name>.cortex.search_service',
      '{
        "query": "<your_question>",
         "columns": ["chunk", "web_url"],
         "filter": {"@contains": {"user_emails": "<user_emailID>"} },
         "limit": <number_of_results>
       }'
   )
)['results'] AS results
Copy

Here’s a complete list of values that you can enter for columns:

Column name

Type

Description

full_name

String

A full path to the file from the Sharepoint site documents root. Example: folder_1/folder_2/file_name.pdf.

web_url

String

A URL that displays an original Sharepoint file in a browser.

last_modified_date_time

String

Date and time when the item was most recently modified.

chunk

String

A piece of text from the document that matched the Cortex Search query.

user_ids

Array

An array of Microsoft 365 user IDs that have access to the document. It also includes user IDs from all the Microsoft 365 groups that are assigned to the document. To find a specific user ID, see Get a user (https://learn.microsoft.com/en-us/graph/api/user-get?view=graph-rest-1.0&tabs=http).

user_emails

Array

An array of Microsoft 365 user email IDs that have access to the document. It also includes user email IDs from all the Microsoft 365 groups that are assigned to the document.

Example: Query an AI assistant for human resources (HR) information

You can use Cortex Search to query an AI assistant for employees to chat with the latest versions of HR information, such as onboarding, code of conduct, team processes, and organization policies. Using response filters, you can also allow HR team members to query employee contracts while adhering to access controls configured in SharePoint.

Run the following in a SQL worksheet to query the Cortex Search service with files ingested from SharePoint. Select the database as your application instance name and schema as Cortex.

Replace the following:

  • application_instance_name: Name of your database and connector application instance.

  • user_emailID: Email ID of the user who you want to filter the responses for.

SELECT PARSE_JSON(
     SNOWFLAKE.CORTEX.SEARCH_PREVIEW(
          '<application_instance_name>.cortex.search_service',
          '{
             "query": "What is my vacation carry over policy?",
             "columns": ["chunk", "web_url"],
             "filter": {"@contains": {"user_emails": "<user_emailID>"} },
             "limit": 1
          }'
     )
 )['results'] AS results
Copy
Language: English