This document covers the key highlights of the flow of data through the platform from data providers to data consumers and the product related security and privacy mechanisms.

Data Provider: Upload

Data Providers have two options for publishing consumer data for monetization: either by using one of our Client Libraries, or through a backend server-to-server integration. For most use cases we recommend the Client Library integration, which integrates the user data licensing contract flow with the data upload function. When using our backend server-to-server integration, you are responsible only for publishing legally licensed consumer data, either via your own agreement flow or ours.

Client Library

Our Client Libraries are convenient wrappers around the Data Provider APIs, simplifying secure authorization, data licensing and data upload. Libraries often include optional common patterns for data capture for convenience.

API Authorization

  1. Create/Load Device Key Pair: RSA 2048 asymmetric key pairs are used to authorize each end user device without requiring users to log in to our platform. End users are registered with the platform utilizing a Data Provider's internal User ID (we recommend opaque ids). Each user may have one or more devices registered with their account. Private keys are generated on-device and remain in the local device encrypted storage (i.e. Keychain for iOS and Keystore for Android).
  2. Authorize User Device: OAuth2 authorization for end user devices utilizes RSA Signatures to prove identity. Upon library initialization, if a local device key exists for the specific user, an authorization token is requested. If a key does not exist, a new key is created and the device's public key is registered to the User ID.
  3. Issue Token: Returned authorization tokens are short-lived (exp claim) and follow the JWT specification with limited licensing and upload permissions controlled by audience and scp (scope) claims. API specifics on authorization tokens for end user devices can be found under Account Management > Authorization > Address Token.

Create Data License

  1. Get Data License Status: Using the authorization token the Client Library retrieves the current data licensing status for the user, regardless of the device it may have been accepted on. A user MUST have an active data license before uploading data to our system.
  2. Show Data License: If the user has not yet accepted the Data License the Data License agreement is automatically shown. If the application has changed their terms or would like to present the user with an opportunity to change their mind, the Data License flow can be presented again. Upon acceptance of the agreement, the user's private key is used to digitally sign the agreement.
  3. Create Data License: The digitally signed license agreement is published to our API, where it is digitally countersigned with the Data Provider's signing key. Data licenses are stored in both an immutable data structure and an immutable data repository, creating a permanent audit trail of both the raw agreement and associated metadata. See What's a What's a Data Provider > Data Licensing to learn more about how Data Licenses work under the covers.

Upload Data

  1. Upload Data: Using the authorization token, the Client Library uploads data to our system with the User's corresponding TID (TIKI ID). Upon upload, the User's data license status is confirmed with the request rejected if a valid agreement does not exist for the type and provenance of data uploaded.
  2. Data Preprocessing: Certain data may require preprocessing. For example, receipt data, which is often captured as images or email attachments. See Receipt Data Processing for an example of our system transforms raw receipts into monetizable machine-readable datasets. Our platform supports a wide range of data schemas, most of which do not require preprocessing.
  3. Add Data to Staging Bucket and Queue: Uploaded data is added to a KMS managed 256-bit AES-GCM encrypted S3 us-east-2 bucket. Upon addition to the staging repository, the event metadata is registered with the staging queue and made available to our Data Preparation system.

Server-to-Server

Our Server-to-Server integration for Data Publishers enables the bulk addition of data to our system. Keep in mind, when using our backend server-to-server integration, you are responsible only for publishing legally licensed consumer data. Our system will automatically attach and filter licenses for data added, but for the sake of privacy, do not add unlicensed data to our Staging Repository.

  1. Extract Data to Publish: On a regular schedule (daily is preferred), in your own backend system, extract the data you wish to publish to one or more data files. We support CSV, Parquet, ORC, and Avro file formats either uncompressed or compressed as Zip, Gzip or BZip2. Only extract the new data since the last extraction schedule to minimize transfer and processing costs.
    Prior to copying data to our system, you will need to provide our team with your table(s) schema.

  2. Copy Data to Staging: Once your schema is approved by our team, an AWS IAM user or role will be granted permission for writing to a unique sub-path of our Staging Repository —a KMS managed 256-bit AES-GCM encrypted S3 us-east-2 bucket. You can use any number of AWS' tools like their HTTP API or CLI to upload your extracted data. However, the simplest pattern most Data Providers follow, is to extract their data direct an internal S3 bucket. From there, simply copy or move the files directly and securely to our bucket.

  3. Attach Licenses: All monetizable data in our system requires a valid Data License. For data bulk-added to our Staging Repository via a Server-to-Sever integration, licenses are attached using corresponding opaque TIDs (TIKI IDs). Data Licenses are attached in one of two ways: Data Providers can either directly create Data Licenses within their application, or, for Data Providers with their own licensing contract flow, a delegated license is applied.

    To create a Data License within an application, providers can use either our Client Libraries or our REST API to capture, sign, and record immutable user data licenses. Delegated licenses are automatically created by wrapping a provider's existing contracts with our licensing metadata and immutability structure. The delegated license flow requires legal approval from our team to ensure compliance with laws, regulations, and our platform's requirements prior to deployment.

Data Preparation

Our data preparation process brings raw uploaded data from all Data Providers into a unified and clean data lake enabling Data Consumers to create subscriptions for unique datasets tailored to their use case(s). Our secure Central Data Lake follows the Iceberg open table specification with data stored in KMS managed 256-bit AES-GCM encrypted S3 buckets and governed by LakeFormation.

  1. Load Staged Data: Staged data upload events added to the queue launch asynchronous, ephemeral (Lambda) Batch Writers. Batch Writers are entirely self-contained within our AWS environment and stateless. They do not call third-party APIs, have no public exposure, and contain no persistence outside of writing to the Central Data Lake.

  2. Batch Write Data to Central Data Lake: Batch Writers combine a multitude of data upload files into singular parquet files compatible with our Central Data Lake. Batch Writers decompress (if necessary) and parse raw uploaded data into strict schemas supported by our platform. Utilizing LakeFormation for access control, Batch Writers commit these singular parquet files to the corresponding encrypted repositories.

  3. Add to Metadata Queue: Following the commit of new data to the Central Data Lake, Batch Writers add corresponding metadata information to a FIFO queue. Queue events are grouped and launch asynchronous, ephemeral (Lambda) Metadata Writers. Metadata Writers are entirely self-contained within our AWS environment and stateless. They do not call third-party APIs, have no public exposure, and contain no persistence outside of writing to the Central Data Lake.

  4. Update Central Data Lake Metadata: Metadata Writers combine multiple data write events for the same table into a singular Iceberg metadata file update. Each table metadata update creates a new snapshot of the table, registered with our Glue Data Catalog and governed by LakeFormation. Metadata files reside alongside data files in corresponding encrypted repositories.

  5. Clean the Data: Raw upload data is initially added to internal-only tables—tables accessible only to our internal data engineering team. Raw upload data then undergoes a series of cleaning transformations, most notably deduplication, sanitization, and deidentification.

    Deduplication removes any duplicate records that snuck through. Sanitization transforms fields for compatibility with our standard public-facing schemas. For example, timestamps often come in as strings following a handful of different standards; they are all converted to Iceberg timestamps. Deidentification removes columns with Data Provider specific IDs and fields used in identity matching like email address, MAID, IP address, etc. and adds a new opaque, pseudonymous TID (TIKI ID) column in their place. The TID field enables common signal processing efforts like grouping and cross table joins without exposing sensitive consumer information.

  6. Update Published Datasets: Post-cleaning data is merged into the published datasets available to Data Consumers who meet the license requirements attached to the underlying data. Published datasets reside in the same encrypted Central Data Lake governed by LakeFormation. See Data Consumer: Purchase for an overview of securely accessing and subscribing to published datasets.

Data Consumer: Purchase

Data Consumers create unique subscriptions to subsets of our larger public datasets using filters to extract and pay for only the data required for their use case. Resulting data is added to a secure, hosted Data Cleanroom and updated daily with changes to originating datasets.

Create Data Cleanroom

  1. Create Cleanroom: Using our API, with your API Key, request the creation of a new hosted Data Cleanroom. Requesting a new cleanroom requires an AWS Account for which the created repository will be securely shared. For API specifics on creating a new Cleanroom see Data Consumer > Managing Cleanrooms > Create Cleanroom
  2. Provision Cleanroom: The infrastructure for a secure repository is created and provisioned. Customer Cleanrooms are governed by LakeFormation and follow the Iceberg open table specification with data stored in a KMS managed 256-bit AES-GCM encrypted S3 bucket.
  3. Configure Secure Access: Using AWS's Resource Access Manager the Customer Cleanroom is securely shared with the Customer's AWS Account. Once shared, the customer can configure and manage access for their team, directly within their own AWS Account using the standard AWS security and data governance tools like IAM Policies and Roles or LakeFormation Tags. Learn more about securely accessing a Cleanroom in Data Consumer > Getting Started > 4. Accept Resource Share and Data Consumer > Accessing Your Data.

Estimate Data Subscription

  1. Create Estimate: Before purchasing a data subscription, the Data Consumer creates an estimate using their SQL Filter. SQL Filters are SELECT only queries enabling customers to select, transform, filter, limit and join tables from the published datasets. For API specifics on creating a new Data Estimate see Data Consumer > Managing Data Subscriptions > Create Estimate.
  2. Apply License Criteria: A wrapped filter is created, applying license criteria to the originally requested SQL Filter. Utilizing the metadata from Data Licenses, only data with license rights corresponding to the Data Consumer's use case is returned, ensuring customers only purchase legally licensable data.
  3. Get Record Stats: Using the wrapped filter statistics about the number of matching records are calculated from the public datasets. Statistics include metrics such as the current total record count and the average monthly update rate.
  4. Return Pricing: Using the retrieved record statistics for the SQL Filter a pricing estimate is calculated. Estimates include both the initial data load cost as well as expected ongoing monthly costs. Estimates, not quotes, are returned because the public datasets are continuously updating. Customers are charged one-time for each record added to their Cleanroom, meaning the initial load costs are typically much higher than the on-going updates.

Purchase Data Subscription

  1. Purchase Subscription: To purchase a Data Subscription first requires an Estimate. The estimate is converted into an active data subscription, loading initial data into table named in the Create Estimate request and then routinely updating this same table as new data becomes available. For API specifics on creating a new Data Estimate see Data Consumer > Managing Data Subscriptions > Purchase Subscription.
  2. Execute Filter: The Wrapped Filter from the Estimate is executed using Athena and LakeFormation to extract the matching records from the public datasets. The Wrapped Filter only extracts new data since the last execution of the filter. For a first-time execution, this includes all current matching data. For sequential executions, only new un-merged data is returned.
  3. Create/Update Cleanroom: The filtered data is either used to create a new Iceberg compatible, encrypted table within the Customer's Cleanroom, or, to update an existing table. Athena is utilized for query execution and LakeFormation for access and governance.
  4. Get Records Stats: Upon completion of either creating or updating a table within a Customer Cleanroom, statistics about the exact records added to the Customer Cleanroom are retrieved. Exact usage metrics are recorded to the customer's billing profile, managed by Stripe Billing.
  5. Charge Account: At the end of the month, customers are automatically billed based on their usage for the preceding month. Usage and charges to-date for the current month and prior months can found at anytime in the customer's Billing Dashboard —use the Billing Dashboard to update payment details and methods.
  6. Daily Update: Once a day, the filter is re-executed using the same process to update the table in the Customer Cleanroom with any new matching records. Only new records added to the cleanroom are charged to the customer's account. Customers can stop a Data Subscription at anytime to pause updates and on-going charges without deleting previously purchased data from their cleanroom.