The flow of user receipt data through the system from raw captured data to monetiziable licensed data.

Receipt Data Flow

App

The application is responsible for authorizing with the Data Provider API and uploading raw receipt data to Amazon Textract for processing.

Image Capture

  • User Actions
    • The user takes one or more photos of a receipt using a device's local camera libraries in iOS or Android.
  • Behind the Scenes
    • The images are stitched together in-memory, prepared for upload.

Authorization

  • What Happens
    • When authorizing (OAuth2) with our Data Provider API, the provider identity is validated.
    • When authorizing (OAuth2) with third-party APIs, the user identity is validated.
  • Why
    • To ensure that published data originates from a registered Data Provider.
    • To eliminate storage of sensitive credentials.
  • Outcome
    • If verification fails, subsequent requests are rejected.
    • If verification passes, authorization tokens are cached locally in the device's encrypted storage.

Extract Emails

  • User Actions
    • The user connects one or more third-party email accounts.
    • The user opens the application after connecting a third-party email account.
  • Behind the Scenes
    • Using the cached authorization token receipt emails are downloaded to the local device.
    • Only emails from the known receipt sender list are downloaded, no personal/sensitive emails.
    • On sequential downloads, only new emails are downloaded.

Verify Data License

  • User Actions
    • The user is presented with the terms of the program and elects to participate or decline.
  • Behind the Scenes
    • If a valid data license already exists, the user does not need to opt-in again.
    • If the user declines participation in the program, all requests are denied.
  • API Request
    • Endpoint: POST /license
    • Request Body (JSON):
      • terms (String, required): The complete legal terms of the agreement, preferably in plain text. This content is permanently added to the License Record.
      • expiry (String, optional): The date for the agreement to automatically terminate. Null if the agreement never expires.

Raw Receipt Data Upload

  • What Happens
    • Raw receipt data, either from extracted emails or captured images, is uploaded to our system.
    • Uploaded data is securely transmitted and encrypted using HTTPS with TLS v1.3.
  • Why
    • Uploaded data is processed and transformed from raw receipt data to monetizable machine-readable data.
    • Raw receipt data must pass through optical character recognition (OCR) software that cannot be executed on the local device - see Amazon Textract.
  • Outcome
    • Uploaded data is staged for processing.
    • A receipt identifier is returned to the client for traceability.
  • API Request
    • Endpoint: POST /receipt
    • Request Body (Multipart):
      • sender (String, optional): The sender's email address, used only for email receipts.
      • email (String, optional): The complete raw email body, used only for email receipts.
      • files (File, optional): Captured images to process or email attachments, if any.

Amazon Textract

Machine-readable data is extracted from raw receipt data using Amazon Textract's optical character recognition (OCR) machine learning (ML) models.

  • What Happens
    • The raw receipt data is temporarily stored in an at-rest encrypted S3 bucket while awaiting processing.
    • Amazon Textract extracts machine-readable data from receipt images and attachments. Textract's data privacy policies can be found under the corresponding section in Amazon's FAQ.
    • Raw receipt data is deleted within 30 days post-processing to allow for troubleshooting and performance optimization.
  • Why
    • To extract meaningful information from receipts in a format that can be programmatically consumed and utilized by licensors.
  • Outcome
    • Machine-readable data is extracted from captured images.
    • Machine-readable data is extracted from the email body and attachments.

Data Prepare

Machine readable data is converted into structured data with fields like merchant name, SKU, and total.

  • What Happens
    • Valuable fields are extracted from the machine readable data.
    • Extracted fields are formatted for compatibility with our Data Repository.
    • Machine-readable data is deleted within 30 days post-processing to allow for troubleshooting and performance optimization.
  • Why
    • To organize the machine-readable data in a way that licensor systems can effectively consume.
    • To remove sensitive and non-valuable data.
  • Outcome
    • The valuable data is extracted from the machine readable data and formatted for ingestion into our Data Repository.

Data Repository

Structured data is added to our at-rest encrypted Iceberg data repository, where access control is managed by Amazon's Lake Formation.

  • What Happens
    • Structured data is combined and staged for addition to the repository.
    • Combined structured data is bulk-added to repository.
    • Structured data is deleted within 30 days post-processing to allow for troubleshooting and performance optimization.
  • Why
    • To store the structured data securely in a format safely accessible by licensors.
    • To combine the structured data with other Data Providers to create monetizable data.
  • Outcome
    • The structured data is safely stored and encrypted within the repository.
    • The structured data is available to licensors that meet the Data License requirements and obligations.

Customer Cleanroom

All structured data is added to a secure Data Cleanroom for access by the Data Provider with corresponding receipt ids for traceability. See Accessing Your Data to learn how to securely access and utilize the data. Amazon's Resource Access Manager , IAM, and Lake Formation are used to control and manage secure access to the asset.

  • What Happens
    • Structured data is periodically (daily by default) added to the customer's cleanroom.
  • Why
    • To make accessible to the original Data Provider all processed data from their end users.
  • Outcome
    • The structured data is encrypted at-rest and securely accessible as a cleanroom.

Licensor Cleanroom

Using filters data licensors create subsets of the larger data repository, including the structured data, based on their use case(s). All licensed data is de-identified by default, unless otherwise agreed to by the licensor, data provider, and end user. Data license terms are automatically applied by the system ensuring licensors remain in-compliance when sourcing data.

  • What Happens
    • Structured data is combined across users and Data Providers to create monetizable panels.
    • Data that matches the filter and license terms requirements is added to a secure cleanroom.
    • Matching data is periodically updated (daily by default) in the licensor cleanroom.
  • Why
    • To ensure licensors only source data required for their use case(s).
    • To ensure licensors only source data compliant with regulations and data license agreements.
    • To provide licensors with secure access to their data assets in a format compatible with modern data tooling.
  • Outcome
    • The licensed data is encrypted at-rest and securely accessible as a cleanroom.