Building Sovereign Data Lineage: A Decentralized Storage Architecture with DID and Proxy Re-Encryption

Protocol · · Kay L

Preface

There is a fundamental contradiction in centralized data governance: to use data, we must entrust it to a platform. Once entrusted, control no longer belongs to the Knowledge Contributor.

The architecture described here combines three technologies to answer whether data can circulate without relying on platform trust while maintaining owner control.

Core Design Philosophy: From Platform Trust to Cryptographic Constraints

Four engineering principles guide the architecture:

  1. Minimal On-Chain Data - Only contribution fingerprints, version relationships, and identity declarations remain on-chain; raw data stays off-chain.
  2. Data Sovereignty - “Who holds the keys” determines control. The platform cannot access plaintext data or decryption keys without authorization.
  3. Embedded Permission - Permissions are cryptographic facts bound to data versions via Verifiable Credentials, not database records.
  4. Auditability First - Every version evolution and authorization creates verifiable traces for third-party verification.

System Architecture Overview

graph TB
    subgraph On-Chain
        L[Lineage Registry]
        I[Identity Registry]
    end
    subgraph Off-Chain Storage
        F[Encrypted Data - IPFS/Arweave/S3]
    end
    subgraph Key Layer
        PRE[Proxy Re-Encryption Node]
        VC[Verifiable Credentials]
    end
    Owner -->|publish fingerprint| L
    Owner -->|encrypt & upload| F
    Owner -->|issue VC| VC
    Owner -->|generate re-enc key| PRE
    User -->|present VC| PRE
    PRE -->|re-encrypt key| User
    User -->|decrypt locally| F

On-Chain Data Lineage Layer

Stores metadata only through a LineageRecord structure:

{
  "contributionFingerprint": "Hash",
  "version": "string",
  "previousHash": "Hash",
  "operatorDID": "string",
  "dataUri": "string"
}

Off-Chain Storage Layer

Stores strongly encrypted data on Filecoin, Arweave, S3, or hybrid solutions. The storage layer is publicly readable by default. Data security relies entirely on encryption and key distribution.

Key and Permission Layer

Verifiable Credentials answer authorization questions; Proxy Re-Encryption handles secure key delivery.

Access Layer

The platform verifies credentials and executes re-encryption but never gains decryption capability.

Data Evolution and Version Control

Two distinct concepts organize data management:

Food Science Dataset Example (v1-v3 Evolution)

Version 1 (Raw Collection): 10,000 food photos with basic metadata

Version 2 (Expert Annotation): Added structured annotations (calories, ingredients, allergens)

Version 3 (Correction and Compliance): Fixed nutrition labels and applied face blurring

Core Mechanism: Secure Key Delivery via Proxy Re-Encryption

The architecture employs hybrid encryption:

Full Authorization and Access Flow

Phase 1: Data Publishing

  1. Owner generates version and symmetric key locally
  2. Encrypts raw data; uploads ciphertext to IPFS/OSS/AWS
  3. Self-encapsulates key by encrypting with their public key; stores in metadata

Phase 2: Decentralized Authorization

  1. User requests access; Owner issues Verifiable Credential
  2. Owner generates proxy re-encryption key locally
  3. Owner distributes credentials and re-encryption key to platform

Phase 3: Proxy Access

  1. User presents VC to platform
  2. Platform verifies VC and executes re-encryption
  3. User decrypts locally using their private key

This achieves “encrypt once, authorize many” without re-encrypting large files for each user.

Permission Revocation and Forward Secrecy

Three-level revocation approach:

  1. Platform Layer: Owner instructs platform to delete re-encryption key
  2. Verification Layer: Owner uses VC Revocation List or expiration mechanisms
  3. Version Layer: Owner generates new version with fresh key; old authorizations naturally fail for new data

The system protects future data from compromised past keys, similar to Signal’s ratchet mechanism.

System Availability When Platform Fails

ScenarioImpactResolution
Platform unavailable, storage existsUsers with obtained keys unaffectedOwner can redeploy PRE logic
Platform acts maliciouslyService denial only, not fact tamperingPlatform cannot forge owner-signed VCs
Platform and partial storage failOld lineage remains verifiableOwner re-uploads to new storage network

Platform failure reduces automation, not data sovereignty.

Auditability

Auditability maximizes traceability when data inevitably leaks. Four-step verification:

  1. Confirm Data Identity: Recalculate hash fingerprint; compare with on-chain record
  2. Trace Complete Access Path: Review lineage records, VC authorizations, and platform audit logs
  3. Narrow Responsibility: Compress uncertainty into investigable scope
  4. Provide Legal Evidence: Create reproducible, immutable evidence package

The system cannot prevent downloads but can narrow an infinite responsibility space into a manageable, investigable, litigable scope.

Summary

The architecture balances performance, security, and decentralization by returning permission control to users rather than relying on traditional Access Control Lists. It provides more engineering feasibility than Fully Homomorphic Encryption while introducing a semi-trusted proxy node.

Extremely high-confidentiality scenarios might benefit from Trusted Execution Environments as stronger trust anchors.