Thread-Safe Architecture¶

This document describes the internal design of Stichotrope's thread-safe profiling implementation in v0.2.0+. This is intended for developers contributing to the project or understanding the implementation details.

Overview¶

Stichotrope uses a thread-local storage design with a global registry and minimal locking strategy to achieve thread-safe profiling with negligible overhead impact.

Design Principles¶

Per-thread lock-free operations: Each thread's profiling data is independent, requiring no synchronization for track/block profiling
Minimal lock contention: Locks are only acquired for cross-thread aggregation operations
Automatic cleanup: Thread-local data is automatically cleaned up when threads terminate
Zero overhead when disabled: Global enable/disable avoids profiling completely when not needed

Data Structures¶

Thread-Local Storage¶

# Per-thread structure (threading.local())
_tracks: Dict[int, Track]

Each thread maintains its own _tracks dictionary mapping track indices to Track objects. Track objects contain:

class Track:
    track_idx: int                      # Track identifier
    track_name: Optional[str]           # User-friendly track name
    enabled: bool                       # Whether track is enabled
    blocks: Dict[int, ProfileBlock]     # Blocks in this track

ProfileBlock objects store timing data:

class ProfileBlock:
    name: str                           # Block name
    file: str                          # Source file
    line: int                          # Source line
    hit_count: int                     # Number of times executed
    total_time_ns: int                 # Total execution time
    min_time_ns: int                   # Minimum execution time
    max_time_ns: int                   # Maximum execution time

Global Registry¶

_all_thread_data: Dict[int, threading.local]

The global registry maps thread IDs to their threading.local() objects. This allows aggregating results across threads while maintaining per-thread independence.

Lock Strategy¶

Global Lock¶

_lock: threading.RLock()

A single RLock (reentrant lock) protects the global _all_thread_data registry and cross-thread aggregation operations.

Lock Acquisition Points¶

Locked operations (require global lock): - __init__(): Register profiler instance in global registry - get_all_thread_data(): Aggregate results from all threads - Cross-thread cache validation

Lock-free operations (no lock needed): - track() decorator application - block() context manager entry/exit - get_results() for current thread - Per-thread profiling updates

Lock Contention Impact¶

Minimal contention: Locks are held for microseconds (only during aggregation)
No locks in hot path: Decorator and context manager execution don't acquire locks
Per-thread efficiency: 99%+ of profiling operations are lock-free

Call-Site Caching¶

Stichotrope uses call-site caching to minimize overhead:

# Global cache maps (file, line, function_name) to ProfileBlock
_call_site_cache: Dict[Tuple[str, int, str], ProfileBlock]

When a decorated function is called:

Inspect call site (file, line, function name)
Check if this call site is already cached
If cached: retrieve existing ProfileBlock
If not cached: create new ProfileBlock and cache it

This reduces overhead for repeated calls to the same function.

Thread Lifecycle¶

Thread Creation¶

When a thread begins profiling:

Thread-local storage is automatically created by threading.local()
Global registry entry is created on first profiling operation
_tracks dictionary is initialized for this thread

Thread Termination¶

When a thread terminates:

Thread-local storage is automatically cleaned up (no manual cleanup needed)
Global registry entry becomes inactive
Data can still be accessed via get_all_thread_data() until profiler is destroyed

Implementation Reference¶

The complete implementation is in stichotrope/profiler.py. Key sections:

Thread-local initialization: _setup_thread_local() method
Global registry management: _register_thread() and _unregister_thread() methods
Decorator/context manager: track() and block() methods
Lock acquisition: get_all_thread_data() method uses lock for aggregation
Call-site caching: _get_or_create_block() method implements cache logic

Testing¶

Thread-safety is verified through:

Unit tests: tests/unit/test_thread_safety.py - Basic thread-safety tests
Integration tests: tests/integration/test_threaded_profiling.py - Real-world scenarios
Performance tests: tests/performance/test_thread_safety_overhead.py - Overhead measurement
Stress tests: Multiple threads with high contention scenarios