SshCache internals

This page is a tour of fleche.remote for contributors hacking on the cross-machine cache. End-user documentation lives in Configuration; this page explains how the pieces fit together.

Security model

The wire format is cloudpickle in both directions. That has two unavoidable consequences:

  • A hostile server can run arbitrary code on the client by returning a crafted payload that the client’s cloudpickle.loads executes.

  • A hostile client can run arbitrary code on the server symmetrically.

The trust boundary is therefore exactly the same as ssh user@host itself: SSH provides confidentiality and authentication of the endpoints, and we trust both ends not to ship hostile pickled objects. Do not point SshCache at a host you would not ssh into, and do not expose the server entry point (python -m fleche remote --serve) on any transport other than the spawned SSH stdin/stdout — there is no authentication on the RPC stream itself.

Cloudpickle is the right call for this feature (we need to ship arbitrary Python objects across, including user value types) but a JSON / msgpack / protobuf wire format with a strict schema would have removed this trust requirement at the cost of constraining the kind of values fleche can cache. That trade-off was made deliberately; revisit it if SshCache ever needs to support untrusted multi-tenant use.

Credentials in info()

The info RPC ships cache_to_config() of the served cache back to the client. The raw config contains credentials — PickleFileBackend writes its HMAC signing keys (secret_key) as hex strings, and the SQL backend’s url may include a database password in the userinfo component. fleche.remote._server_info() therefore walks the config through fleche.remote._redact_config() before putting it on the wire: secret_key values are replaced with "<redacted>" and URL passwords are masked to ***. This matters because the client’s DEBUG-level RPC tracing logs the full response payload — without the redactor, signing keys would land in any DEBUG log file by default.

If you add a new storage type that round-trips credentials through cache_to_config, extend fleche.remote._SENSITIVE_CONFIG_KEYS to cover the new field name.

Version handshake

The first cache operation on a fresh SshCache implicitly fetches an info dict via fleche.remote.SshCache._ensure_handshake(), which calls fleche.remote._warn_on_version_skew() on the response. Mismatched fleche_version or cloudpickle_version between client and server logs a WARNING on the fleche.remote logger — schema or wire format drift is the most common root cause of silently-wrong records across a fleche upgrade, and this surfaces it without forcing a hard failure (forwards-compatible patch releases are common in practice and the user has no recourse mid-session from a raise).

The handshake fires on the first BaseCache method on the SshCache, even reads — every method calls _ensure_handshake() at its top. Cost is exactly one extra RPC per session.

Goals and constraints

The cross-machine cache exists to let two (or more) machines share results of expensive @fleche() calls without copying their cache files around by hand. The constraints that shaped the design:

  • No always-on daemon. The remote side is a vanilla Python process spawned on demand by the client. Nothing has to be running before the first cache lookup.

  • Single authentication. Most multi-user clusters require 2FA. A client that silently reconnects whenever the connection drops would re-prompt the user, possibly while they are away. Auto-reconnect is therefore explicitly out of scope (see Lifecycle below).

  • Backend agnosticism. The remote keeps full freedom over which BaseCache it serves — including stacks containing further SshCache instances, if a user really wants chained shares.

  • Cheap round-trips. Cache ops are typically tiny payloads (digest hits, contains probes). The wire protocol favours per-call latency over fan-out.

Process and stream layout

One SshCache owns exactly one fleche.remote._SshConnection, which in turn owns at most one Popen:

client                                       remote
──────                                       ──────

SshCache.<op>(...)            <op> ∈ save / load / load_value /
    │    ▲                            contains / expand / shrink /
call│    │return                      query / evict / info
    ▼    │
┌──────────────────────┐                ┌─────────────────────┐
│_SshConnection        │──(stdin) req──>│ python -m           │
│                      │                │   fleche remote     │
│ ssh host             │<─(stdout) resp─│   --serve           │
│ python -m fleche     │                │                     │
│ remote --serve       │<─(stderr) lines│   serve(...) loop   │
└──────────────────────┘                └─────────────────────┘
    │
    │ stderr drained by a daemon thread
    ▼
fleche.remote.host.<host>  +  50-line ring buffer

Three streams cross the SSH boundary:

  • stdin / stdout carry length-prefixed cloudpickle RPC frames. All cache operations multiplex over this single pair; there is no fan-out and no parallel requests. _Connection.call takes a threading.Lock for the entire write→read round-trip so threaded callers serialise naturally.

  • stderr is drained by a daemon thread running fleche.remote._forward_stderr(). Each line is logged at INFO on the per-host logger fleche.remote.host.<host> and appended to a 50-line ring buffer on the connection. When the subprocess dies, fleche.remote._SshConnection._diagnose() returns the buffered tail plus the exit code so RemoteConnectionError carries the actual cause instead of just EOFError.

    The logger name is constructed in fleche.remote._SshConnection._open() from the SSH host string; dots and @ are substituted with _ so the host name doesn’t fragment the logger hierarchy. Because the name uses . as the separator, each host’s logger is a real child of fleche.remote and standard logger configuration propagates: setting logging.getLogger("fleche.remote").setLevel(...) controls every per-host child, while logging.getLogger("fleche.remote.host.bigpc_example_com") lets you raise or silence one host independently.

Wire protocol

Every frame is a 4-byte big-endian length prefix (struct.Struct(">I")) followed by a cloudpickle payload. See fleche.remote._write_frame() and fleche.remote._read_frame().

Requests are 3-tuples (method: str, args: tuple, kwargs: dict); responses are 2-tuples either ("ok", value) or ("err", exception). Exceptions re-raise on the client side, so a server-side KeyError becomes a KeyError to the caller, a Rejected stays a Rejected, and so on.

If the exception itself can’t be cloudpickled (some C-extension types refuse), the server downgrades it to a RuntimeError carrying the formatted traceback so the client at least sees what went wrong.

Server-side dispatch

fleche.remote._dispatch() is an explicit if chain — one branch per BaseCache method, plus info handled at the serve() level. The explicit table is deliberate: it documents the full API surface, gives any future method addition a clear place to wire up server-side translation, and avoids giving remote callers reflective access to arbitrary attributes of the cache.

Two pieces need server-side translation rather than raw passthrough:

  • load() and _query() produce LazyCall instances whose _cache field points at the server’s cache. That pointer cannot travel across the wire (it would reach back into a different process on deserialisation). fleche.remote._strip_cache() strips it and returns a plain DigestedCall; the client then calls DigestedCall.fetch(self) to re-bind it to the local SshCache. Subsequent .result accesses on the lazy call therefore round-trip through load_value on the client’s SshCache and back to the remote — fetching the value only when actually used.

  • query() is a generator on the server side. _dispatch materialises the whole iterator into a tuple before sending it back, because the wire protocol is request / response, not streaming. Large query results pay the cost up front in one frame.

Client side

fleche.remote._Connection.call() is the one entry point for every RPC. It acquires the lock, lazily opens the connection if needed, writes the request frame, reads exactly one response frame, and dispatches on the response tag. On (BrokenPipeError, EOFError, OSError) it calls _diagnose() to collect transport-specific detail, closes the connection, and raises RemoteConnectionError.

A note on logging: every RPC is logged at DEBUG on the fleche.remote logger as rpc method args=... and rpc method ok. Enabling debug-level logging on that namespace gives a complete trace of the wire traffic without instrumenting any code.

Lifecycle

The subprocess is spawned lazily on the first cache operation. Once opened it lives for the lifetime of the Python process and is closed via atexit (the client’s hook is registered on first _open).

Note

Subprocesses are not closed on garbage collection.

atexit.register(self.close) keeps a reference to the connection alive for the whole process, so a SshCache that goes out of scope is not eligible for collection and its SSH subprocess stays up until interpreter exit. In a long-running interactive session that creates and discards many SshCache instances (e.g. repeatedly re-reading a config), the subprocesses accumulate. Call close() explicitly when done with a cache you won’t reuse. A future revision could use a weakref-based atexit hook so dropped caches clean themselves up, but the simple always-on-exit cleanup is correct for the common case of a handful of long-lived caches.

Note

Reading info() / read_only costs an RPC.

info() and the read_only property are not free: the first access fetches the server info dict over the wire (and drives the version handshake). The result is cached for the lifetime of the connection, so only the first access pays; but a user poking sc.read_only in a REPL should know it is a network round-trip, not a local attribute. reconnect() invalidates the cache, so the next access re-fetches.

There is no automatic reconnect. If the SSH subprocess dies — network hiccup, server reboot, idle timeout — every subsequent op raises RemoteConnectionError until the caller invokes reconnect() explicitly. The error message produced in fleche.remote._Connection.call() names SshCache.reconnect() directly so the user doesn’t have to know about the lifecycle helper ahead of time. This is the single most opinionated decision in the module: an auto-reconnect would silently re-trigger interactive 2FA prompts (often while the user is away from the terminal). Forcing the call to be explicit means the user can decide when to pay that cost.

For sessions that span multiple Python runs, the recommended pattern is OpenSSH ControlMaster + ControlPersist: the first connection authenticates and the multiplexer keeps the underlying TCP session alive across child processes, so subsequent ssh invocations within the persist window skip the handshake entirely. Wire it up either in ~/.ssh/config or via the ssh_options field on SshCache (see the docstring for an example).

Setup-commands chain

When the user passes workdir and/or setup_commands, the constructed remote command looks like:

cd <workdir> && <snippet1> && <snippet2> && ... && exec <python> -m fleche remote --serve

The trailing exec replaces the wrapping shell so stdin/stdout pipe straight through to the server process — no extra layer to corrupt the binary RPC stream. Any setup failure short-circuits via && and the ssh process exits non-zero; the client’s next read raises EOFError and the diagnostic captured from stderr surfaces the actual reason.

workdir (when set) contributes the leading cd so it runs ahead of the user’s setup_commands. Because the server boots the cache via python -m, that working directory lands on sys.path, letting the remote import the project-local modules referenced by unpickled calls.

Without workdir or setup_commands, the server argv is passed directly to ssh as separate arguments, so SSH exec\ s it as-is — no remote shell is involved at all.

Diagnostics: the info RPC

fleche.remote._server_info() is the introspection back-channel. It is the only method dispatched at the serve() level (alongside the data plane in _dispatch()), because it needs access to two things the dispatcher doesn’t have: the cache_name the server was launched with, and the active cache from the fleche.state._CACHE ContextVar.

It returns:

Key

Meaning

cache

cache_to_config() of the served cache — a structured dict (or list, for stacks) that round-trips back through cache_from_config().

cache_name

The --cache argument the server was launched with, or None for the default cache.

read_only

True when the served cache (or, for a CacheStack, its stack[0]) is a ReadOnlyMixin.

cwd

os.getcwd() on the remote.

hostname

socket.gethostname() on the remote.

python

sys.executable on the remote.

pid

The server process’s PID.

info() is the debugging back-channel for any “the remote isn’t doing what I expected” question — wrong cache loaded, wrong working directory, unexpected read_only flag, surprising Python interpreter, surprising host. Calling it from the client surfaces the remote’s view of itself in one round-trip, with no need to ssh host separately and poke around.

Active cache on the remote

The server’s served cache is its active cache. python -m fleche remote --serve installs the named cache on the fleche.state._CACHE ContextVar before entering the request loop, so any code path on the remote that consults fleche.cache() independently — metadata hooks, LazyCall._cache references, nested @fleche() calls — sees the same instance the RPC layer is dispatching against. The server process is single-threaded and serves exactly one cache, so there’s no other sensible default.

Read-only short-circuit

save() and evict() check the cached read_only flag from the server info dict before issuing the RPC. If the flag is True they raise Rejected locally with no round-trip.

The flag is populated lazily — the first time read_only (or info()) is consulted, an info RPC fires and the result is cached on the SshCache instance. Subsequent saves consult the cache directly. Cost is therefore at most one extra round-trip per session, and zero in the common case where the first cache op is itself a write.

reconnect() invalidates the cached info so the next read re-fetches against the new subprocess.

Testing strategy

The module ships two layers of tests, both in tests/{unit,integration}/test_remote.py:

  • Unit tests drive serve() in a background daemon thread with two os.pipe() pairs in place of an SSH subprocess. The same wire-protocol machinery is exercised — frame layout, dispatch, exception propagation, the _Connection lifecycle — without SSH or process spawning. Tests that need the transport’s behaviour (subprocess exit codes, stderr capture) drive a fleche.remote._Connection subclass directly with no server on the other side; see test_connection_drop_includes_diagnose_output.

  • Integration tests launch python -m fleche remote --serve as a local subprocess (still no SSH involved) so the full Popen-with-three-pipes handshake, module loading, and config-file parsing on the server side are all exercised. Each test uses a tmp_path-scoped fleche.toml so server state lives in an isolated directory.

Adding a new RPC

The path is short and entirely mechanical:

  1. Add a server-side branch in fleche.remote._dispatch() (or, if the RPC needs server-only state like cache_name, branch in serve() itself the way info does). Be explicit about translating any cache-bound return value via _strip_cache().

  2. Add a client method on SshCache that calls self._conn.call("method_name", *args). If the response carries DigestedCall instances, .fetch(self) them back into client-bound LazyCall\ s.

  3. Add tests at both layers: a unit test that drives the new method through the in-process pipe, and (when it’s worth it) an integration test that exercises it across the subprocess boundary.

Avoid adding more methods than necessary — every new method is wire surface that has to round-trip from any user. Prefer threading the new operation through one of the existing methods where possible.