SshCache internals
This page is a tour of fleche.remote for contributors hacking on the
cross-machine cache. End-user documentation lives in
Configuration; this page explains how the pieces fit
together.
Security model
The wire format is cloudpickle in both directions. That has
two unavoidable consequences:
A hostile server can run arbitrary code on the client by returning a crafted payload that the client’s
cloudpickle.loadsexecutes.A hostile client can run arbitrary code on the server symmetrically.
The trust boundary is therefore exactly the same as ssh user@host
itself: SSH provides confidentiality and authentication of the
endpoints, and we trust both ends not to ship hostile pickled objects.
Do not point SshCache at a host you would
not ssh into, and do not expose the server entry point
(python -m fleche remote --serve) on any transport other than the
spawned SSH stdin/stdout — there is no authentication on the RPC
stream itself.
Cloudpickle is the right call for this feature (we need to ship arbitrary Python objects across, including user value types) but a JSON / msgpack / protobuf wire format with a strict schema would have removed this trust requirement at the cost of constraining the kind of values fleche can cache. That trade-off was made deliberately; revisit it if SshCache ever needs to support untrusted multi-tenant use.
Credentials in info()
The info RPC ships
cache_to_config() of the served cache back to the
client. The raw config contains credentials —
PickleFileBackend writes its HMAC
signing keys (secret_key) as hex strings, and the SQL backend’s
url may include a database password in the userinfo component.
fleche.remote._server_info() therefore walks the config through
fleche.remote._redact_config() before putting it on the wire:
secret_key values are replaced with "<redacted>" and URL
passwords are masked to ***. This matters because the client’s
DEBUG-level RPC tracing logs the full response payload — without the
redactor, signing keys would land in any DEBUG log file by default.
If you add a new storage type that round-trips credentials through
cache_to_config, extend
fleche.remote._SENSITIVE_CONFIG_KEYS to cover the new field
name.
Version handshake
The first cache operation on a fresh SshCache
implicitly fetches an info dict via
fleche.remote.SshCache._ensure_handshake(), which calls
fleche.remote._warn_on_version_skew() on the response. Mismatched
fleche_version or cloudpickle_version between client and server
logs a WARNING on the fleche.remote logger — schema or wire
format drift is the most common root cause of silently-wrong records
across a fleche upgrade, and this surfaces it without forcing a hard
failure (forwards-compatible patch releases are common in practice and
the user has no recourse mid-session from a raise).
The handshake fires on the first BaseCache method on the SshCache,
even reads — every method calls _ensure_handshake() at its top.
Cost is exactly one extra RPC per session.
Goals and constraints
The cross-machine cache exists to let two (or more) machines share results of
expensive @fleche() calls without copying their cache files around by
hand. The constraints that shaped the design:
No always-on daemon. The remote side is a vanilla Python process spawned on demand by the client. Nothing has to be running before the first cache lookup.
Single authentication. Most multi-user clusters require 2FA. A client that silently reconnects whenever the connection drops would re-prompt the user, possibly while they are away. Auto-reconnect is therefore explicitly out of scope (see Lifecycle below).
Backend agnosticism. The remote keeps full freedom over which
BaseCacheit serves — including stacks containing furtherSshCacheinstances, if a user really wants chained shares.Cheap round-trips. Cache ops are typically tiny payloads (digest hits,
containsprobes). The wire protocol favours per-call latency over fan-out.
Process and stream layout
One SshCache owns exactly one
fleche.remote._SshConnection, which in turn owns at most one
Popen:
client remote
────── ──────
SshCache.<op>(...) <op> ∈ save / load / load_value /
│ ▲ contains / expand / shrink /
call│ │return query / evict / info
▼ │
┌──────────────────────┐ ┌─────────────────────┐
│_SshConnection │──(stdin) req──>│ python -m │
│ │ │ fleche remote │
│ ssh host │<─(stdout) resp─│ --serve │
│ python -m fleche │ │ │
│ remote --serve │<─(stderr) lines│ serve(...) loop │
└──────────────────────┘ └─────────────────────┘
│
│ stderr drained by a daemon thread
▼
fleche.remote.host.<host> + 50-line ring buffer
Three streams cross the SSH boundary:
stdin / stdout carry length-prefixed cloudpickle RPC frames. All cache operations multiplex over this single pair; there is no fan-out and no parallel requests.
_Connection.calltakes athreading.Lockfor the entire write→read round-trip so threaded callers serialise naturally.stderr is drained by a daemon thread running
fleche.remote._forward_stderr(). Each line is logged atINFOon the per-host loggerfleche.remote.host.<host>and appended to a 50-line ring buffer on the connection. When the subprocess dies,fleche.remote._SshConnection._diagnose()returns the buffered tail plus the exit code soRemoteConnectionErrorcarries the actual cause instead of justEOFError.The logger name is constructed in
fleche.remote._SshConnection._open()from the SSH host string; dots and@are substituted with_so the host name doesn’t fragment the logger hierarchy. Because the name uses.as the separator, each host’s logger is a real child offleche.remoteand standard logger configuration propagates: settinglogging.getLogger("fleche.remote").setLevel(...)controls every per-host child, whilelogging.getLogger("fleche.remote.host.bigpc_example_com")lets you raise or silence one host independently.
Wire protocol
Every frame is a 4-byte big-endian length prefix (struct.Struct(">I"))
followed by a cloudpickle payload. See fleche.remote._write_frame()
and fleche.remote._read_frame().
Requests are 3-tuples (method: str, args: tuple, kwargs: dict);
responses are 2-tuples either ("ok", value) or ("err", exception).
Exceptions re-raise on the client side, so a server-side
KeyError becomes a KeyError to the caller, a
Rejected stays a Rejected,
and so on.
If the exception itself can’t be cloudpickled (some C-extension types
refuse), the server downgrades it to a RuntimeError carrying the
formatted traceback so the client at least sees what went wrong.
Server-side dispatch
fleche.remote._dispatch() is an explicit if chain — one branch
per BaseCache method, plus info handled at
the serve() level. The explicit table is
deliberate: it documents the full API surface, gives any future method
addition a clear place to wire up server-side translation, and avoids
giving remote callers reflective access to arbitrary attributes of the
cache.
Two pieces need server-side translation rather than raw passthrough:
load()and_query()produceLazyCallinstances whose_cachefield points at the server’s cache. That pointer cannot travel across the wire (it would reach back into a different process on deserialisation).fleche.remote._strip_cache()strips it and returns a plainDigestedCall; the client then callsDigestedCall.fetch(self)to re-bind it to the localSshCache. Subsequent.resultaccesses on the lazy call therefore round-trip throughload_valueon the client’s SshCache and back to the remote — fetching the value only when actually used.query()is a generator on the server side._dispatchmaterialises the whole iterator into a tuple before sending it back, because the wire protocol is request / response, not streaming. Large query results pay the cost up front in one frame.
Client side
fleche.remote._Connection.call() is the one entry point for every
RPC. It acquires the lock, lazily opens the connection if needed,
writes the request frame, reads exactly one response frame, and
dispatches on the response tag. On
(BrokenPipeError, EOFError, OSError) it calls _diagnose() to
collect transport-specific detail, closes the connection, and raises
RemoteConnectionError.
A note on logging: every RPC is logged at DEBUG on the
fleche.remote logger as rpc → method args=... and rpc ←
method ok. Enabling debug-level logging on that namespace gives a
complete trace of the wire traffic without instrumenting any code.
Lifecycle
The subprocess is spawned lazily on the first cache operation. Once
opened it lives for the lifetime of the Python process and is closed
via atexit (the client’s hook is registered on first
_open).
Note
Subprocesses are not closed on garbage collection.
atexit.register(self.close) keeps a reference to the connection
alive for the whole process, so a SshCache
that goes out of scope is not eligible for collection and its SSH
subprocess stays up until interpreter exit. In a long-running
interactive session that creates and discards many SshCache
instances (e.g. repeatedly re-reading a config), the subprocesses
accumulate. Call close() explicitly
when done with a cache you won’t reuse. A future revision could use
a weakref-based atexit hook so dropped caches clean themselves
up, but the simple always-on-exit cleanup is correct for the common
case of a handful of long-lived caches.
Note
Reading info() / read_only costs an RPC.
info() and the
read_only property are not free: the
first access fetches the server info dict over the wire (and drives
the version handshake). The result is cached for the lifetime of
the connection, so only the first access pays; but a user poking
sc.read_only in a REPL should know it is a network round-trip,
not a local attribute. reconnect()
invalidates the cache, so the next access re-fetches.
There is no automatic reconnect. If the SSH subprocess dies — network
hiccup, server reboot, idle timeout — every subsequent op raises
RemoteConnectionError until the caller invokes
reconnect() explicitly. The error
message produced in
fleche.remote._Connection.call() names SshCache.reconnect()
directly so the user doesn’t have to know about the lifecycle helper
ahead of time. This is the single most opinionated decision in the
module: an auto-reconnect would silently re-trigger interactive 2FA
prompts (often while the user is away from the terminal). Forcing the
call to be explicit means the user can decide when to pay that cost.
For sessions that span multiple Python runs, the recommended pattern is
OpenSSH ControlMaster + ControlPersist: the first connection
authenticates and the multiplexer keeps the underlying TCP session
alive across child processes, so subsequent ssh invocations within
the persist window skip the handshake entirely. Wire it up either in
~/.ssh/config or via the ssh_options field on
SshCache (see the docstring for an example).
Setup-commands chain
When the user passes workdir and/or setup_commands, the
constructed remote command looks like:
cd <workdir> && <snippet1> && <snippet2> && ... && exec <python> -m fleche remote --serve
The trailing exec replaces the wrapping shell so stdin/stdout pipe
straight through to the server process — no extra layer to corrupt the
binary RPC stream. Any setup failure short-circuits via && and the
ssh process exits non-zero; the client’s next read raises EOFError
and the diagnostic captured from stderr surfaces the actual reason.
workdir (when set) contributes the leading cd so it runs ahead
of the user’s setup_commands. Because the server boots the cache
via python -m, that working directory lands on sys.path, letting
the remote import the project-local modules referenced by unpickled
calls.
Without workdir or setup_commands, the server argv is passed
directly to ssh as separate arguments, so SSH exec\ s it as-is
— no remote shell is involved at all.
Diagnostics: the info RPC
fleche.remote._server_info() is the introspection back-channel.
It is the only method dispatched at the serve()
level (alongside the data plane in _dispatch()),
because it needs access to two things the dispatcher doesn’t have: the
cache_name the server was launched with, and the active cache from
the fleche.state._CACHE ContextVar.
It returns:
Key |
Meaning |
|---|---|
|
|
|
The |
|
|
|
|
|
|
|
|
|
The server process’s PID. |
info() is the debugging back-channel for any “the remote isn’t
doing what I expected” question — wrong cache loaded, wrong working
directory, unexpected read_only flag, surprising Python
interpreter, surprising host. Calling it from the client surfaces the
remote’s view of itself in one round-trip, with no need to
ssh host separately and poke around.
Active cache on the remote
The server’s served cache is its active cache. python -m
fleche remote --serve installs the named cache on the
fleche.state._CACHE ContextVar before entering the request loop,
so any code path on the remote that consults fleche.cache()
independently — metadata hooks, LazyCall._cache references, nested
@fleche() calls — sees the same instance the RPC layer is
dispatching against. The server process is single-threaded and
serves exactly one cache, so there’s no other sensible default.
Read-only short-circuit
save() and
evict() check the cached read_only
flag from the server info dict before issuing the RPC. If the flag is
True they raise Rejected locally with no
round-trip.
The flag is populated lazily — the first time read_only (or
info()) is consulted, an info RPC fires and the result is
cached on the SshCache instance. Subsequent
saves consult the cache directly. Cost is therefore at most one
extra round-trip per session, and zero in the common case where the
first cache op is itself a write.
reconnect() invalidates the cached info
so the next read re-fetches against the new subprocess.
Testing strategy
The module ships two layers of tests, both in tests/{unit,integration}/test_remote.py:
Unit tests drive
serve()in a background daemon thread with twoos.pipe()pairs in place of an SSH subprocess. The same wire-protocol machinery is exercised — frame layout, dispatch, exception propagation, the_Connectionlifecycle — without SSH or process spawning. Tests that need the transport’s behaviour (subprocess exit codes, stderr capture) drive afleche.remote._Connectionsubclass directly with no server on the other side; seetest_connection_drop_includes_diagnose_output.Integration tests launch
python -m fleche remote --serveas a local subprocess (still no SSH involved) so the fullPopen-with-three-pipes handshake, module loading, and config-file parsing on the server side are all exercised. Each test uses atmp_path-scopedfleche.tomlso server state lives in an isolated directory.
Adding a new RPC
The path is short and entirely mechanical:
Add a server-side branch in
fleche.remote._dispatch()(or, if the RPC needs server-only state likecache_name, branch inserve()itself the wayinfodoes). Be explicit about translating any cache-bound return value via_strip_cache().Add a client method on
SshCachethat callsself._conn.call("method_name", *args). If the response carriesDigestedCallinstances,.fetch(self)them back into client-boundLazyCall\ s.Add tests at both layers: a unit test that drives the new method through the in-process pipe, and (when it’s worth it) an integration test that exercises it across the subprocess boundary.
Avoid adding more methods than necessary — every new method is wire surface that has to round-trip from any user. Prefer threading the new operation through one of the existing methods where possible.