Connection Resilience (A+B+C) Implementation Plan
Connection Resilience (A+B+C) Implementation Plan
For agentic workers: REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (
- [ ]) syntax for tracking.
Goal: Fix the “zombie WiFi” bug where the app appears connected after losing WiFi but audio errors out when the buffer drains with no way to recover short of force-killing.
Architecture: Three complementary layers of detection/recovery:
- A — Require
NET_CAPABILITY_VALIDATEDon the NetworkRequest so Android’s captive-portal/validation probe loss is treated as a network loss (not just link-layer disconnect). - B — App-level stall watchdog in
SendSpinClientthat tracks last-received-byte timestamp and forcestransport.close()if a connected+playing session goes silent for >7 seconds, short-circuiting Ktor’s slow ping-timeout detection. - C — Remove the hard 5-attempt cap in normal mode. Use the same shape as High Power Mode: exponential backoff for the first 5 attempts, then 30s steady-state forever (with the banner continuing to display progress). Banner UI already handles unbounded attempt counts.
Tech Stack: Kotlin, Android (minSdk reads NetworkCapabilities), Ktor WebSockets (BaseWebSocketTransport), MockK + JUnit4 for unit tests, UnconfinedTestDispatcher for coroutine testing.
Context for the Implementing Engineer
You are working on SendSpinDroid, a native Kotlin synchronized-audio client. It connects to a SendSpin server via WebSocket and plays PCM audio with microsecond-precise clock sync. The relevant symptom we’re fixing:
User is on WiFi listening to music, walks outside, phone loses WiFi. The app’s UI still shows “Connected”. Audio plays from its buffer for ~10-30 seconds then cuts out. The app never reconnects. Force-killing is the only recovery.
Three root causes combined:
NetworkRequestonly requiresNET_CAPABILITY_INTERNET— Android can still report the WiFi network as “available” withINTERNETeven after it’s declared it unvalidated, soonLost()never fires promptly.- TCP half-open: the socket looks alive until Ktor’s 15-30s ping times out.
- Normal mode gives up after 5 reconnect attempts, and there’s no manual-reconnect UI.
Project coding conventions (read before writing code)
- No emojis in code, logs, UI, or commit messages. Use ASCII:
us,->,+/-. - No self-citation in commits (no
Co-Authored-By: Claude). - Tests live under
android/app/src/test/java/com/sendspindroid/...mirroring the source package. SeeSendSpinClientReconnectBackoffTest.ktfor the reflection-based pattern used to poke at private fields. - Tests mock
android.util.Log,UserSettings, andAudioDecoderFactoryvia MockK (mockkStatic/mockkObject). The boilerplate is cookie-cutter — copy from an existing test. - Tests run on JVM with
UnconfinedTestDispatcherset as the Main dispatcher in@Before(Dispatchers.setMain) and reset in@After. - Commit frequently — one commit per task, always with a green test suite.
How to run tests
cd android
./gradlew :app:testDebugUnitTest --tests "com.sendspindroid.sendspin.SendSpinClientStallWatchdogTest"
To run a full unit test suite:
cd android
./gradlew :app:testDebugUnitTest
To build and verify nothing is broken after all tasks:
cd android
./gradlew assembleDebug
File Structure
Files to modify
android/app/src/main/java/com/sendspindroid/sendspin/SendSpinClient.kt— add stall watchdog (Task B) + remove cap (Task C).android/app/src/main/java/com/sendspindroid/playback/PlaybackService.kt— addVALIDATEDcapability +onCapabilitiesChangedlogic (Task A).android/app/src/main/java/com/sendspindroid/MainActivity.kt— mirrorVALIDATEDcapability on the activity-level NetworkRequest (Task A).
Files to create
android/app/src/test/java/com/sendspindroid/sendspin/SendSpinClientStallWatchdogTest.kt— new unit tests for Task B.android/app/src/test/java/com/sendspindroid/playback/PlaybackServiceValidatedCapabilityTest.kt— new unit test asserting the NetworkRequest includes VALIDATED (Task A). If this turns out to be infeasible due toServicecontext requirements, skip and rely on the existingPlaybackServiceConnectionLifecycleTest.ktharness — see Task 1 Step 3 notes.
Files to update (existing tests)
android/app/src/test/java/com/sendspindroid/sendspin/SendSpinClientReconnectBackoffTest.kt— themax 5 reconnect attempts in normal modetest must be updated to reflect new no-cap behavior.
Task 1: A — Require NET_CAPABILITY_VALIDATED and react to validation loss
Files:
- Modify:
android/app/src/main/java/com/sendspindroid/playback/PlaybackService.kt:355-390, 668-679 - Modify:
android/app/src/main/java/com/sendspindroid/MainActivity.kt:1230-1265 - Create (optional):
android/app/src/test/java/com/sendspindroid/playback/PlaybackServiceValidatedCapabilityTest.kt
Design notes
ConnectivityManager.NetworkCallback.onCapabilitiesChanged(network, capabilities) fires whenever capabilities are re-evaluated. Android’s validation probe runs shortly after association and re-runs on failure. When validation fails (captive portal redirect, zero-return HTTP probe, DNS hijack, no upstream), the network keeps NET_CAPABILITY_INTERNET but drops NET_CAPABILITY_VALIDATED. That’s the exact signal we need for “wifi attached, but there’s no actual internet”.
We’ll keep the existing NetworkRequest matching INTERNET (so we still know about the network at all) but additionally track VALIDATED state in onCapabilitiesChanged. When VALIDATED goes true -> false for the currently-active network, we debounce for 3 seconds (some networks briefly flicker during roaming) and then call sendSpinClient.setNetworkAvailable(false). When VALIDATED returns to true, we call setNetworkAvailable(true) (which already triggers immediate reconnect via onNetworkAvailable() per SendSpinClient.kt:409-415).
Why not addCapability(NET_CAPABILITY_VALIDATED) on the NetworkRequest? If we did, the callback would simply stop receiving events when validation drops — we’d get onLost() instead. That would work, but it’s a blunter instrument: we’d lose the ability to distinguish “VALIDATED flickered” from “the WiFi disassociated”. Tracking it in onCapabilitiesChanged lets us debounce the flickers.
Steps
- Step 1: Read the current
networkCallbackimplementation
Read android/app/src/main/java/com/sendspindroid/playback/PlaybackService.kt lines 340-390. Note that:
networkCallbackis anobject : ConnectivityManager.NetworkCallback()at line 355.onAvailable(line 356),onLost(line 376),onCapabilitiesChanged(line 385) are all present.onCapabilitiesChangedcurrently just callsnetworkEvaluator?.evaluateCurrentNetwork(network)— it does not read validation state or notify the client.-
lastNetworkId(line 342) tracks the currently-attached network’s hash. - Step 2: Write the failing test (best-effort)
Create android/app/src/test/java/com/sendspindroid/playback/PlaybackServiceValidatedCapabilityTest.kt.
package com.sendspindroid.playback
import android.net.NetworkCapabilities
import org.junit.Assert.assertTrue
import org.junit.Test
/**
* Validates that the PlaybackService NetworkCapabilities handling code knows how to
* detect validation loss. Because NetworkCallback is an anonymous object inside a
* Service, we validate via reflection on the declared class structure and method
* presence rather than spinning up the full Service in a unit test.
*
* The contract we assert: PlaybackService.class (or one of its nested classes) must
* reference NetworkCapabilities.NET_CAPABILITY_VALIDATED somewhere in its bytecode.
*/
class PlaybackServiceValidatedCapabilityTest {
@Test
fun `PlaybackService references NET_CAPABILITY_VALIDATED`() {
// NET_CAPABILITY_VALIDATED = 16 per Android docs
assertTrue(
"Expected NET_CAPABILITY_VALIDATED to be referenced in PlaybackService bytecode",
classHasConstantReference(
"com.sendspindroid.playback.PlaybackService",
NetworkCapabilities.NET_CAPABILITY_VALIDATED
)
)
}
private fun classHasConstantReference(className: String, constant: Int): Boolean {
// Read the class bytes and scan the constant pool for the int value.
// If you prefer, just assert `NetworkCapabilities.NET_CAPABILITY_VALIDATED == 16`
// as a smoke test and rely on the compile step + end-to-end test for real coverage.
val resourceName = className.replace('.', '/') + ".class"
val stream = Thread.currentThread().contextClassLoader
?.getResourceAsStream(resourceName) ?: return false
val bytes = stream.readBytes()
// Simple heuristic: look for the integer literal 16 encoded as BIPUSH 0x10 or
// in the constant pool (0x03 tag + 4 bytes). This is fragile but sufficient for
// a smoke check. If this is too brittle for you, delete this test and rely on
// the integration test in E2E.
return bytes.asList().windowed(2).any { (a, b) ->
// BIPUSH 16 — `bipush` opcode is 0x10, followed by the literal byte
a.toInt() == 0x10 && b.toInt() == 0x10
}
}
}
Run:
cd android
./gradlew :app:testDebugUnitTest --tests "com.sendspindroid.playback.PlaybackServiceValidatedCapabilityTest"
Expected: FAIL with assertion failure (PlaybackService doesn’t reference NET_CAPABILITY_VALIDATED yet).
NOTE: If this bytecode-scan approach proves too brittle (e.g., compiler inlines the literal as sipush 16 or folds it), delete this test entirely. Skip Step 2 and Step 7 — A is sufficiently covered by the existing e2e/NetworkLossDrainingReconnectTest.kt once you manually extend it. Don’t burn more than 10 minutes on test plumbing here; the functional change below is the important part.
- Step 3: Add a debounce helper to the PlaybackService companion
Open android/app/src/main/java/com/sendspindroid/playback/PlaybackService.kt. Add the following constant inside the companion object (near line 399, next to the other timing constants):
// Debounce validation-loss events - some networks briefly lose VALIDATED during
// roaming or probe retries. Only treat VALIDATED=false as "offline" if it stays
// false for this long.
private const val VALIDATION_LOSS_DEBOUNCE_MS = 3_000L
- Step 4: Add fields to track validation state and pending debounce job
Near lastNetworkId (line 342), add:
// Tracks the VALIDATED capability of the currently-attached network so we can
// detect the "WiFi attached but no real internet" state (walked out of range but
// phone hasn't disassociated yet).
private var lastValidatedState: Boolean = true
private var validationLossJob: Job? = null
Make sure Job is imported (import kotlinx.coroutines.Job) — it’s likely already imported via other coroutine usage; if not add it.
- Step 5: Extend
onCapabilitiesChangedto detect validation transitions
Replace the body of onCapabilitiesChanged (line 385-389) with:
override fun onCapabilitiesChanged(network: Network, capabilities: NetworkCapabilities) {
Log.d(TAG, "Network capabilities changed: id=${network.hashCode()}")
networkEvaluator?.evaluateCurrentNetwork(network)
val isValidated = capabilities.hasCapability(NetworkCapabilities.NET_CAPABILITY_VALIDATED)
val wasValidated = lastValidatedState
lastValidatedState = isValidated
if (wasValidated && !isValidated) {
// VALIDATED just went true -> false. Debounce in case the probe re-succeeds.
Log.w(TAG, "Network lost VALIDATED capability - debouncing ${VALIDATION_LOSS_DEBOUNCE_MS}ms before treating as offline")
validationLossJob?.cancel()
validationLossJob = serviceScope.launch {
delay(VALIDATION_LOSS_DEBOUNCE_MS)
if (!lastValidatedState) {
Log.w(TAG, "Validation loss confirmed after debounce - notifying client of network unavailability")
sendSpinClient?.setNetworkAvailable(false)
}
}
} else if (!wasValidated && isValidated) {
// VALIDATED came back - cancel any pending debounce and restore availability.
Log.i(TAG, "Network regained VALIDATED capability")
validationLossJob?.cancel()
validationLossJob = null
sendSpinClient?.setNetworkAvailable(true)
}
}
Before writing, grep the file for an existing serviceScope or equivalent CoroutineScope. If none exists in PlaybackService, use the Android Handler(Looper.getMainLooper()) + postDelayed/removeCallbacks pattern instead (which avoids pulling in a CoroutineScope):
private val validationLossRunnable = Runnable {
if (!lastValidatedState) {
Log.w(TAG, "Validation loss confirmed after debounce - notifying client")
sendSpinClient?.setNetworkAvailable(false)
}
}
// In onCapabilitiesChanged:
if (wasValidated && !isValidated) {
Log.w(TAG, "Network lost VALIDATED - debouncing ${VALIDATION_LOSS_DEBOUNCE_MS}ms")
mainHandler.removeCallbacks(validationLossRunnable)
mainHandler.postDelayed(validationLossRunnable, VALIDATION_LOSS_DEBOUNCE_MS)
} else if (!wasValidated && isValidated) {
Log.i(TAG, "Network regained VALIDATED")
mainHandler.removeCallbacks(validationLossRunnable)
sendSpinClient?.setNetworkAvailable(true)
}
Grep for mainHandler — it exists per SyncAudioPlayerStateCallback usage elsewhere in the file (line 721+). Use that handler.
- Step 6: Clean up pending debounce in
unregisterNetworkCallback
In unregisterNetworkCallback (starts at PlaybackService.kt:684), before the try { ... } block add:
mainHandler.removeCallbacks(validationLossRunnable)
This prevents a stale debounce from firing after the service has detached its network callback.
- Step 7: Mirror the change in
MainActivity.kt
MainActivity.kt:1230-1265 registers its own NetworkCallback for UI-level reconnect banner logic. Add the same validation tracking. Open the file and find override fun onCapabilitiesChanged at line 1232. Today it just calls networkEvaluator?.evaluateCurrentNetwork(network) and defaultServerPinger?.onNetworkChanged().
Add a sibling field near the other NetworkCallback state (find where networkCallback is declared — likely a class field):
private var lastActivityValidatedState: Boolean = true
And replace onCapabilitiesChanged with (keeping existing calls):
override fun onCapabilitiesChanged(network: Network, capabilities: NetworkCapabilities) {
networkEvaluator?.evaluateCurrentNetwork(network)
defaultServerPinger?.onNetworkChanged()
val isValidated = capabilities.hasCapability(NetworkCapabilities.NET_CAPABILITY_VALIDATED)
if (lastActivityValidatedState && !isValidated) {
Log.w(TAG, "Activity: network lost VALIDATED")
// Show the same error snackbar we show on onLost - signals user that
// the WiFi they're on has no real internet
runOnUiThread {
if (connectionState is AppConnectionState.Connected ||
connectionState is AppConnectionState.Connecting) {
showErrorSnackbar(
message = "Network has no internet access",
errorType = ErrorType.NETWORK
)
}
}
}
lastActivityValidatedState = isValidated
}
Note: The Activity path does not need to forward to sendSpinClient.setNetworkAvailable(false) — that’s PlaybackService’s responsibility. The Activity only needs to show the UI feedback.
- Step 8: Verify test passes and full suite stays green
cd android
./gradlew :app:testDebugUnitTest --tests "com.sendspindroid.playback.PlaybackServiceValidatedCapabilityTest"
./gradlew :app:testDebugUnitTest
Expected: both commands pass (or, if you deleted the bytecode-scan test in Step 2, the second command passes).
- Step 9: Build to verify no compile errors
cd android
./gradlew assembleDebug
Expected: BUILD SUCCESSFUL.
- Step 10: Commit
git add android/app/src/main/java/com/sendspindroid/playback/PlaybackService.kt \
android/app/src/main/java/com/sendspindroid/MainActivity.kt \
android/app/src/test/java/com/sendspindroid/playback/PlaybackServiceValidatedCapabilityTest.kt
git commit -m "fix: detect loss of NET_CAPABILITY_VALIDATED on active network
When WiFi drops silently (walked out of range, captive portal, DNS hijack)
Android often keeps NET_CAPABILITY_INTERNET set while removing VALIDATED.
Previously onLost() would not fire, so the WebSocket sat open until ping
timed out -- long after the audio buffer drained.
PlaybackService now watches for a VALIDATED=true->false transition on the
active network, debounces 3s (to avoid false positives during WiFi roaming
or probe retries), then tells SendSpinClient the network is unavailable --
which pauses reconnect attempts the same way onLost() already does.
MainActivity mirrors this for its UI snackbar path."
Task 2: B — Application-level stall watchdog in SendSpinClient
Files:
- Modify:
android/app/src/main/java/com/sendspindroid/sendspin/SendSpinClient.kt:74-85, 168-176, 876-911 - Create:
android/app/src/test/java/com/sendspindroid/sendspin/SendSpinClientStallWatchdogTest.kt
Design notes
The watchdog is simple:
- Track
lastByteReceivedAtMs: AtomicLong— updated on everyonMessage(text)oronMessage(bytes)in theTransportEventListener. - A coroutine polls every 3s. If
handshakeComplete && isConnected && !reconnecting && now - lastByteReceivedAtMs > STALL_TIMEOUT_MS, log a warning and force the transport closed with a syntheticSocketException("stall watchdog: no data for Xms"). The existingTransportEventListener.onFailurepath will classify this as recoverable and callattemptReconnect(). - Watchdog starts when a connection reaches handshake-complete and stops on disconnect.
Timeout choice: 7s. Audio chunks arrive very frequently (every 10-20ms), and even if audio is paused server-side the server still sends periodic keepalives (group/update, server/state). 7s of complete silence while the client thinks it’s playing is unambiguous. Longer (10-15s) would be safer against transient hiccups but loses the benefit of fast detection — the whole point of the watchdog is to beat Ktor’s 15-30s ping to the punch.
Why not rely on Ktor’s ping/pong? The Ktor pingIntervalMillis covers part of the gap, but:
- In normal mode the ping is every 30s — by the time pong-timeout fires, buffer is long gone.
- Ktor’s pong-timeout behavior across versions is inconsistent. Explicit application-level tracking is easier to reason about.
Why close the transport rather than directly calling attemptReconnect()? transport.close() flows through onClosed/onFailure which already handle state cleanup correctly. Bypassing that would risk double-freeing transport, leaking listeners, or racing with destroy().
Steps
- Step 1: Write the failing test
Create android/app/src/test/java/com/sendspindroid/sendspin/SendSpinClientStallWatchdogTest.kt:
package com.sendspindroid.sendspin
import android.content.Context
import android.content.SharedPreferences
import android.util.Log
import androidx.preference.PreferenceManager
import com.sendspindroid.UserSettings
import com.sendspindroid.sendspin.decoder.AudioDecoderFactory
import com.sendspindroid.sendspin.transport.SendSpinTransport
import com.sendspindroid.sendspin.transport.TransportState
import io.mockk.every
import io.mockk.mockk
import io.mockk.mockkObject
import io.mockk.mockkStatic
import io.mockk.unmockkAll
import io.mockk.verify
import kotlinx.coroutines.Dispatchers
import kotlinx.coroutines.ExperimentalCoroutinesApi
import kotlinx.coroutines.delay
import kotlinx.coroutines.runBlocking
import kotlinx.coroutines.test.UnconfinedTestDispatcher
import kotlinx.coroutines.test.resetMain
import kotlinx.coroutines.test.setMain
import org.junit.After
import org.junit.Assert.*
import org.junit.Before
import org.junit.Test
/**
* Tests the application-level stall watchdog that detects when a connected,
* handshake-completed client stops receiving bytes for longer than the
* configured timeout. Expected behavior: watchdog forces transport.close()
* which triggers the existing reconnect path.
*/
@OptIn(ExperimentalCoroutinesApi::class)
class SendSpinClientStallWatchdogTest {
private lateinit var mockContext: Context
private lateinit var mockCallback: SendSpinClient.Callback
private lateinit var client: SendSpinClient
private lateinit var fakeTransport: FakeTransport
private class FakeTransport : SendSpinTransport {
var closeCalled = false
var closeCode: Int = -1
override val state = TransportState.Connected
override val isConnected = true
override fun connect() {}
override fun send(text: String) = true
override fun send(bytes: ByteArray) = true
override fun setListener(listener: SendSpinTransport.Listener?) {}
override fun close(code: Int, reason: String) {
closeCalled = true
closeCode = code
}
override fun destroy() {}
}
@Before
fun setUp() {
Dispatchers.setMain(UnconfinedTestDispatcher())
mockkStatic(Log::class)
every { Log.v(any(), any()) } returns 0
every { Log.d(any(), any()) } returns 0
every { Log.i(any(), any()) } returns 0
every { Log.w(any(), any<String>()) } returns 0
every { Log.e(any(), any<String>()) } returns 0
every { Log.e(any(), any(), any()) } returns 0
mockkObject(UserSettings)
every { UserSettings.getPlayerId() } returns "test-player-id"
every { UserSettings.getPreferredCodec() } returns "opus"
every { UserSettings.lowMemoryMode } returns false
every { UserSettings.highPowerMode } returns false
mockkObject(AudioDecoderFactory)
every { AudioDecoderFactory.isCodecSupported(any()) } returns true
mockkStatic(PreferenceManager::class)
val mockPrefs = mockk<SharedPreferences>(relaxed = true)
every { PreferenceManager.getDefaultSharedPreferences(any()) } returns mockPrefs
mockContext = mockk(relaxed = true)
mockCallback = mockk(relaxed = true)
client = SendSpinClient(mockContext, "TestDevice", mockCallback)
fakeTransport = FakeTransport()
// Put client in a "connected + handshake complete" state
val addrField = SendSpinClient::class.java.getDeclaredField("serverAddress")
addrField.isAccessible = true
addrField.set(client, "127.0.0.1:8080")
val transportField = SendSpinClient::class.java.getDeclaredField("transport")
transportField.isAccessible = true
transportField.set(client, fakeTransport)
val handshakeField = SendSpinClient::class.java.superclass.getDeclaredField("handshakeComplete")
handshakeField.isAccessible = true
handshakeField.set(client, true)
}
@After
fun tearDown() {
client.destroy()
Dispatchers.resetMain()
unmockkAll()
}
@Test
fun `lastByteReceivedAtMs is updated on text message`() {
val lastByteField = SendSpinClient::class.java.getDeclaredField("lastByteReceivedAtMs")
lastByteField.isAccessible = true
val atomicLong = lastByteField.get(client) as java.util.concurrent.atomic.AtomicLong
val before = atomicLong.get()
Thread.sleep(10)
// Invoke TransportEventListener.onMessage(String)
val innerClasses = SendSpinClient::class.java.declaredClasses
val listenerClass = innerClasses.find { it.simpleName == "TransportEventListener" }!!
val constructor = listenerClass.getDeclaredConstructor(SendSpinClient::class.java)
constructor.isAccessible = true
val listener = constructor.newInstance(client) as SendSpinTransport.Listener
listener.onMessage("{\"type\":\"ping\"}")
val after = atomicLong.get()
assertTrue("lastByteReceivedAtMs should advance on text message (before=$before after=$after)",
after > before)
}
@Test
fun `lastByteReceivedAtMs is updated on binary message`() {
val lastByteField = SendSpinClient::class.java.getDeclaredField("lastByteReceivedAtMs")
lastByteField.isAccessible = true
val atomicLong = lastByteField.get(client) as java.util.concurrent.atomic.AtomicLong
val before = atomicLong.get()
Thread.sleep(10)
val innerClasses = SendSpinClient::class.java.declaredClasses
val listenerClass = innerClasses.find { it.simpleName == "TransportEventListener" }!!
val constructor = listenerClass.getDeclaredConstructor(SendSpinClient::class.java)
constructor.isAccessible = true
val listener = constructor.newInstance(client) as SendSpinTransport.Listener
listener.onMessage(byteArrayOf(0, 1, 2, 3))
val after = atomicLong.get()
assertTrue("lastByteReceivedAtMs should advance on binary message (before=$before after=$after)",
after > before)
}
@Test
fun `checkStall forces transport close when stalled past timeout`() {
// Seed lastByteReceivedAtMs to far in the past so the watchdog trips immediately
val lastByteField = SendSpinClient::class.java.getDeclaredField("lastByteReceivedAtMs")
lastByteField.isAccessible = true
val atomicLong = lastByteField.get(client) as java.util.concurrent.atomic.AtomicLong
atomicLong.set(System.currentTimeMillis() - 60_000L) // 60s in the past
// Invoke checkStall directly
val checkStall = SendSpinClient::class.java.getDeclaredMethod("checkStall")
checkStall.isAccessible = true
checkStall.invoke(client)
assertTrue("Watchdog should have called transport.close()", fakeTransport.closeCalled)
// Use an abnormal code (not 1000) so onClosed triggers reconnection, not graceful disconnect
assertNotEquals(1000, fakeTransport.closeCode)
}
@Test
fun `checkStall does not close when recently active`() {
val lastByteField = SendSpinClient::class.java.getDeclaredField("lastByteReceivedAtMs")
lastByteField.isAccessible = true
val atomicLong = lastByteField.get(client) as java.util.concurrent.atomic.AtomicLong
atomicLong.set(System.currentTimeMillis()) // just now
val checkStall = SendSpinClient::class.java.getDeclaredMethod("checkStall")
checkStall.isAccessible = true
checkStall.invoke(client)
assertFalse("Watchdog should NOT close when data was recently received", fakeTransport.closeCalled)
}
@Test
fun `checkStall does not close during active reconnection`() {
val lastByteField = SendSpinClient::class.java.getDeclaredField("lastByteReceivedAtMs")
lastByteField.isAccessible = true
val atomicLong = lastByteField.get(client) as java.util.concurrent.atomic.AtomicLong
atomicLong.set(System.currentTimeMillis() - 60_000L)
// Simulate in-progress reconnection
val reconnectingField = SendSpinClient::class.java.getDeclaredField("reconnecting")
reconnectingField.isAccessible = true
val reconnecting = reconnectingField.get(client) as java.util.concurrent.atomic.AtomicBoolean
reconnecting.set(true)
val checkStall = SendSpinClient::class.java.getDeclaredMethod("checkStall")
checkStall.isAccessible = true
checkStall.invoke(client)
assertFalse("Watchdog should NOT close during reconnection", fakeTransport.closeCalled)
}
}
Run:
cd android
./gradlew :app:testDebugUnitTest --tests "com.sendspindroid.sendspin.SendSpinClientStallWatchdogTest"
Expected: FAIL with NoSuchFieldException: lastByteReceivedAtMs or NoSuchMethodException: checkStall — neither exists yet.
- Step 2: Add stall watchdog constants
In SendSpinClient.kt’s companion object (lines 74-85), add after HIGH_POWER_RECONNECT_DELAY_MS:
// Stall watchdog: while connected+handshake-complete, if no bytes arrive for
// this long, force-close the transport so the existing reconnect path kicks in.
// Shorter than Ktor's 30s ping-timeout to beat buffer drain.
private const val STALL_TIMEOUT_MS = 7_000L
private const val STALL_CHECK_INTERVAL_MS = 3_000L
- Step 3: Add fields to track last-received-byte timestamp and watchdog job
Near the other reconnection state (lines 167-176), add:
// Stall watchdog state. lastByteReceivedAtMs is updated on EVERY text/binary
// message from the transport. stallWatchdogJob is the polling coroutine.
private val lastByteReceivedAtMs = AtomicLong(System.currentTimeMillis())
private var stallWatchdogJob: Job? = null
Make sure java.util.concurrent.atomic.AtomicLong is imported — add import java.util.concurrent.atomic.AtomicLong at the top of the file alongside the existing AtomicInteger/AtomicBoolean imports.
- Step 4: Update timestamp in
TransportEventListener.onMessage(String)andonMessage(ByteArray)
Locate TransportEventListener.onMessage(text: String) around SendSpinClient.kt:876 and onMessage(bytes: ByteArray) around line 909. At the very first line of each, add:
lastByteReceivedAtMs.set(System.currentTimeMillis())
So the updated methods start:
override fun onMessage(text: String) {
lastByteReceivedAtMs.set(System.currentTimeMillis())
// Check for auth failure (server may send error if token is invalid)
...
}
override fun onMessage(bytes: ByteArray) {
lastByteReceivedAtMs.set(System.currentTimeMillis())
handleBinaryMessage(bytes)
}
- Step 5: Add
checkStallandstartStallWatchdog/stopStallWatchdogmethods
Add these three methods inside SendSpinClient (place them right before attemptReconnect at line 686):
/**
* Start the stall watchdog. Called once handshake completes.
* Cancels any previous instance.
*/
private fun startStallWatchdog() {
stallWatchdogJob?.cancel()
// Reset so we don't false-trip using a stale pre-handshake timestamp
lastByteReceivedAtMs.set(System.currentTimeMillis())
stallWatchdogJob = scope.launch {
while (true) {
delay(STALL_CHECK_INTERVAL_MS)
checkStall()
}
}
}
/**
* Stop the stall watchdog. Called on disconnect or during reconnect attempts.
*/
private fun stopStallWatchdog() {
stallWatchdogJob?.cancel()
stallWatchdogJob = null
}
/**
* Check whether the transport has gone silent for too long and force-close it
* if so. Only acts when the client is connected, handshake is complete, and we
* are not already in a reconnect cycle.
*
* Package-private for testing via reflection.
*/
private fun checkStall() {
if (userInitiatedDisconnect.get()) return
if (reconnecting.get()) return
if (!handshakeComplete) return
val t = transport ?: return
if (!t.isConnected) return
val sinceLastByte = System.currentTimeMillis() - lastByteReceivedAtMs.get()
if (sinceLastByte > STALL_TIMEOUT_MS) {
Log.w(TAG, "Stall watchdog: no data received in ${sinceLastByte}ms (threshold ${STALL_TIMEOUT_MS}ms) - forcing transport close")
// Use a non-1000 close code so the onClosed handler triggers reconnection.
// 1001 = "Going Away" - appropriate for our intent.
t.close(1001, "stall watchdog")
}
}
- Step 6: Start the watchdog on handshake complete
Grep SendSpinClient.kt and its parent SendSpinProtocolHandler for where handshakeComplete gets set to true. The most likely location is right after receiving the first successful server/state or server/hello message. Add startStallWatchdog() at that point.
If handshakeComplete is set in the superclass (SendSpinProtocolHandler), instead override the onHandshakeComplete hook if one exists, OR add a check in the onConnected callback dispatch. Concrete instruction: find handshakeComplete = true anywhere in the file (there’s an assignment at line 762 during reconnect cleanup — that’s setting it to false, not what you want). If you can’t find a clean “handshake just completed” hook, add one:
In SendSpinProtocolHandler (find via grep), wherever it completes the handshake, call an onHandshakeCompleteHook() virtual method. Override it in SendSpinClient:
override fun onHandshakeCompleteHook() {
startStallWatchdog()
}
If that’s too invasive, simpler alternative: start the watchdog unconditionally in prepareForConnection() or right after createLocalTransport/createRemoteTransport/createProxyTransport — the checkStall() guard if (!handshakeComplete) return makes early starts harmless.
Preferred concrete approach: Start the watchdog at the end of prepareForConnection() (search for private fun prepareForConnection in the file). That runs once per user-initiated connect and is reset on every reconnect attempt via connectLocal/connectRemote/connectProxy. Since checkStall() is guarded by !handshakeComplete, pre-handshake ticks are no-ops.
- Step 7: Stop the watchdog on disconnect / destroy
Grep for fun disconnect( and fun destroy( in SendSpinClient.kt. Add stopStallWatchdog() at the start of each method’s body.
Also add stopStallWatchdog() inside attemptReconnect right after the attempts == 1 block (line 712), so we don’t race the watchdog against in-flight reconnects:
// On first reconnection attempt, freeze the time filter
if (attempts == 1) {
timeFilter.freeze()
Log.i(TAG, "Time filter frozen for reconnection (had ${timeFilter.measurementCountValue} measurements)")
}
stopStallWatchdog() // watchdog restarts on next successful handshake
- Step 8: Run the stall watchdog tests
cd android
./gradlew :app:testDebugUnitTest --tests "com.sendspindroid.sendspin.SendSpinClientStallWatchdogTest"
Expected: all 5 tests pass.
- Step 9: Run the full SendSpinClient test suite to catch regressions
cd android
./gradlew :app:testDebugUnitTest --tests "com.sendspindroid.sendspin.*"
Expected: all pass. If SendSpinClientReconnectBackoffTest or others fail because the watchdog is running during their reflection-based invocations, add setUp() calls to force-stop the watchdog or mock scope — but this is unlikely, since those tests call attemptReconnect directly without starting watchdogs.
- Step 10: Build
cd android
./gradlew assembleDebug
Expected: BUILD SUCCESSFUL.
- Step 11: Commit
git add android/app/src/main/java/com/sendspindroid/sendspin/SendSpinClient.kt \
android/app/src/test/java/com/sendspindroid/sendspin/SendSpinClientStallWatchdogTest.kt
git commit -m "feat: add application-level stall watchdog to SendSpinClient
When a TCP connection goes half-open (WiFi drops silently, NAT rebinds,
upstream hangs) the socket can look alive until Ktor's 30s ping times out.
By then the audio buffer has already drained and playback has errored.
The watchdog tracks the timestamp of the last received text or binary
frame and polls every 3s. If a connected, handshake-complete session has
been silent for 7s, it calls transport.close(1001) which routes through
the existing onClosed() reconnect path. This front-runs Ktor's ping
detection by 20+ seconds -- enough to start a reconnect before the buffer
empties in most cases.
Unit tests cover timestamp update on text/binary frames, trip on stall,
no-trip when recently active, and no-trip during active reconnection."
Task 3: C — Remove 5-attempt cap in normal mode (retry forever with 30s steady-state)
Files:
- Modify:
android/app/src/main/java/com/sendspindroid/sendspin/SendSpinClient.kt:714-724, 738-744, 746 - Modify:
android/app/src/test/java/com/sendspindroid/sendspin/SendSpinClientReconnectBackoffTest.kt
Design notes
Today, attemptReconnect caps at MAX_RECONNECT_ATTEMPTS = 5 unless highPowerMode is on, giving up with wasReconnectExhausted = true on the 6th attempt. After 5 attempts (total wait time 500ms+1s+2s+4s+8s = ~15.5s), the user is stuck.
With this change, normal mode behaves like high-power mode: infinite attempts with exponential backoff for the first 5, then 30s steady-state. The banner already handles arbitrary attempt counts, so no UI change needed. The only side effect users will see is that the “Connection lost, please reconnect manually” toast/error will no longer appear — the banner stays up instead.
Why not a different cap (e.g., 20 attempts)? Because the whole point is “don’t force-kill”. If we cap at any number, we’re setting up the same bug at a different threshold. 30s steady-state retry is cheap (one request every 30s) and the user can always manually disconnect if they’re done listening.
Battery concern? At 30s cadence, worst case is 2 req/min. A silent radio wake costs ~0.5mWh. Over an hour of disconnected state, that’s ~60mWh — negligible vs. a typical 15000mWh battery.
Implementation note (2026-05): The shipped behavior keeps a hard
MAX_TOTAL_RECONNECT_ATTEMPTS = 20ceiling (~7m45s try-window) instead of removing the cap entirely. The above paragraph argued against any cap; the in-code rationale atSendSpinClient.kt:92-97accepts the reasoning while choosing a high enough ceiling to cover all realistic transient outages. If field data shows a 20-attempt cap is hitting real users in benign network glitches, follow this plan’s original guidance and remove it.
Steps
- Step 1: Update the existing
max 5 reconnect attempts in normal modetest
Open android/app/src/test/java/com/sendspindroid/sendspin/SendSpinClientReconnectBackoffTest.kt. The test at line 142 (fun \max 5 reconnect attempts in normal mode``) currently asserts that the 6th attempt triggers error state. We need to rename it and invert its assertion:
Replace that entire test method (lines 141-171) with:
@Test
fun `normal mode retries forever without triggering exhausted error`() {
setupForReconnection()
every { UserSettings.highPowerMode } returns false
val attemptReconnect = SendSpinClient::class.java.getDeclaredMethod("attemptReconnect")
attemptReconnect.isAccessible = true
// Perform 10 attempts - all should succeed in normal mode now
for (i in 1..10) {
attemptReconnect.invoke(client)
}
// Should NOT have called onDisconnected with wasReconnectExhausted=true
verify(exactly = 0) {
mockCallback.onDisconnected(wasUserInitiated = false, wasReconnectExhausted = true)
}
// All 10 should have been onReconnecting calls
verify(exactly = 10) {
mockCallback.onReconnecting(any(), any())
}
// State should remain Connecting (not Error)
assertTrue(
"State should remain Connecting in normal mode with no cap, was: ${client.connectionState.value}",
client.connectionState.value is SendSpinClient.ConnectionState.Connecting
)
}
- Step 2: Add a test verifying steady-state delay kicks in after attempt 5 in normal mode
Add this test to SendSpinClientReconnectBackoffTest.kt immediately after the test from Step 1:
@Test
fun `normal mode uses 30s steady-state delay after attempt 5`() {
// Verify the delay formula selects the steady-state path for attempts > 5
// regardless of highPowerMode setting. The formula we expect in SendSpinClient:
// val delayMs = if (attempts > MAX_RECONNECT_ATTEMPTS) HIGH_POWER_RECONNECT_DELAY_MS
// else (INITIAL_RECONNECT_DELAY_MS * (1 shl (attempts - 1)))
// .coerceAtMost(MAX_RECONNECT_DELAY_MS)
val initialDelay = 500L
val maxDelay = 10_000L
val steadyStateDelay = 30_000L
for (attempt in 1..5) {
val computed = (initialDelay * (1 shl (attempt - 1))).coerceAtMost(maxDelay)
val expected = when (attempt) {
1 -> 500L; 2 -> 1000L; 3 -> 2000L; 4 -> 4000L; 5 -> 8000L
else -> fail("unreachable") as Long
}
assertEquals("Attempt $attempt should use exponential backoff", expected, computed)
}
// Attempt 6+ should use steady-state 30s
for (attempt in 6..10) {
val computed = if (attempt > 5) steadyStateDelay
else (initialDelay * (1 shl (attempt - 1))).coerceAtMost(maxDelay)
assertEquals("Attempt $attempt should use 30s steady-state", steadyStateDelay, computed)
}
}
- Step 3: Run the tests to confirm they fail
cd android
./gradlew :app:testDebugUnitTest --tests "com.sendspindroid.sendspin.SendSpinClientReconnectBackoffTest"
Expected: normal mode retries forever without triggering exhausted error FAILS because the 6th attempt currently triggers onDisconnected(wasReconnectExhausted=true). The steady-state formula test passes (it only tests arithmetic, not the source).
- Step 4: Modify
attemptReconnectto remove the hard cap
Open SendSpinClient.kt. Find the block at lines 714-724:
// Check attempt limits - high power mode allows infinite retries
val maxAttempts = if (UserSettings.highPowerMode) Int.MAX_VALUE else MAX_RECONNECT_ATTEMPTS
if (attempts > maxAttempts) {
Log.w(TAG, "Max reconnection attempts ($MAX_RECONNECT_ATTEMPTS) reached, giving up")
reconnecting.set(false)
timeFilter.resetAndDiscard()
_connectionState.value = ConnectionState.Error("Connection lost. Please reconnect manually.")
callback.onError("Connection lost after $MAX_RECONNECT_ATTEMPTS reconnection attempts")
callback.onDisconnected(wasUserInitiated = false, wasReconnectExhausted = true)
return
}
Delete it entirely. Both modes now retry forever.
- Step 5: Update the backoff formula to apply steady-state in both modes
Find the block at lines 738-744:
// Exponential backoff for first 5 attempts, then steady 30s in high power mode
val delayMs = if (UserSettings.highPowerMode && attempts > MAX_RECONNECT_ATTEMPTS) {
HIGH_POWER_RECONNECT_DELAY_MS
} else {
(INITIAL_RECONNECT_DELAY_MS * (1 shl (attempts - 1)))
.coerceAtMost(MAX_RECONNECT_DELAY_MS)
}
Replace with:
// Exponential backoff for first 5 attempts, then 30s steady-state forever.
// Applies in both normal and high power mode - the user can always disconnect
// manually if they're done listening.
val delayMs = if (attempts > MAX_RECONNECT_ATTEMPTS) {
HIGH_POWER_RECONNECT_DELAY_MS
} else {
(INITIAL_RECONNECT_DELAY_MS * (1 shl (attempts - 1)))
.coerceAtMost(MAX_RECONNECT_DELAY_MS)
}
- Step 6: Simplify the log display since both modes behave identically now
Line 746:
val attemptsDisplay = if (UserSettings.highPowerMode) "$attempts" else "$attempts/$MAX_RECONNECT_ATTEMPTS"
Replace with:
val attemptsDisplay = "$attempts"
- Step 7: Run the tests to confirm all pass
cd android
./gradlew :app:testDebugUnitTest --tests "com.sendspindroid.sendspin.SendSpinClientReconnectBackoffTest"
./gradlew :app:testDebugUnitTest --tests "com.sendspindroid.sendspin.*"
Expected: all pass. Check especially that high power mode uses 30s delay after attempt 5 (existing test, line 201) still passes — the formula still selects the steady-state path when attempts > MAX_RECONNECT_ATTEMPTS, just without the highPowerMode precondition.
- Step 8: Run the full unit test suite
cd android
./gradlew :app:testDebugUnitTest
Expected: all pass. If e2e/NetworkLossDrainingReconnectTest.kt asserts that wasReconnectExhausted=true fires, update it to assert the opposite (that reconnection continues indefinitely).
- Step 9: Build
cd android
./gradlew assembleDebug
Expected: BUILD SUCCESSFUL.
- Step 10: Commit
git add android/app/src/main/java/com/sendspindroid/sendspin/SendSpinClient.kt \
android/app/src/test/java/com/sendspindroid/sendspin/SendSpinClientReconnectBackoffTest.kt
git commit -m "fix: retry reconnect forever in normal mode with 30s steady-state
Previously normal mode gave up after 5 attempts (~15s) with an error
state that required the user to manually reconnect -- but there is no
manual-reconnect button anywhere in the UI, so in practice users had to
force-kill the app after any transient network loss.
Both normal and high power mode now behave identically: exponential
backoff for the first 5 attempts (500ms -> 8s), then 30s steady-state
retries forever. The ReconnectingBanner already handles arbitrary
attempt counts, so no UI change needed.
Battery impact is negligible (2 req/min steady-state). Users can
always disconnect manually if they're done."
Manual Verification (after all three tasks are in)
Before marking this plan done, the implementing engineer should run through this scenario on a physical Android device, not just unit tests. Unit tests verify plumbing; this verifies the user-observable behavior.
- Connect the app to a SendSpin server over WiFi.
- Start playback — verify audio plays and “Now Playing” screen is visible.
- Walk to the edge of WiFi range OR toggle WiFi off on the phone OR disable the router. The goal is to simulate “network still associated but no internet” — so toggling airplane mode is too blunt; prefer walking out of range or stopping the router.
- Within ~10 seconds (watchdog timeout + reconnect start): the “Reconnecting…” banner should appear.
- Audio should continue from buffer until drained, then stop.
- Walk back into WiFi range OR re-enable the router.
- Within ~30s (steady-state retry interval): the app should reconnect automatically. Audio should resume (after a re-sync).
- At no point should the app require a force-kill to recover.
Log tags to grep for:
SendSpinClient— look forStall watchdog: no data received in XXXmsandAttempting reconnection X in XXXmsPlaybackService— look forNetwork lost VALIDATED capabilityandValidation loss confirmed after debounce
If any of those tags do NOT appear during the simulated outage, one of the detection paths is broken. Return to the corresponding task.
Self-Review Summary
- Spec coverage: A, B, C each mapped 1:1 to Tasks 1, 2, 3.
- Placeholders: None — every step has concrete code, file paths, and commands.
- Type consistency:
lastByteReceivedAtMsused consistently across Task 2 implementation and tests.STALL_TIMEOUT_MSandSTALL_CHECK_INTERVAL_MSreferenced consistently.VALIDATION_LOSS_DEBOUNCE_MSreferenced consistently in Task 1. - Edge cases covered: watchdog disabled during reconnection, watchdog disabled pre-handshake, validation-loss debounce to avoid flaps, steady-state formula still triggers correctly.
- Non-goals (deferred to D-G): manual “Reconnect now” button (D), cross-mode failover (E), recoverable-error loosening during network transitions (F), audio-pipeline underrun signal (G).