I turned my phone into a local LLM server, and it handles vision, voice, and tool calls
… They've got multimodal input text, image, and audio , a 128K context window, and a hybrid attention design that keeps memory use low. On a modern phone with enough RAM and a modern chipset, both of these models can run at surprising speed, complete with tool calling. …