Part 1: When Strings Disappear: Rethinking YARA at the Opcode Level

Author: Valli-Nayagam Chokkalingam

Most YARA rules start with strings. This post looks at what’s left when strings disappear—and how detection shifts closer to execution itself.

The focus here is on how YARA rules are reasoned about, not to provide production-ready signatures or drop-in detection rules.

Contents

What YARA Is (and What It Isn’t)

YARA is a pattern-matching tool. It scans data — files, memory regions, or raw byte streams — and checks whether specific patterns are present. Those patterns are defined explicitly by the analyst. There’s no inference, no behavior tracking, and no execution context unless you deliberately build it into the rule.

That simplicity is both the strength and the limitation. YARA works well when something stable exists to anchor on: reused code, recognizable strings, or consistent structures. Where it struggles is when those artifacts are deliberately minimized, transformed, or never exist in a static form at all. Understanding YARA starts with accepting that boundary. It’s not a behavioral detector — it’s a lens, and everything that follows depends on what remains visible through it.


rule Demo_RansomNote_Strings
{
meta:
description = "Demo-only: simple string-based detection using ransom note text"
author = "AdversaryCraft"
sample_sha256 = "074b50b35783a0607862126903837b60c8a7f875f91a1f414bb7d11713c7aa1d"
note = "Illustrates classic ransom-note string matching (not production-ready)"

strings:
$r1 = "your files have been encrypted" nocase ascii wide
$r2 = "decrypt all your files" nocase ascii wide
$r3 = "how to recover" nocase ascii wide
$r4 = "contact us" nocase ascii wide
$r5 = "bitcoin" nocase ascii wide

condition:
uint16(0) == 0x5A4D and
3 of ($r*)
}

Figure 1. A simple string-based YARA rule relying on recognizable ransom note text, illustrating the traditional starting point for static detection.

Where String-Based Detection Starts to Fail

Figure 1 shows a familiar starting point: a YARA rule built around plaintext ransom note strings. This approach works when those strings exist in a form that can be scanned on disk. Many samples still fit that model, which is why string-based rules remain common and often effective.

The problem is that those strings are also the easiest thing to remove. Packing alone is enough to make them disappear from static analysis. Even without a full packer, simple string encryption, runtime decoding, or stack-based construction is sufficient to break disk-level matching. Once the binary no longer carries readable text, the rule has nothing to anchor on.

The same applies to imports. When APIs are resolved dynamically—through hashing or manual lookup—the import address table no longer reflects what the program will actually call. From a static scanner’s perspective, both the strings and the intent are gone, even though execution behavior remains unchanged.

In-memory scanning can recover some of that visibility, but it comes with tradeoffs. Memory-wide YARA scans are expensive, noisy, and difficult to run continuously at scale. They also depend on timing: the artifact has to exist in memory long enough to be seen. Short-lived decoding routines or transient strings can still slip through.

At that point, the limitation isn’t YARA itself. It’s the assumption that detection starts with readable artifacts. As soon as those artifacts stop existing on disk, string-based detection stops being a reliable first step.

Figure 2. Comparison of an unpacked and packed Hello World executable, showing how simple packing is enough to remove readable string artifacts from disk.

From Source Code to Opcodes

Source code is written for people. It explains intent, names things clearly, and makes sense at a glance. None of that is what actually runs. Once the compiler is done, the program exists only as instructions the CPU can execute.

Figure 3 shows that shift in a very simple way. A small Hello World program in C is easy to follow at the source level. After compilation, that view disappears. What remains is assembly: individual instructions that tell the processor exactly what to do—move data, prepare arguments, call a function, return control.

At that level, everything is reduced to opcodes and operands. The opcode is the instruction itself—move this, call that, jump here. The operands are the values or locations the instruction works on: registers, memory addresses, constants. Together, they form the instruction stream the CPU steps through one operation at a time.

Strings tend to stand out during analysis because they’re readable, but they’re not essential. They’re just data sitting alongside the code. The instruction stream is different. Whether a binary is packed, encrypted, or stripped down, those opcodes still have to execute to make anything happen.

That’s the layer that survives. When readable artifacts fade away, execution doesn’t. What’s left is the flow of instructions—opcodes acting on operands—that carries the program forward, regardless of how much effort went into hiding everything else.

Figure 3. A simple C Hello World program alongside its compiled assembly, illustrating how readable source code is reduced to executable instructions.

Looking Closer at the Instruction Bytes

We’ll stick to the first three instructions that run inside main from the Hello World example in Figure 3. Not the whole function. Just where execution actually starts. Each one is pulled out on its own in Figures 4, 5, and 6.

By this point, there’s no source code left. No structure to lean on. The CPU is just walking bytes. Each instruction is a short sequence, read as a unit, then executed. That’s it.

48 83 EC 28          sub rsp, 28h

Figure 4. The first instruction executed inside main, adjusting the stack pointer to establish a usable stack frame.

This instruction exists to move the stack pointer.

48 is the prefix that makes this a 64-bit operation. Without it, rsp wouldn’t be involved at all.

83 selects a subtraction that uses a small constant.

EC is where the target register is encoded. In this case, it resolves to rsp.

28 is the value. 0x28 gets subtracted from the stack pointer.

No locals. No memory access. Just the stack pointer being nudged into place so the function can run.

48 8D 0D D5 11 00 00    lea rcx, Format

Figure 5. An instruction computing the address of the string and placing it into a register for use as a function argument.

This one prepares the argument for the call that follows.

Same 48 prefix. Same 64-bit context.

8D marks this instruction as lea. That matters. Nothing is being read here. Only the address of the string “Hello World!\n” is loaded into rcx.

0D encodes the destination register and the addressing mode. rcx, RIP-relative. The remaining bytes form the displacement. Added to the instruction pointer, they land on the string “Hello World!\n“.

The string itself isn’t touched. Only the address ends up in rcx. That’s enough.

E8 90 FF FF FF       call sub_140001010

Figure 6. A relative call instruction transferring execution to another routine without relying on symbols or absolute addresses.

This is the handoff.

E8 identifies the instruction as a call.

The bytes after it are just an offset. Signed. Relative. No absolute address anywhere.

At runtime, the CPU adds that offset to the current instruction pointer and jumps. A return address gets pushed. Execution continues within the function sub_140001010 which is responsible for printing the string to the console.

What Detection Anchors On When Artifacts Are Gone

When strings stop being reliable, the remaining signals come from execution itself. Code still has to unpack, decrypt, or resolve what it needs before it can do anything useful. That work leaves structure behind, even when everything else is stripped away.

Figure 7 groups those structures into a few broad buckets—custom packers, known algorithms, and hashing logic—before we step through each one individually.

Figure 7. Detection patterns that persist after strings disappear.

Custom Packer Logic: String Decryption Rule

For this post, I’m using a RedLine Stealer sample from VirusShare.
SHA-256: 00da14d8bbe2c85a04314b0ac40c13ebb67fe6693af8e786e63a2c6f6a428b00.

Opening the sample in Detect It Easy, the overall picture becomes clear almost immediately. The binary identifies as a standard PE32, built with Visual C++, but that familiarity stops there. The heuristic flags tell the real story: compressed or packed data, elevated entropy, and a resource section doing more work than it should. There’s even a loose heuristic hint toward .NET Reactor–like behavior, but with no managed metadata to back it up—just import patterns that resemble what Reactor-protected samples often expose, making it a cue to dig deeper rather than a conclusion to trust. At best, this suggests some custom, Reactor-inspired techniques in play rather than a clean, off-the-shelf protector.

Figure 8. The file presents as a packed PE—high entropy, compressed resources, and little else to work with.

The entropy view reinforces that suspicion. The PE header and a couple of standard sections sit where you’d expect them, with relatively low entropy. But both the .text section and, more noticeably, the .rsrc section spike sharply. The resource section in particular stays near the upper end of the scale across its entire range—consistent with compressed or encrypted content rather than icons, dialogs, or version metadata. Whatever this binary is carrying, it isn’t meant to be readable on disk.

Figure 9. .text carries more entropy than expected, alongside a dense .rsrc, pointing to logic and data deliberately blurred at rest.

That expectation carries over into the strings view. Scanning the binary surfaces almost nothing of value. There are no configuration strings, no URLs, no user-facing messages, no obvious markers that could anchor a meaningful signature. What does appear are a small set of import-related API names—exactly the strings the Windows loader requires to resolve imports at runtime. Everything else is either short, high-entropy fragments or completely nonsensical output from the packed data. From a static perspective, the binary offers no stable plaintext indicators beyond what’s structurally unavoidable.

Figure 10. The strings view offers little beyond imported API names; everything else is noise or encrypted.

With static inspection tapped out in Detect It Easy, the next step is obvious: load the binary into IDA and follow execution instead of artifacts. Right at the top of main, before anything meaningful happens, execution funnels into sub_401650. That function runs immediately, reconstructing data byte-by-byte and handing the result back to the caller. In the debugger, the payoff is clear—the decrypted output resolves to Cor_Enable_Profiling, a string that never appears in plaintext on disk.

Figure 12. Execution drops straight into sub_401650 at the very start of main.

Figure 13. Stepping through the code shows the same routine decrypting data in memory at runtime, confirming the strings never exist in plaintext on disk

That placement matters. A decryption routine sitting at the very start of main isn’t incidental—it’s foundational. At this point, the question stops being what strings exist and shifts to how they’re being rebuilt, and what that reconstruction logic looks like under the hood.

Looking deeper into sub_401650, it’s immediately clear what this isn’t. There’s no key schedule, no state array, no rounds, no diffusion step that even vaguely resembles RC4, AES, or any standard algorithm. Nothing is iterated. Nothing evolves. Each byte is touched once, transformed, and discarded.

The logic is blunt and handcrafted. A fixed 32-byte buffer goes in. A fixed sequence of XORs and a single NOT is applied. The constants are embedded directly in the instruction stream—no derivation, no reuse, no abstraction.

That custom shape is exactly what gives the routine its detection value. Even when strings disappear, this logic remains stable and specific to the sample, making it a strong candidate for a YARA rule anchored in opcode.

Figure 14. sub_401650 performing fixed, byte-by-byte decryption using hard-coded constants—custom logic, not a standard cipher.

Figure 15. Cross-references show sub_401650 called repeatedly, decrypting multiple embedded strings across main.


#!/usr/bin/env python3
import sys


def sub_401650(a1: bytes) -> bytes:
    # fixed-size decoder, mirrors the decompile byte-for-byte
    if len(a1) < 32:
        raise ValueError("expected at least 32 bytes")

    out = bytearray(32)

    out[0]  = (a1[0]  ^ 0xA3) & 0xFF
    out[1]  = (a1[1]  ^ 0x54) & 0xFF
    out[2]  = (~a1[2]) & 0xFF
    out[3]  = (a1[3]  ^ 0x75) & 0xFF
    out[4]  = (a1[4]  ^ 0xE7) & 0xFF
    out[5]  = (a1[5]  ^ 0x44) & 0xFF
    out[6]  = (a1[6]  ^ 0x4B) & 0xFF
    out[7]  = (a1[7]  ^ 0x23) & 0xFF
    out[8]  = (a1[8]  ^ 0xBF) & 0xFF
    out[9]  = (a1[9]  ^ 0x45) & 0xFF
    out[10] = (a1[10] ^ 0x3B) & 0xFF
    out[11] = (a1[11] ^ 0x56) & 0xFF
    out[12] = (a1[12] ^ 0xF8) & 0xFF
    out[13] = (a1[13] ^ 0x98) & 0xFF
    out[14] = (a1[14] ^ 0x5B) & 0xFF
    out[15] = (a1[15] ^ 0xF4) & 0xFF
    out[16] = (a1[16] ^ 0xB5) & 0xFF
    out[17] = (a1[17] ^ 0x87) & 0xFF
    out[18] = (a1[18] ^ 0x7B) & 0xFF
    out[19] = (a1[19] ^ 0x0F) & 0xFF
    out[20] = (a1[20] ^ 0xF4) & 0xFF
    out[21] = (a1[21] ^ 0x76) & 0xFF
    out[22] = (a1[22] ^ 0xB9) & 0xFF
    out[23] = (a1[23] ^ 0x34) & 0xFF
    out[24] = (a1[24] ^ 0xBF) & 0xFF
    out[25] = (a1[25] ^ 0x1E) & 0xFF
    out[26] = (a1[26] ^ 0xE7) & 0xFF
    out[27] = (a1[27] ^ 0x78) & 0xFF
    out[28] = (a1[28] ^ 0x98) & 0xFF
    out[29] = (a1[29] ^ 0xE9) & 0xFF
    out[30] = (a1[30] ^ 0x6F) & 0xFF
    out[31] = (a1[31] ^ 0xB4) & 0xFF

    return bytes(out)


def main():
    if len(sys.argv) != 2:
        print(f"{sys.argv[0]} <hex-bytes-file>")
        sys.exit(1)

    with open(sys.argv[1], "r") as f:
        hex_text = f.read()

    a1 = bytes.fromhex(hex_text)
    decoded = sub_401650(a1)

    # trim null padding; print whatever survives
    try:
        print(decoded.rstrip(b"\x00").decode("ascii"))
    except UnicodeDecodeError:
        print(decoded.hex())


if __name__ == "__main__":
    main()

Figure 16. Direct Python clone of sub_401650 for string decryption.

Finding a distinctive routine is only the first step. Once sub_401650 stands out as something worth anchoring on, the next question is restraint. A good rule doesn’t just match—it knows when not to. You don’t want this logic firing on clean binaries that happen to use a few XORs, and you don’t want it so narrow that it misses sibling samples built by the same actor. The goal is balance: tight enough to avoid noise, loose enough to catch the family and its close variants that reuse the same string-hiding approach.

That’s also where performance starts to matter. YARA doesn’t run in a vacuum. In production, every rule competes for CPU time, memory, and scan budget. The more work a rule does, the more selective it needs to be about when that work runs. This is why a raw code pattern is rarely left alone. You layer it with cheap filters first—file size bounds, PE characteristics, section counts, presence or absence of a security directory, even coarse-grained signals like import hash or compiler fingerprints. You can narrow further by checking how execution begins: whether main follows a familiar setup before decryption kicks in, or whether certain code bytes consistently appear just ahead of the routine.
All of that isn’t about weakening the detection. It’s about shaping it. The decryption logic remains the core signal, but everything around it helps decide when that signal is worth evaluating. That’s how a rule moves from “interesting” to usable—specific enough to matter, efficient enough to survive real-world scanning.

For this post, though, that full tuning exercise stays out of scope. The focus here isn’t on squeezing every last microsecond out of a production rule or debating scan-time tradeoffs. It’s about understanding what makes a piece of code worth anchoring on in the first place, before performance and deployment concerns enter the picture.

The next step, then, is to get closer to the bytes themselves. To do that, you need to look past pseudocode and into the actual opcode stream. In IDA, that means switching on opcode bytes in the disassembly view—so each instruction shows not just what it does, but how it’s encoded. That’s the level YARA ultimately reasons about. Once those bytes are visible, the decryption routine stops being an abstract idea and becomes a concrete sequence you can measure, compare, and eventually express as a rule.

Figure 17. Opcode bytes exposed beside each instruction — the raw material for YARA beyond strings.


import "pe"

rule Custom_Sub401650_XorNot32_OpcodeOnly_v2
{
  meta:
    description   = "Demo-only: opcode-level YARA rule for detecting a RedLine Stealer string decryptor routine"
    author  = "AdversaryCraft"
    sample_sha256 = "00da14d8bbe2c85a04314b0ac40c13ebb67fe6693af8e786e63a2c6f6a428b00"
    note = "Illustrates decryptor-centric detection when plaintext strings are absent   (not production-ready)"

  strings:

    /*
      ===== HEAD ANCHOR =====
      Bytes                            Disassembly
      ------------------------------------------------------------
      83 EC 24                         sub     esp, 24h
      56                               push    esi
      8B 44 24 2C                      mov     eax, [esp+2Ch]
      0F B6 08                         movzx   ecx, byte ptr [eax]
      0F B6 50 01                      movzx   edx, byte ptr [eax+1]
      80 F1 A3                         xor     cl, 0A3h
      80 F2 54                         xor     dl, 54h
      88 4C 24 ??                      mov     [esp+..], cl
      0F B6 48 02                      movzx   ecx, byte ptr [eax+2]
      88 54 24 ??                      mov     [esp+..], dl
      0F B6 50 03                      movzx   edx, byte ptr [eax+3]
      F6 D1                            not     cl
      80 F2 75                         xor     dl, 75h
    */
    $head = {
      83 EC 24
      56
      8B 44 24 2C
      0F B6 08
      0F B6 50 01
      80 F1 A3
      80 F2 54
      88 4C 24 ??
      0F B6 48 02
      88 54 24 ??
      0F B6 50 03
      F6 D1
      80 F2 75
    }

    /*
      ===== MID CORROBORATORS =====
      Bytes                            Disassembly
      ------------------------------------------------------------
      80 F1 B5                         xor     cl, 0B5h
      80 F2 87                         xor     dl, 87h

      80 F1 7B                         xor     cl, 7Bh
      80 F2 0F                         xor     dl, 0Fh
    */
    $m1 = { 80 F1 B5 80 F2 87 }
    $m2 = { 80 F1 7B 80 F2 0F }

    /*
      ===== TAIL ANCHOR =====
      Bytes                            Disassembly
      ------------------------------------------------------------
      6A 20                            push    20h
      8D 44 24 ??                      lea     eax, [esp+..]
      50                               push    eax
      80 F1 6F                         xor     cl, 6Fh
      80 F2 B4                         xor     dl, 0B4h
      56                               push    esi
      88 4C 24 ??                      mov     [esp+..], cl
      88 54 24 ??                      mov     [esp+..], dl
      C6 44 24 ?? 00                   mov     byte ptr [esp+..], 0
      E8 ?? ?? ?? ??                   call    memcpy
      83 C4 0C                         add     esp, 0Ch
      8B C6                            mov     eax, esi
      5E                               pop     esi
      83 C4 24                         add     esp, 24h
      C3                               retn
    */
    $tail = {
      6A 20
      8D 44 24 ??
      50
      80 F1 6F
      80 F2 B4
      56
      88 4C 24 ??
      88 54 24 ??
      C6 44 24 ?? 00
      E8 ?? ?? ?? ??
      83 C4 0C
      8B C6
      5E
      83 C4 24
      C3
    }

  condition:

    pe.is_pe and
    pe.machine == pe.MACHINE_I386 and
    filesize < 2000KB and

    $head and $tail and $m1 and $m2
}

Figure 18. Sample YARA rule illustrating opcode-level detection of a custom string decryptor routine

Let’s look at how the rule is structured. The logic isn’t spread evenly across the function—it’s anchored around a few deliberate checkpoints. We’ll walk through the $head, $m1, $m2 and $tail sequences in turn, and why each one was chosen to represent intent rather than incidental compiler noise. We’ll also unpack the use of ?? wildcards—where flexibility is intentional, and where the bytes matter enough that they’re locked down.

$head

The opening bytes are not interesting because they set up a stack frame—they’re interesting because of what follows immediately after. The routine pulls a pointer from the stack and starts reading one byte at a time using movzx. That’s the first signal: byte-wise handling, not block crypto.

The paired XORs with hardcoded constants (A3, 54) matter because they’re embedded directly into the instruction stream. There’s no key material, no loop-driven derivation, no state carried forward. Each byte is treated in isolation. The single not cl stands out even more. Mixing a NOT into an otherwise XOR-only flow is uncommon and gives this routine a shape that’s easy to recognize and hard to accidentally reproduce.

$m1 & $m2

Instead of matching every transformation, the rule samples a few XOR pairs from the middle of the routine. Constants like B5/87 and 7B/0F aren’t special in a cryptographic sense—they’re special because they’re arbitrary. They exist only because the author chose them.

Requiring multiple such pairs makes the rule resilient. One XOR constant could collide with benign code. Several, in a fixed order, almost never do. This keeps the rule wide enough to catch variants using the same routine, but narrow enough to avoid random matches.

$tail

The tail tells you what kind of function this is. push 20h fixes the output length at 32 bytes. The stack-based buffer, explicit null termination, and the call to memcpy leave little ambiguity about the goal – something opaque goes in & a usable string comes out. The cleanup and return simply end the function.

?? wildcards

Offsets, stack layout, and call targets shift between builds. You wildcard those and keep what reflects intent: constants, instruction order, and data flow. That’s how you avoid brittle, one-sample rules.

Testing the Rule: From Grep to Retro Hunt

Before I let this anywhere near a real scan, I want one boring answer: does it light up on clean software? Opcode-level rules can be sharp, but they can also turn generic fast if you accidentally anchor on common compiler output.

So the first pass is intentionally crude. A grep-style content search on VirusTotal over the byte windows I actually care about:


type:peexe
content:{83 ec 24 56 8b 44 24 2c 0f b6 08 0f b6 50 01 80 f1 a3 80 f2 54 88 4c 24 ?? 0f b6 48 02 88 54 24 ?? 0f b6 50 03 f6 d1 80 f2 75}
content:{80 f1 b5 80 f2 87}
content:{80 f1 7b 80 f2 0f}
content:{6a 20 8d 44 24 ?? 50 80 f1 6f 80 f2 b4 56 88 4C 24}
positives:0

Figure 19. A quick VT grep-style sweep over the opcode anchors to sanity-check noise—zero/few hits on clean PE files is exactly the signal you want before moving forward.

A result like positives: 0 is exactly what you want at this stage. It doesn’t prove the rule is “correct,” but it does tell you something important: these anchors aren’t just matching random compiler soup across benign PE files. If this search came back with dozens or hundreds of hits, that’s an immediate red flag—the pattern is too loose, or you latched onto something common.

Only if the search hits a very small number of clean files does hardening even enter the picture. At that point, the goal isn’t to pile on more opcode bytes. Bytes are expensive—every extra pattern makes the rule more brittle and more sample-specific. Good hardening reduces false positives without collapsing the rule into a single hash.

A few practical knobs that usually help when refinement is actually needed:

File size gates. Packers and small loaders tend to live in narrow size bands. A simple filesize < X or bounded range can drop noise fast.

Structurally unavoidable strings. If the only plaintext left is import-related API names, use that. Even lightweight checks for things like FindResource, LoadResource, SizeofResource, VirtualProtect, or WriteProcessMemory can separate loaders from normal applications without relying on missing config strings.

Section-scoped scanning. Don’t hunt these bytes across the entire file. Restricting matches to .text section avoids coincidences in overlays or high-entropy resource blobs.

Location constraints. If the routine consistently appears near the start of .text or within a tight window relative to the entry point, encode that habit. You’re not looking for “anywhere in the binary.”

PE shape hints. Section count, section sizes, presence or absence of a security directory—none of these are signatures on their own, but they make excellent tie-breakers.

Figure 20. VirusTotal Livehunt Retrohunt editor for authoring YARA rules and running historical hunts across selected corpora and time ranges. Source: docs.virustotal.com

When the grep finally stays quiet, the rule graduates to its real exam: Retrohunt. But it doesn’t run just once. The exact same YARA is executed twice, against two very different populations. The first run goes against a goodware-biased corpus, where the only thing you’re testing is restraint—does the rule remain silent in a world full of installers, signed binaries, and boring software that just does its job? The second run goes against VirusTotal’s default corpus, where the noise returns and the question flips. Now you’re looking to see what else lights up. Not clones of your sample, but binaries that carry the same decryptor logic buried under different skins. At this stage, you’re no longer asking whether the rule works. You’re asking whether it understands the behavior it’s trying to describe.

A good rule begins to surface siblings that reuse the same routine, even if everything else around it has shifted. A weak rule just describes one binary very precisely and nothing more. Retro hunts make that difference obvious very quickly. If you want to dig deeper into how VirusTotal’s RetroHunt works and how to run these searches effectively, the official documentation covers it in detail: https://docs.virustotal.com/docs/retrohunt.

Alongside this external testing, it’s worth remembering that most security teams aren’t relying on VirusTotal alone. AV vendors, EDR teams, and internal detection groups usually run their own quality gates before anything ships. Rules get exercised against large cleanware corpora, regression sets, and performance testbeds to make sure they don’t light up on legitimate software or introduce scan-time overhead. False positives and slow rules are caught long before production.

Retro hunts are a way to sanity-check intent and coverage from the outside. Internal QA systems exist to do the unglamorous work at scale—proving that a rule is quiet, fast, and safe once it leaves the lab.

Custom Packer Logic: Payload Decryption Rule

Up to this point, everything we’ve seen has lived in the world of string decryption—small, repeatable routines cleaning up literals just in time for use. This block is where the scope changes. During initial static analysis in Detect It Easy, the resource section already stood out as compressed, so when scrolling through main and execution drops into a run of FindResource → LoadResource → LockResource calls, it’s a natural place to stop and look closer. What’s being pulled here isn’t just data—it’s a packed payload lifted straight out of .rsrc, staged in memory, and processed inside a do { … } while (…) loop via repeated calls to sub_401560, chewing through the buffer chunk by chunk. The final transformation happens in sub_40AC60, where the last pass transforms the extracted resource into its usable form. This is the point where the packer moves beyond string cleanup and reconstructs the real body of the sample.

Figure 21. Native loader lifting an encrypted payload from resources and rebuilding it in memory.

Figure 22. Final unpacking stage: sub_40AC60 reconstructs a .NET PE payload in memory, with EDI pointing at the newly materialized output buffer.

Even before we follow execution into sub_40AC60 (where the final payload transform lands), it’s worth pausing on sub_401560—because this is the “workhorse” that keeps getting hammered inside that do/while pipeline.

At a high level, sub_401560 is a table-driven byte mixer. It copies the input buffer to an output buffer, then rewrites the bytes using a 256×256 lookup table (sitting at this + 0x10000). But it’s not a simple byte-substitution: each byte’s replacement is keyed off a neighbor byte (next/previous), plus a small seed value stored at this[131104].

If the chunk is 1 byte, it does a single lookup keyed by that seed.
If it’s larger, it runs a forward pass (byte + next-byte), does a special keyed transform on the last byte (seed XOR 0x55), then runs a backward pass (byte + prev-byte), and finally re-writes the first byte again using the seed.

Net effect: it turns the buffer into a chained stream transform—each byte is influenced by its neighbors—so by the time we reach sub_40AC60, we’re not looking at “raw extracted resource data” anymore, we’re looking at something that’s already been aggressively stirred.

Figure 23. The IDA pseudocode for sub_401560 showing the chained mixing behavior.


#!/usr/bin/env python3
from __future__ import annotations

from pathlib import Path
import argparse
import sys

KEY_OFF = 0x20020
TABLE_BASE = 0x10000


def sub_401560_debug(state: bytes, src: bytes, debug: bool = False) -> bytes:
    size = len(src)
    if size < 1:
        return b""

    # sanity checks based on the exact indices used
    if len(state) <= KEY_OFF:
        raise ValueError(f"substitution_table.bin too small: need key byte at 0x{KEY_OFF:X}, got len=0x{len(state):X}")
    if len(state) < 0x20000:
        raise ValueError(f"substitution_table.bin too small: need at least 0x20000 bytes for table region, got len=0x{len(state):X}")

    out = bytearray(src)  # memcpy(a4, Src, Size)
    key = state[KEY_OFF] & 0xFF

    def dump(tag: str, buf: bytes, n: int = 64):
        if not debug:
            return
        head = buf[:n]
        print(f"\n[{tag}] len={len(buf)} key=0x{key:02X}")
        print(" head:", head.hex(" "))
        if len(buf) > n:
            tail = buf[-n:]
            print(" tail:", tail.hex(" "))

    dump("after memcpy", out)

    if size == 1:
        b0 = out[0]
        out[0] = state[TABLE_BASE + (256 * b0) + key]
        dump("final (size==1)", out)
        return bytes(out)

    # ---- forward pass ----
    # for (i = 0; i < Size-1; ++i)
    for i in range(size - 1):
        cur = out[i]
        nxt = out[i + 1]  # important: still "original" at that moment
        out[i] = state[TABLE_BASE + (256 * cur) + nxt]

    dump("after forward pass", out)

    # ---- last byte special ----
    last = out[size - 1]
    out[size - 1] = state[TABLE_BASE + (256 * last) + (key ^ 0x55)]

    dump("after last-byte special", out)

    # ---- backward pass ----
    for v7 in range(size - 1, 0, -1):
        cur = out[v7]
        prev = out[v7 - 1]
        out[v7] = state[TABLE_BASE + (256 * cur) + prev]

    dump("after backward pass", out)

    # ---- first byte final ----
    b0 = out[0]
    out[0] = state[TABLE_BASE + (256 * b0) + key]

    dump("final output", out)
    return bytes(out)


def main() -> int:
    ap = argparse.ArgumentParser()
    ap.add_argument("--state", default="substitution_table.bin")
    ap.add_argument("--input", default="input.bin")
    ap.add_argument("--output", default="output.bin")
    ap.add_argument("--size", type=int, default=1024)
    ap.add_argument("--debug", action="store_true")
    args = ap.parse_args()

    state = Path(args.state).read_bytes()
    data = Path(args.input).read_bytes()

    if len(state) < 0x20021:
        raise ValueError(f"substitution_table.bin length is 0x{len(state):X}; need at least 0x20021 (because key at 0x20020)")

    if len(data) < args.size:
        raise ValueError(f"input.bin too small: need {args.size} bytes, got {len(data)}")

    data = data[:args.size]
    out = sub_401560_debug(state, data, debug=args.debug)
    Path(args.output).write_bytes(out)

    print(f"\n[+] wrote {len(out)} bytes to {args.output}")
    return 0


if __name__ == "__main__":
    try:
        raise SystemExit(main())
    except Exception as e:
        print(f"[!] {e}", file=sys.stderr)
        raise SystemExit(1)

Figure 24. Python reimplementation of sub_401560, applying a chained 256×256 table-driven byte transform to an input buffer.

What stands out about sub_401560 is that it doesn’t look like any standard algorithm. The table lookups and neighbor-based chaining give it a very specific shape, which means the function itself is distinctive. That makes it a solid candidate for YARA-based detection: not because it’s sophisticated crypto, but because it’s custom, repeatable, and easy to recognize once you know what to look for.


import "pe"

rule Custom_Sub401560_TableChainedMixer_OpcodeOnly_v1
{
  meta:
    description   = "Demo-only: opcode-level YARA rule for detecting the sub_401560 table-chained byte mixer"
    author        = "AdversaryCraft"
    sample_sha256 = "00da14d8bbe2c85a04314b0ac40c13ebb67fe6693af8e786e63a2c6f6a428b00"
    note          = "Illustrates function-centric detection when plaintext strings are absent (not production-ready)"

  strings:

    /*
      ===== HEAD ANCHOR =====
      Bytes                            Disassembly
      ------------------------------------------------------------
      55                               push    ebp
      8B 6C 24 0C                      mov     ebp, [esp+0Ch]
      83 FD 01                         cmp     ebp, 1
      57                               push    edi
      8B F9                            mov     edi, ecx
      0F 8C ?? ?? ?? ??                jl      loc_40164B
      8B 44 24 0C                      mov     eax, [esp+0Ch]
      56                               push    esi
      8B 74 24 18                      mov     esi, [esp+18h]
      55 50 56                         push    ebp; push eax; push esi
      E8 ?? ?? ?? ??                   call    _memcpy
      83 C4 0C                         add     esp, 0Ch
      83 FD 01                         cmp     ebp, 1
      75 20                            jnz     loc_4015AA
    */
    $head = {
      55
      8B 6C 24 0C
      83 FD 01
      57
      8B F9
      0F 8C ?? ?? ?? ??
      8B 44 24 0C
      56
      8B 74 24 18
      55 50 56
      E8 ?? ?? ?? ??
      83 C4 0C
      83 FD 01
      75 20
    }

    /*
      ===== FORWARD PASS CORE =====
      Bytes                            Disassembly
      ------------------------------------------------------------
      0F B6 14 30                      movzx   edx, byte ptr [eax+esi]
      0F B6 5C 30 01                   movzx   ebx, byte ptr [eax+esi+1]
      81 C2 00 01 00 00                add     edx, 100h
      C1 E2 08                         shl     edx, 8
      03 D7                            add     edx, edi
      8A 14 13                         mov     dl, [ebx+edx]
      88 14 30                         mov     [eax+esi], dl
      40                               inc     eax
      3B C1                            cmp     eax, ecx
      7C ??                            jl      loc_4015C0
    */
    $fwd = {
      0F B6 14 30
      0F B6 5C 30 01
      81 C2 00 01 00 00
      C1 E2 08
      03 D7
      8A 14 13
      88 14 30
      40
      3B C1
      7C ??
    }

    /*
      ===== LAST-BYTE SALT (key ^ 0x55) =====
      Bytes                            Disassembly
      ------------------------------------------------------------
      0F B6 87 20 00 02 00             movzx   eax, byte ptr [edi+20020h]
      0F B6 54 2E FF                   movzx   edx, byte ptr [esi+ebp-1]
      83 F0 55                         xor     eax, 55h
      81 C2 00 01 00 00                add     edx, 100h
      03 C7                            add     eax, edi
      C1 E2 08                         shl     edx, 8
      8A 04 02                         mov     al, [edx+eax]
      88 44 2E FF                      mov     [esi+ebp-1], al
    */
    $last = {
      0F B6 87 20 00 02 00
      0F B6 54 2E FF
      83 F0 55
      81 C2 00 01 00 00
      03 C7
      C1 E2 08
      8A 04 02
      88 44 2E FF
    }

    /*
      ===== BACKWARD PASS CORE =====
      Bytes                            Disassembly
      ------------------------------------------------------------
      0F B6 14 30                      movzx   edx, byte ptr [eax+esi]
      0F B6 4C 30 FF                   movzx   ecx, byte ptr [eax+esi-1]
      81 C2 00 01 00 00                add     edx, 100h
      C1 E2 08                         shl     edx, 8
      03 CF                            add     ecx, edi
      8A 0C 0A                         mov     cl, [edx+ecx]
      88 0C 30                         mov     [eax+esi], cl
      48                               dec     eax
      83 F8 01                         cmp     eax, 1
      7D ??                            jge     loc_401610
    */
    $bwd = {
      0F B6 14 30
      0F B6 4C 30 FF
      81 C2 00 01 00 00
      C1 E2 08
      03 CF
      8A 0C 0A
      88 0C 30
      48
      83 F8 01
      7D ??
    }

  condition:

    pe.is_pe and
    pe.machine == pe.MACHINE_I386 and
    filesize < 2000KB and

    $head and $fwd and $last and $bwd
}

Figure 25. Sample YARA rule illustrating opcode-level detection of the sub_401560 table-chained byte mixer.

Let’s look at how this rule is put together. Rather than trying to describe the entire function byte-for-byte, the rule anchors itself on a few deliberate checkpoints that reflect intent. These anchors line up with the main stages of the transform: setup, forward mixing, a special last-byte step, and the backward mix. We’ll walk through the $head, $fwd, $last, and $bwd sequences in turn, and why each one was chosen.

$head

The opening bytes aren’t interesting because they save registers or set up a stack frame. They matter because of what happens immediately after. The function copies an input buffer with memcpy, checks the size, and branches early if the length is one byte.

That combination—bulk copy followed by byte-wise handling—is the first signal that this isn’t a standard crypto primitive or library routine. The size check and conditional jump establish the structure of the function, while the register usage (edi as the table/state pointer, esi as the output buffer) stays consistent across builds.

This anchor tells us what kind of routine we’re in before any mixing logic even begins.

$fwd

The forward pass is where the behavior becomes distinctive. Each byte is rewritten using a lookup that depends on the next byte, not just its own value. The sequence of movzx, add 0x100, shl 8, and indexed table access isn’t incidental math—it’s how the code walks a 256×256 lookup table.

This pattern is unlikely to appear in benign code by accident, and it doesn’t resemble common encoders or stream ciphers. Anchoring here captures the neighbor-dependent mixing that defines the routine.

$last

The last byte is handled differently, and that difference is deliberate. Instead of using a neighboring byte, the code mixes in a fixed seed value read from [edi+0x20020], XORed with 0x55, before performing the table lookup.

This isn’t cleanup logic or bounds handling—it’s a special case baked into the transform. That makes it a strong discriminator: seeing this exact sequence strongly suggests you’re looking at the same routine.

$bwd

The backward pass runs the same table logic again, but this time it walks the buffer in reverse, pulling in the previous byte instead of the next one. That’s what gives the routine its full shape: a forward sweep, a one-off tweak at the end, and then a second pass back through the data.

Anchoring on this loop helps keep the rule honest. Plenty of code uses a single table-based pass; very little code does it twice, in opposite directions, with the same lookup mechanics. Requiring both $fwd and $bwd makes sure we’re matching the whole transform, not just a convenient slice of it.

As mentioned earlier, the next stage is to start testing the rule. Run a quick grep-style search, follow it up with a retrohunt to see what else the rule pulls in, and validate it against cleanware. From there, adjust the anchors, wildcards, and conditions as needed to balance performance and false positives before using it in any real pipeline.

Detection Considerations for Non-Custom Packers

This kind of opcode-level logic does not translate directly to common, non-custom packers like UPX, ASPack, or similar tools that are routinely used by legitimate software. Writing a static YARA rule against the unpacking stub of these packers will almost always produce false positives, because the stub is shared across thousands of clean binaries.

In those cases, the packer itself is not the signal. It only becomes relevant when it’s paired with malicious behavior downstream.

To handle this, most AV and EDR engines don’t scan the packed bytes in isolation. Instead, they unpack the file first—either through emulation or during execution—and then apply static and behavioral detection to the unpacked code. That’s where rules become meaningful: they match on the post-unpack logic, not the generic wrapper.

The trade-off is performance. Unpacking, emulating, and rescanning code is significantly heavier than a straight static scan. Engines have to decide when that cost is justified, which is why generic packers are usually tolerated unless other signals push the file down a deeper inspection path.

Custom packers don’t get that treatment. Their unpacking logic is unique, reusable across samples, and tightly coupled to the malware itself—making function-level static detection both safer and cheaper in comparison.

We’ll stop here for now. Part 2 will look at detection once known encryption algorithms replace custom routines.

References

VirusTotal Documentation – https://docs.virustotal.com/

YARA Documentation – https://yara.readthedocs.io/en/latest/

VirusShare – https://virusshare.com/

recent posts

Like this:

Leave a ReplyCancel reply

recent posts