Author: Valli-Nayagam Chokkalingam

Contents
The Algorithm Is Not the Detection
In Part 1, everything lived inside custom packer logic: not “malware strings,” not obvious config blobs—just the boring reality of a loader trying to stay unreadable. High entropy, dead strings, and a small decryptor sitting in the middle like a locked door. Part 2 stays in that same lane, but shifts the focus from handcrafted XOR/NOT tricks to known decryption patterns that packers love to reuse—RC4-style state shuffles, TEA/XTEA-looking round loops, tiny block mixers that scream “decrypt me” the moment you see them in IDA. The trap is that recognizing an algorithm isn’t detection by itself. RC4 or TEA showing up in a binary doesn’t make it malicious—it just means someone used crypto, and plenty of clean software does too. What actually matters is how the packer uses it: the fixed constants, the hardcoded key material, the staging around the routine, and the repeatable byte-level fingerprints that survive even when the encrypted payload changes. This part is about writing YARA for that reality—where the algorithm is only the starting point, and the real signal is the custom glue around it.
Detect It Easy: Static Analysis First
Before IDA, before control-flow graphs, before chasing logic that may not even be real yet, the first stop is Detect It Easy. Not because it “detects malware.” Because it detects deception. In this case, the target is a Loki Stealer sample pulled from VirusShare:
SHA256: 0b416446c098203de4b550714e69a2715ed1c2127a4db54f3d46b47cd2d9a2be
Packed samples don’t win by being clever—they win by being unreadable. And the fastest way to lose time is to start reversing a wrapper while believing it’s the payload.
DIE makes that boundary obvious. The file shows up as a PE32 (32-bit GUI) compiled with Microsoft tooling—normal enough on the surface. But the real message is the heuristic flag:
(Heur)Packer: Compressed or packed data — .text section compressed
That line is a warning label. .text isn’t supposed to be compressed. .text is supposed to be instructions. When it looks packed, it’s rarely “optimization.” It’s almost always staging: a small loader sitting in front, hiding the real code behind an unpack step.

Figure 1. Detect It Easy confirms this Loki Stealer sample is packed—.text is compressed, meaning the “real code” is still hiding behind a loader stage.
The Strings view backs it up in a quiet way. No domains. No config. No obvious paths. No flashy indicators. Just Windows APIs—the boring ones—the kind that show up when a binary doesn’t want to carry personality on disk:
GlobalAlloc, VirtualAlloc, VirtualProtect
None of these are malicious on their own. Clean software uses them constantly. But together they sketch the familiar outline of a loader: allocate memory, rebuild something into that buffer, flip protections so it can execute, and then hand off control. The absence of “malware strings” isn’t a gap here—it’s the design.
This is why static triage matters before deeper reversing. DIE isn’t the answer. It’s the direction. It doesn’t say what the payload does. It says the payload isn’t visible yet—and the only stable surface is the glue that survives every rebuild.

Figure 2. The Strings tab stays intentionally empty of personality—just loader-grade APIs like GlobalAlloc, VirtualAlloc, and VirtualProtect, hinting at memory staging rather than readable intent.
Follow the Memory: GlobalAlloc as the First Breadcrumb
Once the file screams “packed,” the next step isn’t to read code like a story. It’s to follow infrastructure—the boring APIs the stub can’t live without. And memory allocation is always one of the first tells, because unpacking needs somewhere to build the real payload. From there, the natural next move is simple: open it in IDA and start tracing those allocation calls into the unpacking flow.
In the Imports view, GlobalAlloc stands out immediately. Not because it’s rare, but because it’s useful. A loader doesn’t allocate memory for fun—it allocates memory because something is about to be staged, decrypted, decompressed, or reshaped into execution-ready bytes.

Figure 3. The import table exposes GlobalAlloc—a clean pivot into the unpacking flow.
So instead of guessing where the unpacker lives, the workflow becomes mechanical:
pick the allocation API → jump to XREFs → land inside the staging logic.
Following the cross-references to GlobalAlloc drops straight into a tiny helper routine: sub_403220. One clean call, one returned buffer, one pointer getting saved. No payload logic, no drama—just setup. That’s exactly what it is, so sub_403220 gets renamed to GlobalAllocWrapper.

Figure 4. The GlobalAlloc XREF is the first crack in the wrapper—pointing straight into sub_403220.

Figure 5. A one-liner allocator stub: GlobalAlloc in, pointer out—nothing but memory prep for the next stage.
Cross-referencing GlobalAllocWrapper leads to 0x403240, the routine that actually uses the allocated memory. This is where “packed” stops being a label and starts being behavior—staging bytes into memory in a way that only makes sense if the next step is execution. Right after allocation, the code flips the region into an executable-ready state with VirtualProtect(lpAddress, dwSize, 0x40, …), and then a tight for loop walks byte-by-byte, copying data from an embedded blob (dword_2B92E24 + 72475) straight into that buffer. No readable intent, no strings coming back to life—just the next stage being rebuilt in memory, one byte at a time. That’s why 0x403240 gets renamed to UnpackPayloadToExecutableMemoryAndRun. The random-looking API calls inside the loop (GetModuleHandleW, SetColorAdjustment, CreateMemoryResourceNotification, etc.) aren’t “features,” they’re junk API padding: cheap clutter thrown in to break patterns, confuse analysts, and make the unpacker look busier than it really is.

Figure 6. Cross-referencing GlobalAllocWrapper leads straight into 0x403240—the real staging routine where unpacking starts to take shape.

Figure 7. Inside 0x403240, the loader flips the allocated region to RWX (VirtualProtect 0x40) and reconstructs the next stage byte-by-byte in a tight copy loop.
And the placement seals it: UnpackPayloadToExecutableMemoryAndRun is the last call made inside WinMain. The program doesn’t end with application logic. It ends with a hand-off. The stub runs just long enough to allocate memory, rewrite it, mark it executable, and pass execution forward into whatever it just assembled.

Figure 8. XREFs pin UnpackPayloadToExecutableMemoryAndRun directly back to WinMain

Figure 9. WinMain doesn’t do much else—its final move is calling UnpackPayloadToExecutableMemoryAndRun, the hand-off into the real stage.
Identifying the Decryption Logic
The staging routine (UnpackPayloadToExecutableMemoryAndRun) already did the loud part: allocate memory, flip it to RWX, and rebuild a blob into lpAddress. That answers the where. The next question is the one that actually matters:
what happens to that buffer after it’s built?
That’s where lpAddress becomes more than a variable—it becomes a trail.
Cross-referencing lpAddress shows it being pulled into another routine (sub_4030E0), where the pointer is treated like a working buffer: v1 = (char *)lpAddress; and then processed in a loop. This is the moment the sample stops “copying bytes” and starts doing something to them.
And right in the middle of that loop, the real pivot appears: a call to sub_402F00(v1), repeated as the pointer moves forward (v1 += 8). That stride isn’t accidental. Eight bytes at a time is block territory—exactly the size you’d expect when something is being transformed in 64-bit chunks instead of raw stream decoding.

Figure 10. XREFs to lpAddress reveal where the unpacked buffer gets consumed next—leading straight into the routine that starts transforming it, not just storing it.

Figure 11. sub_4030E0 grabs lpAddress, walks it in 8-byte steps, and funnels each block into sub_402F00—the first real “decrypt this” pivot.
Once inside sub_402F00, the shape is unmistakable. Shifts. XORs. Adds. The same variables being mixed again and again. A constant-looking shift count (v10 = 5) driving repetitive work. It reads like a block-mixing routine because that’s what it is: the kind of tight arithmetic loop packers love because it’s small, fast, and doesn’t need any strings to function.
That’s where “TEA-like” stops being a vibe and becomes structure. The routine doesn’t need to be a perfect textbook implementation to give itself away—the round posture is there: repeated mixing, XOR chaining, shift-heavy math, and a consistent 8-byte block stride. TEA-family logic can exist in clean software too, but in a packed loader pipeline like this, it’s not decoration. It’s the engine.

Figure 12. Inside sub_402F00, the decryption shows its shape—shift/XOR/add mixing repeated in tight rounds, the kind of math loop packers can’t hide behind.
Then the constant shows up.
0x9E3779B9.
That number isn’t random filler. It’s one of those crypto fingerprints that refuses to stay quiet—the golden ratio constant used across TEA-style designs. The moment it appears alongside the shift/XOR mixing pattern, the routine stops being “maybe decryption” and becomes extremely specific. Strings can disappear. Imports can be reshuffled. Junk APIs can be sprayed everywhere. But the math still has to work, and constants like this tend to survive rebuilds untouched. So the decryption logic isn’t hiding in some giant function with a friendly name.

Figure 13. sub_402F00 hardcodes 0x9E3779B9—the TEA/XTEA “golden ratio” constant that quietly fingerprints the decryptor.

Figure 14. A quick search maps 0x9E3779B9 back to TEA—the golden ratio constant that gives the loop away.
At that point, verification becomes boring—in a good way. The same constant and mixing shape shows up cleanly in public TEA reference code (for example, the implementation in tea.c here – https://github.com/coderarjob/tea-c/blob/master/tea.c). And if the goal is to sanity-check fast instead of debating patterns by eye, tools like FindCrypt can do the constant-hunting automatically—findcrypt.py will label common crypto constants and point straight at the routine addresses it matches.

Figure 15. A public TEA reference (tea.c) shows the same 0x9E3779B9 delta and shift/XOR mixing pattern
But that familiarity is exactly the trap. TEA-style loops aren’t rare, and writing a detection rule around “TEA exists” is the fastest way to gift yourself false positives. So the focus shifts to what isn’t generic: the custom glue around the algorithm—especially the parts the author had to choose. In this case, that means the key material and the way it’s staged and referenced. The algorithm can be common. The key almost never is.
sub_4030E0 is renamed to TEA_ProcessBuffer_8ByteBlocks, and sub_402F00 is renamed to TEA_LikeDecryptBlock to reflect the TEA-style decryption flow.
Identifying the Key
Once you’ve seen 0x9E3779B9 sitting inside a 32-round shift/XOR/add loop, the algorithm stops being a debate. It’s TEA-family math. Close enough that the function could change clothes, switch variable names, sprinkle junk calls, and still keep the same posture.
But that’s also where the real problem starts.
Because detecting “TEA exists” is worthless on its own. TEA-like loops show up in clean software, in academic code, in legitimate packers, and in every copy-paste loader written by someone who spent ten minutes on GitHub. The algorithm is reusable. The implementation pattern is reusable.
The key isn’t.
And that’s the shift: once the decryptor is identified, the next thing you hunt is the part the author actually had to choose.
The key isn’t passed in. It’s baked into globals.
Textbook TEA hands you a clean uint32_t k[4]. This sample doesn’t. It hides the key material where analysts hate looking: globals that feel like boring state.

Figure 16. The decryptor loads its 4-DWORD key straight from global memory (dword_425CA8/425CAC/425CB0/425CB4)—the moment the routine stops looking generic and starts revealing the fingerprint behind the TEA-style loop.
Right at the top of the routine you can see the four values getting pulled in:
dword_425CA8
dword_425CAC
dword_425CB0
dword_425CB4
They get loaded into working variables once, then used repeatedly inside the round logic.

Figure 17. Inside the round loop, the decryptor starts mixing the renamed key variables (v2, v13, v16, v15 → Key_Dword0–Key_Dword3) into the shift/XOR/add math that drives each 8-byte block transformation.
Once those four values are found, everything gets simpler.
You don’t need to obsess over whether the implementation is TEA, XTEA, “TEA-ish,” or a custom remix. That argument dies the moment you realize you’re not trying to detect crypto.
You’re trying to detect ownership.
And keys are ownership.
Most samples can borrow an algorithm. Most samples don’t share the same 128-bit key schedule sitting in .data.

Figure 18. Four hardcoded DWORDs in .data (dword_425CA8–425CB4)—the key material the decryptor is built around.
Writing the YARA Rule
import "pe"
rule PackedStub_TEA_KeyMaterial_4DWORD_ShiftMixer_v3_proximity_fixed3
{
meta:
description = "Demo-only: opcode-level YARA rule for detecting a TEA-like decryptor stub using embedded 4-DWORD key material + TEA delta + TEA-style shift mixer (with proximity window)"
author = "AdversaryCraft"
sample_sha256 = "0b416446c098203de4b550714e69a2715ed1c2127a4db54f3d46b47cd2d9a2be"
note = "Illustrates decryptor's key-centric detection when plaintext strings are absent (not production-ready)"
strings:
/*
===== KEY MATERIAL (4 DWORDs, little-endian) =====
Bytes Meaning
------------------------------------------------------------
16 85 9E 04 k0 = 0x049E8516
DF 08 EB C1 k1 = 0xC1EB08DF
38 39 CE 43 k2 = 0x43CE3938
86 D2 0F 8D k3 = 0x8D0FD286
Full 16-byte blob (contiguous):
16 85 9E 04 DF 08 EB C1 38 39 CE 43 86 D2 0F 8D
*/
$keyblob = { 16 85 9E 04 DF 08 EB C1 38 39 CE 43 86 D2 0F 8D }
/*
===== TEA DELTA =====
Bytes Disassembly / Meaning
------------------------------------------------------------
B9 79 37 9E 0x9E3779B9 (little-endian)
*/
$delta = { B9 79 37 9E }
/*
===== TEA-LIKE SHIFT MIXER HINTS (opcode-only) =====
TEA/XTEA round mixing commonly includes:
(v << 4) and (v >> 5)
We enumerate the concrete encodings for:
shl r32, 4 => C1 E? 04 (reg varies)
shr r32, 5 => C1 E? 05 (reg varies)
Bytes Disassembly
------------------------------------------------------------
C1 E0 04 shl eax, 4
C1 E1 04 shl ecx, 4
C1 E2 04 shl edx, 4
C1 E3 04 shl ebx, 4
C1 E4 04 shl esp, 4
C1 E5 04 shl ebp, 4
C1 E6 04 shl esi, 4
C1 E7 04 shl edi, 4
C1 E8 05 shr eax, 5
C1 E9 05 shr ecx, 5
C1 EA 05 shr edx, 5
C1 EB 05 shr ebx, 5
C1 EC 05 shr esp, 5
C1 ED 05 shr ebp, 5
C1 EE 05 shr esi, 5
C1 EF 05 shr edi, 5
*/
$shl4_0 = { C1 E0 04 } $shl4_1 = { C1 E1 04 }
$shl4_2 = { C1 E2 04 } $shl4_3 = { C1 E3 04 }
$shl4_4 = { C1 E4 04 } $shl4_5 = { C1 E5 04 }
$shl4_6 = { C1 E6 04 } $shl4_7 = { C1 E7 04 }
$shr5_0 = { C1 E8 05 } $shr5_1 = { C1 E9 05 }
$shr5_2 = { C1 EA 05 } $shr5_3 = { C1 EB 05 }
$shr5_4 = { C1 EC 05 } $shr5_5 = { C1 ED 05 }
$shr5_6 = { C1 EE 05 } $shr5_7 = { C1 EF 05 }
condition:
pe.is_pe and
pe.machine == pe.MACHINE_I386 and
filesize < 2000KB and
// Strong anchors: unique key blob + TEA delta
$keyblob and $delta and
// Must see at least one (v<<4) and one (v>>5) style shift
(1 of ($shl4_*)) and
(1 of ($shr5_*)) and
// Proximity: require shifts to occur near the delta (delta ± 0x400 bytes)
(
for any of ($shl4_*) : (
@ >= (@delta - 0x400) and @ <= (@delta + 0x400)
)
) and
(
for any of ($shr5_*) : (
@ >= (@delta - 0x400) and @ <= (@delta + 0x400)
)
)
}
Figure 19. YARA rule detecting the TEA-like decryptor stub via embedded key material, delta constant, and shift/mix opcode patterns (left/right shifts).
Let’s look at how this TEA-stub rule is structured. It’s not trying to fingerprint a whole function end-to-end. It’s doing something more practical: pinning down the few things a TEA-like decryptor can’t hide without rewriting itself. This rule is built around three deliberate checkpoints:
1) The embedded 16-byte key blob
2) The TEA delta constant (0x9E3779B9)
3) The shift-mixer fingerprints (<<4 and >>5) — with a proximity window so the shifts aren’t “anywhere in the binary”, they’re near the crypto logic
That’s the difference between a rule that “matches something with bitshifts” and a rule that matches a decryptor.
$keyblob (the real identity)
The strongest anchor in this rule is the key material.
16 85 9E 04 DF 08 EB C1 38 39 CE 43 86 D2 0F 8D
Those are not “random bytes” in the file. In the context of TEA-style routines, this is the part that matters most: the key is the author’s fingerprint. TEA/XTEA as algorithms are boring — they exist everywhere. But this specific key does not. That’s why the rule doesn’t just match $k0 $k1 $k2 $k3 individually — it matches them as one contiguous $keyblob. Contiguity matters because it reduces accidental collisions and makes it harder for trivial re-ordering to bypass. If the decryptor is recompiled, the stack frame changes. If the decryptor is optimized, registers change. But that key blob? It usually stays exactly the same unless the operator rotates it. So in practice: $keyblob is your “who”.
$delta (the TEA tell)
TEA-like mixing almost always drags the delta constant into the routine:
B9 79 37 9E => 0x9E3779B9 (little endian)
This delta is a recognizable artifact of the TEA family. It’s not “proof” by itself — constants can appear anywhere — but when it’s present alongside a hardcoded 16-byte key it stops being random and starts being intent. That’s why this rule treats $delta as a required anchor rather than a bonus string.
In practice: $delta is your “what family”.
$shl4_* and $shr5_* (the mixer shape)
TEA mixing patterns commonly use:
(v << 4) (v >> 5)
That exact “4 and 5” pairing shows up so often that it becomes a shape, not just an instruction.
But here’s the nuance – If we only matched C1 E0 04 (shl eax,4) and C1 EE 05 (shr esi,5), it would be brittle. Different builds choose different registers.
So the rule expands that into register-agnostic families:
C1 E? 04 for shl r32,4
C1 E? 05 for shr r32,5
That’s why you see $shl4_0 .. $shl4_7 and $shr5_0 .. $shr5_7. Same intent. Different register choices. Still caught. In practice: the shifts are your “how it mixes”.
Proximity window (where the rule becomes “crypto-aware”)
This is the part that upgrades it from “pattern matching” to “context matching”. The rule doesn’t just require:
one shl ?,4
one shr ?,5
It requires that these shifts occur near the delta constant:
@delta ± 0x400 bytes
That constraint is doing a lot of heavy lifting, because without it those bitshifts could be coming from hash routines, bitfield parsing, bitmap operations, compression, UI rendering code—basically anything. But when the shifts show up close to TEA delta, it strongly suggests you’re looking at the TEA-like round logic area, not random code elsewhere. This is what keeps the rule behavioral, not accidental.
Testing the Rule
Before this rule goes anywhere near a real scan, the first check is always noise: does it light up on clean software? The quickest sanity pass is a VirusTotal grep-style sweep using the hard anchors from the rule — the 16-byte key blob, the TEA delta (0x9E3779B9), and the TEA-like shift mixer hints (shl 4 / shr 5). If the search stays quiet (positives: 0 or close), it’s a strong sign the pattern isn’t just matching generic compiler output.
After that, testing moves to Retrohunt (same workflow as Part 1): run the rule against a goodware-biased corpus for restraint, then against the default corpus to see whether it surfaces related binaries.
We’ll stop here for now. Part 3 will look at what happens when imports disappear too — API names replaced with hashing routines, and everything resolved at runtime just to stay unreadable.
References
TEA C Implementation – https://github.com/coderarjob/tea-c
YARA Documentation – https://yara.readthedocs.io/en/latest/
VirusShare – https://virusshare.com/

Leave a Reply