Table of Contents
List of Tables
Table of Contents
Thank you for purchasing BinDiff, the leading executable-comparison tool for reverse engineers that need to analyze patches, malware variants, or are generally interested in the differences between two executables. This manual is intended to help you to get up to speed quickly.
In order to make best use of BinDiff, it is very helpful to spend a bit of time familiarizing yourself with the concepts and algorithms behind BinDiff. For this, we recommend reading Chapter 3, Understanding BinDiff, which explains the behind-the-scenes, Chapter 4, Core Functionality, which explains the basic elements of the user interface. Ideally, we would like you to also read Chapter 6, A basic walk-through Analyzing a Microsoft Patch (a walkthrough through analyzing a simple security update) and Chapter 7, Advanced Usage (a walk-through through porting your symbols and comments from one disassembly to the next). Don't worry - after Chapter 3, Understanding BinDiff, there's not a lot of text to parse and mostly screenshots to look at.
We hope that you have a great experience using our product!
The following typographical conventions are used in this document:
Constant
width
Constant width
italic
We have added two significant new features to zynamics BinDiff:
Many changes have occured between versions 1.5 and 1.6, some very visible, others revolving around the core functionality.
New methods to generate initial fixedpoints:
New fully CPU-independent mode for not explicitly supported processors (used by default if the CPU is not supported). This means that all CPUs that IDA supports are supported using this default mode. However, CPUs with conditional execution (IA64, ARM) may yield suboptimal results.
Please contact <zynamics-support@google.com>
to let us know of any processors that you would
like to have explicit support for, as using
explicitly supported CPUs result in higher-quality
matches.
Table of Contents
Since BinDiff is an add-on product to IDA, it requires a working installation of Hex-Rays IDA Pro version 7.4 of above Please note that support for legacy versions of IDA Pro has been discontinued.
Alternatively, BinDiff ships with beta quality plugins/extensions for both Binary Ninja and Ghidra.
To install BinDiff to your workstation, perform the following steps:
bindiff8.msi
.On the initial Welcome page, click
. You are asked to read and confirm the Apache 2.0 License.To accept the License Agreement, check I accept the terms in the License Agreement and click .
On the Custom Setup page, you can select the features you want to install. By default, all product features will be installed. You can remove a feature by clicking the icon next to it and select .
To change the location where BinDiff gets installed, select BinDiff and click the button. The Change destination folder dialog appears, allowing you to enter a custom installation path. Click to confirm your choice and close the dialog.
Once you are done customizing Setup, click
.By default, customers of BinDiff receive Debian GNU/Linux installation packages. The following Linux distributions are supported:
To request an installation package/installation instructions for a distribution other than Debian or Ubuntu, please file a bug in the BinDiff Issue Tracker. The remainder of this section assumes that the distribution you are installing BinDiff on Debian GNU/Linux or Ubuntu.
To install BinDiff to your workstation, perform the following steps:
bindiff_8_amd64.deb
.
The package is GPG signed with Google's "Linux Packages Signing Authority" key.
Verify as follows:
gpg --recv-key 7721F63BD38B4796 gpg --verify bindiff_8_amd64.deb.asc
To install BinDiff to your Mac, perform the following steps:
Download the macOS disk image from the
releases page:
Download BinDiff8.dmg
.
The disk image and all binaries are code-signed and notarized, so macOS should be able to install without further messages/warnings.
On the initial Welcome page, click
. You are asked to read and confirm the Apache 2.0 License.To accept the License Agreement, check Agree and click .
Table of Contents
This section provides background information about the inner workings of BinDiff. It is recommended reading for anyone who wants to get the most out of the product, and for anyone who wants to understand the details of the different configuration options.
BinDiff works on the abstract structure of an executable, ignoring the concrete assembly-level instructions in the disassembly. Every function gets a signature, based on the structure of the (normalized) flow graph of the function. The signature consists of:
Once the two sets of signatures (for the two executables) have been generated, initial matches are created. This works by selecting a subset of all functions in each executable which share a common characteristic. A match is created if a signature occurs once (and only once) in both examined subsets of signatures.
After this step, callgraphs (graphs which contain information about the calls-to relations between functions) are used to generate more matches: If a match is known, the subsets of all functions called from a matched function are examined. These subsets are significantly smaller than the set of all functions, thus the propability for finding new unique matches is a lot higher. This process is repeated until no new matches can be found.
This means that after you have successfully run BinDiff, you have a list of functions that were successfully associated with each other, as well as two lists of functions that could not be associated.
BinDiff has a list of function attributes (see below) suitable for generating matches. It starts on a global level, considering all functions of the binary and calculates the first attribute for every function. There are several possible outcomes:
After the initial global matching step the parents (callers) and children (callees) of each new match are considered. BinDiff tries to match functions in the set of parents and children by performing a "drill down" step as described above on each. Finally BinDiff performs basic block matching for all newly matched functions and matches functions called from matched basic blocks (function: call reference matching). This concludes global matching for a single attribute. The whole process is restarted on the remaining unmatched functions using the next best attribute.
Function attributes are used in one of two ways. Either canonically per function or per edge. Edge matching tries to match edges (which represent calls in the call graph or jumps in the flow graphs) if source and target function attributes match. Thus edge matching is the stronger criterium, yielding better matches in general. However, since it is possible to have a lot of edges per vertex in a graph (this is especially true for callgraphs, where the number of edges often grows super linearily in the number of vertices) edge matching may potentially be very slow. If you do encounter performance problems for certain binaries you should first try to disable edge matching based algorithms. Also, do not hesitate to send the offending binaries to the zynamics team at Google - we are always trying to improve performance bottlenecks and are grateful for samples.
Function matching algorithms ordered roughly by resulting match quality:
Matches functions based on a hash of the original raw function bytes. Thus two functions matched by this algorithm should be identical on the byte level.
match quality: very good
algorithm performance: very good
Matches functions based on a hash of their names. Only real names are considered, names which have been auto-generated by the disassembler are not used. This is one of the few algorithms that can match imported functions, i.e. functions that do not have an actual body in the binary. False matches are highly unlikely.
match quality: very good
algorithm performance: very good
Matches callgraph edges based on their source and target function's MD indices. Thus calls between two structurally identical functions are matched.
match quality: very good
algorithm performance: medium
Matches callgraph edges based on their callgraph MD index. This means the callgraph leading to that particular call is structurally identical in both binaries. Match quality depends on how deep the callstack leading up to this edge is: the deeper the less likely is a false match.
match quality: good
algorithm performance: medium
Matches functions based on their structure using the MD index. Since the MD index takes a topological graph ordering as one of it's inputs we can parametrize it by whether we sort the graph vertices into levels following calls from the entrypoint (top down) or callers from the exit points (bottom up).
match quality: good
algorithm performance: very good
Matches functions based on their instruction prime products. Each mnemonic gets assigned a unique small prime number. These primes are multiplied for all instructions of the function. This yields a structurally invariant, instruction order independent, product to be subsequently used as a matching attribute.
match quality: good
algorithm performance: very good
Matches functions based on their position in the callgraph. The call graph leading up to that function must be structurally identical as viewed from the program entrypoint (top down) or its exits (bottom up).
match quality: good
algorithm performance: very good
Matches functions based on their local call graph neighborhoods. Calls and callees are only followed two levels deep as seen from the function in question.
match quality: medium
algorithm performance: poor
Matches functions based on their relaxed MD indices. The MD index is calculated without taking topological order into account. This means only the in-edges and out-edges in the function's local neighborhood are considered.
match quality: medium
algorithm performance: medium
Matches functions in order based on their entry point addresses. This is a special matching step that is especially useful during drill downs. Since it would indiscriminately match all functions if not further constrained there are two additional requirements: first the functions in question must already be equivalent according to the relaxed MD index and the flow graph MD index. Second, the two sets of equivalent functions in both binaries must be of equal size.
match quality: poor
algorithm performance: very good
Matches functions based on their referenced string data. All strings referenced from the functions in question are put into a combined hash which is subsequently used as the matching attribute if at least one string is referenced. This is a good algorithm for matching error handling code which often has very little structure (and thus won't get matched by the stronger algorithms) but lots of references to error message strings.
match quality: medium
algorithm performance: very good
Matches functions based on the number of loops. Only applied if at least one loop is present.
match quality: poor
algorithm performance: very good
Special algorithm that is only used for functions with matched parents (callers). The point of the call at the call site is determined as a tuple: topological basic block level, instruction number in the basic block, address. The child function (callee) is matched if basic block level and instruction number match (exact), if only the basic block level match (topology) or simply ordered by call site address (sequence). This produces very weak matches in general, but may be a good strategy if the parent function was matched correctly. In that case it is not unlikely that it will call functions in the same order in both binaries.
match quality: very poor
algorithm performance: good
Basic block matching on flow graph level is algorithmically very similar to function matching. Global attribute matching is followed by drill downs and by trying to match in the reduced set of parents/children of matched basic blocks.
Basic block matching algorithms ordered roughly by resulting match quality:
Flowgraph edges are matched if source and target basic block instruction prime products match. Thus both basic blocks contain identical instructions, potentially ordered differently.
match quality: very good
Basic blocks are matched based on a binary hash of their raw bytes. Only used on basic blocks with at least 4 instructions.
match quality: very good
Basic blocks are matched based on their instruction prime product. Only used on basic blocks with at least 4 instructions.
match quality: very good
Basic blocks are matched if they call at least one function and all called functions have been matched.
match quality: very good
Basic blocks are matched if they reference at least one string and that string is the same in both binaries.
match quality: very good
Basic blocks are matched based on their position in the flowgraph.
match quality: good
Works like the previous prime matching step but with the minimum number of instructions constraint lifted.
match quality: medium
Matches the back edges of loops which have been determined using the Lengauer-Tarjan algorithm.
match quality: medium
Matches basic blocks that are loop anchors, i.e. targets of a back edge.
match quality: low
Matches basic blocks that have self loops.
match quality: low
Matches the entry/exit point basic blocks. The entry point is uniquely identified by the function and usually has an indegree of 0. Exit points are vertices with outdegree 0.
match quality: low
Special matching steps that are only applied to basic blocks that are structurally equivalent based on their MD indices. Basic blocks are matched according to their number of instructions or simply ordered by address.
match quality: very low
Special matching step of last resort. Single unmatched parents/children of matched basic blocks are matched - with no regard of their content.
match quality: very low
The confidence value displayed by BinDiff is the average algorithm confidence (match quality) used to find a particular match weighted by a sigmoid squashing function. The values aren't simply averaged because few single weak matches in an otherwise perfectly matched function/binary shouldn't drag the confidence down too much. Analogously, even a few strong matches will not "rescue" a binary pair matched primarily by address sequence and similarily weak algorithms.
The similarity value for a function is a weighted sum taking the following factors into account:
Weight ~25%: quota of matched flow graph edges to total edges
Weight ~15%: quota of matched basic blocks to total basic blocks
Weight ~10%: quota of matched instructions to total instructions
Weight ~50%: difference in flow graph MD index
The similarity value for the whole binary is a weighted sum taking the following factors into account:
Weight ~35%: quota of matched flow graph edges to total edges
Weight ~25%: quota of matched basic blocks to total basic blocks
Weight ~10%: quota of matched functions to total functions
Weight ~10%: quota of matched instructions to total instructions
Weight ~20%: difference in call graph MD index
For the more technically inclined and for users that are interested in in-depth information on BinDiff and IDA, the following documents are worth reading:
http://www.zynamics.com/downloads/bindiffsstic05-1.pdf . “Graph-Based Comparison of Executable Objects”. SSTIC ’05, Symposium sur la Sécurité des Technologies de l’Information et des Communications. . 2005.
http://www.zynamics.com/downloads/dimva_paper2.pdf . 161-173. “Structural Comparison of Executable Objects”. Detection of Intrusions and Malware & Vulnerability Assessment. . 2004. 3-88579-375-X.
Table of Contents
To begin using BinDiff, start IDA, load a database and press Ctrl+6 to display the plugin main window.
Browse for an IDA database and start the diffing process. Can also be accessed by pressing Shift+D or using the , menu option.
Like
, but allows to select an address range to diff.Browse for a diffing result to load. This functionality is also available via the
, , menu option.Saves the result of the current diffing session to be loaded later using
. This can also be accessed using the , , menu option.Allows to transfer function names and/or individual comments from the other database.
To view functions, that were not matched during the diff, use the Primary Unmatched and the Secondary Unmatched subviews. The first one displays functions that are contained in the currently open database and were not associated to any function of the diffed database, while the Secondary Unmatched subview contains functions that are in the diffed database but were not associated to any functions in the first.
The meaning of the columns is as follows:
The effective address of the function that is not associated with any other function.
The name of the function that is not associated with any other function.
The number of code blocks, instructions and edges in the flow graph in this function. These can be used to identify similar functions manually.
The Matched Functions subview displays the results of the diffing e.g. pairs of functions that are associated with each other.
The columns have the following meaning:
A value between zero and one indicating how similar two matched functions are. A value of exactly one means the two functions are identical (in regard to their instructions, not their memory addresses). Values less than one mean the function has changed parts. Please note that BinDiff only considers basic blocks, edges and mnemonics for calculating similarity values. In particular, instructions may differ in their operands, immediate values and addresses and will still be considered equal if the mnemonics match.
A value between zero and one indicating the confidence of the similarity score. Note that this value represents the calculated confidence score for the matching algorithms that are enabled in the configuration file. It is possible to get a high similarity match with a small confidence score or vice versa. This simply means that a weak algorithm has found a good match or a good algorithm has found a weak match.
This column characterizes the difference between the two matched functions. There can be any combination of the following change types, indicated by single letters:
The effective address (EA) and the name of the function in the currently open IDA database.
The effective address (EA) and the name of the function in the foreign (secondary) database.
The algorithm that led to the match. For a list of algorithms in this version of BinDiff, refer to Chapter 3, Understanding BinDiff and Chapter 5, Configuration .
Number of total matched basic blocks per function, total number of basic blocks in the first (primary) function and total number of basic blocks in the second function.
Like the above, but totalling instructions instead of basic blocks.
These three value describe the total of the matched control flow graph edges, as well as the number of the edges in the control flow graph of the primary and secondary functions.
Right-clicking on one of the matched functions brings up a context menu:
Transfer individual comments from the other database for the selected function.
Displays a graphical representation of the differences between the selected function in the two databases using the BinDiff Graphing GUI.
Copies the contents of the Matched Functions subview to the system clipboard. The copied text is tabular data separated by spaces.
BinDiff displays functions from IDA as highlighted flowgraphs with colors indicating special properties of edges and code blocks.
Green arrows represent conditional branches that need to satisfy a certain condition to be taken, whereas red arrows represent conditional branches if the branch condition is not met as well as unconditional branches.
BinDiff relies on several heuristics to do its work. The behavior of the program is configurable to suit your needs. This section discusses the proper configuration of BinDiff.
If you are in a hurry to get started, just leave the configuration as is, the default settings of BinDiff are pre-configured for the most common use cases.
If you encounter performance problems with BinDiff, would like to adapt the confidence values based on your experience or want to test algorithms individually you may modify the config file directly. Note that BinDiff needs to be restarted for changes to take effect (restart IDA).
Table 5.1. Location of BinDiff configuration files
Platform | Location |
---|---|
Windows | %APPDATA%\BinDiff\bindiff.xml |
Linux | ~/.bindiff/bindiff.xml |
macOS | ~/Library/Application Support/BinDiff/bindiff.xml |
The following configuration options are available:
ui
java-binary
Path to the Java application launcher binary. If empty or unset, BinDiff will try to auto-detect a suitable Java VM.
Default value (Windows): (BinDiff installation
directory)\jre\bin\javaw.exe
Default value (macOS, Linux): (empty)
java-vm-options
Additional arguments to pass to the Java application launcher. There should usually be no need to set this.
Default value: (unset)
max-heap-size-mb
Maximum amount of Java heap space for the UI. If unset, defaults to 75% of physical system memory (minimum 512MiB). There should usually be no need to set this.
Default value: (unset)
directory
Path to the BinDiff installation directory.
Default value: (BinDiff installation directory)
server
IP address to use for internal IPC.
Default value: 127.0.0.1
port
IP port to use for internal IPC.
Default value: 2000
retries
Reserved, do not use.
Default value: 20
ida
directory
Path to the IDA Pro installation directory.
Default value: (IDA Pro installation directory)
executable
Name of the IDA Pro executable for analyzing 32-bit
binaries. If unset, BinDiff uses ida.exe
(Windows) or ida
(macOS, Linux).
Default value: (unset)
executable64
Name of the IDA Pro executable for analyzing 64-bit
binaries. If unset, BinDiff uses ida64.exe
(Windows) or ida64
(macOS, Linux).
Default value: (unset)
threads
How many threads to use when batch-diffing. If set to
max-hw
, BinDiff will use as many threads as there
are logical processors in the system.
Default value: 2
function-matching
,
basic-block-matching
Specifies the order of the internal matching algorithms and the associated confidence values. For a list of supported algorithms, please refer to Chapter 3, Understanding BinDiff .
For each step
element, the following attributes are supported:
confidence
algorithm
In the following section, we will perform a short walk-through explaining the use of BinDiff to reverse-engineer a security patch. Our example will be MS08-063.
From MS08-063:
A remote code execution vulnerability exists in the way that Microsoft Server Message Block (SMB) Protocol handles specially crafted file names. An attempt to exploit the vulnerability would require authentication because the vulnerable function is only reachable when the share type is a disk, and by default, all disk shares require authentication. An attacker who successfully exploited this vulnerability could install programs; view, change, or delete data; or create new accounts with full user rights.
Microsoft also states that the exploitability is “inconsistent exploit likely” , indicating that some complications are to be expected for someone writing an exploit.
Begin by disassembling srv.sys in the unpatched version. Then follow up by disassembling the patched variant of the file. Make sure the IDB of the unpatched version is not opened in another instance of IDA. Then press Ctrl+6 and select the unpatched version as the target for the comparison using the button.
After a few moments, BinDiff should display the Matched Functions subview. To display the functions that changed, click the similarity column to sort by function similarity. Functions that were changed in the patch will have a similarity score of less than one.
In the image above, a total of three functions were changed:
SrvRestartRawReceive(x)
SrvIssueQueryDirectoryRequest(x, x, x,
x, x, x, x, x)
SrvFsdRestartPrepareRawMdlWrite(x)
Let's examine the change in the middle: SrvIssueQueryDirectoryRequest
. If
you press Ctrl+E with the cursor positioned
on this function, the BinDiff graph GUI is launched.
The nodes in green indicate basic blocks that have identical instruction mnemonics in both executables (although operands to individual assembly instructions might have changed), whereas the red nodes indicate basic blocks to which our comparison algorithms were unable to find equivalents. A third category, yellow nodes, indicates nodes for which the algorithms could find equivalents, but which had some instructions changed between versions.
Let's zoom in on the two smaller blocks on the left hand side that have been inserted right before the large red block.
We can see that a 16-bit value is loaded into EDI, and
some calls to RtlUlongSub
have been introduced which were not present in the old
version. If you click the Lock
View icon on the toolbar, and thus allow the two
halves of the graph to be desynchronized, you can start
right-click-panning on the right hand side to see the big
yellow block as it was in the unpatched version.
Now, what did
change? The old version takes a 16-bit value, subtracts it
from arg_8
, subtracts 8
from arg_8
, and then
rounds the result down to the next value divisible by 4.
The result of this calculation is then added to the local
variable VirtualAddress
,
and the number of bytes specified by this 16-bit value are
copied to the buffer pointed to by the result.
The new version replaces the regular integer arithmetic
with calls to RtlUlongSub
, which enforces that no
integer underflow occurs e.g. the result of a subtraction is
always smaller than the value that was subtracted from. In
pseudo-code:
index = (arg_8 - shortval - 8) & 0xFFFFFFFC; memcpy(&destination[index + 8], eax->member_4, shortval);
So the risk here was for an attacker to be able to choose
arg_8
and shortval
in such a manner that
index + 8
remains
negative allowing for memory corruption right before
destination
. The patched
version eliminates this problem by ensuring that the
arithmetic never drops below zero.
Table of Contents
From time to time, we will publish documents on advanced usage of our products on our website http://www.zynamics.com/. Please check the publications section frequently to see any additional information that becomes available.
Aside from analyzing security problems, symbol porting is one of the most useful features of BinDiff. Given an IDB in which comments and names have been added, you can use BinDiff to pull these names and comments into a new IDB for a related executable.
In order to do this, please perform the following steps:
All comments, local variable names and global variable names that BinDiff can associate between the two executables will be ported.
The names of local variables etc. in the current IDB will be overwritten!
A.1. General Questions |
|
A.1.1. |
How come the Matched functions subview is empty? |
Please refer to Chapter 5, Configuration on how to properly configure BinDiff and follow the instructions contained within. If the problem persists although you have enabled several algorithms for creating initial fixedpoints, please contact us. |
|
A.1.2. |
Where did the Changed functions subview go? |
The functionality of both the Changed functions and the Matched functions subviews was merged. Changed functions are functions with a similarity of less than one, whereas identical functions have a similarity score of exactly one. For a more detailed description, refer to the section called “The Matched Functions Subview”. |
|
A.2. Installation |
|
A.2.1. |
Administrative installation of BinDiff |
For simple administrative installs just prepend msiexec /quiet (no output at all) or msiexec /passive (progress bar only) to the installation package: msiexec /passive bindiff6.msi. The installation package can also be deployed via Group Policies just like any other Windows Installer package. |
Report BinDiff bugs using the issue tracker at
https://bugs.zynamics.com/bindiff
(requires a free Google
account).