-
Notifications
You must be signed in to change notification settings - Fork 6
Description
Thanks for the really cool work and the open dataset!
I noticed the deduplication (when splitting into test/train) is missing some near/exact duplicates. I understand that near-duplicate detection is tricky, but I think some more aggressive (text-based) normalization could help.
For Ghidra, FUN_*, LAB_*, DAT_*, and PTR_* (among others) all leak offsets into the decompilation text. Similar labels exist in IDA.
For example, take these two cases from the Ghidra_O3 dataset (per-binary split):
8b24d40a9bcf3a361a784f0d32abb3f0_(0011c500)(test)2334db8f8877d54ca418f044f1954d86_(0011c4e0)(train)
They are considered distinct by the current deduplication algorithm. However, diffing them shows that the only difference is the offsets of callees within their respective binaries:
{
if ((@@var_0@@wine_dbch_xmllite@@ & 8) != 0) {
if (3 < @@var_1@@property@@) {
- FUN_0011c1b0();
+ FUN_0011c190();
}
- FUN_0011bd30();
+ FUN_0011bd10();
}
if (@@var_1@@property@@ == 2) {
*(@@var_2@@iface@@ + 0x28) = @@var_3@@value@@ != 0;
@@ -47,7 +47,7 @@
}
if (@@var_1@@property@@ != 1) {
if ((@@var_0@@wine_dbch_xmllite@@ & 1) != 0) {
- FUN_0011bd30();
+ FUN_0011bd10();
}
return 0x80004001;
}These functions are derived from identical source code.
I understand that we can't say for certain the calls are the same from the text above (we would need to check the callee code, I guess), but I would argue that even if you normalize the callsites (i.e. replacing FUN_* with FUN), the false-positive rate will be very low. Most pairs of nontrivial functions would be expected to differ by far more than just the names of the called functions.
~20% of the test set is duplicated in train, after normalizing for FUN_*, LAB_*, DAT_*, and PTR_*.
I would argue that deduplication should ignore all function names, not just the FUN_* names left by ghidra. There are cases where identical library functions may use an alternative function names (e.g. see --zprefix for zlib), and there are probably cases in the dataset where identical functions have names in some binaries and not others.