Skip to content

Improve deduplication for VarCorpus dataset splitting #1

@ndrewh

Description

@ndrewh

Thanks for the really cool work and the open dataset!

I noticed the deduplication (when splitting into test/train) is missing some near/exact duplicates. I understand that near-duplicate detection is tricky, but I think some more aggressive (text-based) normalization could help.

For Ghidra, FUN_*, LAB_*, DAT_*, and PTR_* (among others) all leak offsets into the decompilation text. Similar labels exist in IDA.

For example, take these two cases from the Ghidra_O3 dataset (per-binary split):

  • 8b24d40a9bcf3a361a784f0d32abb3f0_(0011c500) (test)
  • 2334db8f8877d54ca418f044f1954d86_(0011c4e0) (train)

They are considered distinct by the current deduplication algorithm. However, diffing them shows that the only difference is the offsets of callees within their respective binaries:

 {
   if ((@@var_0@@wine_dbch_xmllite@@ & 8) != 0) {
     if (3 < @@var_1@@property@@) {
-      FUN_0011c1b0();
+      FUN_0011c190();
     }
-    FUN_0011bd30();
+    FUN_0011bd10();
   }
   if (@@var_1@@property@@ == 2) {
     *(@@var_2@@iface@@ + 0x28) = @@var_3@@value@@ != 0;
@@ -47,7 +47,7 @@
   }
   if (@@var_1@@property@@ != 1) {
     if ((@@var_0@@wine_dbch_xmllite@@ & 1) != 0) {
-      FUN_0011bd30();
+      FUN_0011bd10();
     }
     return 0x80004001;
   }

These functions are derived from identical source code.

I understand that we can't say for certain the calls are the same from the text above (we would need to check the callee code, I guess), but I would argue that even if you normalize the callsites (i.e. replacing FUN_* with FUN), the false-positive rate will be very low. Most pairs of nontrivial functions would be expected to differ by far more than just the names of the called functions.

~20% of the test set is duplicated in train, after normalizing for FUN_*, LAB_*, DAT_*, and PTR_*.

I would argue that deduplication should ignore all function names, not just the FUN_* names left by ghidra. There are cases where identical library functions may use an alternative function names (e.g. see --zprefix for zlib), and there are probably cases in the dataset where identical functions have names in some binaries and not others.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions