Improve deduplication for VarCorpus dataset splitting

Thanks for the really cool work and the open dataset!

I noticed the deduplication (when splitting into test/train) is missing some near/exact duplicates. I understand that near-duplicate detection is tricky, but I think some more aggressive (text-based) normalization could help.

For Ghidra, `FUN_*`, `LAB_*`, `DAT_*`, and `PTR_*` (among others) all leak offsets into the decompilation text. Similar labels exist in IDA.

For example, take these two cases from the Ghidra_O3 dataset (per-binary split):
- `8b24d40a9bcf3a361a784f0d32abb3f0_(0011c500)` (test)
- `2334db8f8877d54ca418f044f1954d86_(0011c4e0)` (train)

They are considered _distinct_ by the current deduplication algorithm. However, diffing them shows that the only difference is the offsets of callees within their respective binaries:

```diff
 {
   if ((@@var_0@@wine_dbch_xmllite@@ & 8) != 0) {
     if (3 < @@var_1@@property@@) {
-      FUN_0011c1b0();
+      FUN_0011c190();
     }
-    FUN_0011bd30();
+    FUN_0011bd10();
   }
   if (@@var_1@@property@@ == 2) {
     *(@@var_2@@iface@@ + 0x28) = @@var_3@@value@@ != 0;
@@ -47,7 +47,7 @@
   }
   if (@@var_1@@property@@ != 1) {
     if ((@@var_0@@wine_dbch_xmllite@@ & 1) != 0) {
-      FUN_0011bd30();
+      FUN_0011bd10();
     }
     return 0x80004001;
   }
```

These functions are derived from identical source code.

I understand that we can't say for certain the calls are the same from the text above (we would need to check the callee code, I guess), but I would argue that even if you normalize the callsites (i.e. replacing `FUN_*` with `FUN`), the false-positive rate will be very low. Most pairs of nontrivial functions would be expected to differ by far more than just the names of the called functions.

~20% of the test set is duplicated in train, after normalizing for `FUN_*`, `LAB_*`, `DAT_*`, and `PTR_*`.

I would argue that deduplication should ignore _all_ function names, not just the `FUN_*` names left by ghidra. There are cases where identical library functions may use an alternative function names (e.g. see `--zprefix` for zlib), and there are probably cases in the dataset where identical functions have names in some binaries and not others.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improve deduplication for VarCorpus dataset splitting #1

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Improve deduplication for VarCorpus dataset splitting #1

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions