Malware Detection based on API Sequence Intrinsic Features

This is a novel malware detection framework using deep learning models. We provide API sequence data of test samples (including those generated by benign samples and malicious samples). If you want to reimplement this model against your own dataset, you need to extract the API sequence from the software sandbox report and process it into the form of test samples.

Experimental environment

Operating system: Ubuntu TLS 20.04
Programing language: python 3.7.10
GPU: NVIDIA RTX 3090

Required packages

numpy
scikit-learn
pytorch 1.8.1+cu111.

Files

data_demo.npz (in data_demo.zip): The dataset after data preprocessing
```
## load dataset
data = np.load('./data_demo.npz', allow_pickle=True)
x_name = data['x_name']
x_semantic = data['x_semantic']
y = data['y']
```
- x_name: API sequences (processed by word2id in folder DictionaryForRawData)
- x_semantic: Semantic chains (processed by word2behavior and behavior2id in folder DictionaryForRawData)
- y: data labels
proposed_model.pkl (in folder model): A trained GPU model for test
model.py: Model architecture and model training
test.py: Model evaluation
ablation_study.py: Model architecture with different structures
CuckooReport2APISequence: An example Cuckoo report and API sequence extraction
- extract_API_Sequence_from_json.py: API sequence extraction code
- report_example.json: An example Cuckoo report
DictionaryForRawData: Some dictionaries used in data preprocessing. These dictionaries can help you to turn the "data.npz" to the original API sequences and semantic chains.
- word2id.npz: key: API name string, value: id index (eg., 'FileOpen' : 1)
```
word2id = np.load('./DictionaryForRawData/word2id.npz', allow_pickle=True)
word2id = word2id['word2id'][()]
```
- word2behavior.npz: key: API name string, value: the 4-tuple of this API (eg., *'RegCreateKeyExW' : ['Create', 'RegKeyEx', 'Update', 'registry'])
```
word2behavior = np.load('./DictionaryForRawData/word2behavior.npz', allow_pickle=True)
word2behavior = word2behavior['word2behavior'][()]
```
- behavior2id.npz: dictionary name: behavior2id, key: string in the 4-tuple (action, operation object, class, category), value: id index
```
behavior2id = np.load('./DictionaryForRawData/behavior2id.npz', allow_pickle=True)
behavior2id = behavior2id['behavior2id'][()]
```

Dataset Overview

We obtain API sequences from a large number of Windows software in 2019.

Malware dataset which consists of Spyware, Backdoor, Virus, Downloader, Ransom, Adware, Worm, Trojan and Disputed is collected from VirusShare.
Goodware dataset which consists of System, Internet, Games, Bussiness, Media, Software Development Kit, Education, Social, Travel and Tools is collected from popular free software sources (including Softonic, SourceForge and Portableapps)

Since the size of original data is too large (the original software samples are about 3TB, cuckoo reports about 4TB, feature vectors about 2GB, we just provide a test demo set (including 2000 malware and 2000 goodware feature vectors) here for the time being.

We apply Cuckoo Sandbox to record API sequences while running the software samples. The preprocessed demo dataset is stored in 'data_demo.npz'. User can get the original API sequences and semantic chains through the dictionaries in DictionartForRawData.

Cite

@article{li2022novel,
  title={A Novel Deep Framework for Dynamic Malware Detection Based on API Sequence Intrinsic Features},
  author={Li, Ce and Lv, Qiujian and Li, Ning and Wang, Yan and Sun, Degang and Qiao, Yuanyuan},
  journal={Computers \& Security},
  pages={102686},
  year={2022},
  publisher={Elsevier}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Malware Detection based on API Sequence Intrinsic Features

Experimental environment

Required packages

Files

Dataset Overview

Cite

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
CuckooReport2APISequence		CuckooReport2APISequence
DictionaryForRawData		DictionaryForRawData
model		model
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
ablation_study.py		ablation_study.py
data_demo.zip		data_demo.zip
model.py		model.py
test.py		test.py

friendllcc/Malware-Detection-API-Sequence-Intrinsic-Features

Folders and files

Latest commit

History

Repository files navigation

Malware Detection based on API Sequence Intrinsic Features

Experimental environment

Required packages

Files

Dataset Overview

Cite

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages