This is a novel malware detection framework using deep learning models. We provide API sequence data of test samples (including those generated by benign samples and malicious samples). If you want to reimplement this model against your own dataset, you need to extract the API sequence from the software sandbox report and process it into the form of test samples.
- Operating system: Ubuntu TLS 20.04
- Programing language: python 3.7.10
- GPU: NVIDIA RTX 3090
- numpy
- scikit-learn
- pytorch 1.8.1+cu111.
data_demo.npz(indata_demo.zip): The dataset after data preprocessing## load dataset data = np.load('./data_demo.npz', allow_pickle=True) x_name = data['x_name'] x_semantic = data['x_semantic'] y = data['y']x_name: API sequences (processed byword2idin folderDictionaryForRawData)x_semantic: Semantic chains (processed byword2behaviorandbehavior2idin folderDictionaryForRawData)y: data labels
proposed_model.pkl(in foldermodel): A trained GPU model for testmodel.py: Model architecture and model trainingtest.py: Model evaluationablation_study.py: Model architecture with different structuresCuckooReport2APISequence: An example Cuckoo report and API sequence extractionextract_API_Sequence_from_json.py: API sequence extraction codereport_example.json: An example Cuckoo report
DictionaryForRawData: Some dictionaries used in data preprocessing. These dictionaries can help you to turn the "data.npz" to the original API sequences and semantic chains.word2id.npz: key: API name string, value: id index (eg., 'FileOpen' : 1)word2id = np.load('./DictionaryForRawData/word2id.npz', allow_pickle=True) word2id = word2id['word2id'][()]word2behavior.npz: key: API name string, value: the 4-tuple of this API (eg., *'RegCreateKeyExW' : ['Create', 'RegKeyEx', 'Update', 'registry'])word2behavior = np.load('./DictionaryForRawData/word2behavior.npz', allow_pickle=True) word2behavior = word2behavior['word2behavior'][()]- behavior2id.npz: dictionary name: behavior2id, key: string in the 4-tuple (action, operation object, class, category), value: id index
behavior2id = np.load('./DictionaryForRawData/behavior2id.npz', allow_pickle=True) behavior2id = behavior2id['behavior2id'][()]
We obtain API sequences from a large number of Windows software in 2019.
- Malware dataset which consists of Spyware, Backdoor, Virus, Downloader, Ransom, Adware, Worm, Trojan and Disputed is collected from VirusShare.
- Goodware dataset which consists of System, Internet, Games, Bussiness, Media, Software Development Kit, Education, Social, Travel and Tools is collected from popular free software sources (including Softonic, SourceForge and Portableapps)
Since the size of original data is too large (the original software samples are about 3TB, cuckoo reports about 4TB, feature vectors about 2GB, we just provide a test demo set (including 2000 malware and 2000 goodware feature vectors) here for the time being.
We apply Cuckoo Sandbox to record API sequences while running the software samples. The preprocessed demo dataset is stored in 'data_demo.npz'. User can get the original API sequences and semantic chains through the dictionaries in DictionartForRawData.
@article{li2022novel,
title={A Novel Deep Framework for Dynamic Malware Detection Based on API Sequence Intrinsic Features},
author={Li, Ce and Lv, Qiujian and Li, Ning and Wang, Yan and Sun, Degang and Qiao, Yuanyuan},
journal={Computers \& Security},
pages={102686},
year={2022},
publisher={Elsevier}
}