Skip to content

friendllcc/Malware-Detection-API-Sequence-Intrinsic-Features

Repository files navigation

Malware Detection based on API Sequence Intrinsic Features

This is a novel malware detection framework using deep learning models. We provide API sequence data of test samples (including those generated by benign samples and malicious samples). If you want to reimplement this model against your own dataset, you need to extract the API sequence from the software sandbox report and process it into the form of test samples.

Experimental environment

  • Operating system: Ubuntu TLS 20.04
  • Programing language: python 3.7.10
  • GPU: NVIDIA RTX 3090

Required packages

  • numpy
  • scikit-learn
  • pytorch 1.8.1+cu111.

Files

  • data_demo.npz (in data_demo.zip): The dataset after data preprocessing
    ## load dataset
    data = np.load('./data_demo.npz', allow_pickle=True)
    x_name = data['x_name']
    x_semantic = data['x_semantic']
    y = data['y']
    
    • x_name: API sequences (processed by word2id in folder DictionaryForRawData)
    • x_semantic: Semantic chains (processed by word2behavior and behavior2id in folder DictionaryForRawData)
    • y: data labels
  • proposed_model.pkl (in folder model): A trained GPU model for test
  • model.py: Model architecture and model training
  • test.py: Model evaluation
  • ablation_study.py: Model architecture with different structures
  • CuckooReport2APISequence: An example Cuckoo report and API sequence extraction
    • extract_API_Sequence_from_json.py: API sequence extraction code
    • report_example.json: An example Cuckoo report
  • DictionaryForRawData: Some dictionaries used in data preprocessing. These dictionaries can help you to turn the "data.npz" to the original API sequences and semantic chains.
    • word2id.npz: key: API name string, value: id index (eg., 'FileOpen' : 1)
      word2id = np.load('./DictionaryForRawData/word2id.npz', allow_pickle=True)
      word2id = word2id['word2id'][()]
      
    • word2behavior.npz: key: API name string, value: the 4-tuple of this API (eg., *'RegCreateKeyExW' : ['Create', 'RegKeyEx', 'Update', 'registry'])
      word2behavior = np.load('./DictionaryForRawData/word2behavior.npz', allow_pickle=True)
      word2behavior = word2behavior['word2behavior'][()]
      
    • behavior2id.npz: dictionary name: behavior2id, key: string in the 4-tuple (action, operation object, class, category), value: id index
      behavior2id = np.load('./DictionaryForRawData/behavior2id.npz', allow_pickle=True)
      behavior2id = behavior2id['behavior2id'][()]
      

Dataset Overview

We obtain API sequences from a large number of Windows software in 2019.

  • Malware dataset which consists of Spyware, Backdoor, Virus, Downloader, Ransom, Adware, Worm, Trojan and Disputed is collected from VirusShare.
  • Goodware dataset which consists of System, Internet, Games, Bussiness, Media, Software Development Kit, Education, Social, Travel and Tools is collected from popular free software sources (including Softonic, SourceForge and Portableapps)

Since the size of original data is too large (the original software samples are about 3TB, cuckoo reports about 4TB, feature vectors about 2GB, we just provide a test demo set (including 2000 malware and 2000 goodware feature vectors) here for the time being.

We apply Cuckoo Sandbox to record API sequences while running the software samples. The preprocessed demo dataset is stored in 'data_demo.npz'. User can get the original API sequences and semantic chains through the dictionaries in DictionartForRawData.

Cite

@article{li2022novel,
  title={A Novel Deep Framework for Dynamic Malware Detection Based on API Sequence Intrinsic Features},
  author={Li, Ce and Lv, Qiujian and Li, Ning and Wang, Yan and Sun, Degang and Qiao, Yuanyuan},
  journal={Computers \& Security},
  pages={102686},
  year={2022},
  publisher={Elsevier}
}

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages