How to use CyNER: A Python Library for Cybersecurity Named Entity Recognition
Earlier this year, CyNER, an open-source Python library for Cybersecurity Named Entity Recognition was released. Here are the respective links for the paper and the Github repository. In this post, I will give you a short tutorial on how to get started with it.
Introduction
Before jumping into the code, let’s quickly go over the logic behind CyNER. This model combines the following 3 strategies:
- Transformed-based models
To extract cybersecurity related entities using their context - Heuristics
To extract IOCs (Indicators of Compromise) that follow a specific pattern that can be extracted using RegEx (For example: IP addresses, CVEs, etc.) - Spacy and Flair:
To extract general entities that do not fall under cybersecurity but might be of interest. (For example: Company names, countries, etc.)
CyNER is a flexible model that allows the user to define which strategies to use and what order to use when merging outputs from different models.
👩💻 The code
The first step is to get started by installing the library withpip install git+https://github.com/aiforsec/CyNER.git
The authors of CyNER provided a Jupyter Notebook with a demo, however, if you run it as is, you won’t get the same results as seen in the notebook, which prompted me to write this post.
First, note that the second cell in the demo is a locally fine-tuned model.
In [2]: model1 = cyner.CyNER(transformer_model=’xlm-roberta-large’, use_heuristic=False, flair_model=None)
Therefore, you need to train it first with the code below:
import cynercfg = {'checkpoint_dir': 'MyFolder',
'dataset': 'dataset/mitre',
'transformers_model': 'xlm-roberta-large',
'lr': 5e-6,
'epochs': 100,
'max_seq_length': 280}model = cyner.TransformersNER(cfg)
model.train()
checkpoint_dir
is the directory that will contain the model’s relevant files such as the weight file. For this parameter you can select any folder that you wish.dataset
is the path to the custom dataset to fine-tune your model. The authors of CyNER shared on GitHub a manually-labeled dataset annotated on different cybersecurity incidents from the MITRE database. To use this dataset, you only need to clone their repo and point to the correct folder. Otherwise, you can create your own custom dataset following the BIO format.
Once your model is trained, then you can call it and use it indicating the checkpoint_dir
you selected.
text = 'Proofpoint report mentions that the German-language messages were turned off once the UK messages were established, indicating a conscious effort to spread FluBot 446833e3f8b04d4c3c2d2288e456328266524e396adbfeba3769d00727481e80 in Android phones.'model_run = cyner.CyNER(transformer_model='MyFolder', use_heuristic=False, flair_model=None)entities = model_run.get_entities(text)for i,e in enumerate(entities):
print(i)
print(e)
print()
Side note
When running this model, I came across an error that said:
AttributeError: 'DataParallel' object has no attribute 'save_pretrained'
I haven’t fully investigated why this was happening, but most likely, some of the dependencies have changed since the CyNER was released. I solved this by going directly into the library and modifying it.
To find out where a Python library is installed, you can run:
>>> import cyner
>>> cyner.__file__
'/home/vmarquez/anaconda3/envs/myenv/lib/python3.9/site-packages/cyner/__init__.py'
>>>
Then I updated the following lines in the file (replace with your path) ~/anaconda3/envs/myenv/lib/python3.9/site-packages/cyner/tner/model.py
- Line 319:
self.model.module.save_pretrained(self.args.checkpoint_dir)
- Line 339:
self.model.module.from_pretrained(self.args.checkpoint_dir)
I hope you’ve enjoyed reading this post! I want to end by thanking to tilusnet on GitHub for the insight on how to get started.