A lot of people ask me how to get data for Cybersecurity Data Science (CSDS). The challenge is that distributing malware is illegal and dangerous, so you can’t just go online and download a large collection of malware (which is ironic, since most people try to avoid malware). But when your goal is to learn how to create a next-generation antivirus system, it’s necessary to have a dataset. So here are 5 sources of cybersecurity datasets you can use to practice on and learn from!
“The Zoo” contains a bunch of malware binaries, ranging from the OSX backdoor JacksBot to WannaCry.
2. CERT Insider Threat Dataset
The Insider Threat Test Dataset is a collection of synthetic insider threat test datasets that provide both background and malicious actor synthetic data.
3. KDD Cup
The KDD Cup dataset is a dataset curated for a competition in which the goal was to build a network intrusion detector – a predictive model capable of distinguishing between “bad connections” (“intrusions”) and “good connections” (“normal connections”).
4. North Korean Missile Test Database
The North Korean Missile Test Database is a record of all NK missiles tests for missiles capable of delivering a payload of at least 500 kilograms (1102.31 pounds) at a distance of at least 300 kilometers (186.4 miles). Some features include: missile name, missile type, launch facility, date of launch and longitude and latitude.
5. Microsoft Malware Classification Challenge (BIG 2015) Dataset
This dataset is great because it is both large (almost half a terabyte uncompressed!), and de-fanged (the PE header has been removed). For each file, the raw data contains the hexadecimal representation of the file’s binary content, minus the PE header, as well as a metadata manifest, which is a log containing various metadata information extracted from the binary, such as function calls, strings, etc. This was generated using IDA, the disassembler tool.
Hope this provides you a nice start. ‘Till next time, cyberwarriors!