5 Awesome Datasets to Learn Cybersecurity Data Science

A lot of people ask me how to get data for Cybersecurity Data Science (CSDS). The challenge is that distributing malware is illegal and dangerous, so you can’t just go online and download a large collection of malware (which is ironic, since most people try to avoid malware). But when your goal is to learn how to create a next-generation antivirus system, it’s necessary to have a dataset. So here are 5 sources of cybersecurity datasets you can use to practice on and learn from!

The Zoo

“The Zoo” contains a bunch of malware binaries, ranging from the OSX backdoor JacksBot to WannaCry.

2. CERT Insider Threat Dataset

The Insider Threat Test Dataset is a collection of synthetic insider threat test datasets that provide both background and malicious actor synthetic data.

3. KDD Cup

The KDD Cup dataset is a dataset curated for a competition in which the goal was to build a network intrusion detector – a predictive model capable of distinguishing between “bad connections” (“intrusions”) and “good connections” (“normal connections”).

4. North Korean Missile Test Database

The North Korean Missile Test Database is a record of all NK missiles tests for missiles capable of delivering a payload of at least 500 kilograms (1102.31 pounds) at a distance of at least 300 kilometers (186.4 miles). Some features include: missile name, missile type, launch facility, date of launch and longitude and latitude.

5. Microsoft Malware Classification Challenge (BIG 2015) Dataset

This dataset is great because it is both large (almost half a terabyte uncompressed!), and de-fanged (the PE header has been removed). For each file, the raw data contains the hexadecimal representation of the file’s binary content, minus the PE header, as well as a metadata manifest, which is a log containing various metadata information extracted from the binary, such as function calls, strings, etc. This was generated using IDA, the disassembler tool.

Hope this provides you a nice start. ‘Till next time, cyberwarriors!

Dr. Emmanuel Tsukerman

Award-Winning Cybersecurity Data Scientist Dr. Tsukerman graduated from Stanford University and UC Berkeley. In 2017, his machine-learning-based anti-ransomware product won Top 10 Ransomware Products by PC Magazine. In 2018, he designed a machine-learning-based malware detection system for Palo Alto Network’s WildFire service (over 30k customers). In 2019, Dr. Tsukerman authored the Machine Learning for Cybersecurity Cookbook and launched the Cybersecurity Data Science Course and Machine Learning for Red Team Hackers Course.

Next Going Deep Into DeepFakes - Part 2 - Don't Believe Everything You Hear »

Previous « Going Deep Into DeepFakes - Part 1 - What the Heck is Going On