Categories: learning materials

5 Awesome Datasets to Learn Cybersecurity Data Science

A lot of people ask me how to get data for Cybersecurity Data Science (CSDS). The challenge is that distributing malware is illegal and dangerous, so you can’t just go online and download a large collection of malware (which is ironic, since most people try to avoid malware). But when your goal is to learn how to create a next-generation antivirus system, it’s necessary to have a dataset. So here are 5 sources of cybersecurity datasets you can use to practice on and learn from!

  1. The Zoo

“The Zoo” contains a bunch of malware binaries, ranging from the OSX backdoor JacksBot to WannaCry.

2. CERT Insider Threat Dataset

The Insider Threat Test Dataset is a collection of synthetic insider threat test datasets that provide both background and malicious actor synthetic data.

3. KDD Cup

The KDD Cup dataset is a dataset curated for a competition in which the goal was to build a network intrusion detector – a predictive model capable of distinguishing between “bad connections” (“intrusions”) and “good connections” (“normal connections”).

4. North Korean Missile Test Database

The North Korean Missile Test Database is a record of all NK missiles tests for missiles capable of delivering a payload of at least 500 kilograms (1102.31 pounds) at a distance of at least 300 kilometers (186.4 miles). Some features include: missile name, missile type, launch facility, date of launch and longitude and latitude.

5. Microsoft Malware Classification Challenge (BIG 2015) Dataset

This dataset is great because it is both large (almost half a terabyte uncompressed!), and de-fanged (the PE header has been removed). For each file, the raw data contains the hexadecimal representation of the file’s binary content, minus the PE header, as well as a metadata manifest, which is a log containing various metadata information extracted from the binary, such as function calls, strings, etc. This was generated using IDA, the disassembler tool.

Hope this provides you a nice start. ‘Till next time, cyberwarriors!

Dr. Emmanuel Tsukerman

Award-Winning Cybersecurity Data Scientist Dr. Tsukerman graduated from Stanford University and UC Berkeley. In 2017, his machine-learning-based anti-ransomware product won Top 10 Ransomware Products by PC Magazine. In 2018, he designed a machine-learning-based malware detection system for Palo Alto Network’s WildFire service (over 30k customers). In 2019, Dr. Tsukerman authored the Machine Learning for Cybersecurity Cookbook and launched the Cybersecurity Data Science Course and Machine Learning for Red Team Hackers Course.

Recent Posts

International Jobs in Cybersecurity Data Science

In part I of this blog post series, I told you how you can set…

4 years ago

Finding a Job in Cybersecurity Data Science

In a previous blog post, I told you how you can set yourself apart from…

4 years ago

Becoming a Cybersecurity Data Scientist

A lot of students ask me what to do to become a Cybersecurity Data Scientist…

4 years ago

Going Deep Into DeepFakes – Part 4 – How Humanity Can Persevere Against DeepFakes

If you've been paying any attention at all to what's going on, you must have…

4 years ago

Going Deep Into DeepFakes – Part 3 – AI-generated Reviews, Weaponizing Twitter and Artificially-Generated Universes

If you've been paying any attention at all to what's going on, you must have…

4 years ago

AI for OSINT – part 4 – Identity and Demographic Recognition from Video and Audio Footage

In this post, I’m going to cover how AI can comb through video and audio…

4 years ago