Machine Learning

Learning Goals of this Project:

  • Learning basic Pandas Dataframe manipulations
  • Learning more about Machine Learning (ML) Classification models and how they are used in a Cybersecurity context.
  • Learning about basic data pipelines and transformations.
  • Learning how to write and use unit tests when developing Python code.

Important Highlights

  • You can do this project on your host, you do not need to use the VM.
  • Please see the Setup page for videos and instructions about project setup.
  • Keep the VM around for the final project (Summer 24), Web Security.
  • Please watch the provided videos below to see how to setup your environment, we can’t provide broad support here
  • There are only 25 submissions allowed! This is because Gradescope is a limited resource. It’s improper to test your code against Gradescope.
  • We have provided a local testing suite, be sure to pass that completely before you submit to Gradescope.

Important Reference Materials:

Project Overview Video

This is a 16 minute video by the project creator, it covers project concepts.

Remote video URL

There are other videos on the Setup page that cover installation and other subjects.

BACKGROUND

Many of the Projects in CS6035 are focused on offensive security tasks. These are related to Red Team activities/tasks that many of us may associate with cybersecurity. This project will be focused on defensive security tasks, which are usually considered Blue Team activities that are done by many corporate teams.

Historically, many defensive security professionals have investigated malicious activity, files, and code. They investigate these to create patterns (often called signatures) that can be used to detect (and prevent) malicious activity, files, and code when that pattern is used again. What this means is that these simple methods only were effective on known threats.

This approach was relatively effective in preventing known malware from infecting systems, but it did nothing to protect against novel attacks. As attackers became more sophisticated, they learned to tweak or simply encode their malicious activity, files, or code to avoid detection from these simple pattern matching detections.

With this background information, it would be nice if a more general solution could give a score to activity, files, and code that pass through corporate systems every day. This solution would inform the security team that while a certain pattern may not exactly fit a signature of known malicious activity, files, or code it appears to be very similar to examples that were seen in the past that were malicious.

Luckily machine learning models can do exactly that if provided with proper training data! Thus, it is no surprise that one of the most powerful tools in the hands of defensive cybersecurity professionals is Machine Learning. Modern detection systems usually use a combination of machine learning models and pattern matching (regular expressions) to detect and prevent malicious activity on networks and devices.

This project will focus on teaching the fundamentals of data analysis and building/testing your own machine learning models in python. You’ll be using the open source libraries Pandas and Scikit-Learn.

  • Machine learning in cybersecurity is a growing field. The area was considered among top trends by McKinsey in 2022.
  • In the CompTIA State of Cybersecurity 2024 it says last year there were 660,000 unfilled Cybersecurity positions. Also in the section titled Product: AI Drives the Cybersecurity Product Set to New Heights they note that 56% of respondents use AI and Machine Learning for Cybersecurity.

Additional Information

Table of contents