Using machine learning to classify devices on your network

Cory Stephenson
Gab41
Published in
8 min readApr 24, 2018

--

In this article, we plan to walk readers through using our machine learning code to classify devices on a network. We have touched on this in previous blog posts about the Poseidon Software Defined Networking (SDN) project and how it relates to detecting lateral movement, as well as using machine learning (ML) to analyze network data. With that in mind, we’ve experimented with classifying devices using packet-capture data. We’ve made a few tools available to make it easier to try on your own network as well. The models that we’ll be using run in combination with the Poseidon SDN project, and if you’d like to try that yourself, you can read about how to build your own Software-Defined Network with Raspberry Pis and a Zodiac FX switch or watch the video.

The Poseidon project: combining SDNs and machine learning

For running the machine learning code, you’ll need to have Docker installed, and some way of capturing traffic on the devices that you want to classify. For network data capture , consider using our version of tcpdump that we’ve modified to include flags that strip layer-4 payload information as well as information to external hosts. Once you’ve performed the prerequisites , you can run

docker pull lab41/poseidonml

to get the machine learning container and we’re ready to start. There are four things that this post will cover:

· Running a pre-trained model on your data

· Assembling a dataset

· Training a model on a new dataset

· Testing the performance on a test dataset

Running a pre-trained model on your data

The easiest thing to do is to run one of the models that we’ve trained on your own data. To do that, first create a packet capture (our machine learning software PoseidonML expects a pcap) from the device of interest if you don’t have one already. The capture should be at least 15 minutes long, and contain a large amount of internal-internal traffic, but longer captures may produce more reliable results.

Now you can choose between two models to use for classification. The default choice in PoseidonML is a Random Forest model. This model’s performance is encouraging out of the box when trained on new data (more on that in the next section!), so it is a good place to start. To run this model on your pcap:

docker run –v <path_to_pcap>:/pcaps/eval.pcap lab41/poseidonml

The other model is a one layer neural network, which you can try by running:

docker run –v <path_to_pcap>:/pcaps/eval.pcap lab41/poseidonml:onelayer
Neural network with one hidden layer. Not to be confused with our other networks.

In either case, you should get an output with a message that looks something like this:

Message: {“98:01:xx:xx:xx:xx”: {“classification”: {“confidences”: [0.6856065894902689, 0.2727088338248893, 0.022470232107183397], “labels”: [“Developer workstation”, “Unknown”, “Active Directory controller”]}, “timestamp”: 1498669414.355485, “valid”: false, “decisions”: {“behavior”: “normal”, “investigate”: false}}}

This is the message that the ML tools send back to Poseidon. The “classification” field indicates the labels that the classification model has assigned to the device that the input pcap was sourced from, and the confidence field gives the associated confidences of those labels. In addition, there is a “behavior” field, which typically should be ‘normal’ indicating the model did not detect abnormal behaviors, and an “investigate” field, which is true if the ML tools are requesting that Poseidon gather more data from this device. The “valid” field simply indicates if the request to analyze this device was made by Poseidon with the associated metadata, so it should be false in this case. The top 3 device types are returned, which may include an ‘Unknown’ label.

Assembling a dataset

One of the central concerns we had in this project is whether or not data from your network looks anything like the data that we trained our models on. If it doesn’t, the machine learning tools won’t work very well. In light of that, we decided that we should try to make it easy to retrain the machine learning models on new data. The hardest part of this is collecting a dataset that will work for this. However, here are a few rules of thumb that we found to work well for our specific problem:

1. Have multiple devices for each label you want to assign.

Multiple devices help the model to learn more general information. If you have only one device, models may learn the specifics of that one device, which means they might not work on new data.

2. Capture traffic in uninterrupted segments of at least 15 minutes long.

Our models work by looking at session level information, so this time window is intended to contain many packets from several sessions.

3. Total length of captures from each device should be at least 5 hours.

This is intended to sort of ‘average out’ specific uses of a device, and give a better idea of how it behaves as a whole, and not at one instant in time.

4. Reserve some devices for testing purposes.

This is done to evaluate how well the model you train is working. It’s necessary to test on separate data than you trained on, or you will not know if your model has learned anything general.

To keep things simple, we adopted a naming convention for the capture files of the form

DeviceName-deviceID-time-duration-flags.pcap

For example, a one hour capture from an my workstation on Tuesday at 1520 UTC might have a name like this:

DevWorkstation-User1-Tues1520–60mins-n00.pcap

After doing the collection, you should wind up with a directory of captures. Here are a few examples of what we used in training:

Fileserver-a-unk-unk-n00.pcapGPULaptop-user1-Fri0409–60mins-n00.pcapIphone-b-Wed0945–5days-n00.pcapBusinessWorkstation-user2-Mon-3days-00.pcap

Now, we’ll need to assign the labels that we want to be associated with these captures. To do this, create a JSON file called ‘label_assignments’ in the directory with your data. This will associate a device name with a label. For the examples above, our label_assignments.json might look like this:

{
“DevWorkstation”: “Developer workstation”,
“Iphone”: “Smartphone”,“GPULaptop”: “GPU laptop”,“Fileserver”: “File server”,“BusinessWorkstation”: “Business workstation”
}

In many cases, the label is similar to the device name, but this doesn’t have to be the case. Any captures that aren’t assigned a label will be automatically given the ‘Unknown’ label. This is the basic format that you can use for making both a set of training data and a set of testing data.

Training a new model

Now that you have the training dataset configured, training a model should be pretty straightforward. We’ve automated most of the process, so you should be able to run

docker run -v <path-to-dataset>:/pcaps -v <path-to-save-model>:/models/lab41/poseidonml train

This will train a random forest model, but you could also use lab41/poseidonml:onelayer for the neural network model. We found the random forest model works well on most datasets, but the neural network model can sometimes work better on larger datasets. This step should handle the preprocessing of the dataset, including the selection of features, training of the model, and cross validation. After this runs, you should see an overall F1-score (closer to 1 is better) on a validation set (the training script automatically created one from your training data) and you should have a trained model in the directory you specified. That should be all you have to do to train a model!

Testing model performance

After training a model, you probably will want to test how it will perform on new data. Fortunately that’s not too different from what we’ve done so far. You’ll need to create a test dataset the same way as described above. This should not include any of the data that was in the training dataset, though. After that, you simply run

docker run -v <path-to-dataset>:/pcaps -v <path-to-save-model>:/models/lab41/poseidonml testing

Again, you could also use lab41/poseidonml:onelayer for the other model. After processing the dataset, this should give a result that looks something like the following:

 — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —Results with unknowns — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —F1 of 0.XX for Developer workstationF1 of 0.XX for SmartphoneF1 of 0.XX for UnknownMean F1: 0.XX
— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —
Results forcing decisions— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —F1 of 0.XX for Developer workstationF1 of 0.XX for SmartphoneMean F1: 0.XX— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —Analysis statistics— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —Evaluated X pcaps in X secondsTotal data: X MbTotal capture time: X hoursData processing rate: X Mb per secondtime per 15 minute capture X seconds— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —

The table above shows the performance of the model (measured by F1-score) on each label that we defined, and the performance averaged over all labels (mean F1-score). Additionally, you get some statistics on the run time and amount of data processed at the end. This should be all you have to do to train a network device classification model on your own data.

There’s still a lot left to do, in particular assembling a large shareable network dataset for collaborative research. Hopefully, our tools make it easier to get started running your own experiments. We are really interested in finding out more about how models trained on data from one network work on another, so if you conduct any experiments on that, we would love to hear about them!

Lab41 is a Silicon Valley challenge lab where experts from the U.S. Intelligence Community (IC), academia, industry, and In-Q-Tel come together to gain a better understanding of how to work with — and ultimately use — big data.

Learn more at lab41.org and follow us on Twitter: @_lab41

--

--