Shabak Challenge — Data science (an attempt)

2 min readJun 25, 2021

We are provided with a 1.8 GB CSV file which has the ‘network logs of a medium sized company’. We can use pandas dataframe to analyze it in python.

>>> import numpy as np
>>> import csv
>>> import pandas as pd
>>> df1 = pd.read_csv('challenge.csv',parse_dates=True, index_col=0, error_bad_lines=False)
>>> df1
                                     src_ip           dst_ip  ...  protocol  payload
timestamp                                                     ...                   
2020-06-21 00:00:02.892702   120.18.164.170     120.18.53.84  ...       UDP      218
2020-06-21 00:00:03.771702     120.18.53.84   120.18.164.170  ...       UDP      285
2020-06-21 00:00:03.989702   120.18.164.170    148.26.83.117  ...       TCP      142
2020-06-21 00:00:04.547702    148.26.83.117   120.18.164.170  ...       TCP      130
2020-06-21 00:00:05.170953   120.18.187.161     120.18.63.42  ...       TCP       69
...                                     ...              ...  ...       ...      ...
2020-06-25 23:01:16.406536  251.139.203.226    120.18.142.45  ...       UDP      385
2020-06-25 23:01:17.554536    120.18.142.45  251.139.203.226  ...       UDP      220
2020-06-25 23:01:18.500536  251.139.203.226    120.18.142.45  ...       UDP      350
2020-06-25 23:01:18.579536    120.18.142.45  251.139.203.226  ...       UDP      317
2020-06-25 23:01:19.563536  251.139.203.226    120.18.142.45  ...       UDP      383[24456693 rows x 6 columns]

Now we extract the payload column separately and plot it

>>> df1['payload']
timestamp
2020-06-21 00:00:02.892702    218
2020-06-21 00:00:03.771702    285
2020-06-21 00:00:03.989702    142
2020-06-21 00:00:04.547702    130
2020-06-21 00:00:05.170953     69
                             ... 
2020-06-25 23:01:16.406536    385
2020-06-25 23:01:17.554536    220
2020-06-25 23:01:18.500536    350
2020-06-25 23:01:18.579536    317
2020-06-25 23:01:19.563536    383
Name: payload, Length: 24456693, dtype: int64
>>> df1['payload'].plot(kind='line')
<matplotlib.axes._subplots.AxesSubplot object at 0x7f5904ff7ee0>
>>> import matplotlib.pyplot as plt
>>> plt.show()

We can resample it (by minute) for clarity

>>> df3 = df1['payload'].resample('T').mean()
>>> df3.plot(kind='line')
<matplotlib.axes._subplots.AxesSubplot object at 0x7f58ec5f1130>
>>> plt.show()

Plot of time vs payload, resampled by minute

Clearly there are certain spikes in payload transmission. We can find our tentative infected ip addresses in these spikes.(Unfortunately, the challenge is long over so we will not know our correct answer)

>>> datalog = np.array(df1['payload'])>>> for i in range(len(datalog)):
...     if(datalog[i]>400):
...             print(df1[i:i+1])
...

Seems 8.8.8.8 is the attacker’s c&c server.

Shabak Challenge — Data science (an attempt)

Written by Anarta Poashan