Shabak Challenge — Data science (an attempt)

Anarta Poashan
2 min readJun 25, 2021

We are provided with a 1.8 GB CSV file which has the ‘network logs of a medium sized company’. We can use pandas dataframe to analyze it in python.

>>> import numpy as np
>>> import csv
>>> import pandas as pd
>>> df1 = pd.read_csv('challenge.csv',parse_dates=True, index_col=0, error_bad_lines=False)
>>> df1
src_ip dst_ip ... protocol payload
timestamp ...
2020-06-21 00:00:02.892702 120.18.164.170 120.18.53.84 ... UDP 218
2020-06-21 00:00:03.771702 120.18.53.84 120.18.164.170 ... UDP 285
2020-06-21 00:00:03.989702 120.18.164.170 148.26.83.117 ... TCP 142
2020-06-21 00:00:04.547702 148.26.83.117 120.18.164.170 ... TCP 130
2020-06-21 00:00:05.170953 120.18.187.161 120.18.63.42 ... TCP 69
... ... ... ... ... ...
2020-06-25 23:01:16.406536 251.139.203.226 120.18.142.45 ... UDP 385
2020-06-25 23:01:17.554536 120.18.142.45 251.139.203.226 ... UDP 220
2020-06-25 23:01:18.500536 251.139.203.226 120.18.142.45 ... UDP 350
2020-06-25 23:01:18.579536 120.18.142.45 251.139.203.226 ... UDP 317
2020-06-25 23:01:19.563536 251.139.203.226 120.18.142.45 ... UDP 383
[24456693 rows x 6 columns]

Now we extract the payload column separately and plot it

>>> df1['payload']
timestamp
2020-06-21 00:00:02.892702 218
2020-06-21 00:00:03.771702 285
2020-06-21 00:00:03.989702 142
2020-06-21 00:00:04.547702 130
2020-06-21 00:00:05.170953 69
...
2020-06-25 23:01:16.406536 385
2020-06-25 23:01:17.554536 220
2020-06-25 23:01:18.500536 350
2020-06-25 23:01:18.579536 317
2020-06-25 23:01:19.563536 383
Name: payload, Length: 24456693, dtype: int64
>>> df1['payload'].plot(kind='line')
<matplotlib.axes._subplots.AxesSubplot object at 0x7f5904ff7ee0>
>>> import matplotlib.pyplot as plt
>>> plt.show()
Plot of time vs payload

We can resample it (by minute) for clarity

>>> df3 = df1['payload'].resample('T').mean()
>>> df3.plot(kind='line')
<matplotlib.axes._subplots.AxesSubplot object at 0x7f58ec5f1130>
>>> plt.show()
Plot of time vs payload, resampled by minute

Clearly there are certain spikes in payload transmission. We can find our tentative infected ip addresses in these spikes.(Unfortunately, the challenge is long over so we will not know our correct answer)

>>> datalog = np.array(df1['payload'])>>> for i in range(len(datalog)):
... if(datalog[i]>400):
... print(df1[i:i+1])
...

Seems 8.8.8.8 is the attacker’s c&c server.

--

--