Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Lecture 6 - (26/02/2026)

Today’s Topics:

  • Serialization

  • Visualization

Serializing Objects

Serialization is the act of translating data structures or objects into a format that can be stored and used later.

A standard approach is Pickling:

- Save to a file: `pickle.dump(object, file, ...)`
- Returns a string: `pickle.dumps(object, ...)`

Unpickling is the deserialization of an object:

- Read from a file: `Unpickler(file).load()` or `pickle.load(file, ...)`
- Read from a string: `pickle.loads(bytes_object, ...)`
import pickle

data = {
    'a': ['alex', 'alia'],
    'b': ['brandon', 'bri'],
    'c': ['calvin', 'carmen']
}
with open('data.pickle', 'wb') as f:
    pickle.dump(data,f,pickle.HIGHEST_PROTOCOL)
    # stores data in a file called data.pickle
with open('data.pickle', 'rb') as f:
    new_data = pickle.load(f)
    # reads the data in the file called data.pickle

print(new_data)
{'a': ['alex', 'alia'], 'b': ['brandon', 'bri'], 'c': ['calvin', 'carmen']}

What can be pickled and unpickled:

  • None, True, False

  • integers, floating point numbers, complex numbers

  • strings, bytes, bytearrays

  • tuples, lists, sets, dictionaries containing picklable objects

  • functions defined at the top level of a module (def not lambda)

  • built in functions defined at the top level of a module

  • classes that are defined at the top level of a module

Visualization

Visualizing Quantitative Data: Histograms

We will use plotly and seaborn as our default for creating plots.

import seaborn as sns
import numpy as np
import statsmodels
import pandas as pd
import plotly.express as px
import plotly.figure_factory as ff
sns.set()
ti = sns.load_dataset('titanic').dropna().reset_index(drop=True)

ff.create_distplot([ti['age']], ['ages'])
#                    column      labels
Loading...

We can also control how the data is divided

ff.create_distplot([ti['age']], ['ages'], bin_size=10)
Loading...

Visualizing Quantitative Data: Box Plots

  • Box plots show where most of the data is.

  • Usually use 25th and 75th percentiles of the box with “whiskers” capturing rest of data (modulo outliers).

px.box(x=ti['fare'])
Loading...

Inter-Quartile Range (IQR) used to decide outliers.

IQR=Q3Q1IQR = Q_3 - Q_1
image
lower, upper = np.percentile(ti['fare'], [25,75])
IQR = upper - lower
print(f'Q1: {lower}\nQ3: {upper}\nIQR: {IQR}')
Q1: 29.7
Q3: 90.0
IQR: 60.3

Can display box plots for different categories:

px.box(x=ti['fare'], y=ti['who'])
Loading...

Visualizing Quantitative Data: Scatter Plots

Scatter plots are used to compare two quantitative variables

px.scatter(ti,x='age',y='fare',color='who',trendline='ols')
Loading...

For qualitative or categorical data, we most often use bar charts and scatter charts.

px.histogram(ti,x='alive')
Loading...

Counting the number of survivors, refining by class:

px.histogram(ti,x='alive',color='class')
Loading...

Visualizing Quantative Data: Time Series

Often data is collected over a period of time.

  • Pandas recognizes most time/date formats.

  • Can display and summarize information across time periods.

# load in the dataset
mta = pd.read_csv('https://stjohn.github.io/datasci/spr23/si2020.csv')
mta.head()
Loading...
# compute the rolling average
mta['7DayRolling'] = mta['entries'].rolling(7).mean()

Create a line plot of entries and rolling averages:

fig = px.line(mta,x='date', y='entries')
fig.add_scatter(
    x=mta['date'],
    y=mta['7DayRolling'],
    mode='lines',
    name='7-Day Rolling Avg'
)

fig.show()
Loading...
image

rolling() is a built-in function that computes a rolling average

Other ways of summarizing information:

  • cumulativeAverage(column): Returns a Series with the running average of the values.

  • cyclicAverage(column): Assumes data is cyclic, and computes average of points in the cycle. Since ridership is highly dependent on the day of the week, this averages the values of the same day in past weeks.

  • exponentialSmoothing(column): Returns a Series with a weighted average of the previous values with the most recent values have higher weight and the older ones have lower weights. The value for the first entry, newCol[0] is column[0]. The value for subsequent entries is newCol[t+1] = 0.5*column[t+1] + 0.5*newCol[t].

Time for work! If you need help with School Counts let me know!