Lecture 3 - (05/02/2026)
Today’s Topics:
Information Visualization
Why should we visualize
Plotly Visualizations
Altair Visualizations
Information Visualization¶
Computer-based visualization systems provide visual representations of datasets designed to help people carry out tasks more effectively.
Three requirements
Users
Data
Tasks
A good visualization enables users to complete tasks effectively on the data.
When not to use vis
Don’t need vis when fully automatic solution exists and is trusted
But many analysis problems are ill-specified.
What vis allows for
Long-term use for end users (e.g., exploratory analysis of scientific data)
Presentation of known results
Stepping stone to better understanding of requirements before developing models
Helps developers of automatic solution refine/debug, determine parameters
Helps end users of automatic solutions verify, build trust
Why depend on vis?
Computer-based visualization systems provide visual representations of datasets designed to help people carry out tasks more effectively.
Human visual system is high-bandwidth channel to brain
Overview possible due to background processing
Subjective experience of seeing everything simultaneously
Significant processing occurs in parallel and pre-attentively
Sound: lower bandwidth and different semantics
Overview not supported
Subjective experience of sequential stream
Touch/haptics: impoverished record/replay capacity
Only very low-bandwidth communication thus far
Taste, smell: no viable record/replay devices
Why show data in detail?
Summaries lose information.
Confirm expected and find unexpected patterns.
Assess validity of statistical model.
Why should we visualize?¶
The purpose of visualization is insight, not pictures
import pandas as pd
import numpy as np
import plotly.graph_objects as go
from plotly.subplots import make_subplots
# Anscombe's quartet data
data = {
"x1": [10, 8, 13, 9, 11, 14, 6, 4, 12, 7, 5],
"y1": [8.04, 6.95, 7.58, 8.81, 8.33, 9.96, 7.24, 4.26, 10.84, 4.82, 5.68],
"x2": [10, 8, 13, 9, 11, 14, 6, 4, 12, 7, 5],
"y2": [9.14, 8.14, 8.74, 8.77, 9.26, 8.10, 6.13, 3.10, 9.13, 7.26, 4.74],
"x3": [10, 8, 13, 9, 11, 14, 6, 4, 12, 7, 5],
"y3": [7.46, 6.77, 12.74, 7.11, 7.81, 8.84, 6.08, 5.39, 8.15, 6.42, 5.73],
"x4": [8, 8, 8, 8, 8, 8, 8, 19, 8, 8, 8],
"y4": [6.58, 5.76, 7.71, 8.84, 8.47, 7.04, 5.25, 12.50, 5.56, 7.91, 6.89],
}
def summarize(x, y):
x = np.array(x)
y = np.array(y)
mean_x = np.mean(x)
var_x = np.var(x, ddof=1)
mean_y = np.mean(y)
var_y = np.var(y, ddof=1)
corr = np.corrcoef(x, y)[0, 1]
# Linear regression y = a + b x
b, a = np.polyfit(x, y, 1)
y_hat = a + b * x
r2 = 1 - np.sum((y - y_hat)**2) / np.sum((y - np.mean(y))**2)
return mean_x, var_x, mean_y, var_y, corr, a, b, r2
# Compute stats for each dataset
for i in range(1, 5):
x = data[f"x{i}"]
y = data[f"y{i}"]
mean_x, var_x, mean_y, var_y, corr, a, b, r2 = summarize(x, y)
print(f"Dataset {i}")
print(f"Mean of x:\t\t{mean_x:.2f}")
print(f"Variance of x:\t\t{var_x:.2f}")
print(f"Mean of y:\t\t{mean_y:.2f}")
print(f"Variance of y:\t\t{var_y:.3f}")
print(f"Correlation x,y:\t{corr:.3f}")
print(f"Linear regression:\ty = {a:.2f} + {b:.3f}x")
print(f"R^2:\t\t\t{r2:.2f}")
print("-" * 50)Dataset 1
Mean of x: 9.00
Variance of x: 11.00
Mean of y: 7.50
Variance of y: 4.127
Correlation x,y: 0.816
Linear regression: y = 3.00 + 0.500x
R^2: 0.67
--------------------------------------------------
Dataset 2
Mean of x: 9.00
Variance of x: 11.00
Mean of y: 7.50
Variance of y: 4.128
Correlation x,y: 0.816
Linear regression: y = 3.00 + 0.500x
R^2: 0.67
--------------------------------------------------
Dataset 3
Mean of x: 9.00
Variance of x: 11.00
Mean of y: 7.50
Variance of y: 4.123
Correlation x,y: 0.816
Linear regression: y = 3.00 + 0.500x
R^2: 0.67
--------------------------------------------------
Dataset 4
Mean of x: 9.00
Variance of x: 11.00
Mean of y: 7.50
Variance of y: 4.123
Correlation x,y: 0.817
Linear regression: y = 3.00 + 0.500x
R^2: 0.67
--------------------------------------------------
# Create subplot layout
fig = make_subplots(
rows=2,
cols=2,
subplot_titles=[f"Dataset {i}" for i in range(1, 5)]
)
for i in range(1, 5):
x = data[f"x{i}"]
y = data[f"y{i}"]
# Regression line y = a + b x
b, a = np.polyfit(x, y, 1)
x_line = np.linspace(min(x), max(x), 100)
y_line = a + b * x_line
row = (i - 1) // 2 + 1
col = (i - 1) % 2 + 1
# Scatter points
fig.add_trace(
go.Scatter(
x=x,
y=y,
mode="markers",
name=f"Data {i}",
showlegend=False
),
row=row,
col=col
)
# Regression line
fig.add_trace(
go.Scatter(
x=x_line,
y=y_line,
mode="lines",
name="Regression",
showlegend=False
),
row=row,
col=col
)
fig.update_layout(
title="Anscombe’s Quartet: Same Statistics, Different Distributions",
height=700,
width=900
)
fig.show()
Another instance where this occured: The Datasaurus

Approaches to data analytics:¶
Traditional
Query for known patterns
Display results using traditional techniques
Pros:
Many solutions
Easier to implement
Cons:
Can’t search for the unexpected
Data Mining/ML
Based on statistics
Black box approach
Output outliers and correlations
Human out of the loop
Pros:
Scalable
Cons:
Analysts have to make sense of the results
Makes assumptions on the data
InfoVis
Visual interactive interfaces
Human in the loop
Pros:
Visual bandwidth is enormous
Experts decided what to search for
Identify unknown patterns and errors in the data
Cons:
Scalability can be an issue
In Infovis, we look for insights
Deep understanding
Meaningful
Non-obvious
Actionable
Based on data
An insight is:
Something that the user can learn from the data using the infovis
Which she didn’t know/expect
Also, is useful/needed for her
Moreover, she didn’t know of it
And that she can leverage
Some of the major tools used for visualization:
D3
Vega-lite
Altair
Tableau
Plotly Visualization¶
Plotly is a powerful, open-source data visualization library used to create interactive, publication-quality graphs and dashboards, supporting languages like Python, R, and JavaScript.
import plotly.express as px
import numpy as np
x = np.random.randn(1000)
y = np.random.randn(1000)
color = np.random.permutation(1000)
fig = px.scatter(x,y, color=color)
fig.show()Plotting with plotly (and matplotlib):
Strengths
Designed like MatLab: switching was/is easy
Many rendering backends
Can reproduce just about any plot (with a bit of effort)
Well-tested, standard tool for the last 10 years
Weaknesses
Designed like MatLab
API is imperative and often overly verbose
Slow with large datasets
Can have a steep learning curve with lots of memorization
Statistical Visualization¶
Data in column-oriented format; i.e. rows are samples, columns are features
iris = px.data.iris()
iris.head()Statistical Visualization: Grouping¶
import plotly.graph_objects as go
color_map = {
'setosa': 'blue',
'versicolor': 'green',
'virginica': 'red'
}
fig = go.Figure()
for species, group in iris.groupby('species'):
fig.add_trace(
go.Scatter(
x=group['petal_length'],
y=group['sepal_width'],
mode='markers',
name=species,
marker=dict(
color=color_map[species],
opacity=0.3
)
)
)
fig.update_layout(xaxis_title='Petal Length', yaxis_title='Sepal Width')
fig.show()
Statistical Visualization: Faceting¶
import plotly.graph_objects as go
from plotly.subplots import make_subplots
color_map = dict(zip(
iris.species.unique(),
['blue', 'green', 'red']
))
n_panels = len(color_map)
fig = make_subplots(
rows=1,
cols=n_panels,
shared_xaxes=True,
shared_yaxes=True,
subplot_titles=list(color_map.keys())
)
for i, (species, group) in enumerate(iris.groupby('species'), start=1):
fig.add_trace(
go.Scatter(
x=group['petal_length'],
y=group['sepal_width'],
mode='markers',
name=species,
marker=dict(
color=color_map[species],
opacity=0.3
),
showlegend=False # legend per-panel is redundant
),
row=1,
col=i
)
fig.update_layout(
xaxis_title='petal_length',
yaxis_title='sepal_width',
height=350,
width=n_panels * 350
)
fig.show()
Problem: We’re mixing the what with the how
Declarative Visualization¶
Imperative
Specify How something should be done.
Specification and Execution intertwined.
“Put a red circle here, and a blue circle here”.
Declarative
Specify What should be done.
Separates Specification from Execution.
“Map to a position, and to a color".
Declarative visualization lets you think about data and relationships, rather than incidental details.
Altair Visualization¶

Declarative visualization in Python using the Vega grammar (Visualization Grammar)
Vega is a visualization grammar (a language) and a library for server- and client-side visualizations. A live benchmark showing client-side performance of Vega on differently sized datasets is presented. A Python program that generate experiments used in the article is presented as well.
The “grammar” is far from trivial, but once you get a hang of it, creating graphs can be relatively painless and even enjoyable process. The docs are extremely helpful and definitely worth checking out. There is also a simplified version of the framework called “Vega Lite”
import altair as alt
iris = px.data.iris()
alt.Chart(iris).mark_point().encode(
x='petal_length',
y='sepal_width',
color='species'
)Encodings are flexible:
import altair as alt
iris = px.data.iris()
alt.Chart(iris).mark_point().encode(
x='petal_length',
y='sepal_width',
color='species',
column='species'
)Altair is interactive
import altair as alt
iris = px.data.iris()
alt.Chart(iris).mark_point().encode(
x='petal_length',
y='sepal_width',
color='species'
).interactive()Basics of an Altair Chart
import altair as alt
iris = px.data.iris()
alt.Chart(iris).mark_point().encode(
x='petal_length:Q',
y='sepal_width:Q',
color='species:N'
)Anatomy of an Altair Chart¶
Chart assumes tabular, column-oriented data. It supports pandas dataframes, CSVs, TSVs, JSONs.
alt.Chart(iris)
Chart uses one of the several pre-defined marks:
point
line
bar
area
rect
geoshape
text
circle
square
rule
tick
alt.Chart(iris).mark_xxxxx()
Encoding map visual channels to data columns
Channels are automatically adjusted based on data type (N, O, Q, T)
| Letter | Type | Meaning | Examples |
|---|---|---|---|
| N | Nominal | Categories, no order | "species", "country" |
| O | Ordinal | Ordered categories | "low" < "medium" < "high" |
| Q | Quantitative | Numeric, measurable | height, price, count |
| T | Temporal | Date / time | "2024-01-01", timestamps |
Available channels:
Position (x,y)
Facet (row, column)
Color
Shape
Size
Text
Opacity
Stroke
Fill
Latitude/Longitude
import altair as alt
iris = px.data.iris()
json = alt.Chart(iris).mark_point().encode(
x='petal_length:Q',
y='sepal_width:Q',
color='species:N'
).to_json()Using this JSON, you can view it in: Vega-Studio
import altair as alt
url = "https://gist.githubusercontent.com/armgilles/194bcff35001e7eb53a2a8b441e8b2c6/raw/92200bc0a673d5ce2110aaad4544ed6c4010f687/pokemon.csv"
chart = alt.Chart(url).mark_circle().encode(
x='Attack:Q',
y='Defense:Q',
row='Generation:N',
column='Legendary:N'
)
chart
chart = alt.Chart(url).mark_bar().encode(
y=alt.Y('Generation:N'),
x=alt.X('*:Q', aggregate='count', stack='normalize'),
color=alt.Color('Legendary:N')
)
chart
url = "https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-06-25/ufo_sightings.csv"
chart = (
alt.Chart(url)
.mark_circle(size=4, opacity=0.5)
.encode(
x='longitude:Q',
y='latitude:Q',
color='shape:N',
tooltip=[
'date_time:T',
'ufo_shape:N',
'state:N',
'encounter_length:Q'
]
)
)
chart