1. Preliminary Data Analysis
Contents
1. Preliminary Data Analysis#
Performing a preliminary data analysis is crucial to gaining a better understanding of the data and making necessary cleaning adjustments. During this step, we will explore the following questions:
Does the packname follow a consistent pattern?
What is the distribution of hold times?
Is there a clear relationship between difficulty rating and piece counts?
We’ll be taking a look at the data sets member_hold_times_and_packs.tsv and packs.tsv.
# Importing Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings("ignore")
1.1 Members Data#
memberID
- Unique identifiers for each personholdtime
- number of days the person had the packpuzzlepack
- name of the pack (may have more than 1 puzzle per pack)
member_holdtime_df = pd.read_csv('data/member_hold_times_and_packs_Feb16_2023.tsv', sep='\t', header=None)
member_holdtime_df.columns = ['memberID', 'holdtime', 'puzzlepack']
member_holdtime_df.shape
(19733, 3)
## Does packname have a pattern?
member_holdtime_df.head()
memberID | holdtime | puzzlepack | |
---|---|---|---|
0 | member1 | 2.939411 | Artifact Puzzles Justin Hillgrove Word Travels South Artifact Puzzles Justin Hillgrove Word Travels West |
1 | member1 | 0.998885 | DaVici Puzzles Full Moon Feast DaVici Puzzles World's Greatest Miracle |
2 | member1 | 10.865032 | DaVici Puzzles Flying Frigate DaVici Puzzles Hobby Horse |
3 | member1 | 22.083971 | Liberty Puzzles Haeckel Hummingbirds Nautilus Puzzles Mother Adams |
4 | member1 | 5.077603 | DaVici Puzzles Diana Zimens City Of Cats |
Comments:
Observe the extra space at the end of “cats “. This shows it has this pack has only one puzzle.
other strings have the word “Puzzles” twice while ‘Diana Zimens City Of Cats ‘ has it only once.
Looks like the first word is a brand or something, then the word “puzzle” and the name of puzzle. For example -
DaVici, Puzzles Full Moon Feast DaVici Puzzles World’s Greatest Miracle
DaVici, Puzzles Flying Frigate DaVici Puzzles Hobby Horse
Liberty, Puzzles Haeckel Hummingbirds Nautilus Puzzles Mother Adams
DaVici, Puzzles Diana Zimens City Of Cats
## What is the distribution for holdtimes?
plt.figure(figsize = (10,5))
plt.title("Figure 1. Distribution of Hold Times", weight='heavy',y =1.1, x=0.16).set_fontsize('18')
z = sns.histplot(data=member_holdtime_df, x="holdtime", color="#F9C762", alpha = 1.0, bins = 75)
sns.despine(left=True, bottom=False)
z.spines['bottom'].set_color('#DDDDDD')
z.yaxis.grid(True, color='#EEEEEE')
z.xaxis.grid(False)
plt.xlabel("Hold Time", size=17, weight=500)
plt.ylabel("Frequency", size=17, weight=500)
plt.tight_layout()
# plt.savefig("_static/images/hold_times_dist.png", format="png", dpi=1200)
plt.show()
fig, axs = plt.subplots(1, 2, figsize=(15, 5))
sns.histplot(data=member_holdtime_df, x=member_holdtime_df.loc[member_holdtime_df['holdtime'] <= 150]['holdtime'], bins = 20,color="#ffa07a", alpha = 1.0, ax=axs[0])
axs[0].set_title("Holdtime up to 5 months")
sns.histplot(data=member_holdtime_df, x=member_holdtime_df.loc[member_holdtime_df['holdtime'] > 151]['holdtime'], bins = 20, color="#FB607F", alpha = 0.9, ax=axs[1])
axs[1].set_title("Holdtime over 5 months")
fig.tight_layout()
# plt.savefig("_static/images/hold_times_dist2.png", format="png", dpi=1200)
plt.show()
# Distribution of member hold times value counts
plt.figure(figsize = (10,5))
plt.title("Figure 2. Puzzle Packs per Member", weight='heavy',y =1.1, x=0.16).set_fontsize('18')
z=sns.histplot(data=member_holdtime_df['memberID'].value_counts().reset_index(name='freq'), x='freq', bins = 20, color="#FF3D4D", alpha = 0.9)
sns.despine(left=True, bottom=False)
z.spines['bottom'].set_color('#DDDDDD')
z.yaxis.grid(True, color='#EEEEEE')
z.xaxis.grid(False)
plt.xlabel("Count of Members", size=17, weight=500)
plt.ylabel("Count of Puzzle Packs", size=17, weight=500)
plt.tight_layout()
# plt.savefig("_static/images/members_dist2.png", format="png", dpi=1200)
plt.show()
1.2 Packs Data#
pack_name
- name of the pack (may have more than 1 puzzle per pack)brand
- brand name of puzzle packpiece_count
- number of pieces in puzzledifficulty_rating
- difficulty rating for puzzle
packs_df = pd.read_csv('data/packs_Feb16_2023.tsv', sep='\t', header=None)
packs_df.columns = ['pack_name', 'brand', 'piece_count', 'difficulty_rating']
packs_df.shape
(920, 4)
pack_name | brand | piece_count | difficulty_rating | |
---|---|---|---|---|
0 | Anne Belle Thompson The Mikado Anne Belle Thompson Two Kimonos | Other-Hand-cut | 387,242 | A-Easy,A-Easy |
1 | Anthology Puzzles Alphonse Mucha La Plume Anthology Puzzles Fun at the Fair | NaN | 320,160 | Average,Average |
2 | Anthology Puzzles Colorful Cat Anthology Puzzles Life On The Reef | NaN | 150,170 | Really-Hard,A-Easy |
3 | Anthology Puzzles Framed American Gothic Anthology Puzzles Haeckel Floral | NaN | 320,300 | Average,Average |
4 | Anthology Puzzles Over The Moon Anthology Puzzles Happy House Plants | NaN | 278,177 | Average,A-Easy |
Separate columns into piece_count_1
, piece_count_2
, difficulty_rating_1
, difficulty_rating_2
# making two coulmns for piece count
packs_df['piece_count_1'] = packs_df['piece_count'].str.split(',', expand=True)[0]
packs_df['piece_count_2'] = packs_df['piece_count'].str.split(',', expand=True)[1]
# making two columns for difficulty
packs_df['difficulty_rating_1'] = packs_df['difficulty_rating'].str.split(',', expand=True)[0]
packs_df['difficulty_rating_2'] = packs_df['difficulty_rating'].str.split(',', expand=True)[1]
# creating temporary dataframe with no missing values for analysis
packs_df_temp_1 = packs_df.dropna(axis=0, subset = ['piece_count_1', 'difficulty_rating_1'])
packs_df_temp_2 = packs_df.dropna(axis=0, subset = ['piece_count_2', 'difficulty_rating_2'])
packs_df_temp_1.piece_count_1 = packs_df_temp_1.piece_count_1.astype('int')
packs_df_temp_2.piece_count_2 = packs_df_temp_2.piece_count_2.astype('int')
Count | |
---|---|
Really-Hard | 79 |
Hard | 329 |
A-Easy | 493 |
Average | 931 |
# checking category-wise distribution of piece_count
plt.figure(figsize = [12,10], dpi=100)
color_dict = {'A-Easy':"#ff7f80", 'Average':"#e83e8c", 'Hard':"#ff9867", 'Really-Hard': "#ffb14e"}
plt.subplot(2,1,1)
sns.boxplot(x = "piece_count_1", y= 'difficulty_rating_1',
data=packs_df_temp_1, palette = color_dict, order = ['A-Easy', 'Average', 'Hard', 'Really-Hard'])
plt.title("Puzzle one distribution")
plt.subplot(2,1,2)
sns.boxplot(x = "piece_count_2", y= 'difficulty_rating_2',
data=packs_df_temp_2, palette = color_dict, order = ['A-Easy', 'Average', 'Hard', 'Really-Hard'])
plt.title("Puzzle two distribution")
plt.legend(loc='lower right')
plt.tight_layout()
plt.show()
packs_df.to_csv('data/packs_df.csv', index=False)
member_holdtime_df.to_csv('data/member_holdtime_df.csv', index=False)