Data Science Workflow for Kaggle
Adapted from: Titanic Data Science Solutions
Note 1: Stages should by no means be linear, but intend to explain the general workflow. Note 2: There is no distinction between testing and validation in this workflow.
Workflow Stages
- Question or problem definition
- Acquire training and testing data
- Wrangle, prepare, cleanse the data
- Analyze, identify patterns, and explore the data
- Model, predict and solve the problem
- Visualize, report, and present the problem solving steps and final solution
- Supply or submit the results
General Tools
For Dataframe Handling
- Pandas
- NumPy
re(regex)sklearn(scikit-learn)
For Visualization
- Matplotlib
- Seaborn (as
sns) — built on top of Matplotlib
Preliminaries: Understanding the Purpose of Data Science — the 7Cs
- Classification: categorize samples
- Correlate: find significance between variables
- Converting and Creating: of (often new) variables for correlation, conversion, or completeness goals
- Correcting and Completing: correcting typos (words/numbers), completing through imputation
- Charting: use the right visualizations
Most of the time we observe, then point towards a decision, achieving these purposes in a non-linear fashion.
Step 1: Question/Problem Definition
Understand the problem at hand. In general, it requires answering:
Knowing from a training set of samples containing [some known information], can our model determine, based on a given test dataset NOT containing the known information, the [information to be tested]?
Step 2: Acquiring the Data
train_df = pd.read_csv('../path-to-train-csv') test_df = pd.read_csv('../path-to-test-csv')
Step 3: Preliminary Analysis of Data
Ask yourself:
- What types of data exist?
- How many categorical vs numerical variables?
- Categorical variables could be: normal, ordinal, ratio, or interval-based
- Numerical variables can be continuous or discrete
- Mixed data? e.g. ticket codes like
C03— what does the prefix imply? Possibly seating distance from exit, which might correlate with survival (a Correlate goal)
- What features may contain typos or errors?
Code Implementation
# Check column names train_df.columns.values # General features: null counts, dtypes train_df.info(verbose=True) # Distributions and general stats train_df.describe(verbose=True) # Categorical data statistics train_df.describe(include=['O']) # Preview data train_df.head() # or .tail(n)
The Pearson correlation is also very important — Matplotlib + Seaborn make this easy:
colormap = plt.cm.RdBu plt.figure(figsize=(14, 12)) plt.title('Pearson Correlation of Features', y=1.05, size=15) sns.heatmap(train.astype(float).corr(), linewidths=0.1, vmax=1.0, square=True, cmap=colormap, linecolor='white', annot=True)
Reference: Introduction to Ensembling/Stacking in Python
Step 4: Write Down Assumptions Based on Preliminary Analysis
Note: We will always return to Step 4 during Step 5.
- Correlating: ask yourself what is the dependent variable you want to compare results against. Match quick correlations early.
- Completing: what features do you want to complete? What do you want to drop?
- Correcting: what features do you want to drop and why? Potential reasons:
- The feature cannot contribute to what we're predicting
- Too incomplete (too many typos or incorrect values)
- Creating: why convert continuous values into ordinal integers (e.g. age, fare)?
- Easier visualization in groups/bins — no meaningful reason to distinguish a 15-year-old from a 17-year-old in survival analysis
- Optimizes gradient descent
Step 5: Intermediate Analysis
Strategy 5.1 — Pivot features against each other via tables or graphs:
train_df[['Pclass', 'Survived']].groupby(['Pclass'], as_index=False).mean().sort_values(by='Survived', ascending=False)
Strategy 5.2 — Visualize data, then make decisions:
e.g. (observe) infants (Age <= 4) seem to have a higher survival rate by looking at graphs → (decision) we should complete "Age" in our model
g = sns.FacetGrid(train_df, col='Survived') g.map(plt.hist, 'Age', bins=20)
Step 6: Data Wrangling
-
Extract useful information from alphanumerical values
e.g.
"Mr. Jonathan CHAN"→ the title"Mr."infers gender, marital status, and age range; the name itself is not useful. -
Modify text/numbers
for dataset in combined: dataset['Title'] = dataset['Title'].replace( ['Lady', 'Countess', 'Capt', 'Col', 'Don', 'Dr', 'Major', 'Rev', 'Sir', 'Jonkheer', 'Dona'], 'Rare' )
- Conduct imputation (for the Completing step)
# Quick method: fillna with median train_df['Fare'].fillna(test_df['Fare'].dropna().median(), inplace=True) # More accurate: use median per combination of correlated features # e.g. speculate mean age depends on Pclass and Sex guess_ages = np.zeros((2, 3)) for dataset in combine: # combine = [X_full, X_test_full] for i in range(0, len(guess_ages[0])): for j in range(0, len(guess_ages[1])): guess_df = dataset[(dataset['Sex'] == i) & (dataset['Pclass'] == j+1)]['Age'].dropna() guess_ages[i, j] = int(guess_df.median() / 0.5 + 0.5) * 0.5
-
Drop redundant features
Rule of thumb: if Pearson correlation between two features is high, there are likely redundancies — consider combining them or removing one.
e.g.
FamilySize = SibSp + Parch + 1 -
Create artificial features
Useful for capturing interactions that individual raw features don't express on their own.
Step 7: Model, Predict, and Solve
Narrow down estimators based on:
- The nature of the problem (classification vs regression)
- The type of machine learning (supervised vs unsupervised)
Scikit-learn provides a comprehensive algorithm cheatsheet.
Estimators in Scikit-learn
- Linear Models — Linear/Logistic Regression, Ridge, Lasso, ElasticNet, SGDClassifier
- Support Vector Machines —
SVC,SVR,LinearSVC,LinearSVR - Tree-Based Methods — DecisionTree, RandomForest, ExtraTrees, GradientBoosting, XGBoost, HistGradientBoosting
- Nearest Neighbors — KNeighbors, RadiusNeighbors
- Naive Bayes — GaussianNB, MultinomialNB, BernoulliNB, CategoricalNB
- Discriminant Analysis — LinearDA, QuadraticDA
- Ensemble Methods — Bagging, AdaBoost, Stacking, Voting
- Neural Networks (Shallow) — MLPClassifier, MLPRegressor
- Probabilistic Models — GaussianProcessClassifier/Regressor
Code Implementation
Split into train and test:
X_train = train_df.drop("Survived", axis=1) Y_train = train_df["Survived"] X_test = test_df.drop("PassengerId", axis=1).copy() X_train.shape, Y_train.shape, X_test.shape
Model fitting workflow:
- Choose model (選擇模型)
- Choose (hyper)parameters (超參數)
- Set up cross-validation using
KFoldandcross_val_score- Fit
X_trainandY_train
- Fit
- Obtain prediction
- Obtain prediction score
- Adjust model via hyperparameter tuning: grid search, Bayesian optimization, etc.
logreg = LogisticRegression() logreg.fit(X_train, Y_train) Y_pred = logreg.predict(X_test) acc_log = round(logreg.score(X_train, Y_train) * 100, 2) acc_log
Advanced: Ensemble Methods
There are 3 types of ensemble methods: Bagging, Boosting, and Stacking.
1. Bagging (Bootstrap Aggregating)
- A group of weak learners combine to form a strong learner that is less prone to overfitting, with lower variance.
- Models are trained in parallel.
- Regression: predictions are averaged. Classification: predictions use majority vote.
- Types: instance-based bagging, attribute-based bagging.
2. Boosting
Models are trained sequentially rather than in parallel. Each successive model corrects the errors of the previous one.
3. Stacking
- Combines different types of models, each trained separately on the same data.
- Individual predictions from each base model serve as inputs to a meta-model, which learns to weigh and combine base model outputs as if they were data instances.
Reference: Introduction to Ensembling/Stacking in Python