Developing features for the Titanic Problem

Hi. I'm a beginner in Machine Learning and I am working through Kaggle's Titanic Competition. My first submission only had a 0.68 accuracy score. I feel that I didn't develop the necessary features for the model when there were strong correlations between data.

I find that people that were younger, male, richer (first class suite), and didn't have large families were more likely to survive and people that were not as rich, large families, old in age, and women were least likely to survive.

How do I develop a feature based on his data? I have never developed features before. I'm more interested in the process behind developing features rather than flat out chunks of code that I can always pull out.

Here is my code:

import pandas as pd

from sklearn.tree import DecisionTreeClassifier

from sklearn.model_selection import train_test_split

"""Assigning the train & test datasets' adresses to variables"""

train_path = "C:\Users\Omar\Downloads\Titanic Data\train.csv"

test_path = "C:\Users\Omar\Downloads\Titanic Data\test.csv"

"""Using pandas' read_csv() function to read the datasets

and then assigning them to their own variables"""

train_data = pd.read_csv(train_path)

test_data = pd.read_csv(test_path)

"""Using pandas' factorize() function to represent genders (male/female)

with binary values (0/1)"""

train_data['Sex'] = pd.factorize(train_data.Sex)[0]

test_data['Sex'] = pd.factorize(test_data.Sex)[0]

"""Replacing missing values in the training and test dataset with 0"""

train_data.fillna(0.0, inplace = True)

test_data.fillna(0.0, inplace = True)

"""Selecting features for training"""

columns_of_interest = ['Pclass', 'Sex', 'Age']

"""Dropping missing/NaN values from the training dataset"""

filtered_titanic_data = train_data.dropna(axis=0)

"""Using the predictory features in the data to handle the x axis"""

x = filtered_titanic_data[columns_of_interest]

"""The survival (what we're trying to find) is the y axis"""

y = filtered_titanic_data.Survived

"""Splitting the train data with test"""

train_x, val_x, train_y, val_y = train_test_split(x, y, random_state=0)

"""Assigning the DecisionClassifier model to a variable"""

titanic_model = DecisionTreeClassifier()

"""Fitting the x and y values with the model"""

titanic_model.fit(train_x, train_y)

"""Predicting the x-axis"""

val_predictions = titanic_model.predict(val_x)

"""Assigning the feature columns from the test to a variable"""

test_x = test_data[columns_of_interest]

"""Predicting the test by feeding its x axis into the model"""

test_predictions = titanic_model.predict(test_x)

submission = pd.DataFrame({ 'PassengerId': test_data.PassengerId.values, 'Survived': test_predictions })

submission.to_csv("my_submission.csv", index=False)

onur-oz

posted on 14 Oct 18

Enjoy great content like this and a lot more !

Signup for a free account to write a post / comment / upvote posts. Its simple and takes less than 5 seconds