Titanic Dataset

ML titanic dataset

This was a beginner project/dataset I found on Kaggle to practice my skills in Data science and machine learning in python. The libraries used were, sciPy and sci-kit learn.

Process

This section will be a summary of the thought process I had while attempting this project. As I was still relatively new to the libraries, I had to check the documentation of specific methods multiple times. Also, I may not always use the most efficient way to perform some operations and thus I apologise for this.

Figure 2: Screenshot of data summarised in pandas

Data Science/Analysis

I encountered a few problems while trying to analyse the dataset. This was as I still did not know how to properly visualise the correlation of particular features in the dataset. I also did not know what was the 'right' way to clean up missing data in datasets yet. However, in the end, I decided to use the median average as a way to fill gaps in my data. In addition, I picked out some features that I felt had a strong correlation with the survivability of the passenger.

Figure 3: Screenshot of the class code for the ML models

Machine Learning

I decided to use a class in python to somewhat automate the process of trying out different models from sci-kit learn. Such models include random tree classifiers, logistical regression and the K nearest neighbours. The class would split the training dataset into train, test and validate datasets. Using these datasets, I would try out different combination of models and hyperparameters to try and get the best results.

Conclusion

All in all, I learnt a lot about machine learning and data science from this project. With the knowledge I gained through this project I am actually attempting another dataset on kaggle, this time with much more data and features, so you can expect that project to be up here soon.

Figure 4: Screenshot of my position on the kaggle leaderboard

What I learnt

Through the whole project, I have learnt how to visualise data using the libraries seaborn and matplotlib to get a better understanding of trends in data. I have also learnt how to manipulate and work with data through the libraries numPy and pandas. Lastly, I have learnt how to generate, train and validate models in sci-kit learn modules.

Figure 5: Screenshot of the accuracy of my submissions into kaggle

Finished product

In the end, I managed to get a 79% accuracy on my predictions which was much better than my expectations. This project has made me have a much better appreciation for the intricacies in dealing with data and trying to predict the future using this data. I am now even more interested in this field of data science and artificial intelligence and would probably continue to try out more things in this field.

Chuan Hao