Grim Stats Through The Prism Of Google Facets
This is just the tip of the iceberg
Recently a young girl killed herself in the small hill station that I come from. As usual, there was a lot of gossip on how and why the girl killed herself. But, I doubt if such gossip has ever helped anyone. Suicide is a serious issue and I believe more conversation should happen on the causes and impulses which push people to take such a drastic step.
The Dataset
Fortunately, Kaggle has a dataset on suicides in India that can be used to look into the suicides and related death rates in India. The dataset is of size 1 MB. After downloading the CSV file, we can load the dataset using the pandas library.
So you can see the headers are State, Year, Type_Code, Type, Gender, Age_group, and Total.
For the sake of simplicity, we will only be looking at the states and the total number of suicides that are happening in those states. So we group by the “State” key and aggregate the “Total” column.
Some additional information is there that we are not interested in at the current moment. So, we will just drop the fields.
As you can see, the number of rows has decreased from 38 to 35.
Google Facets
Google Facets is a visualization tool that is designed to have a better insight into the data in the preliminary phase itself so that machine learning engineers can build better models out of them. Google facets has two ways in which we can visualize the data. We will focus on how to use the Dive feature.
To use it in a jupyter notebook first, we need to install it as a plugin. Also, one of the caveats is that facets can only be used in Google Chrome.
Therefore, first, we need to clone the repo and then install the dist package.
git clone https://github.com/PAIR-code/facets.git
jupyter nbextension install facets-dist/ --user
The Dive example that is given in the repo shows how to load the data and build the HTML file together. In this post, we will do them separately so that we can see how to handle larger data sets as well.
Now, since we had boiled down the original matrix to two columns, half of our work is already done. What is left is to convert the dataframe to a record based json and create the HTML.
This should create an html_file.html
. Now, you cannot see the file by opening it in firefox or chrome. For that, you need to fire up a jupyter notebook in chrome and run the following code in one of the cells.
This should show you the output. On the left-hand side under Faceting
, click on Column-Based Faceting, select Total
and keep bucketing to 10.
You should be able to get the following histogram.
Looking at this you can see that the interesting buckets are West Bengal, Tamil Nadu, Maharashtra and Andhra Pradesh. This can lead us to further questions like why that is the case!
In case you found this interesting and would like to talk more on this, just drop me a message @alt227Joydeep. I would be glad to discuss this further. You can also hit the like button and follow me here, on Medium. You can find the code used in this blog here.