The following post might only be of interest to you if you want to know about my progress of learning to code or if you are an avid user of the tech news community Hacker News. Please also note that I cannot give a guarantee for the accuracy of the shown data, even though after thorough double-checking I think it is quite accurate. But don’t bet all your money on it.
As mentioned two month ago in this post, in my quest to teach myself programming with Python, I discovered the Hacker News API as an ideal way to learn about accessing APIs and to take first steps with data analysis and visualization. The API is rather simply structured and doesn’t require an authorization (although I subsequently managed to conquer the Reddit API as well which is more complex and requires an authorization via OAuth).
Something I have been curious about for a while is the dynamic with which articles submitted by a Hacker News user hit the front page of the site. So I went ahead and indulged in a little project to find out.
First, I wrote a Python script (available here) which checks the Hacker News front page via the API, meaning the top 30 items, once a minute, and captures every single new item appearing there. The script then stores the meta data about that item (title, username, score, time etc) in a .csv file (available here), with each new item being appended to the end of the file.
I ran the script using the fantastic service Python Anywhere, which offers free accounts for Python programs that don’t require too much resources. I would have paid if necessary but my script didn’t exceed the limits. I could have ran the script locally on my computer as well, but by outsourcing it to the cloud, I can set my notebook into standby mode when not using it. The script was occasionally interrupted for no apparent reason, possibly due to some access limitations by the Hacker News API or due to a weakness in my code. But a simple restart continued the monitoring and capturing of new entries to the Hacker News front page. With that process, I gathered “snapshots” of 772 items that made it to the Hacker News front page over the past 2 weeks, excluding a few captured items manually posted to the front page by the Hacker News/Y Combinator staff and some items which appeared to be erroneous. Generally, regarding the few items that appeared on the front page while they already had a score of more than 20 (see first chart), it is likely that they had been at least briefly present on the front page before.
To see that the script is running properly, I am forcing it to show a status message every minute, and to highlight when a new item to the Hacker News front page has been saved to the .csv file
In parallel, I created a second Python script to use for analyzing and visualizing the data. I worked myself through various tutorials for Matplotlib, the Python library that creates the plots (or “charts” or whatever you want to call it), read a lot of posts on Stack Overflow, did plenty of web searches and simply followed a trial & error approach.
The results are not very revolutionary of course. While I am eager to acquire more skills for data analysis and visualization, the rather trivial plots below are still only the first steps. I am not even sure if the insights are in any way informative for Hacker News users. However, it personally revealed some things to me that I wasn’t aware of, such as that many items appear on the front page already after 2 upvotes, and that the front page is NOT dominated by a few power contributors that represent the lion’s share of the items showing up there.
The Hacker News FAQ mentions that criteria for ending up on the front page are not only upvotes in relation to time since the story was submitted, but also other factors, including moderator intervention. When looking at the plots below, this gets evident, as some items appear on the front page only many hours after they’ve been posted – while the majority gets there quite swiftly.
If you have ideas for some further analysis of the Hacker News front page, please let me know in the comments. If you want to have a look at the script for analyzing and plotting the data, you find it here. Please note that due to my beginner status, it is highly likely that I use rather inefficient and convoluted code, ignore recommended libraries (maybe I ought to have used Pandas or NumPy, for example) or that I make use of unnecessary hard coding. I am happy about any suggestion for improved, more efficient or shorter code.
And a tag cloud based on the words from all 772 titles. Google has been mentioned in 28 titles. Facebook only in 5.
And what’s next on my plate? Aside from trying to refine the code for this analysis, I am looking into Google’s QPX Express API for flight pricing and routing data… As someone who passionately can procrastinate with flight and travel hacking, this seems like the perfect endeavor for me.