What patterns can we find by looking at the frequency of words in Reddit comments
The above cryptocurrency price chart shows the result of an analysis which plots the frequency of Bitcoin Reddit mentions agains the USD price of Bitcoin over time.
- Is there a correlation between mentions of Bitcoin on Reddit and the price of Bitcoin?
- How strong is the correlation?
- Can we apply this to other coins? For example, is there a correlation between the price of Ethereum and the volume of mentions?
- And perhaps most importantly of all, can we use Bitcoin mentions to predict the future price of Bitcoin and what does the data tell us?
A quick side-note before we begin, ChartStar offers a free range of advanced cryptocurrency trading tools, orderbook heatmaps and custom built price charts, so head on over to the main sitefree to sign up!
Counting Bitcoin mentions
The Reddit API provides free, open access to all kinds of user submitted data, including user comments on topics. We’ll start by querying all comments ever made on Reddit that contain the word “bitcoin”. My initial thought was that this could require some serious number crunching, as in 2019 alone over 1.9 billion comments were made in total across all of Reddit. Wowser.
My trusty 7 year old iMac anxiously broke out into a heavy sweat as I feverishly cooked up a script to pull down the data I needed from the Reddit API. Given my measly bandwidth and sheer amount of data to process this didn’t seem right. There just had to be a smarter way.
Google BigQuery, the most badass database ever
Google BigQuery is a cloud-hosted database which doesn’t have the limitations of an ordinary database, is as scalable as your budget allows (in my case not much but still) and still allows powerful SQl-like queries to be performed. Just as crucially, in the Google Cloud platform they have a free, publicly available dataset of all the Reddit comments ever made pre-loaded into into a Big Query table. Yay!
It wasn’t complete plain sailing from here. The main problem I encountered was how to query the comment data down to granular chunks by month, as there were only raw unix timestamps per comment. Aggregating the data across all time into month chunks posed a problem and just thinking about it was giving me flashbacks to the night before my college math exams
The magic query and a warning
So, just in case anyone else at home wants to play along, here is the magic Google BigQuery query that saves the day:
Plug this into the BigQuery console, hit run and boom you’re done.
One warning though, every Google Cloud account has a free quota of BigQuery allocated to them but once you exceed that you will get billed, and just running the single query above once will push you into the red as it analyzes over a terabyte of data. Note – to any particularly enthusiastic number crunchers who are about to race off with this query, towards the end I find a cheaper (much slower) way to get similar data. Hindsight is 20/20!
Now the fun begins. Bitcoin price charted against Reddit Mentions
Now lets plot the frequency of each mention per month against the dollar price of Bitcoin. I pull the cryptocurrency price data for this analysis from the rather excellent, free CryptoCompare API:
Bitcoin price vs reddit mentions
I just want to point out:
This was the very first attempt at plugging this into the charting software, no adjustments were made. That’s quite an impressive visual correlation.
The monthly price data being plotted is from the mid point of each month
The BigQuery dataset for Reddit comments only goes up to 2019-09 hence why the analysis stops there
I want to see if we go down to a more granular timeframe for Bitcoin prices what happens to the graph and if we can see a predictive effect.
Bitcoin Reddit comment frequency is somewhat predictive
Bitcoin price vs reddit mentions
As we can see with the above weekly Bitcoin price chart, several significant price movements are precipated by a noticeable change in Reddit comment frequency:
Bitcoin price vs reddit mentions
Interestingly, the second circled area shows a spike in comment frequency occured shortly before a decrease in the price. Note the x-axis is of the form year-week.
Mention analysis of other coins
The natural next step is to apply the methodology to other cryptocurrencies. It makes sense to focus on cryptos which have a lot of mentions, so taking a look at the top ranked coins by market capitalization from coinmarketcap we get a few choices:
So all we need to do is plug these coins into the magic query and out comes the good stuff.
Except wrong, it’s not that simple.
The obvious problem soon becomes, well… obvious. The uniqueness of the word used as the name of the coin is crucial for mention tracking to be effective. We can’t track Ripple, as there could be people using the word in a context other than cryptocurrency – “the ripple of the ocean”. Similarly for Tether. There are also some other issues, which I won’t go into as this article is already becoming super long. I settled on analysing the following:
Ethereum price analysis
ether price vs reddit mentions
Litecoin price analysis
litecoin price vs reddit mentions
Bonus content – sentiment analysis
This graph shows the frequency of the word “cryptocurrency” vs Bitcoin’s price:
bitcoin price vs reddit mentions
And as a rough form of sentiment analysis, this one focuses solely on mentions of the word bullish within r/bitcoin
Pushshift.io as a cheaper, slower alternative
So it turned out there’s a way to do this for free?
So I found out later on that pushshift.io have an amazing source of Reddit data which can be searched for free via their API, including all comments. This could be used to get more up-to-date comment data up until Feb 2020, as the BigQuery data ends around 2019-09. It will be slower though, as you would have to loop through the comments 1,000 at a time and there’s billions of them! If there is demand I can put the data I have collated up onto some public Gogle Sheets.
So there we have it, a first look at how comment frequency on Reddit matches up surprisingly well with cryptocurrency prices. Let me know on Twitter if there are other currencies, phrases, subreddits or communities you want me to analyze. As for future directions, more in-depth sentiment analysis could be performed, correlations with on-chain data could be examined, machine learning could be applied to the dataset as well as more advanced SQL queries. I hope you enjoyed the content!
I don’t know if you can tell but this one took a looong time for me to put together and research. If you appreciated it subscribing to the blog or retweeting on Twitter would be a massive help. Thanks!
p.s. I have zero Twitter followers at the time of writing so please show me some love!
Analyzing the correlation between Bitcoin Reddit mentions and the price of Bitcoinhttps://t.co/nGtqggmygN
— ChartStarHQ (@ChartStarHQ) February 9, 2020
Author: Sam Smith
Source: Chart Stara: Analyzing The Correlation Between Bitcoin Reddit Mentions And Price