Our mission at Redsift is to make it easy for everyone to extract meaning from their data. Big data and machine learning technologies have been around for quite some time but very few companies and even fewer individuals directly benefit from them in their daily lives. Getting access to one’s data lake and setting up the infrastructure necessary to use these technologies is not a trivial engineering task, especially to make it run on live data. So most of the work available today are scientific and experimental publications on static datasets.
We want that to change. We are creating a platform where professional and aspiring data scientists can experiment on their own dynamic datasets and create services that can simplify people’s lives and humanise our interaction with data.
A couple of weeks ago, we came across an interesting post from Andrey Kurenkov about his experiments using a neural network to organise his emails with a 94% accuracy.
Wouldn’t it be great to have an agent that could learn from the manual classifications that we apply to our emails and automatically suggest a classification for new emails? — let’s call it the Classifier Sift.
Andrey’s post inspired us to explore machine learning classification techniques in order to build a Sift that attempts to improve the only automatic classification approach currently available on email clients: keyword-based rules. We gave ourselves 4 days for the initial experiments and we discuss the results below.
Getting access to the data
Redsift makes it very easy to get access to live email data. Our platform includes an IMAP client that connects to your email account using secure OAuth authentication and exposes the email contents using the developer-friendly JMAP format. Getting access to an entire mailbox and new emails in real time is just a username and password away. Voila, we now had 7397 emails (and any new emails) to experiment with that we split into a 5552 training set and a 1845 verification set. We also chose a varied set of labels with distinct characteristics and volumes to try to have a representative dataset:
- Property (1.3%) and Tax & Accounting (3.1%): newsletters and automated alerts from various providers.
- Mistakes (1.8%): random emails addressed to the user by mistake.
- INTJ (4.5%) and ICE (89.3%): mailing list discussions.
Choosing the right tool for the job
We built Redsift with that in mind. The core processing engine on the Redsift platform is a JSON-first Directed Acyclic Graph (DAG). Data flows through this graph and computation and aggregation is performed on it at each node. Our container-based architecture joined by our DAG implementation makes it possible for each node of the graph to be written on a different language if you want to. We decided to start the experiment with Natural, a Node.js NLP library, to create a baseline that we can then compare with Andrey’s Python neural network implementation.
Separating the wheat from the chaff
With clean text in hand, the next step was to extract the tokens and make them ready for machine learning. Node’s Natural library has a couple of tokeniser and stemmer implementations and we used a Lancaster stemmer with a regex-based aggressive tokeniser. The final step to prepare the data was a very simplistic approach to try to keep only nouns and verbs by removing any tokens that had less than 3 characters. 20 lines of code later and our data was ready for some serious processing.
A naive start
As with any experiment, we needed a baseline so we picked a Naive Bayes classifier as a starting point. Naive Bayes is probably the most popular classifier used in Machine Learning as it is simple, quick and assumes conditional independence between classes. Although it may not be the most accurate, it performs quite well.
Node’s Natural library provides a simple implementation and all we needed to do was to train the classifier with the tokenised emails from the previous step and then compare the suggested classifications for the verification set. Internally, the classifier maps out the frequency count of the tokens supplied and calculates the probability (using the Bayes equation) that a given email belongs to each of our trained labels.
Training the classifier took 34 seconds and verifying the results took 15 seconds, with an average of 8.45 milliseconds per email and an 80.43% total accuracy, not bad for a baseline.
Let’s get statistical
The next approach was to try a Logistic Regression Classifier. This classifier is based on a more complicated probability calculation that uses a logistic function. The training step of this classifier uses the stochastic gradient descent algorithm which is a de facto algorithm used in machine learning. The mathematics of this method are quite elaborate and beyond the scope of this blog post but an interesting read for the enthusiasts.
Training the model took significantly longer at 4 hours but that was expected as it involves complex mathematical calculations. Verification on the other hand was 18% faster at an average of 6.86 milliseconds per email. The most impressive result however was the 99.4% accuracy. This is definitely a model we will be exploring further with datasets of different types and sizes to verify if the results are repeatable.
Thinking outside of the box?
Experimenting in our platform can be as simple as creating a new node so we decided to also test a wild hypothesis: TF-IDF is commonly used to score the relevance of a word (or words) to a document in a corpus. Could we treat tokenised emails as queries and the average TF-IDF score of all of the tokens against all the documents associated with each label be used as a classification mechanism? Essentially putting to the test the theory that the set of words that comprise a document could be unique enough to classify it among a number of categories.
Training this method was quite fast at 4 seconds as all it does is to create a corpus of documents for each label and to calculate the frequency for each token. Verification on the other hand was significantly slower at 200 milliseconds per email as it has to traverse the entire training corpus to calculate the scores against each document and then normalise the sum against the number of documents in each corpus and against the number of tokens in each email.
While the total accuracy was lower than the other classifiers at 60.49%, the pattern of the confusion matrix was similar to the Naive Bayes classifier, with a higher score on template-based emails (Property and Tax & Accounting) and a lower score on the mailing list and random conversations.
The other interesting observation during our experiments was that the overall score lowered considerably as we increased the size of the dataset. This contrasts with the behaviour that we observed with the other classifiers and is something we will be exploring further.
Adding some HTML back in
Some interesting results from the various approaches that we tested but it would be quite a let down to have to export these into some other tool to statically present them. We are processing live emails from Gmail so we want to see live results, ideally integrated with Gmail.
You guessed it, the previous paragraph was just setting the scene so we can say that Redsift has that covered too. We want users of our platform to have information available on the channels that they already use. For Gmail, we have built a Chrome extension that allows users to install and run Sifts. The concept is similar to an app store for your data, where you can browse and install the Sifts that are relevant to you.
As a Sift developer, you don’t need to worry about data synchronisation, just get your Sift to emit the values that you want your client to have access to and they appear on the other side.
We are also big fans of d3.js and, unsurprisingly, Mike Bostock has an example of a nice co-occurrence matrix that is just what we needed to represent a confusion matrix. Add some photography and a little CSS and we end up with quite a nice interactive presentation integrated with our inbox.
Given the limited time window, we can consider the initial results quite successful. The facilities provided by the platform allowed us to spend most of our time on the machine learning techniques instead of on parsing, storing and connecting data. And there was enough time to compare 3 different approaches to email classification.
In future blog posts we will be integrating Andrey’s Python classifier into this Sift and then discussing how we convert the final results of these experiments into the live Classifier Sift we alluded to in the beginning of this post. We will open-source its code, when it is ready, to provide a starting point and inspiration for the developer community.
We will also talk more in depth about our platform and our roadmap but we hope that, for now, we have managed to give you an idea of what can be achieved with Redsift today.
We are handing out a new batch of invites for our early access program in the coming months and we would love to see the interesting Sifts you can come up with. If hacking your own data and potentially creating solutions that can help thousands (or millions) of other people sounds interesting to you, why not register here for early access or email us?