Thursday 11 October 2012

Using Google Prediction

So this post is all about using Google prediction and some of the experiences I've been gaining so far in using it on my web application.  This is some pretty awesome technology and the way Google have exposed it and made it so easy to use should open up all sorts of opportunities for developers to get value out of it.

The Basics

So the basics of using Google's prediction services are pretty easy to follow and get up and running.  Just go to https://developers.google.com/prediction/ and follow the instructions to get up and running.  Remember, you'll need to sign up for billing for this service, although you do get a courtesy 10000 calls free a month which is plenty to play around with and you'll get nothing charged to your credit card until you go over that.

All you need as an initial sample is some dummy data saved into a csv file as per the screen shot below.  



Tip: The prediction engine IS case sensitive.  In other words you either have to give it two sets of data to train with (lower case and upper case) or give it one set and cast any text to the correct case when you want to run an actual prediction (which is what I've done).

Load the csv file into your Google cloud storage area and you're ready to train your model.  This is as easy as going to the prediction interface, giving your model and ID and putting in the cloud storage path.


Then you can easily try out your new model by choosing prediction.trainedmodels.predict and passing the ID (as above) and any text string you'd like to test.

So that's it, pretty simple so far.

Now It Gets Interesting....

So in it's simple form like above you can see how the prediction engine works and get some simple samples working.  Now, if you want to get serious about using this tool the key thing is above all is you're going to need data, lots and lots of data. 

What I'm building here is a sentiment analysis tool and trust me, the amount of time I've put in so far into building the model has been significant to say the least.  I've now gone in excess of 10000 and I'm fairly happy with how it's been going.  So far I've been purely manually training the model as I've wanted to ensure the quality of the model is not compromised.     

What's really cool with the prediction tool is each time you refresh your model you can run an "analyse" command to see how well your model is measuring up in the different categories you're training it in.
First, it gives you a break-down with some warnings of "stray" values in your source data which makes it easy to cleans your source data when it starts getting very large.  This is seen in the screen-shot below.



Then it also gives you an excellent break-down of how many samples you're providing of each label your scoring on.  This is key to ensure you don't "over-train" your model with some categories and not others.  Finally it gives a "confusion matrix" - see below.  This shows how close the different categories you have are to each other.  The closer the numbers between the categories the more chance they have of getting confused.




So far you can see how the general sentiment analysis works from my model at http://www.socialsamplr.com/prediction.   Type in a phrase (or copy and paste from anywhere) in the top text box and click the "Get Response" button to see what score it gives.  

In addition, I've been training my model across a range of categories which is more of a work-in-progress - this can be checked out here http://www.socialsamplr.com/lab.  

Sourcing Data

There is obviously a rich array of data from social media which you can source to train your model with.  Just make sure you stick on the right side of any rules when you do it.  For example, Twitter allow you to manually copy and paste data for scoring but any automated process where you're saving actual tweets to a cloud data source is against the developer rules of the road.

In short, there's not really a very easy way to build up a really good model for sentiment analysis.  It takes a lot of legwork and needs a lot of work on-going to ensure it remains accurate and relevant.  Other factors to be aware of are when first training your model with new data if the training data is biased towards positive or negative data the model will need to be continually updated over time to account for factors such as this (for example if a topic is having a bad week in the news this could skew any data you may use for that week).  However, when you do eventually get your model working well it's pretty cool to start applying it and seeing the results.

Using Prediction in Google Apps Script

This couldn't be easier.   Set a reference under Resources->Use Google APIs


The write your code...




Some Data Samples

So the way SocialSamplr so far tracks social media sentiment is on-going monitoring of topics by groups of hash-tags.  Although it's early days I've already seen some interesting data coming through on some of the topics being tracked as shown below.  Over time I'll hopefully be able to overlay these charts with current events occurring at the time and be able to derive some really interesting data mining samples.


Next Post...

So that's about it for this week, hope it's been interesting.  Next post will cover real-time monitoring of Facebook data (which I'll hopefully complete testing on soon) and streaming of Twitter data using Python on app-engine.  Maybe a chat about big-query as well.  Any questions or feedback in the meantime, just let me know.


Thanks

Daniel