Humans do it better – but do they scale?

Its not right, is it? Click image for source.

It's not right, is it? Click image for source.

Today, two seemingly unrelated but actually very similar discoveries: socialmention is offering sentiment analysis among other metrics; and SpinVox uses people to transcribe messages.

Humans as machines

First, the second. SpinVox.They offer voice-to-text conversion which is something of a holy grail for computing, and given my past interest in AI, I found the proposition fascinating. I haven’t used the service myself but I’ve followed their progress keenly over the past couple of years, having actually done some work for them. At the time, I met Daniel Doulton and Christina Domecq, and they were a powerhouse. You got the feeling that everything, and anything, was possible.

And it turns out that yes, everything was possible, both good and bad, because news is out that their systems aren’t purely tech. They use people to transcribe, in call centres dotted around the world. This is a revelation to me and kind of damages their core proposition. People on Twitter seem to think so too, as does Rory Cellan-Jones of the BBC who sees SpinVox not so much spinning as unravelling.

Quite apart from potentially being in trouble by having a call centre in Egypt, contrary to their claims of working within the European Economic Area, it implies to me that, far from having systems that scale, they have human beings that do not.

If their solution truly worked entirely with speech recognition then it would be gloriously easily – and a compelling business model – just to plug in server farms and data centres when load grew. But the ultimate corollary of human transcription is that you have half the world calling, and the other half transcribing. It doesn’t compute.

This would account for their other large headache: money. They’ve been asking staff to take share options instead of money, which was probably ok for Apple in 1960s, but times have changed since then. A while ago I heard Christina Domecq on Radio 4’s Bottom Line programme in which she implied the recession was a huge opportunity.I wonder whether she still thinks this?

She also said her systems ‘learned’. From what we now know, I guess this was the truth but maybe not the whole truth.

Machines as humans

Secondly, the first: the search engine socialmention which scans the social media space – blogs, forums, microblogs etc – for your search terms.

After reading about SpinVox I decided to use socialmention to see what people were saying about it. I noticed with interest that socialmention has some metrics I haven’t seen before (admittedly because I haven’t used it in a while). One of them I ‘get’: reach is calculated as the number of unique authors divided by the number of mentions. But the other three – strength, passion and particularly sentiment – I do not.

Strength is ‘phrase mentions within the last 24 hours divided by total possible mentions.’ Total possible mentions? What does this mean? Surely the total possible mentions is virtually infinite?

Passion is ‘the likelihood that people talking about your brand will do so repeatedly.’ This is maybe a bit clearer in that it probably uses frequency of mentions by unique authors. Or something. Again, it’s not particularly clear.

But sentiment is what truly gets me. It talks about ‘generally positive’ and ‘generally negative’ and, being free and openly available, it’s probably doing something similar to Waggener Edstrom’s twendz twitter sentimenting tool which, it seems to me, just uses fairly crude keyword proximity algorithms rather than anything rigorous.

That is, figuring out sentiment, but fairly badly. I used the tool as a test when Jade Goody died. I noticed it would class as ‘negative’ tweets that said “sad that Jade Goody died” – clearly figuring that the proximity of ‘sad’ to ‘Jade Goody’ implied negativity. Wrong.

I’ve done sentimenting myself in the past. I’ve been through search results for clients and figured out whether they’re positive or negative by actually reading them. But I can only do so much, often restricting myself to only a few pages of search results. Machines can do much more – they scale – but can they do it better?

I’ve recently been working a lot with PR measurement, and have had my eyes opened to the crudity of some measures out there. AVE for example, is only good for impressing people. That’s why some PR companies use it to impress their clients, and their clients, in turn, use it to impress their bosses. But it’s total bollocks.

So given the importance of accurate measurement, I would argue that tools like socialmention are actually dangerous. Some people out there might actually be using it to gain insight, and they will be doing so in a wholly unaccountable way. The conversation goes thus: “We’ve found that people are overwhelmingly positive about your brand.” “How do you know that?” “Socialmention says so.” “How does it know that?” “We don’t know.”

They’re not the same (not yet anyway)

On the one hand, perhaps it’s better that SpinVox is using humans because they understand language better than computers, at least for the time being (quite apart from also being naughty by posting their SpinVox grievances on Facebook). On the other, they have some explaining to do because they’ve kind of sort of perhaps maybe possibly led people into believing they were a tech solution, which would imply a much more effective business model if less effective transcription.

Meanwhile, socialmention is an unashamedly tech solution. But it’s claiming to do what humans do, and I just don’t believe that is the case. If they could, SpinVox would be using them, right?

Advertisements

11 thoughts on “Humans do it better – but do they scale?

  1. Having been “working a lot with PR measurment” for 23 years, I LOVE this post. You are SO right. I used Socialmention on my own name, and found that the negatives came up when I said “AVE’s are a terrible metric” which is a belief I’ve maintained for about 20 years and the more the phrase is repeated, the happier I am. So much for a computer’s ability to define “negative” never mind positive.
    I will however, suggest an alternative in the human vs computer battle. We use computer aided technology to help us define the right articles to read. We then use humans to read them. If there are too many to analyze in the time frame or budget that our clients require, we use random sampling, a technique that has been around for years in research. It’s a nice compromise

  2. AVEs are a terrible metric. AVEs are a terrible metric. AVEs are a terrible metric.

    Does that make you happy? 🙂

    Random sampling? As you say, it’s been around for years. From what I’ve learned recently, it seems to me that, to accurately and reliably measure, you either use humans, or you use techniques that have ‘been around for years’ because they’re proven, statistically robust measures.

    It doesn’t really matter that we’re dealing with ‘online’ here. There’s nothing that different. It’s just more volume. The same maths models and statistical techniques apply.

    And I suppose you could say humans have been around for quite some time too.

  3. Hi Brendan,
    Very interesting post. I did, however, want to clarify one aspect with regards to SocialMention’s sentiment scores. You misinterpret what sentiment means in this context.

    Let’s take your example:
    ” I used the tool as a test when Jade Goody died. I noticed it would class as ‘negative’ tweets that said “sad that Jade Goody died””.

    This is actually correct behavior.

    Sentiment, as it applies in this context is scored based on the tonality of the post, which in this case is negative. The author is sad as the post clearly shows, hence, the mention is scored as such.

    Perhaps, it’s the use of the labels “positive” and “negative” that are misleading.

    SocialMention’s sentiment scoring mechanism gauges tonality of the post as a means to flag content – it does not however, utilize a complicated language analysis … such a system wouldn’t be very accurate anyways given the extremely complex nature of internet speak.

    Jon

    • Hmmm. OK, so the tone is negative, but it’s definitely not negative about Jade Goody or her death, is it? If you were to show this statement to someone and ask whether it’s negative or positive in relation to Jade Goody dying, they wouldn’t say negative, I’m sure.

      Perhaps ‘sympathetic’ would be a better word here.

      • Only in a society fueled by Zoloft and anti-depressents would “Sad” be considered a negative. This is the quintessential example of why automated sentiment analysis is almost always misleading.

  4. One thing I discovered in my research is that human analysis is also far from perfect. Jon’s dispute of the “Sad Jane died” example is a good example of this. What I found was that content that is not extremely polarized in how positive or negative it is was often where there was disagreement.

    The other aspect is how you interpret auto-sentiment. With twendz, we’re trying not to present it as the gospel truth, but rather to help you to identify trends as an early-warning indicator, or a snapshot of a point-in-time for your brand. This is why we don’t display the sentiment for individual tweets, but instead as an aggregate since this encompasses a greater accuracy.

    You’re right, though. It’s not a measurement metric, it should instead be treated as an indicator or a gauge.

  5. Anyone doing human content analysis properly should be using formal coding instructions, that are tailored to the market’s definition of desirable or undesirable content. We call it an “Optimal Content Score” which which changes for each client. Our coders routinely achieve between 85 & 90% intercoder reliability scores. Most auto-sentiment doesn’t come close.

  6. “This is why we don’t display the sentiment for individual tweets, but instead as an aggregate since this encompasses a greater accuracy.”

    I’m not sure I understand this. Surely if your individual sentiment calculations are correct, then in aggregation they would show correct sentiment too?

    So, if you have ten positive tweets and five negatives, then you would show a bar chart with positives at ten, and negatives at five, or give a ‘score’ of 10:5 or 200% or 2:1.

    If not then you’re aggregating incorrect sentiment to give… incorrect sentiment.

  7. My company faces these training challenges all the time. It is my experience that in order to build a reliable sentiment analysis tool you need a reliable source of training documents and/or human input at the outset.

    An algorithm may “learn over time” but in order to do so it needs consistent re-training by a user base or trained group of taggers. The more complex the sentiment categories get, the more human input is required to build accuracy over time. This training factor does not take away from the fact that a working machine solution to certain problems is far more scalable than a human one.

    I think the key is transparency–in terms of how you do your training and to what degree human input is required. If interested, I wrote a little about our own training process here: http://adaptivesemantics.com/blogs/Building_consistency

  8. Accuracy in sentiment analysis is as difficult to do as it is desirable to have. I think when a company leads with the idea that they have the best sentiment analysis available you should proceed with caution because whether or not you use computers to do it, everyone runs into similar barriers.

    With computers you get scale but probably more mistakes, with humans you can’t scale but you probably get more accuracy. Regardless, if you are analyzing the English language, no one can claim the holy grail because both humans and computers make mistakes. You have to use the right tools for the job.

    Not strictly about sentiment but it’s something we try to deal with at Networked Insights too: http://bit.ly/gRb8T

  9. @Elena and @Alex – One thing’s certain – whatever systems you’re using to monitor debates about sentiment analysis seem to work! 😉 I notice Alex, you even have an image on your blog that is reminiscent of the one I chose for my post!

    Great comments though. It’s always good to get the insights from people working in the field.

Look! It's a comment field!

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s