Using the Cognitive Services Text Analytics API: Detecting Languages

Microsoft has a lot of fascinating APIs available to build intelligent applications with using their Cognitive Services. Among those services is the Text Analytics API. This API offers a wide range of valuable text-based functionality such as sentiment analysis and k ey phrase extraction.

With these useful APIs available, what could be a better means of utilizing them than incorporating them into a small app? The goal of this app will be to take data from Twitter and run it through some of the Text Analytics APIs to see if we can get some insights into what people are saying about Wintellect. In this post, I’ll be using Python to get Twitter data as well as call the Text Analytics API to extract our insights.

As usual, the notebooks for this post are available on GitHub: retrieving Twitter data and calling the API to detect languages. The API keys will be stored in a separate JSON file local to the application, but it will not be checked in. With that being said, you’ll have to create a JSON file and fill it in with your keys to enable the API functionality.

Getting Twitter Data

Instead of handling the Twitter API directly, we will make it easier on ourselves to get the data by using the tweepy package. The two things we will do with the package are, authorize ourselves to use the API and then use the search API.

Let’s go ahead and get our imports loaded.

import tweepy
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

sns.set()
%matplotlib inline

Twitter Authorization
To use the Twitter API, you must first register to get an API key. To get tweepy just install it via pip install tweepy. The tweepy documentation is best at explaining how to authenticate, but I’ll go over the basic steps.

Once you register your app you will receive API keys, next use tweepy to get an OAuthHandler. I have the keys stored in a separate config file as mentioned earlier.

auth = tweepy.OAuthHandler(config["twitterConsumerKey"], config["twitterConsumerSecretKey"])

Now that we’ve given tweepy our keys to generate an OAuthHandler, we will now use the handler to get a redirect URL. Go to the URL from the output in a browser where you can allow your app to authorize to your account so you can get access to the API.

redirect_url = auth.get_authorization_url()
redirect_url

Once you’ve authorized your account with the app, you’ll be given a PIN number. Use that number in tweepy to let it know that you’ve authorized it with the API.

auth.get_access_token('pin number')

Now that you’ve authorized your app to your account you can now use tweepy to get an instance of the API.

api = tweepy.API(auth)

With an instance of the API made available by tweepy, we can start using the API to perform a search for data.

Search API
Calling the search API is quite easy with tweepy.

wintellect_search = api.search("Wintellect")

If we look at the variable wintellect_search, we can see that it has a data type of tweepy.models.SearchResults. And if we check out the documentation, it tells us that the search method returns a list so that we can iterate on the search results. Before we do that though, let’s see what attributes we can use on one of the SearchResults objects. We can do that with the built in Python function dir.

dir(wintellect_search[0])

In reviewing the results of the dir function, a few useful attributes can be spotted including text, author, and created_at. To get those for each tweet, we’ll loop through the search results and extract those attributes into a dictionary. We’ll also check if the tweet is a retweet. Since retweets can add noise to our data, we won’t include them here.

search_dict = {"text": [], "author": [], "created_date": []}

for item in wintellect_search:
    if not item.retweet or "RT" not in item.text:
        search_dict["text"].append(item.text)
        search_dict["author"].append(item.author.name)
        search_dict["created_date"].append(item.created_at)

Now that we have a dictionary of our results, we can use pandas to convert the dictionary into a data frame.

df = pd.DataFrame.from_dict(search_dict)
df.head()

Next, we can determine the size of our data set by calling the shape attribute on the data frame.

df.shape

Not a lot of data, but it’s enough for us to work with.

Language Detection API

One piece of information that can be extracted with the Text Analytics API is the language of the text. Before the Text Analytics API can be used, an API key must be created. To get an API key, go to the Text Analytics API page and click on “Try Text Analytics API” and from there, in the Text Analytics API row, click “Get API Key”. Now you’re ready to get some insights from text!

The Text Analytics API has great documentation, and it helps us out by giving us a URL to use for calling the API. This URL is only a quick start URL that we can use to play around with the API. If we were to use this URL in a production application that would involve a lot more requests to the API, we would need to go to Azure and set this up on our own.

But for our purposes, the quick start URL is a useful starting place. To call the API, we’ll be using the requests package, which makes it very easy to make web requests and does a better job than what’s built into Python.

import requests

text_analytics_language_url = "https://westcentralus.api.cognitive.microsoft.com/text/analytics/v2.0/languages"

The documentation also gives us a structure for the data that the API expects. Since our data is in a pandas data frame, we can iterate over the rows to build out the data structure. The iterrows method returns a tuple that includes the index of the row and the row itself so that we can unpack the data structure directly in the for loop statement.

documents = {"documents": []}

for idx, row in df.iterrows():    
    documents["documents"].append({
        "id": str(idx + 1),
        "text": row["text"]
    })

Now let’s take a look at the documents variable to see what we have so far.

Our data structure for the API looks good, so now we can start building up the API call. The main piece we need is to supply our subscription key to a header to authorize the call. If we don’t supply this then we will get an unauthorized response.

headers = {"Ocp-Apim-Subscription-Key": config["subscriptionKey"]}

Now that the API key is set, we can now call the API.

response = requests.post(text_analytics_language_url, headers=headers, json=documents)
languages = response.json()
languages

It looks like most of the text is in English which is expected, however if we scroll down toward the end of our results, we’ll see something different.

Interesting, there was a tweet made in French.

Now that we have our results let’s extract what we need from it into another pandas data frame. Looking at the API documentation for the sentiment analysis, we can see that it takes in a language. From our results it appears that we can pass in the iso6391Name, so we’ll extract that out.

detected_languages = []
for language in languages["documents"]:
    for detected in language["detectedLanguages"]:
        detected_languages.append(detected["iso6391Name"])

We can now create a data frame from the list of detected languages.

languages_df = pd.DataFrame(detected_languages, columns=["language"])
languages_df.head()

To persist the tweet and language data, let’s save them in a CSV file, so we can use this same data for in the future.

df.to_csv("./tweets.csv", index=False, encoding="UTF-8")
languages_df.to_csv("./detected_languages.csv", index=False, encoding="UTF-8")

In this post, we got data from Twitter with the tweepy package. Once we had our tweets, we used the Text Analytics API from the Microsoft Cognitive Services to detect the language of each tweet. In our next post, we’ll use the detected languages to determine what the sentiment of each tweet is. Are there more positive than negative tweets? We can find that out by continuing to use the Text Analytics API to perform sentiment analysis on each tweet.