Detecting random text in web registration forms

Published on 22 July 2014, in #data-engineering

I came across an interesting problem a couple of days ago. A friend of mine runs quite a large website with a lot of users registering on their web. The problem is that there are users which are lazy to enter proper information so they just enter random text, as this for example:

img is missing, sorry

Since the correct registration information is crucial to their business, he would like to have a solution which would detect such random text automatically.

It would be very useful for him to have a tool which would check all data gathered from registration forms and mark rows which look like random text. He calls this false user registrations. Also, it would be interesting to have an online detector which would notify the customer-care team if such a false registration is submitted.

Although there are many approaches to this problem, I was curious if such problem can be solved algorithmically. In particular, I was curious what would be the most simple solution that would be useful in 80% of the cases. The reason is that when I am presented with a (data) problem, I always find it very effective to start with the most simple solution first. And often these solutions turn out to be satisfactory enough.

Anyway, I was first tempted to go with supervised algorithms, like Naive Bayes classifier for example. However, I realized that it would be much more interesting to come up with an unsupervised algorithm, which would identify random words automatically without any prior knowledge. Just give the algorithm a list of words and it gives you back words from the list which look random. Such solution would also be practical from the business perspective since you can immediately identify suspicious registrations in a sea of production data.

So here is a description of a very simple - yet surprisingly effective - unsupervised solution to this challenge. Based on frequency-analysis of character bigrams.

Before we dive in, let's the see results #

For the sake of simplicity, I focused on the field of the registration form that contains a customer name. To mock up production data, I took the 2000 most common names in US and added 20 random texts on top of the list. I got this list of words, which I gave as input to the algorithm. (The order of words doesn't matter.)

What the algorithm does is to compute a suspicious score for each word. The higher the score, the more suspicious (i.e. random) the word looks when compared to the rest of the words.

If you run the algorithm, you get this list of words sorted by a suspicious score. I marked the 20 random words with an asterix, so I can quickly evaluate quality of the result.

We see that the algorithm is pretty effective: the top 10 words with the highest suspicious score (above 25.0) consist of random words only. Actually, the top 16 words consist of 15 random words and only one regular name - guadalupe.

The five random words which didn't make it to the top 20 include tyrwereh (#23), qwerpo (#43) and sfert (#422). However, I'm ok with the low scores here, because these words don't look very random to me anyway :)

I have made also an interactive version of the algorithm, it is available here. Go ahead and give it a try. (Note that it might take up to 10 seconds to load the page because the Heroku node might need to wake up.)

How it works #

If you look at a random word, e.g. sdfjtwd, you see a lot of rare combinations of characters like sd or ft or jt. If you look at a word dustin, there is no combination of characters which strikes you are rare. Maybe st, but that's about it.

A combination of two adjacent characters in a word is called a character bigram. So sd, ft and jt are all bigrams of the word sdfjtwd. (By the way, there are also word-bigrams which are adjacent words in a sentence.)

What you want to do to detect random text is the following:

Compute a score for each bigram. This score should be high if the bigram is rare among all given words. The score should be low if the bigram is very common among given words.
Compute suspicious score of a given word as a sum of all bigram scores that are contained in the word.

Step 1 - Compute bigram scores #

Consider the words aabaa, abbb and ababa for example. Let's compute for each bigram how many times it occurs in each word and also how many words in total contain this bigram:

Word / Bigram	aa	ab	ba	bb
aabaa	2	1	1	-
abb	-	1	-	1
ababa	-	2	2	-
Total words containing the bigram	1	3	2	1

We see that the bigram aa occurs twice in the word aabaa, while bigram bb doesn't occur at all in aabaa. Also, we see that the bigram ab is the most common bigram (used in all 3 words) and the bigrams aa and bb are the most rare (used only in one word).

Now we know which bigrams are rare and which are common in the dataset. We just need to translate it into a score.

The most popular metric for this purpose is the inverse document frequency (IDF). For a given bigram $b$, the IDF is computed as

$$ {IDF}(b) = \log\frac{N}{n_b}, $$

where $N$ is the total number of words (in our case $N=3$), and $n_b$ is the number of words which contain the bigram $b$.

In our example of words aabaa, abbb and ababa, the IDFs are as follows:

	aa	ab	ba	bb
$n_b$	1	3	2	1
${IDF}(b)$	1.1	0.0	0.4	1.1

This is exactly what we need: The most common bigram (ab) has zero IDF, while the most rare bigrams (aa and bb) have highest IDF.

By the way, to see the IDFs of all bigrams from our data of 2020 words, click here.

Step 2 - Sum up bigram scores #

To get the suspicious score for a given word, we just need to sum up IDFs of all bigrams occurring in a word:

Word / Bigram	aa	ab	ba	bb	Suspicious score (sum of IDFs)
aabaa	2	1	1	-	2.6 = 2 * 1.1 + 1 * 0.0 + 1 * 0.4
abb	-	1	-	1	1.1 = 1 * 0.0 + 1 * 1.1
ababa	-	2	2	-	0.8 = 2 * 0.0 + 2 * 0.4
IDF	1.1	0.0	0.4	1.1

We see that the word aabaa with a score of 2.6 is the most suspicious word out of our fabricated example.

If you use this algorithm to compute suspicious score on our 2020 words, you get this result. Looks ok, but take a look at positions #9, #11 and #13. Christopher and Jefferson are regular names but they have very high suspicious scores (e.g. higher than lkjwqer, erwtrepo and others).

Can we improve the algorithm?

Improve the algorithm with adjusted-IDF #

The problem with Christpoher or Jefferson is that they are long and contain a lot of common bigrams. The sum of IDFs of these common bigrams is higher that the sum of IDFs of fewer but more rare bigrams.

To overcome this, we can adjust the IDF formula to return lower score for common bigrams. One of many possible options is to use the following formula:

$$ \text{IDF}_{adj}(b) = \log\frac{\max_c(n_c)}{n_b}, $$

where the $\max_c(n_c)$ term is just the number of occurrences of the most common bigram. Using this formula, the bigram score should be zero for the most common bigram and almost zero for very common bigrams.

To see the adjusted-IDFs for the bigram in our sample of 2020 words, click here. You can see that scores of very common bigrams decreased from 2.5 to 0.5, while the score of most rare bigram remain above 5. (You can experiment and tweak this formula more in order to fit your data. Apart from logarithm, you might also consider different functions)

The adjusted-IDF is the score that I used to arrive at the results presented in the beginning of the article.

Grab the source code here #

I put the source code (Scala) on github. It tried to make it simple to get the tool up and running on your local machine (either in batch-mode or in server-mode with API). Also, it should be possible to deploy code to Heroku without any code changes.

→ Next post: Analyzing Elastic MapReduce data with Python, Pandas and scikit-learn

This blog is written by Marcel Krcah, an independent consultant for product-oriented software engineering. If you like what you read, sign up for my newsletter