Natural Language Processing With Built-in PHP Functions

What is Natural Language Processing?

Natural language processing (also known as “NLP”) is a programmatic (and sometimes algorithmic) way of processing digitized text and treating it as human language. IBM has a much better and thorough explanation:

Natural language processing (NLP) refers to the branch of computer science—and more specifically, the branch of artificial intelligence or AI—concerned with giving computers the ability to understand text and spoken words in much the same way human beings can.

You can think of NLP as consisting of levels. Machine learning exists at the highest level; it can analyze and interpret text almost like humans, and in some cases, it’s capable of predicting words in sentences. The lowest level of NLP processes individual words and performs comparative functions on them. Maybe it turns letters into bytes. Maybe it compares similarities between words. This lower level is what I’ll mostly review in this blog post, and then I’ll briefly mention a library that performs intermediate NLP functions.

Why PHP?

It’s the programming language I’m most familiar with, plus it’s used by at least 40% of the internet in the form of WordPress. Most of the NLP functions in PHP are based on existing algorithms, so it’s possible to find these functions in other programming languages’ libraries.

Built-in NLP PHP Functions

Four built-in PHP functions allow a crafty programmer to perform natural language processing on select words:

similar_text: calculates the similarities between two strings and provides a percentage of similarity.
levenshtein: the Levenshtein distance (an algorithm) is defined as the minimal number of characters you have to replace, insert, or delete to transform string1 into string2 (quoted).
soundex: calculates the Soundex key of a string, turning letters into a special Soundex key to mimic the pronunciation of a word.
metaphone: calculates the Metaphone key of a string, turning letters into a special Metaphone key to mimic the pronunciation of a word; it’s considered more accurate than soundex().

Let’s explore each function further and determine when it’s appropriate to use them.

similar_text()

similar_text() is different than all the other functions in this list because it’s based on the spelling of words. This function compares two strings and then outputs the number of matching characters. You can also get a percentage of how much the two strings match by providing a third reference parameter. For spelling comparisons, I prefer this function because it’s easy to use and doesn’t require much understanding to perceive its utility.

levenshtein()

levenshtein() is PHP’s way of calculating the Levenshtein distance, an algorithm used to determine how many edits are required to turn one word into another one.

…minimum number of single-character edits (insertions, deletions or substitutions) required to change one word into the other.
Wikipedia

If the “distance” is small or non-existent, then you can imply that two words are related to each other. Older versions of levenshtein() (PHP 4.4.2) are case-sensitive, but I don’t know if newer versions of this function are case-sensitive. The Levenshtein distance algorithm is useful for spell-checking, correction systems for optical character recognition, and assisting natural-language translation (according to Wikipedia).

soundex()

soundex() replicates word pronunciation by assigning four-character codes to each letter of the alphabet. It’s one of the oldest computing algorithms in existence, created in 1918 and patented in 1922. It’s useful in genealogy to get around surname misspellings. The PHP version of Soundex implements the one described in Donald Knuth’s book, The Art of Computer Programming.

metaphone()

metaphone() also replicates word pronunciation but assigns a Metaphone key to each letter of the alphabet. According to PHP documentation and Wikipedia, this algorithm is more accurate than Soundex. I prefer this function to soundex() because it matches better with similar sounding characters and allows you to compare partial strings instead of full strings. To give an example: “cat” and “kat” using metaphone() would return with matching keys, whereas soundex() wouldn’t because the first letters are different characters even though they sound the same in this situation.

Combining NLP PHP Functions

Let’s observe practical usage by using two of these PHP NLP functions: metaphone() and similar_text(). Combining these functions on my website, Ketofoodist.com, allow me to provide some robust search capabilities and help people refine their searches for ketogenic foods. The code below creates a list of closest search term matches.

First, I want to match pronunciations with known search terms. These search terms would be stored in the array variable $lists_for_comparison. The variable $input stores the input search term entered on a form. I choose to loop through the array variable and find pronunciation matches.

$input = $_GET['search-term'];
$compare = $lists_for_comparison;

$return_array = array();
foreach ( $compare as $item ) {
	if ( ! is_array($item) ) {
		if ( metaphone($input) == metaphone($item) ) {
			if ( $input != $item) {
                                //avoid adding $input to list of closest matches
				$return_array[] = $item; 
			}
		}
		else {
			similar_text( $input, $item, $percent );

			if ( ($percent >= 70) && ($input != $item) ) {
			        //avoid adding $input to list of closest matches
                                $return_array[] = $item; 
			}
		}
	}
}

Know what’s happening? Are you sure? Let’s explain the steps of what’s occurring in the code above:

I create the variable $return_array that will store matching words related to the search term.
I check that each $item isn’t an array variable type.
If there’s a metaphone() match between $input and $item, and both variables aren’t the same words, then $item is added to $return_array. This happens so that the search term isn’t added to the list of closest matches.
If there’s no pronunciation match, then I check for a similar text match using similar_text(). If the match is 70% or greater, and the two variables don’t equal the same values, then $item is added to $return_array.

There are other ways to combine metaphone() and similar_text(), but this is a good example of combining both functions to achieve analysis of search terms.

Intermediate Natural Language Processing: TextRank Algorithm

I went looking for a PHP library that could help me interpret text and summarize it. The library I found is available at PHP.science, an implementation of the TextRank algorithm. AnalyticsVidhya has a thorough explanation of how TextRank works (for the machine learning nerds out there).

How do I use the TextRank algorithm? I have produced sentence list summaries of blog posts, extracted keywords from text to use for tagging (although I haven’t figured out how to fine-tune this functionality), and created text highlights from multiple paragraphs.

In a future blog post, I’ll cover how I installed the library without using Composer and its integration within my code.

PHP has nifty natural language processing functions. You can compare words by pronunciation, similar-spelled text, and by string transformations. Plus, you can take advantage of intermediate algorithms such as TextRank to produce unique summarizations of texts.