Python and Data Science Resources for SEO

Python is one of the top programming languages today. Beginners who have learned it describe the ease and quickness of mastering its syntax. Experts of Python brag about its versatility. Machine learning, data analytics, web page development, and even hardware applications are possible to process and compute.

AI is an emerging area being used to analyze data and patterns faster than human capabilities. Google Organic Search utilizes machine learning techniques to perfect its search engine results. Its RankBrain algorithm is capable of analyzing web pages and visitor clicks to determine keyword search intent and more accurately match relevant websites. This impacts traditional SEO tools that marketing professionals rely upon. Those who work in search engine optimization might find it helpful to use machine learning techniques to decipher machine learning algorithms. Python makes a good choice to complement analyses from these tools while also automating other tasks.

Here is a collection of Python scripts that provide deeper and more meaningful SEO analytics that traditional SEO tools may not uncover.

Generate Text From Images Using Python

Are there missing ALT attributes on your pages? Instead of manually populating the attribute for each image, this handy Python script might be more useful. It uses a Python library called Pythia (created by Facebook). The library uses a “neural network” approach to produce sensible text based on the provided image URL. Read the Search Engine Journal article here to learn more about installing and implementing the script.

Optimize The Image File Size

Image file size (different than “image dimension”) contributes to the page speed of a web page. The ideal maximum length of an image file should be 200kb. Many tools exist to shrink image file sizes, but they require manual uploading. Python can provide an automated, programmatic solution. This script by Victor Domingos gives you multiple options that can shrink an image’s file size such as reducing colors, reducing image dimensions, changing file formats, deleting EXIF data, and defining JPG quality levels.

Log File Analysis For SEO

Log file analysis is a useful skill for SEO professionals. It can provide deeper insights that a typical crawl tool might not provide. For example, if you’re detecting a lot of 404 errors, then you can use log file analysis to determine the issue’s source URL(s). Ian Lurie of Portent.com gives a small list of SEO insights that can be gleaned from log files:

Spider traps. Log files give you a great look at how search bots are crawling your site.
Spam content. If some hacker dumped a bunch of pages listing porn links on your site, any clicks to those pages appear in the log files.
Broken external links. Google eventually gives up crawling broken links from other sites. But people still click them. Track down those busted external links and reclaim some authority.
Incorrect server responses. Google Search Console can show some soft 404 errors, but the log file shows you all of them.

Log files record server codes and instances, but reading these files and retrieving useful data would be a challenge and take time.

One developer has discovered a Python programming method to create useful log file data at scale.

Moshe Ma-yafit wrote a Medium article describing log file analysis using the Python programming language (actual script link here). The script can retrieve data from log files and then allows you to output visuals about:

Server response codes
Google bot hits
404 errors per referrer URL

The above list represents only a few examples. You can read the article to learn more about the process of using Python and the log file analyzer script.

Search Engine Land provides a different Python method for performing log file analysis. Author Charly Wargnier provides a detailed example of how to utilize the power of Google Cloud to run a Python analytical script that gleans insights from log files.

HREFLANG Tag Validation

HREFLANG is a meta HTML tag attribute that’s used to indicate the target country and language for a web page. It’s intended for websites that have multiple versions of their web pages transcribed in different languages for multiple countries.

Problems arise when the HREFLANG attribute is targeting the wrong country and/or language for a web page. Validating multiple pages’ attributes is a nightmare if someone validates them manually. A scalable solution would not only be quicker, but it would also remove the human checking error as well. Luckily, Python comes to the rescue.

SEO company Distilled released an open-source Python library called polly that checks HREFLANG attributes. You can install it to your local server and then run the tool, but another option is to use the online version of the tool at http://hreflang.ninja that’s also built with the library. You can read this article on Moz.com for instructions about using the library.

Calculating Keyword Growth

Keyword research is still an important task. You can use it to discover topics and how people search for things on Google. One important aspect of a keyword is whether there’s been growth or decline in its average monthly search volume. This indicates growing interest or growing disinterest in a keyword. It’s simple to analyze this per keyword, but doing this for multiple keywords simultaneously would take a long time. This Python script from SearchWilderness.com provides a programmatic solution involving the Python library pandas and Google Trends.

Google Search Console Data Collections And Analyses

This article from HackerNoon.com explains SEO analysis scenarios by using Python. It’s possible to analyze keyword SERP positions, impressions, and clicks—to determine if there are abnormal performance variations with a keyword, which keywords are top performers, new and lost keywords, and which keywords exhibit under-performing Click-Thru-Rates. The script relies on the Python library pandas and performance data from Google Search Data.

It’s also useful to categorize the keyword data that Google Search Console is able to capture. Search Engine Land published an article by JR Oakes that shows how an analyst can categorize keywords by tapping into the Alpriori algorithm using Python. If you’ve never heard of the algorithm, here’s an explanation from the author:

The Apriori algorithm was proposed by RakeshAgrawal and RamakrishnanSrikant in 2004. It was essentially designed as a fast algorithm used on large databases, to find association/commonalities between component parts of rows of data, called transactions.
–JR Oakes

After categorizing the data, Oakes continues to demonstrate how to use BERT embeddings to find semantically related categories and then shows an example of how to plot position growth and decline per semantic category.

Competitive Analysis

According to the author of this Python script:

Most keyword ranking tools will tell you which pages your domain is ranking for, but not your competitor’s highest-ranking pages per keyword…we aren’t tracking page performance for the long-term. We’re simply trying to get quick data.
—John McAlpin

Suppose you’re trying to figure out how many times a competitor’s list of pages ranks for a list of keywords. How many times does a page appear and for which rank positions? Instead of aggregating multiple data sources, this Python technique is able to automatically gather the data for you. Combine it with this Data Studio template (using Google Sheets) to generate a visually pleasing chart. You’ll need the following libraries: urllib, lxml, and requests.

Automating Schema.Org Generation

The Python script exhibited in the article (link) can be adapted to automatically generate any type of Schema. In this example, it’s used to generate VideoObject using the following libraries: Genson, Selenium, and BeautifulSoup. There’s also some manual formatting of your spreadsheet file required, but it’s not too hard.

Automate Keyword Research

This detailed tutorial utilizes pre-defined coding rules to automatically pull keywords from a keyword research report. Also, the script can detect Google Ads for each keyword and calculate keyword ranking difficulty and search volume trends. According to the author of the article, the agency that developed the script decreased its keyword research time from 2 hours to 10 minutes.

Keyword intent is an important determination when you’re doing keyword research. One way to conduct successful keyword research is to also group keywords by intent. This task can take hours for hundreds and thousands of keywords if performed manually. It’s possible to automate the process using Python and machine learning techniques. Hamlet Batista wrote an article in the Search Engine Journal describing how to combine Python, TensorFlow, BigQuery, and Data Studio to implement the BERT technique. This allows for a more accurate analysis since it takes into account the context of a word phrase and does not just rely on the presence of certain modifier words (for example: “where is the nearest safeway” has the same intent as “safeway within 1 mile”).

Another useful aspect of Python for SEO is uncovering clues about how Google classifies your website. This can help decide what type of content to create and thus which keywords to focus on when conducting keyword research. JR Oakes once again has created a handy method for helping uncover that insight. His method involves utilizing the Google Natural Language API to apply sentiment and categorization scoring. You can use this to analyze your own pages or your competitors.

Content Scraping For Site Analysis

Content scraping is a popular usage of Python. Normally the activity is reserved for programmers trying to retrieve data values from a web page, but according to famed marketer Neil Patel, SEO professionals can use content scraping for the following purposes:

Scrape organic keyword results
Search Google Ads copy for inspiration
Find potential influencers with blog comments
Use data feeds for guest blogging research
Determine your best-performing categories
Find content on forums to create backlinks
Get more data for your blog posts

The author also provides content scraping tips and advice plus warns marketers not to use it for copying content from other pages.

However, Neil Patel’s article doesn’t provide Python solutions. It recommends using other software programs, but a Python solution can give you endless options. Two popular libraries for content scraping are requests and BeautifulSoup. Examples are provided in the automated Schema generation; the Competitive Analysis section above utilizes both libraries.

Extra: Calculating Internal PageRank (R Language)

This involves an SEO concept called “internal linking.” How web pages are linked together on a website determines its structure, and it’s an important optimization factor of SEO. Calculating your internal PageRank scores can provide a quantitative way of judging the effectiveness of your internal link structure.

Although this method doesn’t utilize Python, it’s nonetheless useful to gauge the effectiveness of your internal link structure. A Search Engine Land article written by Paul Shapiro details the steps of using the R programming language to calculate the PageRank score for a website. The article also provides real-life applications of using PageRank to plan internal links.

Python libraries that might provide functionality similar to the R approach are NetworkX, Fast PageRank, and iGraph for Python. The first is a popular library that has a built-in function specifically for PageRank link analysis. The second is a library created by Armin Sajadi for use on his WikiSim project. They both, at least by appearance, depend on referencing URLs by numerical indices (ex: https://yahoo.com = [0], https://google.com = [1]). The iGraph library also exists for Python so you might be able to produce the same output display. No practical SEO example could be found, but if you tinker with the Python libraries for a while, maybe you could produce a Python version of the methods in Paul Shapiro’s demo. Just a thought.

Python is a popular programming language for data analysis. It can uncover deeper insights that traditional SEO tools aren’t able to provide. If you work in search engine optimization and data needs are becoming complex, Python is a programming language you might want to try.

The fastest way of learning how to code with Python is by taking an online course. Compare available courses from learning platforms such as Udemy, Coursera, Edx, and others.

Correction: the link that references Paul Shapiro’s demo of analyzing internal links is based on the R programming language, not Python. This has been corrected in the blog post.