Keywords with Scikit-learn and Jekyll

I use Jekyll, a static website generator, for this blog. Jekyll is static because it takes some files written in a simplified language, such as markdown-files for blog entries, and translates them to standard weblanguages such as HTML, CSS and PHP.

Here are some useful sources for making your own blog with Jekyll:

There is a ton of cool stuff Jekyll allows by default, one of them is hosting your blog on github, but my aim is to mostly keep it simple and use Python to demonstrate simple and custom extensions.

This blog focuses on automatically creating a list of keywords for each blog post. I use the keywords meta-data to show the general content of a post on the homepage, and have so far manually added words. In Jekyll meta-data is implemented in blogposts as such:

layout: post
date: 2016-02-27 21:03:33 +0100
keywords: ['icon', 'tjkreutz', 'the', 'to', 'twitter']
title: "What's this?"
thumbnail: '/img/thumbnails/thumbnail-1.png'

You could argue that writing software to automatically select keywords is as much, if not more, trouble then to manually add them, but it more generally demonstrates some simple natural language processing techniques and adheres to a programmer’s mindset that if something can be done automatically, it should be done automatically. My go to language for such a problem is Python, with the Scikit-learn machine learning modules.

from sklearn.feature_extraction.text import TfidfVectorizer

We specifically use the feature extraction module that learns tfidf scores given some documents (my blogposts here). Tfidf stands for Term Frequency x Inverse Document Frequency, which can best be explained as the relative distinctiveness of a word for a document. Say I write my first blogpost about the Java programming language and ‘Java’ does not appear in other blogposts (it does now..), then the word ‘Java’ is extremely distinctive for that document. Its unique occurrence results in a higher tfidf score, and a position amongst the top keywords for the blogpost.

We get tfidf scores simply by training the vectorizer. Note that I have written a post module that walks through the _posts directory and stores all text and metadata in a post-object:

allposts = posts.read_posts(dir)
alltexts = [post.get_text() for post in allposts]
vec = TfidfVectorizer()
tfidf = vec.fit_transform(alltexts)

Then, using the learnt weights the vectorizer can rank the words for each blog post to come to the top (5) keywords.

features = vec.get_feature_names()
for doc in tfidf:
    zipped = zip(doc.data, doc.indices)
    sortedwords = [features[ind] for data, ind in reversed(sorted(zipped))]
    topfive = sortedwords[:5]

We can display the words for this blogpost:

print(topfive)

The first results are strange at best, but that is because the weights are not yet reliable with only two blog posts.

The next step would be to cluster posts with similar content automatically and find tags to signify them. At any time refer to my blog’s github repository to have a look at the scripts I am working on.