A powerful and open source content optimizer

Overview

This post will walk through my open-sourced Article Optimizer app, taking a look at some of its key architectural decisions and features. This app is running at www.article-optimize.com if you’d like to try it out. I’ve also open-sourced it, so you can follow along in the Github repo here.

What is the Article Optimizer?

Background

This is a web application built in PHP 7 leveraging the Symfony web framework. Its intended user base is content writers, especially those whose content will ultimately live online. At the end of the day, a complex interplay between human signals, such as comments and social shares, and search engine ranking systems will ultimately determine how “successful” a given piece of content is, from a rankings perspective at least. This tool was developed to assist authors in understanding how their content is likely to be parsed and ranked by search engines.

How it works

You can submit an article on any topic through the tool and in around 1 second you’ll be presented with a detailed analysis of your work, breaking down its:

  • rough word count
  • content category (Food & Drink, Politics, Hobbies etc)
  • sentiment analysis of major concepts, keywords, entities (positive, negative, neutral)
  • keyword / phrase density

In addition, the tool will do its best to find copyright-free images that are on-topic. If you write an article about growing tomatoes at home, the tool will try to find you a good handful of high quality tomato images that you are free to include and publish in your article. Including on-topic high quality media in your content ultimately leads to a better reading experience on the web and can improve your article’s search rankings.

Finally, the report generated by the tool contains a reference to your full article and is written to a unique URL that can be easily shared via the programmatically generated Bitly link, the social share buttons, or by using a built-in feature that allows you to send the report by email to a friend, colleague or client. This aspect is intended to empower authors to show off the quality of their work as validated by a third party service.

Here's an example report:

Why use it

This tool is intended to be used as a “spot check” for authors when they are getting close to wrapping up their article or are getting ready to publish it. They may not be aware, for example, that they are unintentionally “keyword stuffing” their article by repeating a given phrase with unnatural frequency, or that their portrayal of a given person is coming across as negative.

Main Technical Considerations

How much work has to be done and how long can I put it off for?

In creating this tool, one of my primary concerns was speed. Specifically, the time between the moment an author submits their article for processing and the moment at which they can view the completed analysis of their work should be as short as possible.

A general rule of thumb is that users of web applications begin to perceive the app they’re interacting with as sluggish or non-responsive if there’s anything more than a second of latency between the time they initiate an action through the user interface and the time they receive the next visual feedback from the app. For a more in-depth article on this sense of responsiveness from a user experience perspective, check out this article.

The Article Optimizer renders reports for most articles in 1 second or less. This performance is the result of careful planning and thinking through what processing has to happen before the report can be rendered, and what functionality can be offloaded until afterwards.

Identifying the bare minimum you can do before rendering

In the case of this application, I’m using IBM Watson’s AlchemyLanguage API to do most of the heavy lifting in terms of linguistic processing. Since this means I have to do at least one network request to an external service following a user submission and before rendering the report, I wasn’t willing to call any others during processing. Everything else (Bitly, Flickr) would have to be done via ajax after the initial report was rendered.

That’s why the bulk of the processing is done so quickly and why the tool feels snappy and quick to use. When a user submits an article, I do validation of the content both on the client and on the server. If there are any show stopping issues, such as if the user submitted bogus content that’s too short, they’ll get a helpful and descriptive error message up front and everything grinds to a halt until they fix their submission.

Assuming the user submits a valid-looking article of sufficient length, I sanitize and filter it to remove any crap, remnant HTML tags from a hurried copy and paste operation, or malicious embedded scripts because we never trust user input. Only then do I hand off the article to AlchemyLanguage for a detailed breakdown. Once I receive the response, I do a minimum of report preparation work: examining the AlchemyLanguage response, bundling the key information into a sane format as expected by the report templates. Once this is done, I can render the permanent report, write its contents to a static HTML file that will live in perpetuity for the user to share, and redirect the user to this completed report.

It’s important to step back at this point to understand: at this time the user’s article has been fully processed, their report written to the server, and the user is actually looking at the report and beginning to read it, but none of these things have happened yet:

  1. The Bitly short link for this specific report has not been generated or displayed
  2. The Twitter share button has not had its contents altered by jQuery to include this Bitly link
  3. None of the copyright-free images have even been searched for yet

The user doesn’t notice these things because it will take several more seconds, at the earliest, before they begin to need any of this information. As soon as the report page is finished loading, jQuery handlers go to work fetching and updating all of this data via the help of server-side controllers designated for each task and associated service. The point is that most of the time the user will never notice the brief delay, because they need to scroll through a whole lot of report details before getting to the images section at the bottom. All the average user knows is that the tool they’re using processes articles very quickly.

Keeping it clean and easily extendable

Part of the reason I chose the Symfony framework for this rewrite is that Symfony does a good job of enforcing some sanity and organization around web projects without being so dogmatic that you can’t customize things to your liking.

In case I decided to return to this application after 2 years to add some new features, I know my future self would want to open a project that is clean, well-organized, thoroughly documented and demonstrating good separation of concerns. This means lots of DocBlocks throughout the application, giving other developers everything they need to know about a given class or method up front.

Starting with the PHP side of things, the project begins with the Default controller. This defines handlers for the main routes of the application, and the behavior of the index page and its 3 forms. One of Symfony’s main strengths is the significant work they have put into forms and form processing, which allowed me to define rich form functionality in less code, keeping the Default controller reasonably slim.

Once we have a valid submission, work flows to the Analyzer class, which is concerned with breaking down and processing the article text, interacting with the AlchemyLanguage API, and bundling the final analysis into a format expected by the report templates.

Keeping it easy to reason about

Loosely following the idea that a literate program should read like a short story, I think of the steps described in the Analyzer class as a recipe for processing a user’s article in a generic way that will result in useful insight. My thinking here is that if one or more other programmers were to begin working on this project, they should be able to easily read the Analyzer class to quickly gain an understanding of the general processing steps that occur for each article.

Separation of concerns

At the same time, I want to maintain a balance between legibility and concision. That’s why the nitty gritty details of lower level actions are abstracted away by the higher level Analyzer class, but detailed in the AnalysisHelper class. Furthermore, curl calls are abstracted into a Curl Helper class, just as Flickr and Bitly calls are abstracted into their own classes which use the Curl helper class. The main idea is to build reusable components and then load them wherever leveraging them makes sense.

We don’t want to configure and execute raw curl calls in every method that makes a network request, because it’s not as easily maintainable and will also result in a lot of duplicated code. If we wind up needing to change a header across all of our curl calls, we’d need to find every instance of a raw curl call to change it - or we may miss some and be left with inconsistent behavior.

Leverage the framework

Symfony also features the twig templating language, and it’s excellent. Though in general we want to keep as much logic out of our templates as possible, many of the built in-functions (such as count) are useful for determining if we have enough data to render a full section, or if we should display an error instead.

Even in templates, we want to build reusable components

After getting everything working and displaying the way I wanted it, I started looking for duplicated code in my templates. The ribbons that introduce each new report section, for example, are all roughly the same - though their overlaid text changes. This makes them a good candidate for refactoring: moving the code that generates a ribbon into a custom twig function “ribbon” that we can call with a single string argument for the text that should appear on the ribbon.

Twig lets you create an AppExtension class that defines your custom twig functions. In addition to cutting down on duplicate code and helping you to keep your templates clean, leveraging custom twig functions is also a great way to ensure your team members can call the same functions when building out features, helping you maintain style uniformity throughout your application.

Client-side Javascript

I chose tried and true jQuery for the client side javascript for a few reasons:

  • In general, this app is not complex or javascript heavy enough to warrant the overhead of a full frontend framework
  • There are no server side models that need to be updated in realtime / synced to the client, so data binding is not a concern
  • The bulk of this app lives on the server. The client side javascript is only tasked with handling a few modals, making some ajax calls, and doing some light validation on user input

However, just because we’re not using a the latest framework hotness doesn’t mean we should be writing disorganized spaghetti scripts. I wanted a good way to organize the jQuery selectors that my javascript would be using, without falling victim to the all to common gotchas of changing DOM selectors over the course of project development causing errors.

I found a good approach in a Stackoverflow response that boils down to defining your jQuery selectors in one file or object, and then passing that “controls” object into the actual client module. This accomplishes a couple of things:

It keeps our jQuery selectors defined in a single place. Though the client itself might use and manipulate a given selector a number of times in various complex functions, there’s only one place to update its selector if your DOM changes When your module is agnostic about the selectors of the given elements its operating on, its easier to write portable code - keeping things abstract makes it easier to publish the module as a library, jQuery plugin, etc Our final code is cleaner and simpler. We don’t have a confusing mixture of selectors and objects polluting our module.

In addition to this handy trick, I employed the Revealing Module pattern via a closure to define a “client” object that exposes only a single method, init, to the on-page javascript that instantiates it.

I do this because I want the module’s state to remain private, such that users can’t tamper with it and so that there is no chance for collisions between module variables and variables in the global scope. This is a handy pattern for developing javascript plugins that might need to run in the same memory space as a dozen other libraries, while avoiding the pitfalls that come from polluting global scope.

Building configurable knobs for ease of use

Things that you expect to happen often for your application should be easy. For example, this application has two separate advertisement modules, each containing 2 ad blocks, that can enabled or disabled. If this application were running on behalf of a company, you could imagine that the marketing department would have a keen interest in swapping out the ads every time there was a new promotion or campaign running.

We don't want to deal with a complex, tedious or error-prone manual process each time we update this section, then. Therefore, we should make advertisements configurable and do the work of implementing our business logic in the controllers for displaying ads up front, before we launch. To demonstrate this approach I defined an advertisements blog in the example-parameters.yml file here.

Now when marketing opens a ticket for 4 new campaigns, you're just modifying the values in a config file, instead of wrangling a bunch of custom assets and HTML by hand.

Building configurable knobs for your friends in DevOps

This same principle applies across any aspects of your app that you expect would need to change for any reason in the future, foreseeable or otherwise. Maybe operations needs to consolidate vendor accounts under a single new corporate credit card, which means the API keys your app depends on need to change. Would you rather have to tell Ops that you need a day or so to grep through every instance of the key and change it by hand? Or that they can simply change an environment variable, clear the production cache and restart Apache?

Anything that could change or will need to change should probably be a configurable knob: an evironment variable or parameter that someone who is not necessarily a developer with deep expertise in the given application can look up in the project documentation and modify with a high degree of confidence they won't break anything by doing so.

Thanks for reading

If you have any questions feel free to email me. If something's not clear or you'd like to see additional clarification or detail on any particular topics, I'd like to know that, too.