Blog > Innovation

Classifying Planning Applications with AI

March 31, 20255 min read

The Problem

At Martello, we process a large amount of planning applications across the UK to produce detailed planning reports for new home buyers. These reports offer valuable insights into local development activities, but understanding the scale of each application is essential for presenting the applications in a meaningful way.

Each planning application varies significantly in scope-some cover small home improvements, while others involve large-scale residential or commercial developments. To enhance our reporting, we needed a reliable method to classify these applications into three categories: small, medium, and large.

The tricky part? There’s no universal rulebook for classifying these applications. Our definitions for small, medium, and large are based on Martello’s own criteria, built to fit our needs. So, we couldn’t just grab something off the shelf-it had to be our own solution.

Furthermore, the classification process needed to be accurate, automated, and scalable. Our data pipeline ingests thousands of new planning applications daily. Manual categorisation was not an option-it was inefficient, error-prone, and far too slow.

Data Flow Diagram

Planning a Solution

We needed a system that could classify planning applications accurately and consistently-without any manual effort. So, we broke it down into a few clear steps:

  • Create a reliable dataset by manually labelling thousands of planning applications based on our internal definitions.
  • Train a machine learning model to classify applications using the proposal text and extracted dwelling counts.
  • Deploy the model to production, integrated seamlessly into our existing Airflow pipeline on AWS.

This project had everything a great AI build should have: a clear problem, reliable data labels, model experimentation, ML engineering, and production deployment.

Data Labelling

For the model to learn how to classify applications accurately, we needed a properly labelled training set. So, we manually labelled 5,000 planning applications, ensuring a balanced distribution across our three classes: small, medium, and large.

The decision for each class was based solely on the description of the planning application. While I won’t go into the specifics of our internal logic, one of the key factors we use is the number of dwellings mentioned in the description.

To make this work, we wrote a Python function that automatically extracts the number of dwellings from each proposal. This allowed us to include that information as a feature for training the model.

For example, the following partial application descriptions would suggest the given nubmer of dwellings:

  • "External stair to rear garden from first floor flat existing stair": 1
  • "Proposed development for the erection of 5 dwellings": 5
  • "Demolition of buildings and construction of eleven new units": 11

Model Training and Testing

With our labelled dataset ready, the next step was to train and evaluate a model that could accurately classify new planning applications.

Data Preparation

We split the 5,000 labelled applications into a training set and a testing set, ensuring a balanced representation of all three classes: small, medium, and large. The training set was used to fit the model, while the testing set was reserved for evaluating the final model’s performance.

Before training, we pre-processed the data by:

  • Tokenizing the proposal text to convert it into a usable format for machine learning.
  • Extracting the number of dwellings mentioned in each description, which we found to be a key factor in classification.

Model Selection and Training

We experimented with several machine learning models and evaluated their performance using cross-validation on a validation set. This approach allowed us to test multiple models and hyperparameter combinations to find the most effective solution.

Ultimately, we chose LightGBM, a high-performance gradient boosting framework, due to its speed, accuracy, and ability to handle our mixed data types (text + numeric features). We tuned the model using GridSearchCV to optimise hyperparameters and achieve the best performance.

Performance Evaluation

Once trained, we evaluated the model’s performance on the testing set using F1 Score as the primary metric. This metric was chosen to provide a balanced measure of precision and recall, particularly useful given the imbalanced nature of our classes.

  • Small: 0.85
  • Medium: 0.69
  • Large: 0.95

Overall Weighted F1 Score - 0.89

The model performed exceptionally well on the small and large classes, achieving high F1 scores. As expected, the medium class proved to be the most challenging, likely due to its overlapping characteristics with the other two classes.

Model Deployment to Production

Once the model was trained and tested, the final challenge was to productionise it - making sure it could run reliably and automatically as part of our data pipeline.

Productionising machine learning models is often the hardest and most overlooked step in ML projects. Many never make it past experimentation. For us, getting the model into production was just as important as building it-turning a good idea into something that runs, scales, and delivers value every day.

We packaged the trained model inside a Docker container, giving us full control over its dependencies and runtime environment. This container receives planning applications as input, makes predictions, and writes the results back to the database.

The container was deployed to AWS ECR and is run using AWS ECS(Fargate), so there’s no need to manage servers.

The final step was integrating the model into our Airflow pipeline. Now, every time new planning applications flow through our system, the model container is triggered to make predictions. It processes thousands of applications in just 1–2 minutes, making the classification a part of our real-time data flow.