# Introduction¶

We have used Python, R, GGPlot and Plotl.ly to analyse over 20 Million houses to compare the number of houses sold in each year and the price of houses. Instead of comparing averagge houses by location or any other aggregated analysis, we will directly compare the price of the same house being sold across years.

Because there are so many rows, to run this notebook at least 16GB of RAM is required and we have taken visual steps to reduce the computation and visual strain of so much data.

In summary there are three sections:

• Part 1: Data and Exploratory Analysis
• Part 2: Connecting House Sales Between Years
• Part 3: Connecting House Sales In the Same Year
In [1]:
import pandas as pd
import numpy as np

import plotly.plotly as py
import plotly.graph_objs as go

from plotly import __version__

init_notebook_mode(connected=True)

from numpy import arange,array,ones
from scipy import stats

import matplotlib.pyplot as plt
import seaborn as sns; sns.set()

from datetime import datetime

import math


# Part 1: Data and Exploratory Analysis¶

### Import Data¶

Unfortunately, the data on kaggle doesn’t contain the address location that we use later for comparing the houses.

In [47]:
#data = pd.read_csv('E:\\Documents\\UKHousePrices\\price_paid_records.csv')
dataimport = pd.read_csv('E:\\Documents\\UKHousePrices\\pp-complete.csv', names = ['ID','Price', 'Date of Transfer', 'Post Code',
'Borough', 'Region', 'PPDCategoryType', 'RecordStatus'])
len(dataimport)

Out[47]:
23075790

### Remove Duplicates¶

In [48]:
len(dataimport) - len(dataimport.drop_duplicates(subset=['Price','Date of Transfer','Post Code','Property Type',
'Borough','Region','PPDCategoryType','RecordStatus']))

Out[48]:
10625

Less than 0.05% are duplicates which we removed.

In [4]:
(10625/len(dataimport))*100

Out[4]:
0.04604392742350316
In [5]:
dataimport2 = dataimport.drop_duplicates(subset=['Price','Date of Transfer','Post Code','Property Type',
'Borough','Region','PPDCategoryType','RecordStatus'])


Then reduce the data by removing extreme house prices. Those are the ones that are:

1. less than £10 Million and,
2. greater than £10,000
In [6]:
dataimport3 = dataimport2.loc[(dataimport2['Price'] < (10000000)) & (dataimport2['Price'] > (10000)),]
len(dataimport3)

Out[6]:
23016232

Let us take a quick exploratory look at the distribution of house prices. We see that the majority of house prices across all years is less than £500,000.

In [135]:
#Plot.ly Histogram
hist_trace = go.Histogram( x = dataimport3['Price'],
name = 'All House Prices',
xbins = dict(
start = 0,
end = 1000000,
size = 10000
),
marker = dict(
color = '#EB89B5')
)
histlayout = go.Layout(
title='Distribution of All House Prices between £10,000 and £1 Mil',
xaxis=dict(
title='House Price (£)'
),
yaxis=dict(
title='Count'
),
bargap=0.05,
bargroupgap=0.1
)
histdata = [hist_trace]
histfig = go.Figure(data=histdata,layout=histlayout)

iplot(histfig)

Out[135]:
'file://E:\\Documents\\UKHousePrices\\temp-plot.html'

## Add and Manipulate Features for Analysis¶

### Convert ‘Date of Transfer’ to a datetime and then create ‘Year’ column¶

In [8]:
dataimport3['Date of Transfer'] = pd.to_datetime(dataimport3['Date of Transfer'])

dataimport3['Year'] = dataimport3['Date of Transfer'].dt.year


### Calculate the Inflation Adjusted Price of each Sale¶

Year Inflation Multiplier
2017 1.00
2016 1.8% 1.02
2015 1.0% 1.03
2014 2.4% 1.05
2013 3.0% 1.08
2012 3.2% 1.12
2011 5.2% 1.18
2010 4.6% 1.23
2009 -0.5% 1.23
2008 4.0% 1.27
2007 4.3% 1.33
2006 3.2% 1.37
2005 2.8% 1.41
2004 3.0% 1.45
2003 2.9% 1.49
2002 1.7% 1.52
2001 1.8% 1.55
2000 3.0% 1.59
1999 1.5% 1.62
1998 3.4% 1.67
1997 3.1% 1.72
1996 2.4% 1.77
1995 3.5% 1.83
In [9]:
dataimport3['1995AdjustedPrice'] = (dataimport3['Price']*1.83).where(dataimport3['Year']==1995)

In [10]:
dataimport3['AdjustedPrice'] = dataimport3['1995AdjustedPrice']

In [11]:
dataimport3.head()

Out[11]:
0 {5BBE9CB3-6332-4EB0-9CD3-8737CEA4A65A} 42000 1995-12-21 NE4 9DN S N F 8 NaN MATFEN PLACE NaN NaN NaN NaN NaN NaN NaN NaN NaN 76860.0
1 {20E2441A-0F16-49AB-97D4-8737E62A5D93} 95000 1995-03-03 RM16 4UR S N F 30 NaN HEATH ROAD NaN NaN NaN NaN NaN NaN NaN NaN NaN 173850.0
2 {D893EE64-4464-44B5-B01B-8E62403ED83C} 74950 1995-10-03 CW10 9ES D Y F 15 NaN SHROPSHIRE CLOSE NaN NaN NaN NaN NaN NaN NaN NaN NaN 137158.5
3 {F9F753A8-E56A-4ECC-9927-8E626A471A92} 43500 1995-11-14 TS23 3LA S N F 19 NaN SLEDMERE CLOSE NaN NaN NaN NaN NaN NaN NaN NaN NaN 79605.0
4 {E166398A-A19E-470E-BB5A-83B4C254CF6D} 63000 1995-09-08 CA25 5QH S N F 8 NaN CROSSINGS CLOSE NaN NaN NaN NaN NaN NaN NaN NaN NaN 115290.0

5 rows × 41 columns

In [12]:
list(dataimport3)

Out[12]:
['ID',
'Price',
'Date of Transfer',
'Post Code',
'Property Type',
'Old/New',
'Duration',
'Town/City',
'District',
'Borough',
'Region',
'PPDCategoryType',
'RecordStatus',
'Year',
'AdjustedPrice']
In [49]:
data = dataimport3.iloc[:,[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,40]]
list(data)

Out[49]:
['ID',
'Price',
'Date of Transfer',
'Post Code',
'Property Type',
'Old/New',
'Duration',
'Town/City',
'District',
'Borough',
'Region',
'PPDCategoryType',
'RecordStatus',
'Year',
'AdjustedPrice']

### Add a ‘LondonCheck’ column for whether the locaiton is in London or the ‘Rest of the UK’¶

In [50]:
data['LondonCheck'] = (data['Region']).where(data['Region']=="GREATER LONDON")
data['LondonCheck'] = data['LondonCheck'].fillna("Rest of UK")
data['LondonCheck'].unique()

Out[50]:
array(['Rest of UK', 'GREATER LONDON'], dtype=object)
In [51]:
data.head()

Out[51]:
ID Price Date of Transfer Post Code Property Type Old/New Duration Address 1 Address 2 Address 3 Town/City District Borough Region PPDCategoryType RecordStatus Year AdjustedPrice LondonCheck
0 {5BBE9CB3-6332-4EB0-9CD3-8737CEA4A65A} 42000 1995-12-21 NE4 9DN S N F 8 NaN MATFEN PLACE FENHAM NEWCASTLE UPON TYNE NEWCASTLE UPON TYNE TYNE AND WEAR A A 1995 76860.0 Rest of UK
1 {20E2441A-0F16-49AB-97D4-8737E62A5D93} 95000 1995-03-03 RM16 4UR S N F 30 NaN HEATH ROAD GRAYS GRAYS THURROCK THURROCK A A 1995 173850.0 Rest of UK
2 {D893EE64-4464-44B5-B01B-8E62403ED83C} 74950 1995-10-03 CW10 9ES D Y F 15 NaN SHROPSHIRE CLOSE MIDDLEWICH MIDDLEWICH CONGLETON CHESHIRE A A 1995 137158.5 Rest of UK
3 {F9F753A8-E56A-4ECC-9927-8E626A471A92} 43500 1995-11-14 TS23 3LA S N F 19 NaN SLEDMERE CLOSE BILLINGHAM BILLINGHAM STOCKTON-ON-TEES STOCKTON-ON-TEES A A 1995 79605.0 Rest of UK
4 {E166398A-A19E-470E-BB5A-83B4C254CF6D} 63000 1995-09-08 CA25 5QH S N F 8 NaN CROSSINGS CLOSE CLEATOR MOOR CLEATOR MOOR COPELAND CUMBRIA A A 1995 115290.0 Rest of UK

## General Analysis¶

First, for interest, I plotted the number of houses sold each month. We can clearly see the effect of the 2008 regression on the housing market that has slowly increased until a huge spike in 2016 just prior to the UK EU referendum.

### Timeline plot for the number of house sales between 1995 and 2017¶

In [16]:
# Create a list of a day from each month from January 1995 to December 2017
daterange = pd.date_range('1995-01-01','2017-12-31' , freq='1M')
daterange = daterange.union([daterange[-1] + 1])
daterange = [d.strftime('%d-%m-%Y') for d in daterange]
daterange

# Use group by to calculate the number of sales of each month
fulldatafortimeplot = data.groupby([(data["Date of Transfer"].dt.year),(data["Date of Transfer"].dt.month)]).count()
fulldatafortimeplot2 = pd.DataFrame(fulldatafortimeplot['ID'])
fulldatafortimeplot2['Dates'] = daterange
fulldatafortimeplot2['Dates'] = pd.to_datetime(fulldatafortimeplot2['Dates'], format = '%d-%m-%Y')
fulldatafortimeplot2.columns = ['Count', 'Dates']
fulldatafortimeplot2 = fulldatafortimeplot2.iloc[0:276,:]

In [17]:
# Plot.ly Timeline plot with range slider
trace_time = go.Scatter(
x=fulldatafortimeplot2['Dates'],
y=fulldatafortimeplot2['Count'],
name = "Number of House Sales",
line = dict(color = '#7F7F7F'),
opacity = 0.8)

data_timline = [trace_time]

layout_timeline = go.Layout(
dict(
title='Timeline of the Number of House Sales in the UK between 1995 and 2017',
xaxis=dict(

rangeslider=dict(),
type='date'
),
annotations = [
dict(
x = datetime.strptime('23-06-2016', '%d-%m-%Y'),
y = 84927,
xref = 'x',
yref = 'y',
text = 'UK Referendum',
showarrow = True,
ax = 0,
ay = -40
),
dict(
x = datetime.strptime('01-12-2007', '%d-%m-%Y'),
y = 104283,
xref = 'x',
yref = 'y',
text = 'Financial Crash',
showarrow = True,