Introduction¶
Analysis for data set:
https://www.kaggle.com/hm-land-registry/uk-housing-prices-paid
https://www.gov.uk/government/statistical-data-sets/price-paid-data-downloads
We have used Python, R, GGPlot and Plotl.ly to analyse over 20 Million houses to compare the number of houses sold in each year and the price of houses. Instead of comparing averagge houses by location or any other aggregated analysis, we will directly compare the price of the same house being sold across years.
Because there are so many rows, to run this notebook at least 16GB of RAM is required and we have taken visual steps to reduce the computation and visual strain of so much data.
In summary there are three sections:
- Part 1: Data and Exploratory Analysis
- Part 2: Connecting House Sales Between Years
- Part 3: Connecting House Sales In the Same Year
import pandas as pd
import numpy as np
import plotly.plotly as py
import plotly.graph_objs as go
from plotly import __version__
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
init_notebook_mode(connected=True)
from numpy import arange,array,ones
from scipy import stats
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()
from datetime import datetime
import math
Part 1: Data and Exploratory Analysis¶
Import Data¶
Import data set from: https://www.gov.uk/government/statistical-data-sets/price-paid-data-downloads
Unfortunately, the data on kaggle doesn’t contain the address location that we use later for comparing the houses.
#data = pd.read_csv('E:\\Documents\\UKHousePrices\\price_paid_records.csv')
dataimport = pd.read_csv('E:\\Documents\\UKHousePrices\\pp-complete.csv', names = ['ID','Price', 'Date of Transfer', 'Post Code',
'Property Type', 'Old/New','Duration','Address 1',
'Address 2', 'Address 3', 'Town/City', 'District',
'Borough', 'Region', 'PPDCategoryType', 'RecordStatus'])
len(dataimport)
Remove Duplicates¶
len(dataimport) - len(dataimport.drop_duplicates(subset=['Price','Date of Transfer','Post Code','Property Type',
'Old/New','Duration','Address 1','Address 2',
'Address 3','Town/City','District',
'Borough','Region','PPDCategoryType','RecordStatus']))
Less than 0.05% are duplicates which we removed.
(10625/len(dataimport))*100
dataimport2 = dataimport.drop_duplicates(subset=['Price','Date of Transfer','Post Code','Property Type',
'Old/New','Duration','Address 1','Address 2',
'Address 3','Town/City','District',
'Borough','Region','PPDCategoryType','RecordStatus'])
Then reduce the data by removing extreme house prices. Those are the ones that are:
- less than £10 Million and,
- greater than £10,000
dataimport3 = dataimport2.loc[(dataimport2['Price'] < (10000000)) & (dataimport2['Price'] > (10000)),]
len(dataimport3)
Let us take a quick exploratory look at the distribution of house prices. We see that the majority of house prices across all years is less than £500,000.
#Plot.ly Histogram
hist_trace = go.Histogram( x = dataimport3['Price'],
name = 'All House Prices',
xbins = dict(
start = 0,
end = 1000000,
size = 10000
),
marker = dict(
color = '#EB89B5')
)
histlayout = go.Layout(
title='Distribution of All House Prices between £10,000 and £1 Mil',
xaxis=dict(
title='House Price (£)'
),
yaxis=dict(
title='Count'
),
bargap=0.05,
bargroupgap=0.1
)
histdata = [hist_trace]
histfig = go.Figure(data=histdata,layout=histlayout)
iplot(histfig)
Add and Manipulate Features for Analysis¶
Convert ‘Date of Transfer’ to a datetime and then create ‘Year’ column¶
dataimport3['Date of Transfer'] = pd.to_datetime(dataimport3['Date of Transfer'])
dataimport3['Year'] = dataimport3['Date of Transfer'].dt.year
Calculate the Inflation Adjusted Price of each Sale¶
Year | Inflation | Multiplier |
---|---|---|
2017 | 1.00 | |
2016 | 1.8% | 1.02 |
2015 | 1.0% | 1.03 |
2014 | 2.4% | 1.05 |
2013 | 3.0% | 1.08 |
2012 | 3.2% | 1.12 |
2011 | 5.2% | 1.18 |
2010 | 4.6% | 1.23 |
2009 | -0.5% | 1.23 |
2008 | 4.0% | 1.27 |
2007 | 4.3% | 1.33 |
2006 | 3.2% | 1.37 |
2005 | 2.8% | 1.41 |
2004 | 3.0% | 1.45 |
2003 | 2.9% | 1.49 |
2002 | 1.7% | 1.52 |
2001 | 1.8% | 1.55 |
2000 | 3.0% | 1.59 |
1999 | 1.5% | 1.62 |
1998 | 3.4% | 1.67 |
1997 | 3.1% | 1.72 |
1996 | 2.4% | 1.77 |
1995 | 3.5% | 1.83 |
dataimport3['1995AdjustedPrice'] = (dataimport3['Price']*1.83).where(dataimport3['Year']==1995)
dataimport3['1996AdjustedPrice'] = (dataimport3['Price']*1.77).where(dataimport3['Year']==1996)
dataimport3['1997AdjustedPrice'] = (dataimport3['Price']*1.72).where(dataimport3['Year']==1997)
dataimport3['1998AdjustedPrice'] = (dataimport3['Price']*1.67).where(dataimport3['Year']==1998)
dataimport3['1999AdjustedPrice'] = (dataimport3['Price']*1.62).where(dataimport3['Year']==1999)
dataimport3['2000AdjustedPrice'] = (dataimport3['Price']*1.59).where(dataimport3['Year']==2000)
dataimport3['2001AdjustedPrice'] = (dataimport3['Price']*1.55).where(dataimport3['Year']==2001)
dataimport3['2002AdjustedPrice'] = (dataimport3['Price']*1.52).where(dataimport3['Year']==2002)
dataimport3['2003AdjustedPrice'] = (dataimport3['Price']*1.49).where(dataimport3['Year']==2003)
dataimport3['2004AdjustedPrice'] = (dataimport3['Price']*1.45).where(dataimport3['Year']==2004)
dataimport3['2005AdjustedPrice'] = (dataimport3['Price']*1.41).where(dataimport3['Year']==2005)
dataimport3['2006AdjustedPrice'] = (dataimport3['Price']*1.37).where(dataimport3['Year']==2006)
dataimport3['2007AdjustedPrice'] = (dataimport3['Price']*1.33).where(dataimport3['Year']==2007)
dataimport3['2008AdjustedPrice'] = (dataimport3['Price']*1.27).where(dataimport3['Year']==2008)
dataimport3['2009AdjustedPrice'] = (dataimport3['Price']*1.23).where(dataimport3['Year']==2009)
dataimport3['2010AdjustedPrice'] = (dataimport3['Price']*1.23).where(dataimport3['Year']==2010)
dataimport3['2011AdjustedPrice'] = (dataimport3['Price']*1.18).where(dataimport3['Year']==2011)
dataimport3['2012AdjustedPrice'] = (dataimport3['Price']*1.12).where(dataimport3['Year']==2012)
dataimport3['2013AdjustedPrice'] = (dataimport3['Price']*1.08).where(dataimport3['Year']==2013)
dataimport3['2014AdjustedPrice'] = (dataimport3['Price']*1.05).where(dataimport3['Year']==2014)
dataimport3['2015AdjustedPrice'] = (dataimport3['Price']*1.03).where(dataimport3['Year']==2015)
dataimport3['2016AdjustedPrice'] = (dataimport3['Price']*1.02).where(dataimport3['Year']==2016)
dataimport3['2017AdjustedPrice'] = (dataimport3['Price']*1.00).where(dataimport3['Year']==2017)
dataimport3['AdjustedPrice'] = dataimport3['1995AdjustedPrice']
dataimport3['AdjustedPrice'] = dataimport3['AdjustedPrice'].fillna(dataimport3['1996AdjustedPrice'])
dataimport3['AdjustedPrice'] = dataimport3['AdjustedPrice'].fillna(dataimport3['1997AdjustedPrice'])
dataimport3['AdjustedPrice'] = dataimport3['AdjustedPrice'].fillna(dataimport3['1998AdjustedPrice'])
dataimport3['AdjustedPrice'] = dataimport3['AdjustedPrice'].fillna(dataimport3['1999AdjustedPrice'])
dataimport3['AdjustedPrice'] = dataimport3['AdjustedPrice'].fillna(dataimport3['2000AdjustedPrice'])
dataimport3['AdjustedPrice'] = dataimport3['AdjustedPrice'].fillna(dataimport3['2001AdjustedPrice'])
dataimport3['AdjustedPrice'] = dataimport3['AdjustedPrice'].fillna(dataimport3['2002AdjustedPrice'])
dataimport3['AdjustedPrice'] = dataimport3['AdjustedPrice'].fillna(dataimport3['2003AdjustedPrice'])
dataimport3['AdjustedPrice'] = dataimport3['AdjustedPrice'].fillna(dataimport3['2004AdjustedPrice'])
dataimport3['AdjustedPrice'] = dataimport3['AdjustedPrice'].fillna(dataimport3['2005AdjustedPrice'])
dataimport3['AdjustedPrice'] = dataimport3['AdjustedPrice'].fillna(dataimport3['2006AdjustedPrice'])
dataimport3['AdjustedPrice'] = dataimport3['AdjustedPrice'].fillna(dataimport3['2007AdjustedPrice'])
dataimport3['AdjustedPrice'] = dataimport3['AdjustedPrice'].fillna(dataimport3['2008AdjustedPrice'])
dataimport3['AdjustedPrice'] = dataimport3['AdjustedPrice'].fillna(dataimport3['2009AdjustedPrice'])
dataimport3['AdjustedPrice'] = dataimport3['AdjustedPrice'].fillna(dataimport3['2010AdjustedPrice'])
dataimport3['AdjustedPrice'] = dataimport3['AdjustedPrice'].fillna(dataimport3['2011AdjustedPrice'])
dataimport3['AdjustedPrice'] = dataimport3['AdjustedPrice'].fillna(dataimport3['2012AdjustedPrice'])
dataimport3['AdjustedPrice'] = dataimport3['AdjustedPrice'].fillna(dataimport3['2013AdjustedPrice'])
dataimport3['AdjustedPrice'] = dataimport3['AdjustedPrice'].fillna(dataimport3['2014AdjustedPrice'])
dataimport3['AdjustedPrice'] = dataimport3['AdjustedPrice'].fillna(dataimport3['2015AdjustedPrice'])
dataimport3['AdjustedPrice'] = dataimport3['AdjustedPrice'].fillna(dataimport3['2016AdjustedPrice'])
dataimport3['AdjustedPrice'] = dataimport3['AdjustedPrice'].fillna(dataimport3['2017AdjustedPrice'])
dataimport3.head()
list(dataimport3)
data = dataimport3.iloc[:,[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,40]]
list(data)
Add a ‘LondonCheck’ column for whether the locaiton is in London or the ‘Rest of the UK’¶
data['LondonCheck'] = (data['Region']).where(data['Region']=="GREATER LONDON")
data['LondonCheck'] = data['LondonCheck'].fillna("Rest of UK")
data['LondonCheck'].unique()
data.head()
General Analysis¶
First, for interest, I plotted the number of houses sold each month. We can clearly see the effect of the 2008 regression on the housing market that has slowly increased until a huge spike in 2016 just prior to the UK EU referendum.
Timeline plot for the number of house sales between 1995 and 2017¶
# Create a list of a day from each month from January 1995 to December 2017
daterange = pd.date_range('1995-01-01','2017-12-31' , freq='1M')
daterange = daterange.union([daterange[-1] + 1])
daterange = [d.strftime('%d-%m-%Y') for d in daterange]
daterange
# Use group by to calculate the number of sales of each month
fulldatafortimeplot = data.groupby([(data["Date of Transfer"].dt.year),(data["Date of Transfer"].dt.month)]).count()
fulldatafortimeplot2 = pd.DataFrame(fulldatafortimeplot['ID'])
fulldatafortimeplot2['Dates'] = daterange
fulldatafortimeplot2['Dates'] = pd.to_datetime(fulldatafortimeplot2['Dates'], format = '%d-%m-%Y')
fulldatafortimeplot2.columns = ['Count', 'Dates']
fulldatafortimeplot2 = fulldatafortimeplot2.iloc[0:276,:]
# Plot.ly Timeline plot with range slider
trace_time = go.Scatter(
x=fulldatafortimeplot2['Dates'],
y=fulldatafortimeplot2['Count'],
name = "Number of House Sales",
line = dict(color = '#7F7F7F'),
opacity = 0.8)
data_timline = [trace_time]
layout_timeline = go.Layout(
dict(
title='Timeline of the Number of House Sales in the UK between 1995 and 2017',
xaxis=dict(
rangeslider=dict(),
type='date'
),
annotations = [
dict(
x = datetime.strptime('23-06-2016', '%d-%m-%Y'),
y = 84927,
xref = 'x',
yref = 'y',
text = 'UK Referendum',
showarrow = True,
arrowhead = 7,
ax = 0,
ay = -40
),
dict(
x = datetime.strptime('01-12-2007', '%d-%m-%Y'),
y = 104283,
xref = 'x',
yref = 'y',
text = 'Financial Crash',
showarrow = True,
arrowhead = 7,
ax = 0,
ay = -40
)
]
)
)
fig = dict(data=data_timline, layout=layout_timeline)
iplot(fig)