Consensus.GeocodeMerger module
Using SmartLinker()
This module provides a SmartLinker()
class that finds the shortest path between two columns in different tables in Open Geography Portal. The idea for this class was borne out of the need to constantly access Open Geography Portal for data science projects, creating complex lookup merges for comparing 2011 and 2021 Census data. You can think of this class as a convenience wrapper for downloading data from many datasets in Open Geography Portal. SmartLinker()
class takes a starting and ending column, list of geographic areas and the column where the values of geographic areas should be, and finds the shortest path between the start and end points.
We do this by using graph theory, specifically the Breadth-first search method between the columns of all tables on Open Geography Portal.
The end result is not by any means perfect and you are advised to try different paths and to check that the output makes sense.
You can furher use the output of this class to download data from Nomis using the Consensus.Nomis.DownloadFromNomis()
class. More specifically, if you know the geographic areas (e.g. wards) you are interested in are available for a specific Census table (say, TS054 - Tenure from Census 2021), you can find the ward geocodes using SmartLinker()
and then input those to DownloadFromNomis().download()
to access Nomis data.
Usage:
This class works as follows.
Internally, on initialising the class with await SmartLinker().initialise()
, a json lookup file of the available tables in Open Geography Portal is read if the json file exists or created if it is not available.
Then, using the information contained in the json file, a graph of connections between table columns is created using the run_graph()
method. At this point the user provides the names of the starting and ending columns,
an optional list of geographic_areas
and an optional list of columns for the geographic_area_columns
that the geographic_areas
uses to create a subset of data.
Following the creation of the graph, all possible starting points are searched for (i.e., which tables contain the user-provided starting_table). After this, we look for the shortest paths to the ending column.
To do this, we look for all possible paths from all starting_columns to ending_columns and count how many steps there are between each table.
The run_graph()
method prints out a numbered list of possible paths.
The user can get their chosen data using the geodata()
method by providing an integer matching their chosen path to the selected_path
argument.
Intended workflow
First explore the possible geographies.
from Consensus.GeocodeMerger import SmartLinker, GeoHelper
import asyncio
gh = GeoHelper()
print(gh.geography_keys()) # outputs a dictionary of explanations for nearly all UK geographic units.
print(gh.available_geographies()) # outputs all geographies currently available in the lookup file.
print(gh.geographies_filter('WD')) # outputs all columns referring to wards.
Once you’ve decided you want to look at 2022 wards, you can do the following:
async def get_data():
gss = SmartLinker()
await gss.initialise()
gss.allow_geometry('geometry_only') # use this method to restrict the graph search space to tables with geometry
gss.allow_geometry('connected_tables') # set this to ``True`` if you must have geometries in the *connected* table
gss.run_graph(starting_column='WD22CD', ending_column='LAD22CD', geographic_areas=['Lewisham', 'Southwark'], geographic_area_columns=['LAD22NM']) # you can choose the starting and ending columns using ``GeoHelper().geographies_filter()`` method.
codes = await gss.geodata(selected_path=9, chunk_size=50) # the selected path is the ninth in the list of potential paths output by ``run_graph()`` method. Increase chunk_size if your download is slow and try decreasing it if you are being throttled (or encounter weird errors).
print(codes['table_data'][0]) # the output is a dictionary of ``{'path': [[table1_of_path_1, table2_of_path1], [table1_of_path2, table2_of_path2]], 'table_data':[data_for_path1, data_for_path2]}``
return codes['table_data'][0]
output = asyncio.run(get_data())
From here, you can take the WD22CD column from output
and use it as input to the Consensus.Nomis.DownloadFromNomis()
class if you wanted to.
- Consensus.GeocodeMerger.BFS_SP(graph, start, goal)
Breadth-first search.
- Parameters:
graph (Dict[str, List[Tuple[str, str]]]) – Dictionary of connected tables based on shared columns.
start (str) – Starting table and column.
goal (str) – Final table and column.
- Returns:
A path as a list
- Return type:
List[Any]
- class Consensus.GeocodeMerger.GeoHelper(server='OGP')
Bases:
object
GeoHelper()
class helps with exploring the different possibilities for start and end columns.- This class provides 3 methods:
geography_keys()
, which outputs a dictionary of short-hand descriptions of geographic areas. You can typically append the abbreviations with a number and either a CD or NM. For instance, BUA, which stands for Built-up areas, could be appended to say “BUA11CD”, which refers to the geocodes of 2011 BUA’s.available_geographies()
, which outputs all available geographies. Combine with the above method to get an explanation for a given geography.geographies_filter()
, combines the above two method in a single convenience method so you don’t have to create your own filter. Just grab a key fromgeography_keys()
method and use it as input.
- None
Usage:
gh = GeoHelper() print(gh.geography_keys()) print(gh.available_geographies()) print(gh.geographies_filter('WD')) # outputs all columns referring to wards.
- __init__(server='OGP')
Initialise
GeoHelper()
by getting lookup table fromSmartLinker()
for the chosen server.
- available_geographies()
Prints the geocode columns available in the current lookup file, which is built from the Open Geography Portal data.
- Returns:
A list of available geographies.
- Return type:
List[str]
- geographies_filter(geo_key=None)
Helper method to filter the available geographies based on a given key.
- Parameters:
geo_key (str) – The key to filter the available geographies.
- Returns:
A list of filtered geographies.
- Return type:
List[str]
- static geography_keys()
Get the short-hand descriptions of most common geographic areas.
- Returns:
A dictionary of short-hand descriptions of geographic areas.
- Return type:
Dict[str, str]
- exception Consensus.GeocodeMerger.InvalidColumnError
Bases:
Exception
Raise if invalid column
- __init__(*args, **kwargs)
- add_note()
Exception.add_note(note) – add a note to the exception
- args
- with_traceback()
Exception.with_traceback(tb) – set self.__traceback__ to tb and return self.
- exception Consensus.GeocodeMerger.InvalidPathError
Bases:
Exception
Raise if graph path length less than one
- __init__(*args, **kwargs)
- add_note()
Exception.add_note(note) – add a note to the exception
- args
- with_traceback()
Exception.with_traceback(tb) – set self.__traceback__ to tb and return self.
- exception Consensus.GeocodeMerger.MissingDataError
Bases:
Exception
Raise if no data found
- __init__(*args, **kwargs)
- add_note()
Exception.add_note(note) – add a note to the exception
- args
- with_traceback()
Exception.with_traceback(tb) – set self.__traceback__ to tb and return self.
- class Consensus.GeocodeMerger.SmartLinker(server='OGP', lookup_folder=None, **kwargs)
Bases:
object
Uses graph theory (breadth-first search) to find shortest path between table columns.
- server
Name of the server to use (‘OGP’ or ‘TFL’). Defaults to ‘OGP’.
- Type:
str
- lookup_location
Path to the
lookup.json
file. Defaults to None.- Type:
Path
- graph
A dictionary of connected tables based on shared columns.
- lookup
A dictionary of table names and their corresponding columns.
- local_authorities
A list of local authorities to filter the data by.
- initialise()
This method reads the lookup.json file and creates the graph.
- run_graph()
This method creates the graph by searching through the lookup.json file for data with shared column names, given the names of the starting and ending columns.
- geodata()
This method outputs the geodata given the start and end columns.
- allow_geometry()
This method restricts the graph search space to tables with geometry. Counter-intuitively, you reset it by running it without any arguments.
- Usage:
This class works as follows.
Internally, on initialising the class with
SmartLinker().initialise()
, a json lookup file of the available tables in Open Geography Portal is read if the json file exists. If it is not available, you must build the lookup file first using the appropriateEsriConnector()
sub-class. Then, using the information contained in the json file, a graph of connections between table columns is created using therun_graph()
method. At this point the user provides the names of the starting and ending columns, an optional list ofgeographic_areas
and an optional list of columns for thegeographic_area_columns
that thegeographic_areas
uses to create a subset of data.Following the creation of the graph, all possible starting points are searched for (i.e., which tables contain the user-provided starting_table). After this, we look for the shortest paths to the ending column. To do this, we look for all possible paths from all starting_columns to ending_columns and count how many steps there are between each table. The
run_graph()
method prints out a numbered list of possible paths.The user can get their chosen data using the
geodata()
method by providing an integer matching their chosen path to theselected_path
argument.The intended workflow is:
from Consensus.GeocodeMerger import SmartLinker import asyncio async def example(): gss = SmartLinker(server='OGP') # change ``server`` argument to 'TFL' if you so wish. This class may not perform well for all Esri servers as the stored data tables may not have many or any common column names and therefore this class may build a disconnected graph. await gss.initialise() gss.allow_geometry('geometry_only') # use this method to restrict the graph search space to tables with geometry gss.allow_geometry('connected_tables') # set this to ``True`` if you must have geometries in the *connected* table gss.run_graph(starting_column='WD22CD', ending_column='LAD22CD', geographic_areas=['Lewisham', 'Southwark'], geographic_area_columns=['LAD22NM']) # the starting and ending columns should end in CD codes = await gss.geodata(selected_path=9, chunk_size=50) # the selected path is the ninth in the list of potential paths output by `run_graph()` method. Increase chunk_size if your download is slow and try decreasing it if you are being throttled (or encounter weird errors). print(codes['table_data'][0]) # the output is a dictionary of ``{'path': [[table1_of_path_1, table2_of_path1], [table1_of_path2, table2_of_path2]], 'table_data':[data_for_path1, data_for_path2]}``. asyncio.run(example())
- __init__(server='OGP', lookup_folder=None, **kwargs)
Initialise SmartLinker.
- Parameters:
server (str) – Name of the server to use (‘OGP’ or ‘TFL’). Defaults to ‘OGP’.
lookup_location (Path) – Path to the
lookup.json
file. Defaults to None.**kwargs (
Dict
[str
,Any
]) – Passes keyword arguments to EsriConnector class.
- Returns:
None
- async _get_ogp_table(pathway, where_clause='1=1', **kwargs)
Uses
FeatureServer()
to download data from Open Geography Portal. Keyword arguments are passed toFeatureServer()
.- Parameters:
pathway (str) – The name of the service to download data for.
where_clause (str) – The where clause to filter the data.
**kwargs – Keyword arguments to pass to
FeatureServer().setup()
. Main keywords to use aremax_retries
,timeout
,chunk_size
, andlayer_number
. Change these if you’re experiencing connectivity issues or know that you want to download a specific layer.
- Returns:
A tuple containing the downloaded data and the pathway used.
- Return type:
Tuple[pd.DataFrame, str]
- _path_to_tables(paths=[[]])
Make a list of tables in the path.
- Returns:
A list of tables in the path.
- Return type:
List[str]
- allow_geometry(setting=None)
Use this method to limit the graph search space, which slightly speeds up the process, but also limits the possible connections that can be made. If you’re only interested in geographic areas with geometries (say, you need ward boundaries), then set the
setting
argument togeometry_only
.If you choose to set ‘connected_tables’, this will set self.force_geometry to True, so that only tables with geometry will be used in the connecting tables. If False, all tables will be used. Defaults to False. Note that this does not affect the starting table as doing so would equal to setting ‘geometry_only’.
If a setting has been chosen, you may find that you need to reset it so that you can search a wider space. To do so, simply run the method without any arguments and it will reset the lookup space to default.
- Parameters:
setting (str) – One of: ‘geometry_only’, ‘connected_tables’, or ‘non_geometry’. Anything else will use the default, which is that both geometry and non-geometry tables are used.
- Return type:
None
- Returns:
None
- create_graph()
Create a graph of connections between tables using common column names.
- Returns:
A tuple containing a dictionary representing the graph and a list of table-column pairs.
- Return type:
Tuple[Dict[str, List[Tuple[str, str]]], List[str]]
- find_paths()
Find all paths given all start and end options using
BFS_SP()
function.- Returns:
A dictionary containing the possible paths. Paths are sorted alphabetically.
- Return type:
Dict[str, List]
- find_shortest_paths()
From all path options, choose shortest.
- Returns:
A list of the shortest paths.
- Return type:
List[str]
- async geodata(selected_path=None, retun_all=False, **kwargs)
Get a dictionary of pandas dataframes that have been either merged and filtered by geographic_areas or all individual tables.
- Parameters:
selected_path (int) – Choose the path from the output of
run_graph()
method.retun_all (bool) – Set this to True if you want to get individual tables that would otherwise get merged.
**kwargs – These keyword arguments get passed to
EsriConnector.FeatureServer().setup()
. Main keywords to use aremax_retries
,timeout
,chunk_size
, andlayer_number
. Change these if you’re experiencing connectivity issues. For instance, add more retries and increase time between tries, and reducechunk_size
for each call so you’re not being overwhelming the server. If you’re not getting the layer you expected, you can try changing thelayer_number
- most should work with the default 0, but there is a possibility of multiple layers being available for a given dataset.
- Return type:
Dict
[str
,List
[Any
]]- Returns:
Dict[str, List[Any]] - A dictionary of merged tables, where the first key (‘paths’) refers to a list of lists that of the merged tables and the second key-value pair (‘table_data’) contains a list of Pandas dataframe objects that are the left joined data tables.
- get_starting_point()
Starting point is hard coded as being from any table with ‘LAD’, ‘UTLA’, or ‘LTLA’ columns.
- Returns:
A dictionary containing the starting tables and their columns.
- Return type:
Dict[str, List[str]]
- get_starting_point_without_local_authority_constraint()
Starting point is any table with a suitable column.
- Returns:
A dictionary containing the starting tables and their columns.
- Return type:
Dict[str, List[str]]
- async initialise()
Initialise the connections to the selected Esri server and prepare the async Feature Server for downloading.
- Raises:
ValueError – If the server does not have a service table.
- Return type:
None
- Returns:
None
- paths_to_explore()
Returns all possible paths (only table names) as a dictionary. The keys can be used to select your desired path by inputting it like:
geodata(selected_path=key)
.- Returns:
A dictionary of possible paths.
- Return type:
Dict[int, str]
- run_graph(starting_column=None, ending_column=None, geographic_areas=None, geographic_area_columns=['LAD22NM', 'UTLA22NM', 'LTLA22NM'])
Use this method to create the graph given start and end points, as well as the local authority. The starting_column and ending_column parameters should end in “CD”. For example LAD21CD or WD23CD.
- Parameters:
starting_column (str) – The starting column for the graph search.
ending_column (str) – The ending column for the graph search.
geographic_areas (List[str]) – A list of geographic areas to filter the data by.
geographic_area_columns (List[str]) – A list of columns to use when filtering the data using the
geographic_areas
list. Defaults to [‘LAD22NM’, ‘UTLA22NM’, ‘LTLA22NM’].
- Raises:
Exception – If the starting_column or ending_column is not provided.
- Return type:
None
- Returns:
None