Consensus.LocalMerger module

This module is not yet fully implemented and is a work in progress. Please let me know if you would like to contribute.

class Consensus.LocalMerger.DatabaseManager(db_path)

Bases: object

A class to manage a DuckDB database, including creating tables from file data and querying tables using various join types.

db_path

Path to the DuckDB database file.

Type:

str

conn

Connection to the DuckDB database.

Type:

duckdb.DuckDBPyConnection

__init__(db_path

str): Initializes the DatabaseManager with the provided database path.

create_database(table_paths

Dict[str, Path]): Creates tables in the DuckDB database from CSV or Excel files.

query_tables_from_path(path

List[str], table_paths: Dict[str, Path], join_type: str = ‘left’): Queries multiple tables specified in the path and joins them using the specified join type.

query_tables_from_graph(graph

nx.DiGraph, join_type: str = ‘left’): Queries tables based on a directed graph and joins them using the specified join type.

query_tables_from_dict(graph

Dict[str, List[str]], join_type: str = ‘left’): Queries tables based on a dictionary representation of a graph and joins them using the specified join type.

Usage:

from Consensus.LocalMerger import DatabaseManager
db_manager = DatabaseManager('path/to/database.db')
db_manager.create_database({'table1': Path('path/to/table1.csv'), 'table2': Path('path/to/table2.csv')})
result = db_manager.query_tables_from_path(['table1', 'table2'], {'table1': Path('path/to/table1.csv'), 'table2': Path('path/to/table2.csv')}, 'left')
__init__(db_path)

Initializes the DatabaseManager with the provided database path.

Parameters:

db_path (str) – The path to the DuckDB database.

_join_tables(dfs, join_type='left')

Joins multiple DataFrames using the specified join type.

Parameters:
  • dfs (Dict[str, pd.DataFrame]) – A dictionary of table names to DataFrames to be joined.

  • join_type (str) – The type of join to perform (e.g., ‘inner’, ‘outer’).

Returns:

The resulting DataFrame after performing all joins.

Return type:

pd.DataFrame

Raises:

ValueError – If no common columns are found between DataFrames for joining.

close()

Closes the connection to the DuckDB database.

Return type:

None

Returns:

None

create_database(table_paths)

Creates tables in the DuckDB database from CSV or Excel files.

Parameters:

table_paths (Dict[str, Path]) – A dictionary mapping table names to file paths.

This method loads data from the specified file paths and creates tables in the database. The file paths must point to CSV or Excel files.

list_all_tables()

Lists all tables in the database.

Returns:

A list of table names in the database.

Return type:

List[str]

query_tables_from_path(path, table_paths, join_type='left')

Queries multiple tables specified in the path and joins them using the specified join type.

Parameters:
  • path (List[str]) – A list of table names to include in the query.

  • table_paths (Dict[str, Path]) – A dictionary mapping table names to file paths.

  • join_type (str) – The type of join to perform. Default is ‘outer’.

Returns:

A DataFrame containing the result of the join operation.

Return type:

pd.DataFrame

Raises:

ValueError – If no valid tables are found in the provided path.

class Consensus.LocalMerger.GraphBuilder(directory_path)

Bases: object

A class to build and manage a graph from CSV and Excel files in a directory.

This class constructs a graph where nodes are tables and columns, and edges represent relationships between them. It provides methods to find paths between tables or columns.

directory_path

The path to the directory containing the data files.

Type:

Path

graph

The graph representing the relationships between tables and columns.

Type:

nx.DiGraph

__init__(directory_path

str): Initializes the GraphBuilder with a directory containing CSV and Excel files.

_build_graph()

Scans the directory for CSV and Excel files and builds the graph.

_process_csv(file_path

Path): Processes a CSV file and updates the graph with table and column information.

_process_excel(file_path

Path): Processes an Excel file and updates the graph with table and column information.

_process_dataframe(df

pd.DataFrame, table_name: str, file_path: Path): Processes a DataFrame and updates the graph with table and column relationships.

get_table_paths()

Returns a dictionary of table names and their corresponding file paths.

bfs_paths(start

str, end: str): Finds all paths between the start and end nodes using breadth-first search (BFS).

find_paths(start

str, end: str, by: str = ‘table’): Finds all paths between the start and end nodes, either by table name or column name.

get_full_graph()

Returns the full graph with all nodes and edges.

get_all_possible_paths(start

str, end: str, by: str = ‘table’): Outputs all possible paths based on start and end, by table or column.

choose_path(paths

List[List[str]], index: int): Allows the user to choose a path from a list of paths by specifying the index.

Usage:

from Consensus.LocalMerger import GraphBuilder
graph_builder = GraphBuilder('path/to/directory')
graph_builder.find_paths('table1', 'table2')
graph_builder.get_all_possible_paths('table1', 'table2')
__init__(directory_path)

Initializes the GraphBuilder with a directory containing CSV and Excel files.

Parameters:

directory_path (str) – The path to the directory containing the data files.

_build_graph()

Scans the directory for CSV and Excel files and builds the graph.

This method iterates through all CSV and Excel files in the specified directory, processes them, and adds the tables and columns to the graph.

Return type:

None

Returns:

None

_process_csv(file_path)

Processes a CSV file and updates the graph with table and column information.

Parameters:

file_path (Path) – The path to the CSV file to process.

Return type:

None

Returns:

None

_process_dataframe(df, table_name, file_path)

Processes a DataFrame and updates the graph with table and column relationships.

Parameters:
  • df (pd.DataFrame) – The DataFrame containing the table’s data.

  • table_name (str) – The name of the table.

  • file_path (Path) – The path to the data file.

Return type:

None

Returns:

None

_process_excel(file_path)

Processes an Excel file and updates the graph with table and column information.

Parameters:

file_path (Path) – The path to the Excel file to process.

Return type:

None

Returns:

None

bfs_paths(start, end)

Finds all paths between the start and end nodes using breadth-first search (BFS).

Parameters:
  • start (str) – The starting node.

  • end (str) – The ending node.

Returns:

A list of paths from start to end nodes.

Return type:

List[List[str]]

Notes

This method finds all simple paths between nodes, not necessarily the shortest.

choose_path(paths, index)

Allows the user to choose a path from a list of paths by specifying the index.

Parameters:
  • paths (List[List[str]]) – A list of possible paths.

  • index (int) – The index of the chosen path.

Returns:

The chosen path.

Return type:

List[str]

Raises:

IndexError – If the provided index is out of range for the list of paths.

find_paths(start, end, by='table')

Finds all paths between the start and end nodes, either by table name or column name.

Parameters:
  • start (str) – The starting node.

  • end (str) – The ending node.

  • by (str) – Specifies whether to search by ‘table’ or ‘column’. Defaults to ‘table’.

Returns:

A list of paths between start and end nodes.

Return type:

List[List[str]]

Raises:

ValueError – If the search type is not supported (e.g., ‘column’ is used but no columns exist).

get_all_possible_paths(start, end, by='table')

Outputs all possible paths based on start and end, by table or column.

Parameters:
  • start (str) – The starting node.

  • end (str) – The ending node.

  • by (str) – Specifies whether to search by ‘table’ or ‘column’. Defaults to ‘table’.

Returns:

A list of all possible paths from start to end.

Return type:

List[List[str]]

get_full_graph()

Returns the full graph with all nodes and edges.

Returns:

The full graph object containing all nodes and edges.

Return type:

nx.Graph

get_table_paths()

Returns a dictionary of table names and their corresponding file paths.

Returns:

A dictionary where the keys are table names and the values are file paths.

Return type:

Dict[str, Path]