Consensus.LocalMerger module
This module is not yet fully implemented and is a work in progress. Please let me know if you would like to contribute.
- class Consensus.LocalMerger.DatabaseManager(db_path)
Bases:
object
A class to manage a DuckDB database, including creating tables from file data and querying tables using various join types.
- db_path
Path to the DuckDB database file.
- Type:
str
- conn
Connection to the DuckDB database.
- Type:
duckdb.DuckDBPyConnection
- __init__(db_path
str): Initializes the DatabaseManager with the provided database path.
- create_database(table_paths
Dict[str, Path]): Creates tables in the DuckDB database from CSV or Excel files.
- query_tables_from_path(path
List[str], table_paths: Dict[str, Path], join_type: str = ‘left’): Queries multiple tables specified in the path and joins them using the specified join type.
- query_tables_from_graph(graph
nx.DiGraph, join_type: str = ‘left’): Queries tables based on a directed graph and joins them using the specified join type.
- query_tables_from_dict(graph
Dict[str, List[str]], join_type: str = ‘left’): Queries tables based on a dictionary representation of a graph and joins them using the specified join type.
Usage:
from Consensus.LocalMerger import DatabaseManager db_manager = DatabaseManager('path/to/database.db') db_manager.create_database({'table1': Path('path/to/table1.csv'), 'table2': Path('path/to/table2.csv')}) result = db_manager.query_tables_from_path(['table1', 'table2'], {'table1': Path('path/to/table1.csv'), 'table2': Path('path/to/table2.csv')}, 'left')
- __init__(db_path)
Initializes the DatabaseManager with the provided database path.
- Parameters:
db_path (str) – The path to the DuckDB database.
- _join_tables(dfs, join_type='left')
Joins multiple DataFrames using the specified join type.
- Parameters:
dfs (Dict[str, pd.DataFrame]) – A dictionary of table names to DataFrames to be joined.
join_type (str) – The type of join to perform (e.g., ‘inner’, ‘outer’).
- Returns:
The resulting DataFrame after performing all joins.
- Return type:
pd.DataFrame
- Raises:
ValueError – If no common columns are found between DataFrames for joining.
- close()
Closes the connection to the DuckDB database.
- Return type:
None
- Returns:
None
- create_database(table_paths)
Creates tables in the DuckDB database from CSV or Excel files.
- Parameters:
table_paths (Dict[str, Path]) – A dictionary mapping table names to file paths.
This method loads data from the specified file paths and creates tables in the database. The file paths must point to CSV or Excel files.
- list_all_tables()
Lists all tables in the database.
- Returns:
A list of table names in the database.
- Return type:
List[str]
- query_tables_from_path(path, table_paths, join_type='left')
Queries multiple tables specified in the path and joins them using the specified join type.
- Parameters:
path (List[str]) – A list of table names to include in the query.
table_paths (Dict[str, Path]) – A dictionary mapping table names to file paths.
join_type (str) – The type of join to perform. Default is ‘outer’.
- Returns:
A DataFrame containing the result of the join operation.
- Return type:
pd.DataFrame
- Raises:
ValueError – If no valid tables are found in the provided path.
- class Consensus.LocalMerger.GraphBuilder(directory_path)
Bases:
object
A class to build and manage a graph from CSV and Excel files in a directory.
This class constructs a graph where nodes are tables and columns, and edges represent relationships between them. It provides methods to find paths between tables or columns.
- directory_path
The path to the directory containing the data files.
- Type:
Path
- graph
The graph representing the relationships between tables and columns.
- Type:
nx.DiGraph
- __init__(directory_path
str): Initializes the GraphBuilder with a directory containing CSV and Excel files.
- _build_graph()
Scans the directory for CSV and Excel files and builds the graph.
- _process_csv(file_path
Path): Processes a CSV file and updates the graph with table and column information.
- _process_excel(file_path
Path): Processes an Excel file and updates the graph with table and column information.
- _process_dataframe(df
pd.DataFrame, table_name: str, file_path: Path): Processes a DataFrame and updates the graph with table and column relationships.
- get_table_paths()
Returns a dictionary of table names and their corresponding file paths.
- bfs_paths(start
str, end: str): Finds all paths between the start and end nodes using breadth-first search (BFS).
- find_paths(start
str, end: str, by: str = ‘table’): Finds all paths between the start and end nodes, either by table name or column name.
- get_full_graph()
Returns the full graph with all nodes and edges.
- get_all_possible_paths(start
str, end: str, by: str = ‘table’): Outputs all possible paths based on start and end, by table or column.
- choose_path(paths
List[List[str]], index: int): Allows the user to choose a path from a list of paths by specifying the index.
Usage:
from Consensus.LocalMerger import GraphBuilder graph_builder = GraphBuilder('path/to/directory') graph_builder.find_paths('table1', 'table2') graph_builder.get_all_possible_paths('table1', 'table2')
- __init__(directory_path)
Initializes the GraphBuilder with a directory containing CSV and Excel files.
- Parameters:
directory_path (str) – The path to the directory containing the data files.
- _build_graph()
Scans the directory for CSV and Excel files and builds the graph.
This method iterates through all CSV and Excel files in the specified directory, processes them, and adds the tables and columns to the graph.
- Return type:
None
- Returns:
None
- _process_csv(file_path)
Processes a CSV file and updates the graph with table and column information.
- Parameters:
file_path (Path) – The path to the CSV file to process.
- Return type:
None
- Returns:
None
- _process_dataframe(df, table_name, file_path)
Processes a DataFrame and updates the graph with table and column relationships.
- Parameters:
df (pd.DataFrame) – The DataFrame containing the table’s data.
table_name (str) – The name of the table.
file_path (Path) – The path to the data file.
- Return type:
None
- Returns:
None
- _process_excel(file_path)
Processes an Excel file and updates the graph with table and column information.
- Parameters:
file_path (Path) – The path to the Excel file to process.
- Return type:
None
- Returns:
None
- bfs_paths(start, end)
Finds all paths between the start and end nodes using breadth-first search (BFS).
- Parameters:
start (str) – The starting node.
end (str) – The ending node.
- Returns:
A list of paths from start to end nodes.
- Return type:
List[List[str]]
Notes
This method finds all simple paths between nodes, not necessarily the shortest.
- choose_path(paths, index)
Allows the user to choose a path from a list of paths by specifying the index.
- Parameters:
paths (List[List[str]]) – A list of possible paths.
index (int) – The index of the chosen path.
- Returns:
The chosen path.
- Return type:
List[str]
- Raises:
IndexError – If the provided index is out of range for the list of paths.
- find_paths(start, end, by='table')
Finds all paths between the start and end nodes, either by table name or column name.
- Parameters:
start (str) – The starting node.
end (str) – The ending node.
by (str) – Specifies whether to search by ‘table’ or ‘column’. Defaults to ‘table’.
- Returns:
A list of paths between start and end nodes.
- Return type:
List[List[str]]
- Raises:
ValueError – If the search type is not supported (e.g., ‘column’ is used but no columns exist).
- get_all_possible_paths(start, end, by='table')
Outputs all possible paths based on start and end, by table or column.
- Parameters:
start (str) – The starting node.
end (str) – The ending node.
by (str) – Specifies whether to search by ‘table’ or ‘column’. Defaults to ‘table’.
- Returns:
A list of all possible paths from start to end.
- Return type:
List[List[str]]
- get_full_graph()
Returns the full graph with all nodes and edges.
- Returns:
The full graph object containing all nodes and edges.
- Return type:
nx.Graph
- get_table_paths()
Returns a dictionary of table names and their corresponding file paths.
- Returns:
A dictionary where the keys are table names and the values are file paths.
- Return type:
Dict[str, Path]