ridb data acquisition and storage

Aquisition

RIDB data consists facilities and historic reservation data sets.

Facility data is provided via a RESTful api for realtime usage. It is also available as a complete set of csv files in a zip archive. Since realtime is not needed for this application, the zip archive is used.

A python program is used to retrieve the zip file via the requests library. Once retrieved the file is extracted and read into memory.

A connection to the MySQL database is established.

Each of the 21 files is looped into a dataframe individually. This dataframe is then sent to the database’s matching table via the to_sql call of pandas.

import pandas as pd
import sqlalchemy as sq
import requests
from io import BytesIO
from zipfile import ZipFile

## get zip file from ridb of daily facility updates

url = "https://ridb.recreation.gov/downloads/RIDBFullExport_V1_CSV.zip"
content = requests.get(url)
ridbzip = ZipFile((BytesIO(content.content)))
i = 0
#print(ridbzip.namelist())

## database engine stuff for connection
##> {dialect}+{driver}://{user}:{password}@{host}:{port}/{database}
database = "brianwf2_recreation_data"
user = "xxxxxxxxxxxxxxxxxx"
password = "pppppppppppp"
## local dev
# host = "127.0.0.1"
# port = 8889
## prod
host = "xxx.xxx.xxx.xxx"
port= "3306"

engine =  sq.create_engine(f"mysql+pymysql://{user}:{password}@{host}:{port}/{database}")


for name in ridbzip.namelist():
    with ridbzip.open(name) as csv:
        print(f"{i}-------{name}")
        table_name = name.replace(".csv", "")
        ## if table exists, truncate so df can append to empty table
        if sq.inspect(engine).has_table(table_name):
            conn = engine.connect()
            conn.execute(f"DELETE FROM {table_name}")
            conn.close()
        df = pd.read_csv(csv, delimiter = ',', index_col=0)
        df_index_count = len(df.index)
        row_count = df.to_sql(table_name, engine, if_exists='append')
        print(f"{row_count} rows inserted in {table_name} for {df_index_count} in dataframe")
        

print ("run ended")

Historic Reservations Data

Reservation data is provided by ridb as annual csv files. Due to the size of these files, loading via dataframe was found to be very slow and inefficient. Utilizing a different tool was done for both efficiency and ease. Knime is an open source data science tool that uses graphical workflows to move and analyze data.

The files for 2019 to 2022 were downloaded and unzipped. A knime workflow opened each of the files, matched data types and filled in missing values for constants. It then used MySql Load to bulk load the data. Due to network and database stability issues this was done in batches of 10000 via a looped output.