LAANE Project
Constructing a data pipeline to facilitate the exploration of housing-instability and short-term rentals in the city of LA.
Author: Karina López
Collaborator: Albert Ulysses Chavez
Objective
The purpose of this project is to establish a centralized database of various open and closed-source datasets facilitating analyses of housing-instability and short-term rentals in the city of LA. To that end, this page will serve as an overview of the process in building this database for the Los Angeles Alliance for a New Economy (LAANE) in collaboration with Albert U. Chavez from Hack for LA.
Problem
The LAANE organization would like to have access to a database built from over 20 datasets. A few of these datasets will be static, while others will be updated with varying frequencies (e.g., monthly, yearly). Additionally, we are also tasked to automate as much of the data cleaning process as possible. Finally, the completed product should allow analysts the ability to query datasets that can provide answers to any of the following questions:
When did an Airbnb host register with the city of LA?
What complaints (if any) are associated with a specific Airbnb listing?
Which Airbnb listings are not registered with the city of LA?
and many more.
Datasets
This database is constructed from over 20 data sets obtained from 3 separate sources: city of LA, short-term housing platforms, and Inside Airbnb. Depicted below are the 3 data sources and their respective datasets organized under 10 broad categorizations.
Complexities
Listed below are some of the complexities we encounter in constructing this database:
Noisy and messy datasets
Multiple sources lead to inconsistent data-collection methods
No immediate shared columns
Volume of datasets vary, many of which are very large
Datasets are updated at a varying frequency
Individual data points may enter and leave the database in multiple timepoints and will need to be tracked
Datasets pulled may have different column names from time to time
Process
Our process for creating the database architecture is provided below:
Identify questions stakeholders would like to answer with data
Identify data to keep or exclude from the SQLite database
Propose database structure optimizing storage and ease of access for addressing stakeholder needs
Prepare automated and single-use data cleaning scripts
Proposed database architecture
In progress…this page will be updated as soon as a structure is established! Cleaning scripts will also be provided at the repo linked below.