ETL Transformation to Third Normal Form for Philadelphia Building Demolition Permits
This ETL transformation was performed as an exercise in demonstrating proficiency using SQL in the PostgreSQL relational database with the PostGIS spatial extensions. I selected a spatial dataset of building demolition permits to load into the database and normalize to Third Normal Form. Once in place, this data can be combined with a secondary dataset on children with elevated blood lead levels to investigate correlations between demolition activity and elevated lead levels.
Datasets Used
The dataset being used is a set of point data on building demolition permits from the Philadelphia Department of Licenses and Inspections. This dataset was downloaded from OpenDataPhilly in the form of a shape file. This is a rich set of data related to demolition permits such as information on the property owner, specifics on the parcel in question, demolition contractor, and a variety of permit and status information. Because this is a spatial dataset, this table contains a geom field, which contains specially formatted coordinate information for the location of each demolition permit. It is not stored in a human-readable form, but there are tools to decode it into text.
Because so much data is being stored in one attribute table, it was in need of normalization to the third normal form. There are a number of fields storing repeated strings of data, such as those tracking permit status and type of demolition. There are also a number of fields such as that storing information on demolition contractors that clearly have multiple rows with information on the same contractor entered differently. Not all of this will be fixed by normalization, but it is a crucial first step towards enforcing constraints and organizing the existing data.
By itself, this dataset can provide interesting insights into construction activity throughout the city. Analysis of private contractor demolitions could provide valuable information into the extent to which different neighborhoods are experiencing turnover and new construction. Looking at demolitions being performed by the city as a result of a code violation could provide very useful information on which sections of the city are more at risk by properties in dangerous states of disrepair. Lastly, analyzing permits for removing an underground oil storage tank would be invaluable for followup on environmental issues. Ensuring that these tanks have been removed cleanly and safely is an important step in ensuring the safety and health of neighborhood residents.
A secondary dataset on blood lead levels in Philadelphia’s children was obtained from the Philadelphia Department of Public Health for analysis purposes. This dataset is much simpler, and did not require normalization. This dataset contains the raw number of children under 6 with blood lead levels of 5 or above, the total screened, and the percentage of children under 6 with blood lead levels of 5 or above. Since a common source of lead comes from older homes, these two datasets together could be very important for directing monitoring and outreach resources for a very important public health issue.
An entity-relationship diagram for these two datasets can be found in Figures 1 and 2.
ETL process
To transform the building demolition data into the third normal form, six additional tables had to be created. Four of them stored categorical data assigning one of a small number of text-based options to the permit. Because of this, we have the same strings of text repeated thousands of times spread throughout the data, for instance, each row in our table contains either COMPLETED or ACTIVE in the status field. Any time you are repeatedly entering text, this creates an opportunity for error, which can negatively impact data quality and consistency. Repeated strings of text can also waste storage space, which may be less of a concern than it used to, but for a very large database, it could still add up to significant additional cost. The fields record_typ, typeofwork, status, and applicantc in the original dataset have been moved to lookup tables called record_type, type_of_work, status, and applicant_capacity. These lookup tables contain one each of the original categories, and an assigned integer to represent these text categories. The lookup table for the status field is demonstrated in Table 1.
Two new tables were also necessary for the property ownership information from the Office of Property Assessment (OPA), and information on the licensed contractor. In the case of both of these, information such as name, address, and other contact and identifying information was stored for every row. This was especially problematic in the case of the contractor information, as a large licensed contractor could have dozens to hundreds of demolition jobs in their history. There were a number of contractors entered into the table many times with slightly different spellings of name or address information, and a few even had information in the incorrect fields. To correct this, two tables titled opa_account and contractor were created to consolidate this information. Unique information was only entered once into each table, so in theory if a contractor’s information has been entered exactly the same every time, they will only have one row instead of possibly hundreds. In practice, this did not happen, so there will still be data inconsistencies to sort out manually. However, as a first pass this should significantly reduce the size of the tables. All six of these tables were created with automatically incrementing primary key fields.
The remaining fields from the original li_demolitions table were all moved to a new li_demolitions_norm table, and six foreign key columns were created with the primary keys from these six new tables. This will enable users to perform a join between the primary table and lookup table to discover that a status of 1 equates to ACTIVE.
Table 1: Status Table
status_id | status |
---|---|
1 | ACTIVE |
2 | COMPLETED |
Analytical Queries
Query 1: Percentage of elevated blood levels and demolition cases by census tract
This query returns three columns of data:
- A three-digit census tract number
- The percentage of screened children under the age of six with blood lead levels greater than five for the tract in question
- A count of total demolition permits for the tract in question
In the blood lead level table, the census tract number is stored as the full eleven character identifier, including state and city information. Since we are only dealing with Philadelphia, all but the last three characters of the identifier were discarded using the substring function. The data on elevated lead levels was taken directly from the table unmodified. The count of demolition permits was obtained by counting the total number of records in the li_demolitions_norm table. To make this analysis possible, the li_demolitions_norm and child_blood_lead_levels_by_ct were joined using the ST_CONTAINS spatial function. By looking at each table’s geom field, this function is joining the two tables based on which building demolitions occurred in each census tract.
To help ensure an accurate resulting dataset, using a WHERE clause we exclude all null values in the field containing the percentage of children with elevated blood lead levels, and the demolition permits are restricted to the years of 2013 - 2015 to match the years contained in the child_blood_lead_levels_by_ct data. To help highlight the most active census tracts, the resulting dataset was sorted in descending order by count of demolition permits.
The resulting dataset could be used to investigate potential correlations between the total count of building demolitions and child blood lead levels, which could be very important for people working in public health and those crafting construction regulations.
SELECT substring(census_tra, 7, 3) AS tract, cb.perc_5plus, count(ldn.*)AS demo_cases
FROM child_blood_lead_levels_by_ct cb
JOIN li_demolitions_norm ldn
ON ST_CONTAINS(cb.geom, ldn.geom)
WHERE cb.perc_5plus IS NOT NULL
AND ldn.completed_date BETWEEN '2013-01-01' AND '2015-12-31'
GROUP BY tract, cb.perc_5plus
ORDER BY demo_cases DESC
Query 2: Elevated blood levels, demolitions performed by the city, and owner-requested demolitions by census tract
This query returns four columns of data, ordered by ascending census tract identifier:
- A three-digit census tract number
- The percentage of screened children under the age of six with blood lead levels greater than five for the tract in question
- The total count of demolitions performed by the city through the standard bid process in reaction to a code violation notice
- The total count of demolitions performed by choice by the property owner
In the blood lead level table, the census tract number is stored as the full eleven character identifier, including state and city information. Since we are only dealing with Philadelphia, all but the last three characters of the identifier were discarded using the substring function. The data on elevated lead levels was taken directly from the table unmodified.
The other two values are obtained from subqueries in the SELECT statement. In the first subquery, the city performed demolitions data is selected by counting all rows in the li_demolitions_norm table and limiting it in a WHERE clause. First, it requires a record_type_id of 1 to limit the type of demolition. The query then requires permits in the years of 2013 - 2015 to match the years contained in the child_blood_lead_levels_by_ct data. Lastly, it limits the permits to only the census tract selected in the outer query. This is all returned to the outer query in the form of a column named violation_cases.
In the second subquery, the owner-requested demolitions data is selected by counting all rows in the li_demolitions_norm table and limiting it in a WHERE clause. First, it requires a record_type_id of 2 to limit the type of demolition. The query then requires permits in the years of 2013 - 2015 to match the years contained in the child_blood_lead_levels_by_ct data. Lastly, it limits the permits to only the census tract selected in the outer query. This is all returned to the outer query in the form of a column named permit_cases.
SELECT substring(census_tra, 7, 3) AS tract, cb.perc_5plus,
(SELECT count(*)
FROM li_demolitions_norm
WHERE record_type_id = 1
AND completed_date BETWEEN '2013-01-01' AND '2015-12-31'
AND censustract = substring(cb.census_tra, 7, 3)) AS violation_cases,
(SELECT count(*)
FROM li_demolitions_norm
WHERE record_type_id = 2
AND completed_date BETWEEN '2013-01-01' AND '2015-12-31'
AND censustract = substring(cb.census_tra, 7, 3)) AS permit_cases
FROM child_blood_lead_levels_by_ct cb
ORDER BY tract
Query 3: Elevated blood levels, complete demolition of an existing structure, demolition of an imminently dangerous building, and removal of an underground storage tank by census tract
This query returns five columns of data, ordered by ascending census tract identifier:
- A three-digit census tract number
- The percentage of screened children under the age of six with blood lead levels greater than five for the tract in question
- The total count of demolitions involving the complete demolition of an existing structure using a private contractor
- The total count of demolitions through an accelerated curbside bid process for an imminently dangerous building
- The total count of demolitions involving the removal of an underground storage tank
In the blood lead level table, the census tract number is stored as the full eleven character identifier, including state and city information. Since we are only dealing with Philadelphia, all but the last three characters of the identifier were discarded using the substring function. The data on elevated lead levels was taken directly from the table unmodified.
The other three values are obtained from subqueries in the SELECT statement. In the first subquery, the complete demolition through private contractor data is selected by counting all rows in the li_demolitions_norm table and limiting it in a WHERE clause. First, it requires a type_of_work_id of 4, to limit the type of demolition. The query then requires permits in the years of 2013 - 2015 to match the years contained in the child_blood_lead_levels_by_ct data. Lastly, it limits the permits to only the census tract selected in the outer query. This is all returned to the outer query in the form of a column named demo_contractor.
In the second subquery, the curbside bid demolition data is selected by counting all rows in the li_demolitions_norm table and limiting it in a WHERE clause. First, it requires a type_of_work_id of 1 to limit the type of demolition. The query then requires permits in the years of 2013 - 2015 to match the years contained in the child_blood_lead_levels_by_ct data. Lastly, it limits the permits to only the census tract selected in the outer query. This is all returned to the outer query in the form of a column named demo_curbside.
In the third subquery, the tank removal demolition data is selected by counting all rows in the li_demolitions_norm table and limiting it in a WHERE clause. First, it requires a type_of_work_id of 3 to limit the type of demolition. The query then requires permits in the years of 2013 - 2015 to match the years contained in the child_blood_lead_levels_by_ct data. Lastly, it limits the permits to only the census tract selected in the outer query. This is all returned to the outer query in the form of a column named demo_tankremoval.
SELECT substring(census_tra, 7, 3) AS tract, cb.perc_5plus,
(SELECT count(*)
FROM li_demolitions_norm
WHERE type_of_work_id = 4
AND completed_date BETWEEN '2013-01-01' AND '2015-12-31'
AND censustract = substring(cb.census_tra, 7, 3)) AS demo_contractor,
(SELECT count(*)
FROM li_demolitions_norm
WHERE type_of_work_id = 1
AND completed_date BETWEEN '2013-01-01' AND '2015-12-31'
AND censustract = substring(cb.census_tra, 7, 3)) AS demo_curbside,
(SELECT count(*)
FROM li_demolitions_norm
WHERE type_of_work_id = 3
AND completed_date BETWEEN '2013-01-01' AND '2015-12-31'
AND censustract = substring(cb.census_tra, 7, 3)) AS demo_tankremoval
FROM child_blood_lead_levels_by_ct cb
ORDER BY tract
Figure 1: ERD representing normalized Building Demolitions data
Figure 2: ERD representing Philadelphia Child Blood Lead Levels data
Figure 3: Percent of Tested Children with Elevated Blood Lead Levels
Figure 4: Building Demolition Permits heatmap
Figure 5: Building Demolition Permits by Census Tract
Appendix 1: Obtaining and Loading Data
Obtaining Data
Both datasets were downloaded from the City of Philadelphia’s OpenDataPhilly portal, at the locations listed in Table 2 below. The Building Demolitions dataset was downloaded as a shape file, and the Child Blood Lead Levels data was downloaded as a csv file
Table 2: Data Sources
Dataset Name | Download location |
---|---|
Building Demolitions | https://www.opendataphilly.org/dataset/building-demolitions |
Philadelphia Child Blood Lead Levels | https://www.opendataphilly.org/dataset/philadelphia-child-blood-lead-levels |
Loading Data Into PostGIS
QGIS 2.14.14 was used to import the shape and CSV files into the PostGIS server
- Open QGIS
- In the Browser Panel, right click on PostGIS, and choose New Connection. Fill in the relevant information for your PostGIS server, and click OK to connect
- From the Database menu, open the DB Manager
- In the DB Manager, expand the triangle to the left of PostGIS, and click on your PostGIS server name
- Click on the down-arrow icon labeled “Import layer/file”
- Figure 6: Click on the “three dots” button to the right of the input field, and select the file to be added to your database
- If you need to use a schema other than “public”, select it in the “Schema” drop-down
- In the “Table” text box, enter the name of the table to import this file into
- Click on the checkbox next to Primary Key, and change that value from “id” to “gid”
- Click on the checkbox next to “Convert field names to lowercase”
- Click on the checkbox next to “Create spatial index”
- Click OK
Appendix 2: ETL Script
The final SQL script for this transformation can also be found on GitHub
--DROP table opa_account;
CREATE TABLE opa_account
(
opa_account_key serial PRIMARY KEY,
li_demo_gid integer,
opa_account_num varchar(9),
ownername varchar(57),
address varchar(35),
unit varchar(3),
zip varchar(10)
);
INSERT INTO opa_account (li_demo_gid, opa_account_num, ownername, address, unit, zip)
(
SELECT gid, opa_accoun, ownername, address, unit, zip
FROM li_demolitions
);
CREATE TABLE record_type
(
record_type_id serial PRIMARY KEY,
record_type varchar(14)
);
INSERT INTO record_type (record_type)
(
SELECT DISTINCT record_typ
FROM li_demolitions
);
CREATE TABLE type_of_work
(
type_of_work_id serial PRIMARY KEY,
type_of_work varchar(6)
);
INSERT INTO type_of_work (type_of_work)
(
SELECT DISTINCT typeofwork
FROM li_demolitions
);
CREATE TABLE status
(
status_id serial PRIMARY KEY,
status varchar(9)
);
INSERT INTO status (status)
(
SELECT DISTINCT status
FROM li_demolitions
);
CREATE TABLE applicant_capacity
(
applicant_capacity_id serial PRIMARY KEY,
applicant_capacity varchar(8)
);
INSERT INTO applicant_capacity (applicant_capacity)
(
SELECT DISTINCT applicantc
FROM li_demolitions
);
--DROP TABLE contractor
CREATE TABLE contractor (
contractorid serial PRIMARY KEY,
name varchar(30),
type varchar(25),
address1 varchar(36),
address2 varchar(27),
city varchar(17),
state varchar(2),
zip varchar(10)
);
INSERT INTO contractor (name, type, address1, address2, city, state, zip)
(
SELECT DISTINCT contractor, contract_1, contract_2, contract_3,
contract_4, contract_5, contract_6
FROM li_demolitions
);
--DROP TABLE li_demolitions_norm;
--contractorname, type, and address1, opa_account_num, opa_ownername,
--opa_address will all be dropped after contractorid and opa_account_key
--are populated
CREATE TABLE li_demolitions_norm (
gid integer PRIMARY KEY,
geom geometry(MultiPoint,4326),
objectid integer,
opa_account_key integer REFERENCES opa_account (opa_account_key),
censustract varchar(3),
organization_name varchar(30),
caseorpermitnumber varchar(6),
record_type_id integer REFERENCES record_type (record_type_id),
type_of_work_id integer REFERENCES type_of_work (type_of_work_id),
city_demo varchar(3),
completed_date date,
start_date date,
addresskey varchar(6),
permitstatus varchar(1),
status_id integer REFERENCES status (status_id),
applicant_capacity_id integer REFERENCES applicant_capacity (applicant_capacity_id),
primarycontact varchar(70),
contractorid integer REFERENCES contractor (contractorid),
mostrecentinsp date,
geocode_x numeric,
geocode_y numeric,
opa_account_num varchar(9),
opa_ownername varchar(57),
opa_address varchar(35),
contractorname varchar(30),
contractortype varchar(25),
contractoraddress1 varchar(36)
);
INSERT INTO li_demolitions_norm
(
gid, geom, objectid, censustract, organization_name, caseorpermitnumber,
city_demo, completed_date, start_date, addresskey, permitstatus, primarycontact,
mostrecentinsp, geocode_x, geocode_y, opa_account_num, opa_ownername, opa_address,
contractorname, contractortype, contractoraddress1, record_type_id, type_of_work_id,
status_id, applicant_capacity_id
)
(
SELECT gid, geom, objectid, censustrac, organizati, caseorperm, city_demo,
completed_, start_date, addresskey, permitstat, primarycon, mostrecent, geocode_x,
geocode_y, opa_accoun, ownername, address, contractor, contract_1, contract_2,
CASE
WHEN record_typ = 'VIOLATION CASE' THEN 1
WHEN record_typ = 'PERMIT' THEN 2
END AS record_type_id,
CASE
WHEN typeofwork = 'FULL' THEN 1
WHEN typeofwork = 'CASE' THEN 2
WHEN typeofwork = 'COMDEM' THEN 3
WHEN typeofwork = 'TANKRI' THEN 4
END AS type_of_work_id,
CASE
WHEN status = 'ACTIVE' THEN 1
WHEN status = 'COMPLETED' THEN 2
END AS status_id,
CASE
WHEN applicantc = 'ATTORNEY' THEN 2
WHEN applicantc = 'DSGNPROF' THEN 3
WHEN applicantc = 'TENANT' THEN 4
WHEN applicantc = 'PROF' THEN 5
WHEN applicantc = 'APPL' THEN 6
WHEN applicantc = 'CONTRCTR' THEN 7
WHEN applicantc = 'OWNER' THEN 8
WHEN applicantc = 'AGENT' THEN 9
WHEN applicantc = 'BILLING' THEN 10
END AS applicant_capacity_id
FROM li_demolitions
);
--populate foreign key from opa_account table
UPDATE li_demolitions_norm AS ldn
SET opa_account_key = oa.opa_account_key
FROM opa_account oa
WHERE ldn.opa_account_num = oa.opa_account_num
AND ldn.opa_ownername = oa.ownername
AND ldn.opa_address = oa.address
;
--populate foreign key from contractor table
UPDATE li_demolitions_norm AS ldn
SET contractorid = c.contractorid
FROM contractor c
WHERE ldn.contractorname = c.name
AND ldn.contractortype = c.type
AND ldn.contractoraddress1 = c.address1
;
-- remove extra contractor and opa columns. They were only necessary for the join
ALTER TABLE li_demolitions_norm
DROP COLUMN IF EXISTS contractorname,
DROP COLUMN IF EXISTS contractortype,
DROP COLUMN IF EXISTS contractoraddress1,
DROP COLUMN IF EXISTS opa_account_num,
DROP COLUMN IF EXISTS opa_ownername,
DROP COLUMN IF EXISTS opa_address
;
Appendix 3: Data Dictionary
applicant_capacity table
- applicant_capacity_id serial - The automatically incrementing primary key for this table
- applicantc varchar(8) - Type of applicant on record with private permit
contractor table
- contractorid serial - The automatically incrementing primary key for this table
- name varchar(30) - Name of licensed contractor
- type varchar(25) - Type of contractor
- address1 varchar(36) - Address 1
- address2 varchar(27) - Address 2
- city varchar(17) - City
- state varchar(2) - State
- zip varchar(10) - Zip
child_blood_lead_levels_by_ct table
- gid integer NOT NULL - The automatically incrementing primary key for this table
- geom public.geometry(MultiPolygon,4326) - The coordinate data for this table representing polygons, stored in WGS 84
- census_tra varchar(11) - Census tract of child’s address
- data_redac integer - If the number of children with blood lead levels in a geographic area is between 1-5, this data was redacted from the dataset for privacy concerns, and appears as ‘null’ in the dataset. True = it was redacted. False = data was not redacted
- num_bll_5p integer - Number of children under the age of 6 with newly identified blood lead levels greater than or equal to 5 micrograms per deciliter
- num_screen integer - Number of children under the age of 6 screened for lead poisoning
- perc_5plus numeric - Percent of children under the age of 6 screened with newly identified blood lead levels greater than or equal to 5 micrograms per deciliter
li_demolitions_norm table
- gid integer NOT NULL - The automatically incrementing primary key for this table
- geom public.geometry(MultiPoint,4326) - The coordinate data for this table representing point data, stored in WGS 84
- objectid integer - Unknown
- opa_account_key Foreign key - Foreign key for the opa_account table
- censustract varchar(3) - Census tract of demolition
- organization_name varchar(30) - Commercial organization
- caseorpermitnumber varchar(6) - Unique identifier for demolitions. Demolitions are identified by either a building permit issued to the property owner or contractor, or by the code violation ‘case’ that triggered L+I to demolish the building in the interest of public safety
- record_type_id Foreign key - Foreign key for the record_type table
- type_of_work_id Foreign key - Foreign key for the type_of_work table
- city_demo varchar(3) -
‘YES’ - ‘Curbside Demolition’ contracted through the city on dangerous buildings
‘NO’ - Private demolition executed by landowner - completed_date_ date - Date building permit or L+I contractor bid was closed
- start_date date - Date building permit was issued or date L+I issued a bid to a contractor to demolish on the city’s behalf
- addresskey varchar(6) - Unique address identifier used internally within the Dpt of L+I
- permitstatus varchar(1) - Internal system status of demolition permit or case
- primarycontact varchar(70) - Name of contact on permit record
- contractorid Foreign key - Foreign key for the contractor table
- mostrecentinsp date - Most recent inspection
- geocode_x numeric - Geographic Coordinate
- geocode_y numeric - Geographic Coordinate
opa_account table
- opa_account_key serial - The automatically incrementing primary key for this table
- li_demo_gid integer - GID value from the original li_demolitions staging table
- opa_account_num varchar(9) - ‘Office of Property Assessment Account Number’ Every land parcel within the City of Philadelphia is assigned a unique account number from OPA. This number stays with the property for as long as the deed is held by the same owner. Once a property transfers ownership, it is assigned a new OPA account number and the old one is retired. This field can be used as a join key for various city datasets. It is also sometimes known as the ‘BRT’ number, or within some datasets coming directly from OPA, is also called the ‘PARCEL_NUMBER’. (Other departments may use different internal ‘Parcel Numbers’, so please verify before using)
- ownername varchar(57) - Owner of the property as identified by the Office of Property Assessment
- address varchar(35) - Address of demolition
- unit varchar(3) - Address Unit
- zip varchar(10) - Zip code of demolition
record_type table
- record_type_id serial - The automatically incrementing primary key for this table
- record_typ varchar(14) -
‘PERMIT’- Demolition was an action of a private owner, who was issued a building permit from L+I. (CASEORPERMITNUMBER will be a ‘PERMIT’ number)
‘VIOLATION CASE’- Demolition was an action of the Dpt of L+I, due to dangerous building conditions, as a result of violation of the city’s building code. (CASEORPERMITNUMBER will be a ‘CASE’ number)
status table
- status_id serial - The automatically incrementing primary key for this table
- status varchar(9) -
ACTIVE-Permit or demolition bid is currently open, this indicates construction is ongoing or about to begin
COMPLETED-Indicates that construction activity has concluded
type_of_work table
- type_of_work_id serial - The automatically incrementing primary key for this table
- typeofwork varchar(6) -
FULL - Complete demolition of existing structure through private contractor
TANKRI - Removal of underground tank from property
CASE - Demolition through curbside demolition, this typically involves a complete demolition of a structure