DTAN-310 > big data

Property Asessment Data

The purpose of this project is to get our hand dirty with a data set that's too big for your average spreadsheet on a desktop: about 580,000 records. The data is accessible as a huge flat file via the WPRDC data store. The home page for this massive data set is contains a data dictionary and other resources.

Take a moment and review the structure of the data: notice that each record is a parcel in the county, and we have all sorts of juicy data about each parcel, including its most recent assessed land and building value, along with condition and others!!

Step 1: Project setup

1A: Create a directory (aka folder) called property_assessments and download the entire flat file of parcels in the county into this directory. Give the file a sensible name with the extension .csv.

1B: Let's create an abbreviated data set for testing that only contains a few hundred records, instead of 500,000! Create a new jupyter notebook and give it a sensible title. You'll want that notebook file to be in the same directory as your data file for easy access, or at least in the parent of the data directory.

1C: Write a small program in your notebook that reads the data in your massive data file into a smaller file called assessment_abbrev.csv. Write a function that does this transfer of data from one file to another and takes in a single integer that determines how many records from the original file to move over to the smaller file, NOT INCLUDING the header row, which must be migrated regardless of the number of individual rows.

Step 2: Initial processing

2A: Write a program made of one or more functions that reads in data from either the large or small/test file and computes some basic statistics about the parcels in the data file:

Compute the total number of parcels processed. Then use the value of the variable "FAIRMARKETBUILDING" to count how many parcels in your data set are vacant, meaning the parcels have a building value of zero dollars. With your count of vacant parcels, compute the overall percent of vacant parcels.
Use the CLASS or CLASSDESC variables to break down the percent of vacant parcels by owner category: residential, commercial, government, and other. Create a dictionary that whose keys are the distinct values of CLASS. The value of each of these keys should be itself another dictionary whose keys are "total_parcels", "vacant parcels", and TWO other values metrics of your choosing, such as the count of parcels in each class that are in POOR condition, or which was sold in the last 5 years. Or how many of each class are worth over $1m, or worth under $30,000.

2C: Try your program using your test data set, then attempt the above steps using the entire data set about all parcels in Allegheny County.

2-EXT: Research how you could use the time tools in python to measure the time required for your program to digest all 500,000+ rows of county data and decide how many are vacant, etc.

Database basics

Let's try reading in this data to a relational database called sqlite!

Visualization with MatPlotLib and Pandas

Once you have the basic tools down for digesting CSV data, let's try using more complex data handling tools for processing the parcel data. Our module is located on technologyrediscovery.net