Check this page for updates on upcoming classes, our learning goals, and the lesson modules we'll use to get there.

return home

DAT 102: Introduction to Data Analytics

Course Schedule and Lesson Guides

Each class session has it's own box:

Jump down to current week

Week 1: Sat 20 Jan 18 - Friday 26 Jan

Data schemas, data collection, and essential analysis

Study Topics:

Organizing structured data in tables
Approaches to data sampling
Collecting (Messy) data
Using spreadsheets for first-pass analysis

Learning Objectives

Design a coding schema for basic data table work
Conduct basic analysis on structured data
Respond intelligently to the question: what is data analytics?

Learning Exercises:

Learn about one another with a structured data creation exercise
Grading performance analysis
Data ratings algorithms and checking them

Learning Resources

Doing Data Science code set

Columbia Data Science Program

Western PA Regional Data Center

Spacing

Flow chart of core concept learning sequence

Vehicle study columns:

Visible use of mobile computer device/phone
Driver not looking at road ("The Look")
# passengers
Youth in car(boolean)
Sloppy driving (swerving, crossing lane lines, running lights)
Vehicle type: passenger car/ pickup, commercial vehicle (tractor trailer, delivery, etc.), mass transit (bus, large van), Motocycle

Please note: weather, location, speed limit, street quality, surrounding features, time of day, temp deg F, method of observation

Week 2: Sat 27 Jan - Friday 2 Feb

Data table design & manipulation | Python crash-course | File conversions

Learning Objectives

Create data set tables designed for analytic purposes: clear column headers, data dictionaries, etc.
Conduct field gathering of data and entering of that data based on data dictionary
Implement data cleaning procedures to address observational inconsistencies and general "messiness"
Create a data manipulation and work process log for our distracted driving data
Export data as a CSV, verify the export, and read that data into python
Diagram essential python data structures: dictionaries and data frames
Conduct basic data set manipulations in Python\

Mid-week to-dos

Prepare to work on these steps by Tuesday evening and provide Eric feedback on sticking points or areas for additional learning by midnight on Tuesday. Either call (leave a VM) 412.894.3020 or--if you really despise talking--send an email to him<./li>
Research and find/purchase/borrow/rent a statistics text that you can study happily and work exercises. I have used Statistics for Business and Economics by McClave and am working on studying the classic Introductory Probability and Statistical Applications by Meyer.
Install and test Python 3 on your computer.
Install and test the Anaconda 3 library of data-related python tools.
Try running Jupyter Notebook which is a python tool for interactive programming and saving python scripts. It is a package already installed with Anaconda 3--no extra installs needed, just figuring out how to run the server and access it in a web browser.
Download and install Libre office if you ever anticipate not having full access to MS office products. Once installed, create a spreadsheet, export it as a CSV, and import it into, say, google sheets, to test your work.
Devote 3-4 hours tinkering with CSV imports using the python tutorials linked here and found online. Place a priority on reading about any fundamentals you may not have in your tool set. Use this as a time to assess your programming acumen and interest in going more in depth into Python.
Work through basic pandas tutorial--focusing on reading data in and printing out the object information
Explore one other package in the python standard library by importing it and creating the object and calling a few methods on that object.

Learning Exercises:

Review our sampling of car data on the computer
Discuss sampling methods and their importance in the foundational value of our data sets
Conduct a group observational experiment of car data
Format and process this data in CSV format and export into Python

Learning Resources

Distracted Driving Data Shared Drive
Python File resources on this site
Python Documentation
Reading on Sampling from the University of Texas at Dallas
Use a non-archiving search engine like DuckDuckGo to find related resources to the weekly to-dos.

Week 3: Sat 3 Feb - Fri 9 Feb

Back to basics with Pivot tables and basic python

Learning Objectives

Extract inquiry questions from a dataset of interest
Using our spreadsheet skills, generate a pivot table and pivot chart to investigate these questions
Write python code to read in a CSV file and conduct basic processing operations on that set
Direct the output of a python processing operation to a text file that can be read back by a spreadsheet

Learning Exercises:

Locate a dataset of interest and brainstorm relevant questions that can be answered with a pivot table
Explore pivot table basics in MS Excel or in Libre Office
Answer the inquiry questions using that Pivot table and make a parallel pivot chart
Write up your mini-analysis in this shared google doc. Use the Table of Contents on page 1 to jump down to your dedicated page. NOTE: you must be logged in to ANY google account to paste in screen shot images of your pivot tables!
Using your own dataset, use python to do that CSV processing

Learning Resources

Mid-week Exercises: Spreadsheet Madness

Based on last week's class, we have growth needs in spreadsheet fundamentals. These exercises will ask you to process some data in a spreadsheet using a series of functions. You'll be given links to various core skills in spreadsheets. We'll review them in class on Saturday.

Note that these steps will provide you links for resources on using functions in Libre Office Calc since this is a free and open source program anybody can download. You can find similar documentation on MS Excel with an internet search. Many of the functions have the same name across programs.

Download a CSV version of the non-traffic citations issued in Allegheny county on the WPRDC.
Open this file in a spreadsheet (either Libre office or Microsoft Corporation's Excel) and save the file as a native spreadsheet file (i.e. either a .ods file or .xlsx). This allows you to use the full features of the spreadsheet and preserve any pivot tables you create.
Let's make a backup of our data so we can always go back and restart with little hassle. Do this by duplicating the tab that contains the raw data. Label the two tabs logical names such that one is our master original data copy and the other is the processed data. L
Take some time to browse the data, using the data dictionary as your guide. As yourself: what kind of questions can I ask about this data? Is it clean? Do I see fields that I can easily crunch or do they need to be formatted at all? Are there data fields that I might need to adjust to be easily processable?
This step sequence will help you develop spreadsheet skills necessary to answer these brainstormed questions, which we'll then discuss in person on Saturday:

Which neighborhood has the highest incident frequency?
Of the neighborhood with the highest incident frequency, which offense type was the most frequent? What conclusions can we draw about severity of offense and frequency of offense?
Are black folks more likely to be cited than white folks?
Are younger folks more likely to be cited for a certain crime than older people? Which ones?

We need to do some trimming of this data, particularly the data column and our ugly offenses field. Start by generating a new column next to the existing "CITEDTIME" column. Use the formula called 'Left' to extract only the year, month, and day, discarding the time.
Now let's wrestle with the "OFFENSES" column. First, the data are ugly--some are all caps, some are not. The section and subsections are not a consistent length, so we can't just extract those with a nice left() or right() function call. Let's start by getting them all to be in lowercase for consistency (this way it doesn't look like some offenses are more important than others just because they are in all caps). Scan the list of text functions in Libre office for an appropriate function to convert all that text to lowercase. Remember, you'll want to create new columns for each adjustment to the original data. We can always hide the original data.
Before we process the offense data, let's make sure we don't have any trailing spaces at the end of our offenses. Find the name of the function that will automatically remove non-printing characters from the beginning and end of data in a particular cell. You get to do the searching for this one!
This is a tricky step and it will take time and PATIENCE. It may take you up to an hour to get this working right--but is a rite of passage in spreadsheets since this formula is so powerful. We would have liked to be given an office ID that is a consistent length that we could use for sorting and displaying. But we weren't given that data--only the ugly legal code section, subsection, etc. What's worse, those are not in a consistent format. Some have () in them, or more than one pair of (). Some have decimal places, others dont. We want to generate our own set of offense IDs and attach these IDs to existing offense columns. We're going to do this with a special function called VLOOKUP(). The idea behind this function is that we can create a key table that associates unique offense names and our own offense IDs that are pretty, like 1000 and 10001. VLOOKUP will search our data column called OFFENSE and see if there is a matching value in our little lookup table. If a match is found in our lookup table, our own offense IDs are returned as the output of the VLOOKUP() function call. This is handy, but tricky to get to work correctly.
Start by reading the documentation on Vlookup from libre office help. This is not a very helpful guide on this function. But we always start with the official documentation and go from there. This page looks ugly, but is one of the most complete discussions of using VLOOKUP in Libre office.. You can also find some video tutorials out there, but I think videos are tedious and not very good as references, which I generally want.
Go ahead and try to generate a new column in our data table called "offenseID" and populate it with the offense IDs that you created in a reference table that VLOOKUP uses. (The analog is matching up the letter A with a value of 90 to 100 in a final grade percent table--we are looking up offense IDs instead of letter grades). When creating your lookup table, you'll want to list all of the unique values found in the original data in one column, and give them a pretty ID number of your choosing. So your first row in this lookup table will be something like "5511(c)(1) CRUELTY TO ANIMALS" in column 1 and a number like 1000 in column 2.
When building your lookup table, you need to determine all the unique values in the OFFENSES column. Here is a discussion of some ways to do this in Libre Office calc.
Whew! With your new offense IDs generated, you're ready to start analyzing the data with filters, pivot tables, and charts.
Filter all the columns such that you can sort by unique values, and exclude values that you don't want to work with. Here's the Ahuka tutorial.
Use your pivot table skills to mush the data around to look for answers to the above inquiry questions or ones you generate on your own. Prepare to discuss these in person on Saturday.

Week 4: Sat 10 Feb - Fri 16 Feb

Return of the spreadsheet | OpenRefine Magic

We must never underestimate the importance and value of spreadsheets as the foundation for data analysis. This lesson will review essential spreadsheet operations. We'll also introduce a tool designed for more powerful data cleaning and replacement (but weaker on analysis tools) called OpenRefine. The mid-week exercise will involve producing a small data analysis project through its lifecycle.

Learning Objectives

Confidently implement essential spreadsheet text and numeric functions to process data for display
Create pivot tables and pivot charts after initial processing of spreadsheet data
Conduct basic faceting and filtering in Open Refine
Critique a data journalism/report created by a data scientist and published online

Learning Resources

OpenRefine's GitHub Account with install instructions
OpenRefine's Expression Language Reference
The master data repo: Data is Plural Blog Archive and repo links (Make a copy into your own Google Drive account)
Eric's data processing example: Nuclear Blast record

Learning Exercises:

Review the following data journalism piece by the local outfit: PublicSource. Data journalism example: Let's Talk About race and statistics every Pittsburgher should know. Consider the following questions:

What makes the graphs presented effective? How do you feel about the "drawing" effect used in the graphics?
How did Public Source use honest journalism principles in their article?
This article didn't present any analysis or political opinions explicitly--but are there implicit messages given in this data? How do you feel about this approach (sharing policized data without analysis)?
What is the source of this data? Does knowing the source add or detract from the value of this article?

Let's stretch our spreadsheet muscles with school-related civil rights (discipline) data located on our server here. Download this spreadsheet and follow these analysis guidelines.

Review the data guides for this CRDC data published by the office of civil rights.
In your spreadsheet, copy the initial raw data table. Rename the initial tab something like "rawData". In your copied tab, delete unwanted columns to isolate the variables of interest.
Shorten column names, remove spaces, etc.
Cut out the rows to isolate the level of analysis you are targeting: county, district, or school.
Calculate a per enrolled student metric (field / total enrollment) for each of your columns of interest.
Generate inquiry questions related to the connection between school size or school type and your field of interest.
Use min/max/stdev functions to compute summary statistics for each of your fields of interest (5-6)
Create a pivot chart to help you answer your inquiry questions. Once you have isolated the data you want, generate a few charts from the pivot table that shed light on your conclusions.
Publish your results in our shared google doc here

Open Refine Practice

Acquire our CSV of nuclear explosion data from our server
Load this data into OpenRefine
Use the Facet functions to clean up the blast size field so it's all potentially numeric
use the toNumber() function to convert this field to numeric values
Categorize the blasts based on their size into three classes: small, medium, and large blasts
Export data to CSV and open in a spreadsheet
Develop inquiry questions based on this data
Follow standard analysis procedures in a spreadsheet to uncover the answers. Prepare to share.

Mid-week to-dos

Make sure to have week 3 pivot table practice done and uploaded to the google drive file
Conclude and make your Civil Rights data analysis presentable

NEW Products to Produce Mid-Week

Find a "data journalism" article and prepare to share your findings: what was concluded? Was it presented with obvious bias? Is the sources of the data accessible? Write a short email to the author sharing your findings and links to the analysis google doc here.

Week 5: Sat 17 Feb - Fri 23 Feb

Visualization tools | US Census and ACS data processing

Learning Objectives

Access US Census data through American Factfinder and navigate the download tool to extract two years of data
Clean US Census data to isolate variables of interest in a spreadsheet and OpenRefine
Create a database to conduct a join on the data
Export Joined data to a spreadsheet and create basic chart visualizations of that data

Learning Resources

US Census data is all accessible through the American FactFinder portal located here.
Directory listing of class files: including sample database and chart spreadsheet

Learning Exercises:

Let's begin with a discussion of a fascinating use of census data to visualize distributions of people by race in the USA. Investigate this "Racial Dot Map" tool created by researchers at the University of Virginia. Consider these discussion questions:

Where did you explore first? What were your first impressions of this data?
What makes this an effective data visualization tool? What are its limitations?
Explore the researcher's data portal by clicking the "what am I looking at?" Link. What principles of good data analysis are exhibited?
What additional layers of data would you like to add to this racial dot map? What conclusions or ideas would adding this data allow viewers to consider or conclude?

Browse the American Community Survey data to find two geographies of interest over two time periods in which you can investigate change in some set of variables. You'll need to make sure that you have data on the same geography level for the tables you choose in both years.
Download both tables and import them into a spreadsheet
Clean the columns by developing sensible column names, deleting the columns you don't want (probably margin of errors for this practice activity). Remember--No strange columns and no spaces in variable names!
Save this file and import into OpenRefine to clean up fields. Delete records in which there is very little data. Replace no-value markers with 0 so we can use numeric functions on the fields
Export the data from Open Refine back into a spreadsheet
Copy the cleaned data into LibreOffice base and create a master table with joined data on a key column for export
Extract data from the database back into the spreadsheet for visualization
Visualize the data you've gathered and do the write-up in the shared google doc located here.
Prepare to give a short presentation on this data at the start of the next class.

Mid-week to-dos

Complete the in-class census data exercise and make sure your write-up is solid and presentable next week

Products to Produce

Completed, thorough, and presentable US Census data mini-project in the google spreadsheet linked above

Week 6: Sat 24 Feb - Fri 2 March

Mapping Fundamentals | Spatial data analysis | QGIS Glory | CartoDB magic

Learning Objectives

Demonstrate the essential elements of mapping: coordinate planes, projections, datums, and spatial joins
Assemble a geospatial US Census data set using data from American Factfinder and Tiger shapefiles
Use basic geospatial analysis tools in QGIS to answer meaningful inquiry questions
Export map data from QGIS into CartoDB for online sharing

External Resources

Cartography on Wikipedia: Essential concepts and interesting history on the field of mapping and geospatial analysis
QGIS main page: The free and open source competitor to ESRI's $1000+ Arcmap software. It's documentation is extensive and amazing.
US Census data is all accessible through the American FactFinder portal located here.
The actual map polygons are all published through the TIGER shapefiles program
Carto (formerly CartoDB) is a free tool for creating and publishing a variety of map types online. Great for sharing and visualizing and embedding but lacking the spatial join tools available in QGIS.

Learning Exercises:

Explore the essential elements of map creation with the physical mapping projection activity (enter old-school overhead transparency films!)
Check out examples of cool online maps: Eric's CRDC black/white suspension ratios dot mapand CMU Students for Urban Data Systems Police jurisdiction maps and Mark Egge's bus bunching analysis and map. What principles of honest data journalism are employed? How does qualitative and quantitative data blend or not blend in these examples?
Navigating mapping with carto online: Visualizing and tinkering with Allegheny County Property Assessment Data accessed through the Western Penssylviaia Regional Data Center's Property Information Extractor. We'll cook up a data set to export and dump it into Carto to tinker with map basics: choropleth map analysis, layers, annotations, presentation.
Explore QGIS as a fully featured mapping tool with support for layer building, data joining, advanced geospatial analysis, and exporting/sharing.
Build a map of our US Census data we analyzed last week using QGIS and export up to Carto.

Map analysis in QGIS assembly sequence

Generate compelling inquiry questions that you hope to answer (at least partially) through your analysis.
Create a blank map in QGIS
Assemble the "flat file" of US Census data that you'd like to map. Make sure to preserve the geographic entity id (usually called GEOID) since that's what we'll use to join the flat file data to the geospatial features.
Download the appropriate TIGER shapefile (resource link above) and import into QGIS. Make sure the data seem reasonable--are these indeed the geospatial features you'd like? Ensure a correspondence between the year of your CENSUS data and the year the Shapefile data was published (the US Census adjusts the shapes of the geospatial features every 10 years). Do a sample lookup of a GEOID in both the flat file table and the map features to make sure the join is likely to find matches for some or all of the features you'd like to visualize data for.
Join the flat file data to the map features using the data join process described in this tutorial
Be sure to open your data table for the geospatial layer and make sure the join worked.
In the symbology tab of the polygon layer choose appropriate data fields to visualize and tinker with the bin size, etc. of the coloring to make sure you're drawing out the essential data that speak to your inquiry questions.
Export your map as an image and load into our weekly project assembly document.
Thoughtfully explain your data analysis process, your inquiry questions, and the conclusions your data can (and cannot) support in your section.

Mid-week to-dos

Complete the mapping sequence described above and your full--thoughtful--write up in this week's work product document.

Products to Produce

A completed section in the work product document linked above for sharing FIRST THING next week in class.

Week 7: Sat 3 March - Fri 9 March

Exposure to Python and Using python scripts | Taste of regression | Project Design

Final Project Share Directory

Please compile all of your project-related files into a single location and upload them to your specific sub-directory in our shared drive linked below.

Required Components

Original/Raw data sets labeled as such, in its OWN directory
Processed data files labeled appropriately
Project outline as a PDF: This should be a document with nice headers and carefully proofed text that includes: background on your project, inquiry questions, analysis steps, conclusions, and ideas for further exploration. Include snaps of any visualizations you use.
Any graphs or charts or images used--properly titled.
A work process log that is a step-by-step list of what you did to analyze your data: tools used, files moved, functions/libraries, etc.
The source code of any scripts (such as that generated by open refine) used in the project
A file called readme.txt that lists the files in this directory and why they are included.

Final Project Shared Directory

Week 8 [LAST SESSION]: Sat 10 March

Share final analysis projects with client | Celebrate the promise of data analytics

Learning Objectives

Present a coherent narrative of your project's birth, your guiding questions, analysis, and conclusions
Converse with a professional data analyst about the field at large, specific data analysis procedures and approaches, and recommendations for skill development.

Special Guest Conversant

Our class had the pleasure of hosting professional data analyst Mark Egge, a consultant at High Street consulting, based in Pittsburgh, PA. Mr. Egge lives in Bozeman, MT

Products to Produce

Completed final project directory as described and linked in last week's guide

DAT 102: Introduction to Data Analytics

Course Schedule and Lesson Guides

Week 1: Sat 20 Jan 18 - Friday 26 Jan

Data schemas, data collection, and essential analysis

Study Topics:

Learning Objectives

Learning Exercises:

Learning Resources

Vehicle study columns:

Week 2: Sat 27 Jan - Friday 2 Feb

Data table design & manipulation | Python crash-course | File conversions

Learning Objectives

Mid-week to-dos

Learning Exercises:

Learning Resources

Week 3: Sat 3 Feb - Fri 9 Feb

Back to basics with Pivot tables and basic python

Learning Objectives

Learning Exercises:

Learning Resources

Mid-week Exercises: Spreadsheet Madness

Week 4: Sat 10 Feb - Fri 16 Feb

Return of the spreadsheet | OpenRefine Magic

check_circleLearning Objectives

bookLearning Resources

listLearning Exercises:

Open Refine Practice

extensionMid-week to-dos

cakeNEW Products to Produce Mid-Week

Week 5: Sat 17 Feb - Fri 23 Feb

Visualization tools | US Census and ACS data processing

check_circleLearning Objectives

bookLearning Resources

listLearning Exercises:

extensionMid-week to-dos

cakeProducts to Produce

Week 6: Sat 24 Feb - Fri 2 March

Mapping Fundamentals | Spatial data analysis | QGIS Glory | CartoDB magic

check_circleLearning Objectives

bookExternal Resources

listLearning Exercises:

buildMap analysis in QGIS assembly sequence

extensionMid-week to-dos

cakeProducts to Produce

Week 7: Sat 3 March - Fri 9 March

Exposure to Python and Using python scripts | Taste of regression | Project Design

check_circleFinal Project Share Directory

Required Components

Week 8 [LAST SESSION]: Sat 10 March

Share final analysis projects with client | Celebrate the promise of data analytics

check_circleLearning Objectives

listSpecial Guest Conversant

cakeProducts to Produce

Learning Objectives

Learning Resources

Learning Exercises:

Mid-week to-dos

NEW Products to Produce Mid-Week

Learning Objectives

Learning Resources

Learning Exercises:

Mid-week to-dos

Products to Produce

Learning Objectives

External Resources

Learning Exercises:

Map analysis in QGIS assembly sequence

Mid-week to-dos

Products to Produce

Final Project Share Directory

Learning Objectives

Special Guest Conversant

Products to Produce