DAT 102: Introduction to Data Analytics
Course Schedule and Lesson Guides
Each class session has it's own box:
Week 1: Sat 20 Jan 18 - Friday 26 Jan
Data schemas, data collection, and essential analysis
Vehicle study columns:
Please note: weather, location, speed limit, street quality, surrounding features, time of day, temp deg F, method of observation
Week 2: Sat 27 Jan - Friday 2 Feb
Data table design & manipulation | Python crash-course | File conversions
Week 3: Sat 3 Feb - Fri 9 Feb
Back to basics with Pivot tables and basic python
- Extract inquiry questions from a dataset of interest
- Using our spreadsheet skills, generate a pivot table and pivot chart to investigate these questions
- Write python code to read in a CSV file and conduct basic processing operations on that set
- Direct the output of a python processing operation to a text file that can be read back by a spreadsheet
- Locate a dataset of interest and brainstorm relevant questions that can be answered with a pivot table
- Explore pivot table basics in MS Excel or in Libre Office
- Answer the inquiry questions using that Pivot table and make a parallel pivot chart
- Write up your mini-analysis in this shared google doc. Use the Table of Contents on page 1 to jump down to your dedicated page. NOTE: you must be logged in to ANY google account to paste in screen shot images of your pivot tables!
- Using your own dataset, use python to do that CSV processing
- Libre Office Calc version of our Allgheny County Jail census data and its pivot table
- WPRDC data set
- MS Excel Pivot Tables tutorial
- Libre Office pivot tables tutorial
- Function/Formula overivew for MS Excel
Mid-week Exercises: Spreadsheet Madness
Based on last week's class, we have growth needs in spreadsheet fundamentals. These exercises will ask you to process some data in a spreadsheet using a series of functions. You'll be given links to various core skills in spreadsheets. We'll review them in class on Saturday.
Note that these steps will provide you links for resources on using functions in Libre Office Calc since this is a free and open source program anybody can download. You can find similar documentation on MS Excel with an internet search. Many of the functions have the same name across programs.
- Download a CSV version of the non-traffic citations issued in Allegheny county on the WPRDC.
- Open this file in a spreadsheet (either Libre office or Microsoft Corporation's Excel) and save the file as a native spreadsheet file (i.e. either a .ods file or .xlsx). This allows you to use the full features of the spreadsheet and preserve any pivot tables you create.
- Let's make a backup of our data so we can always go back and restart with little hassle. Do this by duplicating the tab that contains the raw data. Label the two tabs logical names such that one is our master original data copy and the other is the processed data. L
- Take some time to browse the data, using the data dictionary as your guide. As yourself: what kind of questions can I ask about this data? Is it clean? Do I see fields that I can easily crunch or do they need to be formatted at all? Are there data fields that I might need to adjust to be easily processable?
- This step sequence will help you develop spreadsheet skills necessary to answer these brainstormed questions, which we'll then discuss in person on Saturday:
- Which neighborhood has the highest incident frequency?
- Of the neighborhood with the highest incident frequency, which offense type was the most frequent? What conclusions can we draw about severity of offense and frequency of offense?
- Are black folks more likely to be cited than white folks?
- Are younger folks more likely to be cited for a certain crime than older people? Which ones?
Week 4: Sat 10 Feb - Fri 16 Feb
Return of the spreadsheet | OpenRefine Magic
We must never underestimate the importance and value of spreadsheets as the foundation for data analysis. This lesson will review essential spreadsheet operations. We'll also introduce a tool designed for more powerful data cleaning and replacement (but weaker on analysis tools) called OpenRefine. The mid-week exercise will involve producing a small data analysis project through its lifecycle.
- Confidently implement essential spreadsheet text and numeric functions to process data for display
- Create pivot tables and pivot charts after initial processing of spreadsheet data
- Conduct basic faceting and filtering in Open Refine
- Critique a data journalism/report created by a data scientist and published online
- OpenRefine's GitHub Account with install instructions
- OpenRefine's Expression Language Reference
- The master data repo: Data is Plural Blog Archive and repo links (Make a copy into your own Google Drive account)
- Eric's data processing example: Nuclear Blast record
- Review the following data journalism piece by the local outfit: PublicSource. Data journalism example: Let's Talk About race and statistics every Pittsburgher should know. Consider the following questions:
- What makes the graphs presented effective? How do you feel about the "drawing" effect used in the graphics?
- How did Public Source use honest journalism principles in their article?
- This article didn't present any analysis or political opinions explicitly--but are there implicit messages given in this data? How do you feel about this approach (sharing policized data without analysis)?
- What is the source of this data? Does knowing the source add or detract from the value of this article?
Open Refine Practice
- Acquire our CSV of nuclear explosion data from our server
- Load this data into OpenRefine
- Use the Facet functions to clean up the blast size field so it's all potentially numeric
- use the toNumber() function to convert this field to numeric values
- Categorize the blasts based on their size into three classes: small, medium, and large blasts
- Export data to CSV and open in a spreadsheet
- Develop inquiry questions based on this data
- Follow standard analysis procedures in a spreadsheet to uncover the answers. Prepare to share.
- Make sure to have week 3 pivot table practice done and uploaded to the google drive file
- Conclude and make your Civil Rights data analysis presentable
cakeNEW Products to Produce Mid-Week
- Find a "data journalism" article and prepare to share your findings: what was concluded? Was it presented with obvious bias? Is the sources of the data accessible? Write a short email to the author sharing your findings and links to the analysis google doc here.
Week 5: Sat 17 Feb - Fri 23 Feb
Visualization tools | US Census and ACS data processing
- Access US Census data through American Factfinder and navigate the download tool to extract two years of data
- Clean US Census data to isolate variables of interest in a spreadsheet and OpenRefine
- Create a database to conduct a join on the data
- Export Joined data to a spreadsheet and create basic chart visualizations of that data
- US Census data is all accessible through the American FactFinder portal located here.
- Directory listing of class files: including sample database and chart spreadsheet
- Let's begin with a discussion of a fascinating use of census data to visualize distributions of people by race in the USA. Investigate this "Racial Dot Map" tool created by researchers at the University of Virginia. Consider these discussion questions:
- Where did you explore first? What were your first impressions of this data?
- What makes this an effective data visualization tool? What are its limitations?
- Explore the researcher's data portal by clicking the "what am I looking at?" Link. What principles of good data analysis are exhibited?
- What additional layers of data would you like to add to this racial dot map? What conclusions or ideas would adding this data allow viewers to consider or conclude?
- Complete the in-class census data exercise and make sure your write-up is solid and presentable next week
cakeProducts to Produce
- Completed, thorough, and presentable US Census data mini-project in the google spreadsheet linked above
Week 6: Sat 24 Feb - Fri 2 March
Mapping Fundamentals | Spatial data analysis | QGIS Glory | CartoDB magic
- Demonstrate the essential elements of mapping: coordinate planes, projections, datums, and spatial joins
- Assemble a geospatial US Census data set using data from American Factfinder and Tiger shapefiles
- Use basic geospatial analysis tools in QGIS to answer meaningful inquiry questions
- Export map data from QGIS into CartoDB for online sharing
- Cartography on Wikipedia: Essential concepts and interesting history on the field of mapping and geospatial analysis
- QGIS main page: The free and open source competitor to ESRI's $1000+ Arcmap software. It's documentation is extensive and amazing.
- US Census data is all accessible through the American FactFinder portal located here.
- The actual map polygons are all published through the TIGER shapefiles program
- Carto (formerly CartoDB) is a free tool for creating and publishing a variety of map types online. Great for sharing and visualizing and embedding but lacking the spatial join tools available in QGIS.
- Explore the essential elements of map creation with the physical mapping projection activity (enter old-school overhead transparency films!)
- Check out examples of cool online maps: Eric's CRDC black/white suspension ratios dot mapand CMU Students for Urban Data Systems Police jurisdiction maps and Mark Egge's bus bunching analysis and map. What principles of honest data journalism are employed? How does qualitative and quantitative data blend or not blend in these examples?
- Navigating mapping with carto online: Visualizing and tinkering with Allegheny County Property Assessment Data accessed through the Western Penssylviaia Regional Data Center's Property Information Extractor. We'll cook up a data set to export and dump it into Carto to tinker with map basics: choropleth map analysis, layers, annotations, presentation.
- Explore QGIS as a fully featured mapping tool with support for layer building, data joining, advanced geospatial analysis, and exporting/sharing.
- Build a map of our US Census data we analyzed last week using QGIS and export up to Carto.
buildMap analysis in QGIS assembly sequence
- Generate compelling inquiry questions that you hope to answer (at least partially) through your analysis.
- Create a blank map in QGIS
- Assemble the "flat file" of US Census data that you'd like to map. Make sure to preserve the geographic entity id (usually called GEOID) since that's what we'll use to join the flat file data to the geospatial features.
- Download the appropriate TIGER shapefile (resource link above) and import into QGIS. Make sure the data seem reasonable--are these indeed the geospatial features you'd like? Ensure a correspondence between the year of your CENSUS data and the year the Shapefile data was published (the US Census adjusts the shapes of the geospatial features every 10 years). Do a sample lookup of a GEOID in both the flat file table and the map features to make sure the join is likely to find matches for some or all of the features you'd like to visualize data for.
- Join the flat file data to the map features using the data join process described in this tutorial
- Be sure to open your data table for the geospatial layer and make sure the join worked.
- In the symbology tab of the polygon layer choose appropriate data fields to visualize and tinker with the bin size, etc. of the coloring to make sure you're drawing out the essential data that speak to your inquiry questions.
- Export your map as an image and load into our weekly project assembly document.
- Thoughtfully explain your data analysis process, your inquiry questions, and the conclusions your data can (and cannot) support in your section.
- Complete the mapping sequence described above and your full--thoughtful--write up in this week's work product document.
cakeProducts to Produce
- A completed section in the work product document linked above for sharing FIRST THING next week in class.
Week 7: Sat 3 March - Fri 9 March
Exposure to Python and Using python scripts | Taste of regression | Project Design
check_circleFinal Project Share Directory
Please compile all of your project-related files into a single location and upload them to your specific sub-directory in our shared drive linked below.
- Original/Raw data sets labeled as such, in its OWN directory
- Processed data files labeled appropriately
- Project outline as a PDF: This should be a document with nice headers and carefully proofed text that includes: background on your project, inquiry questions, analysis steps, conclusions, and ideas for further exploration. Include snaps of any visualizations you use.
- Any graphs or charts or images used--properly titled.
- A work process log that is a step-by-step list of what you did to analyze your data: tools used, files moved, functions/libraries, etc.
- The source code of any scripts (such as that generated by open refine) used in the project
- A file called readme.txt that lists the files in this directory and why they are included.
Week 8 [LAST SESSION]: Sat 10 March
Share final analysis projects with client | Celebrate the promise of data analytics
- Present a coherent narrative of your project's birth, your guiding questions, analysis, and conclusions
- Converse with a professional data analyst about the field at large, specific data analysis procedures and approaches, and recommendations for skill development.
listSpecial Guest Conversant
Our class had the pleasure of hosting professional data analyst Mark Egge, a consultant at High Street consulting, based in Pittsburgh, PA. Mr. Egge lives in Bozeman, MT
cakeProducts to Produce
- Completed final project directory as described and linked in last week's guide