lab02 : Using STL to aggregate data and query

num	ready?	description	assigned	due
lab02	true	Using STL to aggregate data and query	Wed 01/20 12:00AM	Thu 01/28 11:59PM

Goals

Learning objectives. At the end of this lab, students should be able to:

read and understand base C++ code (related to reading data from CSV file (all lab01))
Use STL hashmaps to aggregate ‘county’ level data to ‘state’ data (including computing the average)
Design and implement a C++ class representing state demographic data
Utilize data aggregated into a hashmap to answer questions about the data (e.g. state with the youngest population)
practice using a testing framework, including writing additional tests

Orientation

Last week, we developed code to read in county demographic data (age and education). This week, our goal is to examine that data and be able to asnwer questions about regions in the USA. For example, which region has the youngest population? Later in the quarter we will combine other datasets and be able to expand our queries.

The first step is deciding at what regional level we want to look at our data. With over three thousand counties, we want to consider larger regions (by grouping counties). There are lots of ways to understand regional data, but for this week we will look at state level data. Thus, one of the first tasks is to start aggregating county data together into states.

Once we have the data aggregated together, we will use various methods to identify extremes in the data (minimums and maximums) of various demographic fields. Although you have choices about the core tasks, you will need to wrap your solution in a class that we will use for testing.

This week’s lab is a bit different in that you will have choices about how to solve the primary aspects of the problem and then you will need to implement a very specific class to help with testing. Part of your lab grade will depend upon passing tests and another portion of the grade depends on your code being reviewed by one of the teaching assistants or instructor.

In general, C++ has many tools to solve various aspects of this problem, it is just important that you understand the tools you are using in your solution. You can implement a solution to each of these tasks with the material that has already been discussed in lecture. Make sure you understand your solution as you will/may be asked to discuss your solution.

Tasks

Step 0: Getting Started - think about the problem

Starting from your code last week

There are various ways to tackle this problem. To start, make sure you understand the problem. Given county level data, we want to average all the county data for a given state together and then be able to query the maximums and minimums of any of the data fields. See below note about averaging percentages and necessary moifications to demogData class.

There are some clear cut tasks, but the order and exact implementation is somewhat up to you, with the exception of the dataAQ class, for which you must implement the specified functions for use in testing.

In general, to solve this problem, we must aggregate county data into states. This means having an implementation to collect all the counties for a given state and combine the county’s demographic data to state level data. For this assignment, to combine the data, we will be averaging any county data in our data set for its associated state.

Download all the new files from: <ahref=”https://github.com/ucsb-cs32-w21/Lab02-STARTER”>https://github.com/ucsb-cs32-w21/Lab02-STARTER</a> Note that some of them are blank, but you need to use these files (for naming convention for testing). You will need all the code (including tddFuncs.h and .cpp) from Lab01

. Put all the files (those from the lab02-starter) and all your code from lab01 into one folder. (You will not need testDemog1.cpp or testDemog2.cpp)

Recall that when we think about a problem, one of the first tasks is to consider the ‘data’ associated with that problem (and then closely related, to consider the data structures we can use to build up necessary data relationships). There are multiple valid solutions here, but for this lab we do expect to see solutions to the following general tasks. You can tackle them in whatever order makes sense to you, but we will be looking for these aspects in your solution.

Regardless of exactly how you solve the following tasks, you must support the specified queries as a part of the dataAQ class. Make sure you understand exactly how your solution will be tested before you dive in too deep.

Do create a new github repo for this weeks lab. We will be looking at your code via github.

Task 0: prepare for averaging percentages of populations

Note that we can not just average our population values because they are percentages, and each county could have a different number of samples.</p> Thus, one of the first changes you will need to make is to add data to demogData in order to store and represent the county population. Parse.cpp is one of the updated files and it now reads the 2014 population number from the csv file and calls a constructor for demogData that passes in this value. Modify your demogData to support storing this value.

Then when designing your state data, also store a total state population and think about how to aggregate the county data (for example, during aggregation convert the percentages into actual counts and then compute those counts as percentages of the total state population, after all county data has been aggregated together).

Task 1 and 2: Representing ‘state’ data

Design and Implement a class to represent ‘state’ data

Ultimately, we will want to conduct simple data analysis on state level data (i.e. which state has the most people with Bachelor’s degrees), etc. Design a class to represent state level demographic data. Again, there are multiple valid solutions here. Design a solution that makes sense to you.

At this point we ask that you do not use inheritence or polymorphism - that will come in future weeks. Designing a solution without them will help motivate their use in later labs.

The state data should have the same demographic information that the county data has, that is age and education. State data will need data in addition to what is stored in a county (especially associated with averaging the county data). But, yes, this does mean you will have two classes that are very similar but represent different regional zones and store different data (as mentioned above, we will revisit this design as we expand our data project).

This class should be implemented in the empty provided files stateDemog.h and stateDemog.cpp.

Create and propagate data into a hashmap to aggregate county data to state data

Use an STL hashmap in order to associate any county data with it’s state (i.e. recall our demogData class has a string which is the name of the state where that county is located). Again you have choices here. Do what makes sense for your solution.

Depending on the order you tackle these tasks, don’t forget that one task is to average the demographic data for all counties into state level demographic data.

<h2 id="task-3-use-data-represenation-stl-hashmap-to-be-able-to-answer-queries-about-data">Task 3: Use data representation (STL hashmap) to be able to answer queries about data</h2> <p>Once you have a colletion of state level data, we should be able to identify extremums (maximum and minimums) for any of the data fields (or combination of data fields).

Your solution should be general, i.e. we should be able to ask for you to be able to find extremes of any of the data fields (and you should expect we will do this during your code review).

Task 4: Testing - implement the dataAQ class exactly as specified for testing

specific cases that must match output (e.g. state with the youngest population)

For the sake of testing please implement a class that can aggregate data and print out results from specific queries, named `dataAQ’. This class should be filled in using the blank dataAQ and dataAQ.cpp files provided. See testStates.cpp for example of how this class will be used. Again, the exact implementation is up to you, but your dataAQ class must support the following methods:

//data aggregator and query for testing<br />
class dataAQ {
  public:
    dataAQ();
    //function to aggregate the data - this CAN and SHOULD vary per student - depends on how they map
    void createStateData(std::vector< shared_ptr<demogData> > theData); (*)
    //return the name of the state with the largest population under age 5
    string youngestPop();
    //return the name of the state with the largest population under age 18
    string teenPop();
    //return the name of the state with the largest population over age 65
    string wisePop();
    //return the name of the state with the largest population who did not finish high school
    string underServeHS();
    //return the name of the state with the largest population who completed college
    string collegeGrads();

    //additional methods AND data to support above methods.  You are allowed for data to be public
    ...
 };

(*) note that this used to say, void createStateDemogData(std::vector< shared_ptr > theData); but the autograder is configured with 'createStateData' so use that at this point*

Again, see testStates.cpp for the use of the dataAQ class to test your implementation.

You are encouraged to write additional test cases for each of the required queries in dataAQ.

The output for dataProj should include a complete version of the following (your code would fill in for BLANK):

* the state that needs the most pre-schools
State Info: UT Number of Counties: 29
Population info:
(over 65): 10.03% and total: 295145
(under 18): 30.71% and total: 903829
(under 5): 8.58% and total: 252377
Education info:
(Bachelor or more): 30.54% and total: 898887
(high school or more): 91.01% and total: 2678411
Total population: 2942902
* the state that needs the most high schools
State Info: BLANK
Number of Counties: BLANK
Population info: (over 65): BLANK
and total: BLANK
(under 18): BLANK
and total: BLANK
(under 5): BLANK
and total: BLANK
Education info: (Bachelor or more): BLANK
and total: BLANK
(high school or more): BLANK
and total: BLANK
Total population: BLANK
* the state that needs the most vaccines State Info: BLANK
Number of Counties: BLANK
Population info: (over 65): BLANK
and total: BLANK
(under 18): BLANK
and total: BLANK
(under 5): BLANK
and total: BLANK
Education info: (Bachelor or more): BLANK
and total: BLANK
(high school or more): BLANK
and total: BLANK
Total population: BLANK
* the state that needs the most help with education State Info: BLANK
Number of Counties: BLANK
Population info: (over 65): BLANK
and total: BLANK
(under 18): BLANK
and total: BLANK
(under 5): BLANK
and total: BLANK
Education info: (Bachelor or more): BLANK
and total: BLANK
(high school or more): BLANK
and total: BLANK
Total population: BLANK
* the state with most college grads State Info: BLANK
Number of Counties: BLANK
Population info: (over 65): BLANK
and total: BLANK
(under 18): BLANK
and total: BLANK
(under 5): BLANK
and total: BLANK
Education info: (Bachelor or more): BLANK
and total: BLANK
(high school or more): BLANK
and total: BLANK
Total population: BLANK

For clarity, the questions in main, refer to the queries in dataAQ (i.e. ‘most’ refers to a ranking in proportion to the state’s whole population, not total count). Thus:
-the state that needs the most pre-schools: should be the state with the largest percentage of its population under age 5
-the state that needs the most high schools: should be the state with the largest population under age 18
-the state that needs the most vaccines: should be the state with the largest population over age 65
-the state that needs the most help with education: should be the state with the largest population who did not finish high school
-the state with most college grads: should be the state with the largest population who completed college

Grading

(50) tests passed (GS)
(20) reasonable state data design
(20) reasonable aggregation computation
(10) reasonable reporting of mins and max in dataAQ

Acknowledgements

Winter `21 vesion: Zoë Wood. Thank you to the original CORGIS crew for their work sharing this data (https://corgis-edu.github.io/corgis/) and to Aaron Keen for curriuclum review. Editing, autograder and general support to thanks to: AMR, SDM, QH, AR, BL