Real Data Exercise

Here are two files from the internet: SAT-2010.csv  and SAT-2010-small.csv, where the second file is a tiny version of the first -- with the small one used for practice and debugging.  These contain the average SAT scores for various schools in the city for 2010.  Download these files, and look at them with two different tools: 1) a spreadsheet, like Google-Docs or Excel or Libre-Office spreadsheet, and 2) an ordinary text editor, like gedit, or emacs, or notepad.

For instance, this is what it looks like in a spreadsheet:

Spreadsheet view of the file

And this is what it looks like inside a text editor:

Text editor view

This is really a .CSV file, and so, you should separate the file into lines (using .split('\n')) and then, for each line, separated it into fields (using .split(',')).  Then, you can access the individual school's scores.

Exercises:

1.  Create a function called ReadSAT(filename)  that will read this type of file, and return a list of school's data.  Each element of the list, will itself be a list of a school's data.

>>> lists=ReadSAT('SAT-2010-small.csv)
>>> print lists[:2]
[['Henry Street Settlement for International Studies',31,391,425,385],['University Neighborhood High School',60,394,419,387]]

2. Dealing with exceptional data:  Look at the data above.  In the text-editor view, look at lines 6 and 8, each of which has an issue.  In line 6, there is no numerical data, only the letter "s" in the place of those numbers.  Line 8 is even more problematic because the school name contains a comma, and therefore if you separate this line using a comma as a separator, there'll be trouble -- you'll be separating the school name itself.  When a school name has embedded commas, the entire name will be surrounded by double-quotes in the file, as it is in line 8.  Try your hand at figuring out how to deal with these problems.  Create a function called HighLowSAT(filename) that will print out the name of school with the highest math score and the school with the lowest math score.  Remember, the scores in the file are strings, and you'll need to convert to numbers in order to compare them to each other.

3. Let's deal with real stats.  Create the function BigStats(filename) which will return a list of 4 numbers.  1) will be the total SAT test takers in the city; 2) will be the average student reading score; 3) average student math score; 4) average writing score.  By average student score, I mean the score that the average student would get, which means you have to use the number of test takers in each school in the calculation.

4. Create a function School2Dict(line) that will take a line of data from this file and create a dictionary with key-names called "DBN", "Name", "Number", "Reading", "Math" and "Writing" and will put the appropriate values into the dictionary.  For instance:

>>> print School2Dict('01M650,CASCADES HIGH SCHOOL ,35,411,401,401')
{'DBN':'01M650','Name':'CASCADES HIGH SCHOOL','Number':35,'Reading':411,'Math':401,'Writing':401}

>>> print School2Dict('01M509,CMSP HIGH SCHOOL ,s,s,s,s')
{'DBN':'01M509','Name':CSMP HIGH SCHOOL','Number':'s','Reading':'s','Math':'s','Writing':'s'}
>>> print School2Dict('01M539,"New Explorations into Sci, Tech and Math HS ",47,568,583,568')
{'DBN':'01M539','Name':'New Explorations into Sci, Tech and Math HS','Number':47,'Reading':568,'Math':583,'Writing':568}

Notice that a number of issues must be dealt with: 

  1. When there are numbers for the schools, the strings are converted to numbers, so that arithmetic can be done; however, for schools with no numbers, the string "s" is retained;
  2. The normal trailing space character at the ends of school names has been eliminated
  3. Schools with commas in their names are correctly dealt with, and the double-quotes removed from the school name

5. Like the exercise we did in class on Thurs, create a master dictionary whose key is the DBN field of SAT-2010 data.  Create the function MakeMaster(filename) that returns the master dictionary, so that one can do the following:

>>> m=MakeMaster('SAT-2010.csv')
>>> print m['01M509']['Name']
CSMP HIGH SCHOOL
>>> print m['01M650']['Math']
401

6. Download the file namelist.csv.  If contains first and last names and ratings like in the exercise we did in class on Friday.  Create a function called NameQuery(filename), that will be given the filename (whatever name you downloaded the file as) and will digest that file in order answer queries about it.  Here's a session:

>>> NameQuery('namelist.csv')

Report on what name(s)? max
MAXWELL IRIKURA has a rating of 53
MAXWELL LIN has a rating of 83

Report on what name(s)? ish
Ish MAHDI has a rating of 192

Report on what name(s)? ma
MATTHEW AU has a rating of 192
MAXWELL IRIKURA has a rating of 53
MAXWELL LIN has a rating of 83
ERIK MAI has a rating of 21
MATTHEW WONG has a rating of 162
MAX LOBEL has a rating of 179
Ish MAHDI has a rating of 192 MAJOR MANWELL has a rating of 200
Allan Mahdi has a rating of 192

6a) Challenge:  Same as problem 6 above, except that the names must be printed out in order by ratings, highest ratings first.  If a bunch of people have the same rating, then print them out in alphabetical order by last name, and if  they also have the same last name, then alphabetically by first name.  (Be careful about upper and lower case names.)  This re-ordering problem is not easy, so if you want, just re-order by rating (highest to lowest) first, and verify that it works, and only after that try to do the sub-orderings (if the ratings are equal) with lastname and firstname.  So...

Report on what name(s)? ma
MAJOR MANWELL has a rating of 200
MATTHEW AU has a rating of 192
Allan Mahdi has a rating of 192 Ish MAHDI has a rating of 192 MAX LOBEL has a rating of 179
MATTHEW WONG has a rating of 162 MAXWELL LIN has a rating of 83 MAXWELL IRIKURA has a rating of 53
ERIK MAI has a rating of 21