Here are two files from the internet: SAT-2010.csv
and SAT-2010-small.csv, where the
second file is a tiny version of the
first -- with the small one used for practice and debugging. These
contain the average SAT scores for various schools in the city for
2010. Download these files, and look at them with two different
tools: 1) a spreadsheet, like Google-Docs or Excel or Libre-Office
spreadsheet, and 2) an ordinary text editor, like gedit, or emacs, or
notepad.
For instance, this is what it looks like in a spreadsheet:
This is really a .CSV file, and so, you should separate the file into
lines (using .split('\n')) and then, for each line, separated it into
fields (using .split(',')). Then, you can access the individual
school's scores.
Exercises:
1. Create a function called ReadSAT(filename) that will read
this type of file, and return a list of school's data. Each
element of the list, will itself be a list of a school's data.
>>> lists=ReadSAT('SAT-2010-small.csv) >>> print lists[:2] [['Henry Street Settlement for International Studies',31,391,425,385],['University Neighborhood High School',60,394,419,387]]
2. Dealing with exceptional data: Look at the data above. In the text-editor view, look at lines 6 and 8, each of which has an issue. In line 6, there is no numerical data, only the letter "s" in the place of those numbers. Line 8 is even more problematic because the school name contains a comma, and therefore if you separate this line using a comma as a separator, there'll be trouble -- you'll be separating the school name itself. When a school name has embedded commas, the entire name will be surrounded by double-quotes in the file, as it is in line 8. Try your hand at figuring out how to deal with these problems. Create a function called HighLowSAT(filename) that will print out the name of school with the highest math score and the school with the lowest math score. Remember, the scores in the file are strings, and you'll need to convert to numbers in order to compare them to each other.
3. Let's deal with real stats. Create the function BigStats(filename) which will return a list of 4 numbers. 1) will be the total SAT test takers in the city; 2) will be the average student reading score; 3) average student math score; 4) average writing score. By average student score, I mean the score that the average student would get, which means you have to use the number of test takers in each school in the calculation.
4. Create a function School2Dict(line) that will take a line of data from this file and create a dictionary with key-names called "DBN", "Name", "Number", "Reading", "Math" and "Writing" and will put the appropriate values into the dictionary. For instance:
>>> print School2Dict('01M650,CASCADES HIGH SCHOOL ,35,411,401,401') {'DBN':'01M650','Name':'CASCADES HIGH SCHOOL','Number':35,'Reading':411,'Math':401,'Writing':401} >>> print School2Dict('01M509,CMSP HIGH SCHOOL ,s,s,s,s') {'DBN':'01M509','Name':CSMP HIGH SCHOOL','Number':'s','Reading':'s','Math':'s','Writing':'s'}>>> print School2Dict('01M539,"New Explorations into Sci, Tech and Math HS ",47,568,583,568') {'DBN':'01M539','Name':'New Explorations into Sci, Tech and Math HS','Number':47,'Reading':568,'Math':583,'Writing':568}
Notice that a number of issues must be dealt with:
5. Like the exercise we did in class on Thurs, create a master
dictionary whose key is the DBN field of SAT-2010 data. Create the
function MakeMaster(filename) that returns the master dictionary, so
that one can do the following:
>>> m=MakeMaster('SAT-2010.csv')
>>> print m['01M509']['Name']
CSMP HIGH SCHOOL
>>> print m['01M650']['Math']
401
6. Download the file namelist.csv. If contains first and last names and ratings like in the exercise we did in class on Friday. Create a function called NameQuery(filename), that will be given the filename (whatever name you downloaded the file as) and will digest that file in order answer queries about it. Here's a session:
>>> NameQuery('namelist.csv')
Report on what name(s)? max
MAXWELL IRIKURA has a rating of 53
MAXWELL LIN has a rating of 83
Report on what name(s)? ish
Ish MAHDI has a rating of 192
Report on what name(s)? ma
MATTHEW AU has a rating of 192
MAXWELL IRIKURA has a rating of 53
MAXWELL LIN has a rating of 83
ERIK MAI has a rating of 21
MATTHEW WONG has a rating of 162
MAX LOBEL has a rating of 179
Ish MAHDI has a rating of 192 MAJOR MANWELL has a rating of 200
Allan Mahdi has a rating of 192
6a) Challenge: Same as problem 6 above, except that the names must be printed out in order by ratings, highest ratings first. If a bunch of people have the same rating, then print them out in alphabetical order by last name, and if they also have the same last name, then alphabetically by first name. (Be careful about upper and lower case names.) This re-ordering problem is not easy, so if you want, just re-order by rating (highest to lowest) first, and verify that it works, and only after that try to do the sub-orderings (if the ratings are equal) with lastname and firstname. So...
Report on what name(s)? ma
MAJOR MANWELL has a rating of 200
MATTHEW AU has a rating of 192
Allan Mahdi has a rating of 192 Ish MAHDI has a rating of 192 MAX LOBEL has a rating of 179
MATTHEW WONG has a rating of 162 MAXWELL LIN has a rating of 83 MAXWELL IRIKURA has a rating of 53
ERIK MAI has a rating of 21