Wednesday, February 4, 2015

Data visualizations based on Spotify listens

I've been tracking the music I listen to on Spotify since early November. There is an IFTTT recipe that automatically updates a Google Spreadsheet every time I listen to a song with the song name, artist name, album title, and timestamp.

While other people might be interested in the type of music and how much of it I listen to, I am likely the main audience for this visualization. No matter if the audience is just me, or family members, friends, classmates, or others, some might generate assumptions about what types of music I listen to on a regular bases. People who know I have a daughter would understand the kids songs in the treemap, those who don't know about my daughter might find the kids songs strange. The audience also probably has some assumptions about how they listen to music throughout the day, and may compare my listening habits to theirs. Someone who needs silence while working will not understand how I listen to so much music. People may also judge the types of music I listen to, and think better or worse of me.

My questions are: Which artists do I listen to the most, and do I like a lot of those artists' songs or are they one-hit wonders to me? What day of the week do I listen to the most music? What time of day do I listen to the most music?

There are several other questions I have, but don't have the proper data currently. Those questions are: Do I listen to certain genres of music on specific days? I'm guessing that music on Mondays is different than music on Fridays. How do the number of meetings I have during the day affect how many songs, and what types of songs, I listen to? Again, guessing that more meetings might mean fewer songs, but that the genres would change (since many meetings tend to put me in a bad mood, especially when they are poorly run). To make these analyses work, I would need to manually input genres for each song, and would need to go through my calendar and manually update the number of meetings I have each day.

I'm using R for the data preparation and analysis, and Illustrator to fine tune the images.
For the treemap I used the R package 'portfolio'. Flowingdata.com helped with the steps I needed to take to make this work. For the bar graph, I used barplot. For the line graph, I used ggplot.

I'm also including my initial sketches of how I wanted the visualization to look. This helped a lot for planning my attack, especially with regard to the data cleanup.

Visualizations are below, all code is on GitHub

Sketch of what I wanted:

Treemap of artists I listen to most:
Songs listened to per hour:


Songs listened to by day:


Tuesday, January 20, 2015

State of the Union Address in Wordcloud form

The State of the Union address happened tonight, and the White House released the speech on Medium. I created a wordcloud of the most popular words that were written in the prepared remarks. "New" and "America" are the two most popular words used. You can see that "Americans", and "American" are very popular, too. I chose to leave all the different variations of those words in the wordcloud so it was that much more apparent how popular certain words are.

Here is the wordcloud:
And here is the code (written in R):
library(tm)
library(wordcloud)
SOTU <- read.table('SOTU.txt', sep="\t")
mycorpus<-Corpus(VectorSource(SOTU))
mycorpus<-tm_map(mycorpus, content_transformer(tolower))
mycorpus<-tm_map(mycorpus, removeNumbers)
mycorpus<-tm_map(mycorpus, removePunctuation)
mycorpus<-tm_map(mycorpus, stripWhitespace)
mycorpus<-tm_map(mycorpus, removeWords, c("will", "what", "whats", "was", "way", "too", "that"))
mycorpus<-tm_map(mycorpus, removeWords, stopwords("english"))
dtm<-DocumentTermMatrix(mycorpus)
inspect(dtm)
freq<-colSums(as.matrix(dtm))
findFreqTerms(dtm, lowfreq = 3)
freq<-sort(colSums(as.matrix(dtm)), decreasing = TRUE)
plot(freq)
wf <- data.frame(word=names(freq), freq=freq)
library(wordcloud)
set.seed(123)
wordcloud(names(freq), freq, min.freq = 5, max.words=500, scale=c(5,.1), colors = c("red", "blue"))

Saturday, January 3, 2015

World Map built in R

This is a quick map I built in R using the world.cities dataset that comes built into the "maps" package. I plotted the location of every city around the world and colored them orange. Then I broke out the dataset to include only cities with more than 1,000,000 in population, plotted those cities, and colored them black. There are a couple things I really like about this visualization. First, you can pretty clearly see the borders of continents. It is clear that cities were able to grow in part because of their proximity to water. It is also interesting to see where the gaps in cities are. The Sahara Desert has very few cities as does much of Western China. Europe's entire land mass is basically covered by cities, while Australia has very few.

None of this is revolutionary, but interesting to look at nonetheless. I've included the simple code on GitHub here: https://github.com/samedelstein/World-Maps-in-R

Monday, December 22, 2014

Wage and Employment in Onondaga County

I pulled this data set from the Quarterly Census of Employment and Wages Annual Data: Beginning 2000. This sets reports both employment numbers and average wages broken up by county and also by job function. 

Because there were so many job functions listed, I decided to group many of them together so the visualizations were easier to digest. You'll see those specific visualizations below, and I potentially may have miscategorized some of the the job functions, but tried to match based on the names.

The first visualization shows that overall, annual wages are increasing across New York State, which makes sense because of inflation. New York County draws the highest wages by far, with Westchester County following with the second highest average wages in the state. I highlighted Onondaga County in orange since I live in Onondaga County and I will focus on it more in visualizations below.

New York County sees an increase in wages of about $30,000 from 2000 to 2014. Onondaga County saw a raise in wages by about $13,000.  Many counties see wages flat or decreasing from 2007-2009, thanks to the economic recession.

This visualizations shows the same information as the visualization above, but looks at a map of all the counties. You can scroll through to see how the counties change each year. Red shaded counties mean the average wage is below $40,000. Blue means the average wage is above $40,000. You can see an increasing number of counties turning blue over time - again thanks to inflation and rising wages. 
Now we'll look at average wages over time for different job functions, specifically in Onondaga County. Not all job functions have been measured every year, and again, I made the choice of how to group the jobs functions. Utility and finance jobs garner the highest average wages. Data and Information Services had a large jump in 2002, but since has dropped and flattened out. Finance suffered in the economic recession and is only now getting back to the same average wages as before the recession. 
While finance and utilities pay the best in Onondaga County, health, government, and administrative support jobs are the most plentiful. Those jobs that pay the most? They are some of the rarest jobs in the county, with fewer than 2,000 people working in the field. Health, accommodation, and education jobs are all on the rise, in terms of employment.
What else do you see? Any other insights that you find interesting?

Sunday, December 21, 2014

Overweight and Obesity Rates for Students in New York State

I grabbed the Student Weight Status Category Reporting Results: Beginning 2010 dataset. After looking it over, I wanted to dive into how the obesity rates change by date for each county. In the data set, there is one measurement done in the 2010-2012 year range, and the other done in the 2012-2013 year range. 

First, looked at the average obesity and overweight percentages for each county's school districts and compared by year. For this, I used a slope graph to make it easy to see which counties are increasing in obesity as time goes on and which are decreasing. Orange is increasing, blue is decreasing. The biggest jump is in Hamilton, beginning at 18.87% and ending at 43.10%. Schoharie also has a large jump, and finishes as the county with the highest obesity and overweight percentages. In the 2010-2012 study, Schoharie began at 35.64% and by the 2012-2013 study, it was at 45.80%.

Since I live in Onondaga County, I was interested in looking a little deeper at the numbers. First, overall for the county, the numbers remained largely the same, with a slight increase from 33.5% to 33.72%.

After looking at the overall numbers for all counties in the state, I looked a little closer at each school district within Onondaga County. Some school districts did not report obesity and overweight figures in each study, and in these cases, I removed the district. While overall in the state Onondaga County was largely flat from the 2010-2012 study to the 2012-2013 study, within Onondaga County the numbers are more volatile. 

Syracuse City Schools have a large jump, beginning at 32.2% and ending at 41.5%. Meanwhile, Fabius-Pompey Central School District observes almost no growth in obesity and overweight rates, but still maintains the highest percentages, beginning at 44.4% and ending at 44.9%. Both Baldwinsville and Tully School Districts had relatively large declines in obesity and overweight measures, beginning at about 33% and ending at about 30%. Fayetteville-Manlius School District has the lowest obesity and overweight rates, and even manages to reduce them slightly from 23% to 22.5%.
Last, I was interested in how the change in grade level affects students' obesity and overweight rates. The yearly changes analyzed above are looking at school districts as a whole. Grade level is looking more at individuals on a broad scale. The results are interesting.

Despite Fabius Pompey Central School District having the highest obesity and overweight rates by year, they actually see a large decrease in rates from elementary to middle/high school years. The district has the largest percentage of overweight and obese students in elementary school at 54.80%. By middle/high school, though, the district has dropped into the middle of the pack relative to other Onondaga County schools at 33.95%.

Fayetteville-Manlius remains at the lowest rates versus other county school districts, and continues to see a drop from elementary to middle/high school from 24.55% to 20.1%. Syracuse City Schools increase from 36.7% to 38.45%. Onondaga Central Schools have the highest jump from 33.35% to 45.50%.

There are some clear differences in school districts. What do you think causes them? Any other insights you can glean from the data?

Wednesday, December 17, 2014

New York Council on the Arts funding since 2003

I found a data set from the New York Council on the Arts detailing all the grants they have issues since 2003. 
Above, you can see all the visualizations I created. For the first two tabs, which are a heatmap and a treemap, it is easy to see just how much funding New York City and its adjacent areas receive in funding. In many ways this makes sense, since there is so much different art and culture happening, and many hundreds of organizations exist. But, New York and surrounding areas are not getting the largest average grants, as you can see on the following tab ("Average grant by county"). Instead Steuben county has the highest average grant at $37,483 while New York receives an average of $16,623.

You can also see funding per year, where - no surprise - New York and Brooklyn are much higher than every other county. I created a separate graph without the two counties so it would be easier to see how other counties stack up against each other. Overall, funding is going down.

By program, funding is also going down in recent years. However, state and local partnership funding from the council takes a steep rise in 2007, and has remained the top funding option from the council since then - it is also on the rise after several year of drop off.

What else do you see that is interesting?

Sunday, November 2, 2014

National Park Service Visitor Statistics

I love national parks, so I thought it would be fun to look at visitor use statistics as well as the number of parks per state. At first glance, the details that stand out most are that California has both a large number of national park sites compared to all other states, and it also get a relatively large number of visitors. This is compared to Alaska, which has a large number of national park sites, but relatively low number of visitors. North Carolina seems to be the other state that has a higher number of visitors on average versus the number of national park sites.

One key note is that since some NPS sites exist in multiple states, I needed to make a decision about which state to limit the park to. This artificially increases or reduces the actual number of parks per state, though only by one plus or minus. All data is from NPS.gov.