Yelp is a website on which people review local businesses. In their own words, Yelp describes themselves as an “online urban city guide that helps people find cool places to eat, shop, drink, relax and play, based on the informed opinions of a vibrant and active community of locals in the know.”
Earlier this year, Yelp released a subset of their data to be used for academic use. The dataset includes business data (e.g., address, longitude, latitude), review data (e.g., number of ‘cool’ votes) and user data (e.g., name of user, number of reviews). Yelp provided these data for the following 30 colleges and universities (the nearest 250 businesses to these campuses):
California Institute of Technology
California Polytechnic State University
Carnegie Mellon University
Georgia Institute of Technology
Harvey Mudd College
Massachusetts Institute of Technology
Rensselaer Polytechnic Institute
University of California - Los Angeles
University of California - San Diego
University of California at Berkeley
University of Illinois - Urbana-Champaign
University of Maryland - College Park
University of Massachusetts - Amherst
University of Michigan - Ann Arbor
University of North Carolina - Chapel Hill
University of Pennsylvania
University of Southern California
University of Texas - Austin
University of Washington
University of Waterloo
University of Wisconsin - Madison
*Yelp claims that you can email them to add other campuses, but when I did that to request the University of Minnesota, they responded that they were not adding other campuses at the time. Perhaps if more requests come in they will update the dataset.
The data is in the JSON format, which makes d3 a good candidate for visualization work. R can also read JSON formatted data using the rjson package. [See the StackOverflow post here.] You need to install Yelp’s Python framework to obtain the data, but instructions are on the webpage. They also provide a GitHub page to inspire you.
What questions can you answer with this dataset? Here are some examples?
Which review or reviews have the most number of unique words (i.e., words not seen in any other review)? [see here]
Can you predict which businesses will close based on reviews?
Are there businesses that users tend to rate highly when they rate other businesses highly (i.e., are there associations between businesses that owners should/could take advantage of)?
Are there differences in the reviews of
new' andold’ businesses? How do these changes play out over time?
There are many others that I am sure people can come up with, most probably more interesting than those. If anyone has worked with this dataset and produced something, link to it in the comments.