Data Representation Final | Leaving New York City

// Before reading further, let me warn you in saying that the first part of this blog post will discuss the failure of my initial project idea and the second portion will focus on my eventual final project.

// Part One

For a time, I could not pinpoint what exactly I wished to focus on for my final project. Initially, I wanted to explore visualizing data in a 3-dimensional space. Before beginning my project, I had the constraint of not being able to utilize Processing (long story), so I had to resort to other means for my final project.

So,  I decided to visualize a section of NYC in Houdini – a 3D software application. As far as what meaning I would represent in this environment, I did not have a clue; I just knew that I wanted to have a go at it and see what would develop.  So, I took a 2D cross-section of Lower Manhattan & Brooklyn from From this cross-section, I was able to extract a XML file which contained location data of the buildings, roads, parks, etc.  I successfully parsed the XML data (as far as a 2D representation)  utilizing python scripting within  Houdini. When it came to script a 3D model of my map, my render exhibited unexplained behavior. Visually, the render was compelling but it was not an accurate representation of that cross-section of NYC. Obviously, I could not expect an exact 3d representation of my NYC cross-section since the map provided by Open Street Maps does not contain every building in NYC merely an approximation. I  needed a ‘true’ 3D representation of NYC as a base before altering it with various data sets such as the water levels in NYC before and after Hurricane Sandy hit.  I will most likely hammer out the kinks in the scripting over the holidays.

Here is a bit of video documentation of my progress (read: failure).

//Part Two

I hit a wall. I had a visually compelling 3d render of a section of NYC, but why should you care about it?  What were the reasons for the irregularities? What story was I attempting to tell in this model? With a week to go till final presentations, I did the unthinkable. I shelved this project till a later time for further exploration, and I decided to start again and return to zero.

I decided to return to the question of what truly resonated with me. Throughout the semester, I was drawn to projects that utilized text such as the Public’s Theater’s Shakespeare Machine.  So, I wanted to explore text as data. I also wanted the text to carry a personal element – something that would resonate with me first and possibly others (hopefully). In a flash of intuition, I began to think about my entire experience in New York.  I’ve been living in New York City for the past 7 years, and the one lingering question I have  is whether or not to re-sign my lease to this NYC experience or try my luck in a more affordable city with possibly less stress once I graduate. As The Clash would say, should I stay or should I go?  The more I thought about my own quandary; the more I wondered about the experiences of others.   Why do people come to New York City? Why do they leave? In the past few years, do more people move to NYC as opposed to those that leave? With all of these questions, I had a direction to explore.

In my research, I stumbled across this compelling article:

Manhattan Lures the Newest New Yorkers

Even though I found all of this data interesting, I wanted something that went beyond the numbers and the usual reasons for leaving/moving to New York – school, career, love, etc. So, how would I go about it? Why do people move/leave New York City beyond the expected?  I wanted to explore text not only as data but as a document of an individual/collective experience.

What is the data?

Initially, I decided to use Twitter data. A move to or a departure from NYC is a major event in one’s life, so I felt that a person would most likely document such a momentous event on Twitter (and hopefully the reasons for doing so in that respective Tweet). I made a list of hashtags or phrases one would make if they were moving to NYC such as the following.

#movingtoNYC, ‘moving to NYC’, ‘Moving to New York City’, #leavingNYC, etc.

I then used a python script to initiate search queries of Twitter. Here is the general structure of the script below:

import json
import urllib
import time

def search_twitter(query):
resp = urllib.urlopen('' + urllib.urlencode(query))
data = json.loads(
tweets = list()
for item in data['results']:
return tweets

if __name__ == '__main__':

import sys
query = {'q': sys.argv[1], 'rpp': 100}

for tweet in search_twitter(query):
print tweet['text'].encode('ascii', 'replace')

I wanted to collect all of the tweets of people moving to NYC in one corpus and those leaving NYC in another corpus.

Though the idea seemed promising, I ran into several hurdles with this approach.

The data received contained elements of my various search queries but nothing more. Many times a Twitter user would only tweet – ‘Moving to New York City’ . I wanted something a little deeper.

Also, it’s very difficult to encapsulate an experience in so many characters or less, so it was hard to find tweets that went beyond the superficial.

I also ran the problem of finding a distinction between tweets from actual NYC residents (current, soon to be, soon to leave) and that of tourists. I would say that the bulk of the tweets I found were from tourists.  By my best estimations, a tourist visits/leaves NYC being enamored with the experience. Yet, the tourist experience differs greatly with one who lives here, who has to make a life here, who has to work here, etc. I wanted to get that experience from the text – not the experience via a rose-colored P.O.V.   I made a judgement call not to include ‘tourist’ tweets in the data set though many were interesting and entertaining in their own right. So, I manually parsed through all the text and made a quick determination as to whether or not the text originated from an actual resident or a visitor. I was more interested to find reasons why someone would move to  and/or leave nyc.  Though I was able to find a decent amount of tweets to place in my ‘comingto’ and ‘leavingfrom’ corpuses, I was not happy with the quantity and quality of the data/tweets.

So, I hit another wall.

In my many attempts to climb and circumvent this ‘wall’, I came up with the following idea – incorporate Flickr data in addition to the Twitter data for each corpus.  Many people I know like to document both their ordinary and extraordinary experiences via photos.  Prior to moving to NYC in 2005, I took a wealth of photos to document my move from Houston to NYC out of sentimentality. If I do plan on leaving NYC next year, I may find myself going through the same process – documenting the landmarks and experiences I’ve accumulated here using photos as the medium.  Also, many times we take photos in order to share to a larger audience. My thoughts immediately went to Flickr. So, I performed the same queries that I used in Twitter in Flickr, and I received results such as the following:


If you read the title and subtext of each photo, this was the type of information that I was seeking. As I surveyed all the results of my queries, I came to the following conclusion.  Moving to NYC is like the beginning of an exciting new relationship – a relationship so exciting that the mere anticipation of it can be overwhelming. After having the experience of NYC, one develops stronger emotions for NYC (either in the positive or the negative) or even slight resignation. For people leaving NYC, I found the text mirror that of a bad breakup. Something went wrong along the way, and someone had to make a break for it for better or worse. That’s the type of vibe I gathered from the Flickr text.  So, I combined the Flickr and  the Twitter text in each respective corpus.

I decided to also pipe the text of each corpus into various python scripts. One script would randomize the lines of each corpus. Another python script would pipe the corpus through a Markov chain, etc.  Here is a sampling of some of the python scripts used below:

1) Markov Script

import sys
import markov

generator = markov.MarkovGenerator(n=3, max=500)
for line in sys.stdin:
line = line.strip()

for i in range(5):
print generator.generate()

# Functions

def feed(self, text):

tokens = self.tokenize(text)

# discard this line if it's too short
if len(tokens) < self.n:

# store the first ngram of this line
beginning = tuple(tokens[:self.n])

for i in range(len(tokens) - self.n):

gram = tuple(tokens[i:i+self.n])
next = tokens[i+self.n] # get the element after the gram

# if we've already seen this ngram, append; otherwise, set the
# value for this key as a new list
if gram in self.ngrams:
self.ngrams[gram] = [next]

2) Adjective Extractor Script

import sys

adj_set = set()
for line in open('adjectives'):
line = line.strip()

for line in sys.stdin:
line = line.strip()
adjs = [s for s in line.split(" ") if s.lower() in adj_set]
if len(adjs) > 0:
print ', '.join(adjs)

3) Randomize Lines

import sys
import random

all_lines = list()

for line in sys.stdin:
line = line.strip()


for line in all_lines:
print line

4) Randomize Words

import sys
import random

for line in sys.stdin:
line = line.strip()
words = line.split(" ")
output = " ".join(words)
print output

So, I piped each corpus – ‘Come.txt’ & ‘Go.txt’ – through various python scripts in different combinations until I attained an output that was telling, poetic, and did not veer into nonsensical territory (hopefully).  Here is an example of how I would pipe the corpus through various python scripts in Terminal:

And, here is an example of some of the output I  received (click to enlarge):

Now, I had a collection of texts/poems representing the experience of moving to and leaving nyc. So, how would I represent this data?

2) What is the medium? 

I decided to create a physical object , specifically, an airline boarding pass. I could have chosen to present the poems simply on a screen, but there is something decisive and final when you hold an actual ticket in hand.  It’s a commitment to the experience and what it may or may not bring. I created two types of airline boarding passes – one going to and one leaving NYC.  For a boarding pass leaving NYC, I would place a short poem originating from the ‘go.txt’ corpus in the middle of the ticket. Here are some examples:

Here are some examples of  boarding passes coming to NYC with text originating from the ‘come.txt’ corpus.

For the final presentation in class, I printed 16 boarding passes (8 leaving, 8 coming). At the bottom of each ticket,  I included NYC migration data and incorporated it with the bar code you would usually find in a typical boarding pass. I derived the data from the US  Census American Community Survey.  From this data, you will notice that more people left rather than moved to NYC in 2008-2010. The trend began to change the other way around in 2011. I would have  loved to incorporate the 2012 data, but it was not available (or I did not know where to look).

3) What is the question? 

I feel that the question and answer changes for each viewer of each boarding pass.   If the viewers are New Yorkers, the words may resonate with their own experience and remind them why they came, why they stay, or why they wish to leave New York City. For the viewers that are not New Yorkers, the words of the boarding pass may either entice them into living in NYC or dissuade them. I’m not sure. The boarding pass presents the question, but only the viewer will be able to answer the question(s) it presents . There may be a clear answer in the reasons why one comes to or leaves NYC.   NYC is the one city that many dream about as well as never forget once one leaves the city indefinitely.

All and all, I just feel that I’ve scratched the surface. Given more time, I feel that something more compelling could arise from this beginning. If I could find a dataset outside of twitter with a wealth of information, that would be great. Twitter is a great medium for data, but not everyone uses it. I may have a twitter account, but I rarely tweet. I have many friends and family that do not see the need to have a twitter account. Also, it only allows so much information to be parsed from it. I do know that Twitter is allowing its users to to receive all of their tweets they ever tweeted (is that a word?) in an archive. So, I’ll see what happens next with this project.