In a nondescript building in Virginia, analysts are tracking millions of tweets, blog posts, and Facebook updates from around the world


How stable is China? What are people discussing and thinking in Pakistan? To answer these sorts of question, the U.S. government has turned to a rich source: social media.


The Associated Press reports that the CIA maintains a social-media tracking center operated out of an nondescript building in a Virginia industrial park. The intelligence analysts at the agency's Open Source Center, who other agents refer to as "vengeful librarians," are tasked with sifting through millions of tweets, Facebook messages, online chat logs, and other public data on the World Wide Web to glean insights into the collective moods of regions or groups abroad. According to the Associated Press, these librarians are tracking up to five million tweets a day from places like China, Pakistan and Egypt:


From Arabic to Mandarin Chinese, from an angry tweet to a thoughtful blog, the analysts gather the information, often in native tongue. They cross-reference it with the local newspaper or a clandestinely intercepted phone conversation. From there, they build a picture sought by the highest levels at the White House, giving a real-time peek, for example, at the mood of a region after the Navy SEAL raid that killed Osama bin Laden or perhaps a prediction of which Mideast nation seems ripe for revolt.


Yes, they saw the uprising in Egypt coming; they just didn't know exactly when revolution might hit, said the center's director, Doug Naquin. The center already had "predicted that social media in places like Egypt could be a game-changer and a threat to the regime," he said in a recent interview with The Associated Press at the center. CIA officials said it was the first such visit by a reporter the agency has ever granted.

该中心的负责人Doug Naquin称,他们的确预见了埃及的骚乱,只是无法精确判断骚乱爆发的时间。Doug Naquin在最近美联社的一次采访中称,该中心还“预测了埃及等国家的社交媒体可能将改变事态的发展,并有可能威胁到当地政权”。中情局官员称这是该部门首次接受记者的访问。

The CIA facility wasn't built specifically to track the ebb and flow of social media: The program was established in response to a recommendation by the 9/11 Commission with the initial mandate to focus on counterterrorism and counterproliferation. According to the Associated Press, the center shifted gears and started focusing on social media after watching thousands of Iranian protesters turn to Twitter during the Iranian election protests of 2009, challenging the results of the elections that put Iranian President Mahmoud Ahmadinejad back in power.


In the past few years, sentiment and mood analysis have become mainstays in the defense and intelligence communities. Last October, an Electronic Frontier Foundation lawsuit revealed how the Department of Homeland Security has carefully monitored a variety of public online sources, from social networks to highly popular blogs like Daily Kos for years, alleging that "leading up to President Obama's January 2009 inauguration, DHS established a Social Networking Monitoring Center (SNMC) to monitor social-networking sites for 'items of interest.' "In August, the Defense Advanced Research Projects Agency (DARPA), invited analysts to submit proposals on the research applications of social media to strategic communication. DARPA planned on shelling out $42 million in funding for "memetrackers" to develop "innovative approaches that enable revolutionary advances in science, devices, or systems."

在过去的几年中,情绪分析已经成为国防和情报部门运用的重要手段。去年十月,电子前沿基金会的一起诉讼案揭示了国土安全部监视各种网络公共资源的整个过程:国土安全部设立了社交网络监视中心(SNMC)以监视社交网络上的“有关信息”,监视范围从社交网络到 Daily Kos等流行博客,其监视时间长达数年之久,直到2009年1月奥巴马总统宣誓就职之前才得以结束。今年八月份,国防先进技术研究计划署(DARPA)邀请分析员向战略沟通部门提交社交媒体研究应用的建议,还拨出4200万美元资金用于“文化基因跟踪”计划,目的是发展“创新方法以促进科学、装置或系统等领域的革命性进步”。

But how useful is all of this activity?
Memetracking is still in its infancy. I spoke with Johan Bollen, a professor at the School of Informatics and Computing at Indiana University. Bollen's research into how Twitter can be used to predict the rise and fall of the Dow Jones Industrial Average made him a niche celebrity last year. He notes that memetracking is facing serious challenges. For example, how do you get a random sample?

目前,文化基因跟踪计划尚处于初级阶段。笔者曾与印第安纳大学信息与计算学院的 Johan Bollen教授进行过交流。 在去年,Bollen教授对于如何利用推特预测道琼斯工业平均指数涨跌的研究让他一夜成名。他强调,文化基因跟踪技术正面临严峻挑战。例如其中一个问题就是,你如何能获得一份随机的样本?

"You have little control over the composition of a sample," Bollen explained. "Regular surveys are conducted with only 1000 people, but those samples are carefully balanced to provide an accurate cross section of a given society. This is much more difficult to do in these online environments. Sure, the samples are huge -- there are 750 million people on Facebook -- but no matter how you look at it, it's still possible that the sample could still be biased. It requires someone to own a computer, to be on Facebook, to even USE Facebook... There are all kind of biases built into these samples that are difficult to control for."

Bollen 教授解释称:“样本的组成几乎是无法控制的,通常情况下,调查仅仅设置1000个受访者,但这些样本都经过仔细的平衡,这样才可以准确覆盖某个特定社会的各个方面。同样的事情在网络上就会困难许多。虽然样本的数量巨大,Facebook上有7.5亿用户,但无论你采取什么方式,最终样本还是有可能产生偏离。这需要作为样本的人必须得拥有电脑、Facebook在线,甚至还要使用Facebook......样本中存在各种各样的偏离因素,想要控制它们是相当困难的。”

The other major challenge, says Bollen, is that sentiment analysis only provides a scrape of potentially useful information. "Right now, analysis is very specialized. We're looking at how people feel about very particular topics," says Bollen. "There's a lot room for growth in deeper semantic analysis: not just learning what people feel about something, but what people think about things. There are 250 million people on Twitter....if you could perform even a shallow analysis of people's opinions about something, their semantic opinions, you can learn a lot from the wisdom of the crowd that could be leveraged."


Diving deep into the semantics of online communication is the next big challenge for government agencies. While the Associated Press points out that the CIA uses native dialects to determine sample sizes and pinpoint trending topics among target groups, deciphering the intricacies of human language is a major obstacle, and one that will not be easily overcome.
