By Thea Singer, republished from news@Northeastern

In the age of Big Data, auto­mated sys­tems can track soci­etal events on a global scale. These sys­tems code and col­lect vast stores of real-​​time “event data”—happenings gleaned from news arti­cles cov­ering every­thing from polit­ical protests to eco­log­ical shifts around the world.

In new research pub­lished in the journal Sci­ence, North­eastern net­work sci­en­tist David Lazer and his col­leagues ana­lyzed the effec­tive­ness of four global-​​scale data­bases and found they are falling short when tested for reli­a­bility and validity.

Mis­clas­si­fi­ca­tion and duplication

Science MagazineThe fully auto­mated sys­tems studied were the Inter­na­tional Crisis Early Warning System, or ICEWS, main­tained by Lock­heed Martin, and Global Data on Events Lan­guage and Tone, or GDELT, devel­oped and run out of George­town Uni­ver­sity. The others were the hand-​​coded Gold Stan­dard Report, or GSR, gen­er­ated by the non­profit MITRE Corp., and the Social, Polit­ical, and Eco­nomic Event Data­base, or SPEED, at the Uni­ver­sity of Illi­nois, which uses both human and auto­mated coding.

First the researchers tested the sys­tems’ reli­a­bility: Did they all detect the same protest events in Latin America? The answer was “not very well.” ICEWS and GDELT, they found, rarely reported the same protests, and ICEWS and SPEED agreed on just 10.3 per­cent of them.

Next they assessed the sys­tems’ validity: Did the protest events reported actu­ally occur? Here they found that only 21 per­cent of GDELT’s reported events referred to real protests. ICEWS’ track record was better, but the system reported the same event more than once, jacking up the protest count.

The sys­tems were also vul­ner­able to missing news. “If some­thing doesn’t get reported in a news­paper or a sim­ilar outlet, it will not appear in any of these data­bases, no matter how impor­tant it really is,” says Lazer, Dis­tin­guished Pro­fessor of Polit­ical Sci­ence and Com­puter and Infor­ma­tion Sci­ences who also co-​​directs the NULab for Texts, Maps, and Net­works.

These global-​​monitoring sys­tems can be incred­ibly valu­able, trans­for­ma­tive even,” added Lazer. “Without good data, you can’t develop a good under­standing of the world. But to gain the insights required to tackle global prob­lems such as national secu­rity and cli­mate change, researchers need more reli­able event data.”

And what about the reported protests that actu­ally weren’t protests at all? “Auto­mated sys­tems can mis­clas­sify words,” says Lazer. For example, the word “protest” in a news article can refer to an actual polit­ical demon­stra­tion, but it can also refer to, say, a polit­ical can­di­date “protesting” com­ments from a rival candidate.

It’s so easy for us as humans to read some­thing and know what it means,” says Lazer. “That’s not so for a set of com­pu­ta­tional rules.”

Analysis begets policy

From com­mu­nity building among scholars and the for­ma­tion of mul­ti­dis­ci­pli­nary groups—which were among the policy rec­om­men­da­tions by the researchers—teams within the group could com­pete against one another to spur innovation.

Trans­parency is key,” says Lazer. In the best-​​case sce­nario, the devel­op­ment methods, the soft­ware, and the source mate­rials would be avail­able to everyone involved. “But many of the source mate­rials have copy­right pro­tec­tion, and so they can’t be shared widely,” he says. “So one ques­tion is: How do we develop a large pub­licly share­able corpus?”

Par­tic­i­pants should be able to test their varying coding methods on open, rep­re­sen­ta­tive sets of event data to see how the methods com­pare, Lazer says. Con­tests could be used as a cat­a­lyst. Finally, the researchers rec­om­mend that a con­sor­tium should be estab­lished to bal­ance the busi­ness needs of the news providers with the source needs of the devel­opers and event-​​data users.

The authors sug­gest that reli­able data-​​tracking sys­tems can be used to build models that antic­i­pate the esca­la­tion of con­flicts, fore­cast the pro­gres­sion of epi­demics, or trace the effect of global warming on the ecosystem.