Online|Offline Participation
Social and Contextual Biases in Tweeting. An approach based on geographical inference
Twitter data is both a promised land and a nightmare to any sociologist. As is now very well known, more than 1.3 billion accounts have been created on the microblogging platform since 2006 and the number of monthly active Twitter users worldwide as reached 330 Million in January 2018. The total number of Tweets sent per day by those users is now over 500 Million. This impressive figures often lay ground for a lot of research looking at twitter as a new form of public sphere, especially in media and opinion research. Nevertheless strong biases arise whenever Twitter data are used to describe such things as « para-journalistic » coverage of an issue, opinion formation or even elections outcome.
Among those biases the most important stem a) from the difference between having a Twitter account and actively tweeting : most users on Twitter have a very small activity on the social network (it is estimated that 44% never sent a Tweet and only 8% have sent more than 50 tweets) and among those with a regular activity the probability to have a lot of followers and thus to be retweeted varies a lot. It also stems b) from very well known socio-demographical biases (mostly gender, age and profession — as well as race in some countries).
Studying this issue is very hard due to the lack of systematic data that could document users attributes. Thus a lot of research has been carried on to infer users attributes from Twitter profile information, tweeting behavior, the linguistic content of tweets or social network information gathered from retweets patterns. Very stimulating results have been obtained in inferring gender (Rao et. al., 2010, Liu & Ruths, 2013), age (Schler et. al., 2006 ; Al Zamal et. al., 2011), occupation and social class (Sloan et. al., 2014 ; Preotiuc-Pietro et. al., 2015 ; Mac Kim et. al., 2016), location (Jones et. al., 2007), political orientation (Thomas et. al., 2006 ; Rao et. al., 2010), ethnicity (Pennacchiotti & Popescu, 2011 ; Rao et. al., 2011). Other data have also been used such as twitter accounts lists data to infer profession (Ke et. al., 2016), or websites visitors demographics (Goel et. al., 2012 ; Culotta et. al., 2015).
This project carried on within the University of Grenoble Alpes Data Institute (https://data-institute.univ-grenoble-alpes.fr/) is based on a geographical inference approach to the analysis of bias in social networks. Using geographical inference is particularly interesting in the case of Twitter because it makes it possible to compare the influence of various kinds of variables ont the tweeting activity of a certain area : socio-demographical variables (such as age, gender, profession), morphological variables (such as the human density of the area or the public transportation system), contextual variables (such as the average income or the share of unemployed people) and political ones (such as the participation in local elections).
32.8 Million Tweets sent from France between 2014 and 2017 were collected with their GPS geolocalisation using the Twitter API and Bounding Box limitations. All Tweets were then attributed to the IRIS zone they were sent from (IRIS zones are the smallest geographical unit of the french national office for statistical information, INSEE. They usually count +/- 2.500 inhabitants). Census (and other) data were collected to describe every IRIS. A dataset with 47.484 IRIS counting at least one tweet between 2014 and 2017 resulted from this process. Some information being only available at the town level (e.g. political participation) another dataset was created with 33.881 towns (most of them being small/very small towns that count only one IRIS). Biases arising from the the bounding box approach in the dataset have been controlled by selecting only the tweets emitted from users tweeting for at least one month in France.
Actualités du projet | Project's updates
| 20 Mar 2018 | Who tweets ? A first encounter with Twitter data |
| On March 20th 2018 I participated to a scientific meeting conveyed by the Data Institute of Univ. Grenoble Alpes. My talk — prepared with Etienne Dublé (LIG) and Sophie Kuegler... |