Inferencia de edad y sexo de usuarios de facebook basado en las actualizaciones de Estado

 

Authors
V?zquez Pati?o, ?ngel Oswaldo
Format
MasterThesis
Status
publishedVersion
Description

The use of Social Networks Sites has grown exponentially in the last decade. People are sharing online (publicly or in private) their social and romantic interactions, expressing their likes and dislikes, etc. Among many Social Networks Sites, Facebook is the most popular one. Since active and more connected users tend to adopt a private profile, some demographic information is not available (at least in a public way). Knowing hidden demographic attributes is useful in many fields, e.g., marketing campaigns. In this study we predict age and gender attributes of Facebook users relying on their status updates only. As baseline we replicated the predictive model based on language from the Open Vocabulary Approach which uses a set of features based on the actual communication among users. In addition, based on such a set of linguistic features, we analyzed the performance of a new document representation called Second Order Representation in the domain of Facebook. Second Order Representation has been proposed to deal with two problems of the Bag Of Terms representation (used in the Open Vocabulary Approach): terms considered independents of the classes and its high dimensionality and sparsity. Second Order Representation has been introduced in the field of the Author Profiling in the domain of microblogging and social networks. In order to investigate the effect of reducing the feature dimension, we experimented with 2-test as term selection method for both the predictive model of the Open Vocabulary Approach and the predictive model of the Second Order Representation. Our results show that it is possible to infer gender with an accuracy of 0.908 combining the Open Vocabulary Approach to extract linguistic features and 2-test as term selection method which reduces the time of processing and the feature dimension to only 10, 000 highly discriminative terms. On age inferring, our results (R = 0.792) show that we can beat the baseline (R = 0.791) by using only 10,000 highly discriminative terms for representing users with Bag Of Terms and using lasso regression, which is a kind of feature reduction technique to reduce the number of terms used to perform regression. Finally, Second Order Representation did not beat the base line model giving an accuracy of 0.816 in gender prediction and a square-root of the coefficient of determination of 0.782 in age prediction by using a subset of features of only 15,000 highly discriminative terms.
En este estudio se infiere la edad y sexo de los usuarios de Facebook basado s?lo en las actualizaciones de estados. Como baseline se reproduce el modelo predictivo basado en lenguaje el enfoque de vocabulario abierto. Se analiz? el rendimiento de los algoritmos de machinte learning usando second order representation. Con el fin de investigar la reducci?n de la dimensi?n, se experiment? con chi cuadrado como m?todo de selecci?n. Los resultados muestran que es posible inferir el g?nero con una exactitud del 90.8% y en cuanto a la inferencia de la edad, los resultados (R=0.792) muestran que se puede superar el baseline (R=0.791) usando solamente 10.000 t?rminos de la baseline dando una exactitud de 81.6% en predicci?n de g?nero y un 0.782 en la ra?z cuadrada del coeficiente de determinaci?n en cuanto a la predicci?n de la edad, usando solamente 15.000 t?rminos altamente discriminativos.

Publication Year
2015
Language
eng
Topic
MACHINE LEARNING
REDES SOCIALES
INFERENCIA
SEXO-EDAD
Repository
Repositorio SENESCYT
Get full text
http://repositorio.educacionsuperior.gob.ec/handle/28000/1964
Rights
openAccess
License