Corpora Analysis: Journalistic and Scientific

José Márcio Duarte, Kelsy Areco, Samuel Goihman, Edvane Birelo Lopes De Domenico, Felipe Mancini


Objective: This study aimed to compare two Corpora, one obtained from compiled newspapers – Journalistic Corpus, and the other from scientific papers – Scientific Corpus, with the hypothesis that the Scientific Corpus is more appropriated to Part-of-Speech information extraction in scientific similar texts. The aims were to analyze differences and similarities through: accuracy measurement; descriptive analysis; and independence of components in the Corpora. Methods: The analysis consisted on three steps: Descriptive Analysis; Accuracy Assessment; and Pointwise Mutual Information - PMI. Results: There was an important difference between words that do not match in both Corpora. The Scientific Corpus (92.95%) accuracy assessment  was higher than Newspaper Corpus (88.32%). The PMI calculations for the bigrams of Newspaper and Scientific Corpora did not show statistically significant difference. Conclusion: The experiments carried out lead us to conclude that in order to extract PoS information with accuracy a better performance resulted with the association of scientific text with its specific Corpus and not a generic one, like Newspaper Corpus.


Natural Language Processing; Medical Informatics; Information Science

