Credibility Corpus with several datasets (Twitter, Web database) in French and English
Description
Description of the corpora
The set of these datasets are made to analyze ifnormation credibility in general
(rumor and disinformation for English and French documents),
and occuring on the social web.
Target databases about rumor, hoax and disinformation helped to
collect obviously misinformation. Some topic (with keywords) helps us to made corpora from the micrroblogging
platform Twitter, great provider of rumors and disinformation.
1 corpus describes Texts from the web database about rumors and disinformation.
4 corpora from Social Media Twitter about specific rumors (2 in English, 2 in French).
4 corpora from Social Media Twitter randomly built (2 in English, 2 in French).
4 corpora from Social Media Twitter about specific rumors (2 in English, 2 in French).
Size of different corpora :
Social Web Rumorous corpus: 1,612
French Hollande Rumorous corpus (Twitter): 371
French Lemon Rumorous corpus (Twitter): 270
English Pin Rumorous corpus (Twitter): 679
English Swine Rumorous corpus (Twitter): 1024
French 1st Random corpus (Twitter): 1000
French 2st Random corpus (Twitter): 1000
English 3st Random corpus (Twitter): 1000
English 4st Random corpus (Twitter): 1000
French Rihanna Event corpus (Twitter): 543
English Rihanna Event corpus (Twitter): 1000
French Euro2016 Event corpus (Twitter): 1000
English Euro2016 Event corpus (Twitter): 1000
A matrix links tweets with most 50 frequent words
Text data :
_id : message id
body text : string text data
Matrix data :
52 columns (first column is id, second column is rumor indicator 1 or -1, other columns are words value is 1 contain or 0 does not contain)
11,102 lines (each line is a message)
Hidalgo corpus: lines range 1:75
Lemon corpus : lines range 76:467
Pin rumor : lines range 468:656
swine : lines range 657:1311
random messages : lines range 1312:11103
Sample contains :
French Pin Rumorous corpus (Twitter): 679
Matrix data :
52 columns (first column is id, second column is rumor indicator 1 or -1, other columns are words value is 1 contain or 0 does not contain)
189 lines (each line is a message)
Author
This dataset has been published on the initiative and under the responsibility of nicolas turenne.
Latest update
December 1, 2016
License
Metadata quality:
Data description filled
Files documented
License filled
Update frequency not set
File formats are open
Temporal coverage filled
Spatial coverage not set
All files are available
Metadata quality
Update frequency not set
Spatial coverage not set
There are no reuses for this dataset yet.
There are no community resources for this dataset yet.
Information
Tags
License
ID
5840066288ee38426dc65bb3
Temporality
Creation
December 1, 2016
Frequency
Unknown
Temporal coverage
2006/01/01 to 2015/07/01
Latest update
December 1, 2016
Actions
Embed
Statistics for the year
Views
2.2k
Downloads
326
Reuses of this dataset
0
Followers
0