This is an assignment from the last time I taught the class. I’m not sure we’ll do it again, but there are good links to example guidelines and papers.
Annotation Efforts
Part of Speech Tagging
- Part-of-Speech Tagging Guidelines for the Penn Treebank Project (3rd Revision)
- Data available on GoogeleDocs for both WSJ and SWBD. It is useful and interesting to try both
Penn Treebank
- The Penn Treebank Project
- Data available on GoogeleDocs for both WSJ and SWBD. It is useful and interesting to try both
Discourse Treebank
- The Penn Discourse Treebank 2.0 Annotation Manual
- Switchboard Dialog Act Corpus
- Switchboard Annotations from NTX
Time ML
Unified Linguistic Annotation Text Collection: Committed Belief and REFLEX Entity Translation
Language Understanding Corpus, Committed Belief
- Data a & Docs on Brandeis CS Servers: /home/j/corpuswork/ldc-data/unified_lang_ann/Language_Understanding_Corpus/data/committed_belief
REFLEX Entity Translation Dev Test
- Data & Docs on Brandeis CS Servers: /home/j/corpuswork/ldc-data/unified_lang_ann/Reflex_entity_relations
CoNLL 2010: Detecting Uncertain Information and Resolution on in-sentence scopes of hedge cues
- ConLL-2010 Shared Task
- Guidelines for both tasks described in the first paper in the set :The CoNLL 1020 Shared Task: …
- Choose one of the following: (Data available for both on the class GoogleDrive)
- Task 1: Detecting Uncertain Information
- Task 2: Resolution on in-sentence scopes of hedge cues
OntoNotes: coreference, named entity, parses, propositions, sense
- Project documentation and annotation guidelines/li>
- Sample WSJ English data for coreference, named entity, parses, propositions, sense and raw (unmarked) on the class GoogleDrive)
- Full Data & Docs on Brandeis CS Servers: /home/j/corpuswork/ldc-data/ontonotes_r1
More links to guidelines and tasks. Some of the data may overlap with data pointed to or linked to above
SemEval
Propbank: Semantic Role Labeling (CoNLL-2005 Shared Task)
- Propbank
- Some data (or something) on Brandeis CS Servers: /home/j/corpuswork/corpora/PropBank/
- Semantic Role Labeling CoNLL-2004-2005 Shared Tasks
- Propbank Guidelines 2005
- Propbank Guidelines 2012
Senseval 3
*Sem 2012: Resolving the Scope and Focus onf Negation
- *Sem Overview
- *Sem: Resolving the Scope and Focus onf Negation (Annotation description in the paper)
*Sem
*Sem 2013 Semantically Textual Similarity
Dysfluencies
Possibly interesting options, but I couldn’t find data. But if you can, go for it!
Dependency Treebank
i2b2
- I2B2
- 2014 i2b2 Shared Task
- Options
ACE: Automatic Content Extraction
BOLT: Broad Operational Language Translation
Assignment (to be done in pairs) (from 2016)
NOTE: We will discuss the sets in more detail in class on Friday (1/23). If you know what you want to do before that email me directly. I’ll set up the forum after class.
- By the end of the day on Friday, select one of the sets below and announce your selection on the forum I will send out. (and look at the forum before making your choice so we don’t get duplicates. If it’s a race condition, I’ll let the second
- Using the information provided (which is a mix of guidelines and papers) determine
the annotation goal
the annotation task - Describe the properties fo the corpus. How was it collected? Is it a good example of sampling and balance?
- Using the text provided (soon) follow the guidelines, annotating the text by hand individually.
- In your groups, compare your annotations with the “gold standard” and discuss differences, what was hard, what was underspecified, what was clear.
- Find at least two research groups that used the annotated corpus (you can each do one) and determine their annotation goal, what algorithms they usedhow they evalutated their work (e.g. f-measure, WER, etc), and what the result was.
- Present to the class a description of the annotation project, your assessment of the guideline, and a summary of the research that used the data. You should have roughly 1 slide per bullet point (though one for each research group you look at)