LLL
Abstract:This paper presents the preliminary works to put online a French oral corpus and its transcription. This corpus is the Socio-Linguistic Survey in Orleans, realized in 1968. First, we numerized the corpus, then we handwritten transcribed it with the Transcriber software adding different tags about speakers, time, noise, etc. Each document (audio file and XML file of the transcription) was described by a set of metadata stored in an XML format to allow an easy consultation. Second, we added different levels of annotations, recognition of named entities and annotation of personal information about speakers. This two annotation tasks used the CasSys system of transducer cascades. We used and modified a first cascade to recognize named entities. Then we built a second cascade to annote the designating entities, i.e. information about the speaker. These second cascade parsed the named entity annotated corpus. The objective is to locate information about the speaker and, also, what kind of information can designate him/her. These two cascades was evaluated with precision and recall measures.
Abstract:Thanks to the Eslo1 ("Enqu\^ete sociolinguistique d'Orl\'eans", i.e. "Sociolinguistic Inquiery of Orl\'eans") campain, a large oral corpus has been gathered and transcribed in a textual format. The purpose of the work presented here is to associate a morpho-syntactic label to each unit of this corpus. To this aim, we have first studied the specificities of the necessary labels, and their various possible levels of description. This study has led to a new original hierarchical structuration of labels. Then, considering that our new set of labels was different from the one used in every available software, and that these softwares usually do not fit for oral data, we have built a new labeling tool by a Machine Learning approach, from data labeled by Cordial and corrected by hand. We have applied linear CRF (Conditional Random Fields) trying to take the best possible advantage of the linguistic knowledge that was used to define the set of labels. We obtain an accuracy between 85 and 90%, depending of the parameters used.