Adam Kilgarriff is Director of Lexical Computing Ltd.
He has led the development of the Sketch Engine http://www.sketchengine.co.uk, a leading tool for corpus research used for dictionary-making at Oxford University Press, Cambridge University Press, HarperCollins, Le Robert and elsewhere.
His scientific interests lie at the intersection of computational linguistics, corpus linguistics, and dictionary-making.
Following a PhD on "Polysemy" from Sussex University, he worked at Longman Dictionaries, Oxford University Press, and the University of Brighton prior to starting the company in 2003.
He is a Visiting Research Fellow at the University of Leeds.
He has been an Expert Witness in a number of legal cases involving trademarks.
He is active in moves to make the web available as a linguists' corpus and was the founding chair of ACL-SIGWAC (Association for Computational Linguistics Special Interest Group on Web as Corpus).
He has been chair of the ACL-SIG on the lexicon and Board member of EURALEX (European Association for Lexicography).
See also http://www.kilgarriff.co.uk
will present ...
Terminology-finding in the Sketch Engine: an Evaluation
The Sketch Engine is a leading corpus query tool, in use for lexicography at OUP, CUP, Collins, Le Robert and Cornelsen, and at national language institutes of eight countries, and for teaching and research in many universities. Its distinctive feature is the ‘word sketch’ a one page, automatic, corpus, derived summary of a word’s grammatical and collocational behaviour.
Very large corpora and word sketches are available for sixty languages.
A number of tools and resources have recently been added with translators and terminologists in mind.
These were introduced at the 2013 Translating and the Computer conference.
In 2014, we have evaluated the tools, across a number of languages and domains, and it is these results that we present in 2014.
- - - - - - - - - -
The Sketch Engine is a leading corpus query tool, in use for lexicography at OUP, CUP, Collins, Le Robert and Cornelsen, and at national language institutes of eight countries, and for teaching and research in many universities. Its distinctive feature is the ‘word sketch’ a one page, automatic, corpus, derived summary of a word’s grammatical and collocational behaviour. Very large corpora and word sketches are available for sixty languages.
A number of tools and resources have recently been added with translators and terminologists in mind. These were introduced at the 2013 Translating and the Computer conference. In 2014, we have evaluated the tools, across a number of languages and domains, and it is these results that we present in 2014.
The term-finding functionality
The term-finder starts from a domain corpus, and a reference corpus. First it finds all the noun phrases, and their frequencies, in both corpora. It then takes the ratio, and the items with highest ratios will be terms, as in Figure 1 (data supplied by lead users, the World Intellectual Property Organisation).
Figure 1. French terms in the mobile communications domain.
In some cases, as with WIPO, the user will have domain corpora, but in others they will not. In that case they may use the BootCaT procedure (Baroni and Bernardini 2004). The user, typically a translator working in a domain where they are not an expert, inputs a few domain-specific ‘seed words’; these are sent to a search engine, and the hits identified by the search engine are gathered, cleaned, de-duplicated and processed to give a domain-specific corpus. This functionality has been found to support translators well (Bernardini et al 2013). For some time, the Sketch Engine has incorporated a BootCaT tool, allowing users to create an instant corpus for a domain, which means they can then compare this corpus with a reference corpus to find the keywords of the domain. The functionality has recently been extended so the user can find the terms alongside key words. Thus, where the user has Bootcatted an English environment corpus, the Sketch Engine provides the "key words and terms" report shown in Figure 2.
The requirements for the term-finding functionality are:
a processing chain, comprising tokeniser, lemmatiser and part-of-speech tagger, installed and ready to apply to the user's domain corpus
a reference corpus processed with the processing chain
a term grammar.
At time of writing, these are in place for Chinese, English, French, German, Japanese, Korean, Russian, Spanish and Portuguese.
Figure 2: English key words and terms in the environment domain. The tickboxes are so the user can iterate the procedure to extend and refine the corpus.
To evaluate a term-finder for a language and a domain, a list of all the 'true terms' is required. Then we can compute precision and recall.
One problem: how to define the domain? The straightforward answer: provide a corpus of it. Then we have the more constrained task of assessing recall and precision, from a given corpus, when the terms in that corpus are known.
Another problem: won't two different terminologists inevitably deliver two different lists?
We approached the task by hunting for research datasets comprising domain corpora and term-lists derived, by human experts, from them. In most cases, this had been done as part of a term-finding task, so there were also published papers, with term-finding results, over these datasets, so we had a reference result to compare our results with. This addressed the second problem, as, whatever the lists, we were confronted with the same challenge as the resource developers. We found datasets for seven languages and six domains. In each case we entered the corpus into the Sketch Engine, ran the term-finder, and computed precision and recall (which we could then compare with the performance figures of the group who developed the dataset.) The paper will present these results.
The Sketch Engine has for some years been a leading tool for lexicography and corpus linguistics. Its terminology function is now a year old. We present a thorough evaluation.
M. Baroni and S. Bernardini. 2004. BootCaT: Bootstrapping corpora and terms from the web. Proceedings of LREC 2004, Lisbon: ELDA. 1313-1316.
S. Bernardini, A. Ferraresi and E. Zanchetta. 2013. Old needs, new solutions: comparable corpora for language professionals. In Sharoff, S., R. Rapp, P. Zweigenbaum, P. Fung, editors. Building and Using Comparable Corpora. Springer