电竞英雄: NLTK-Lite Efficient scripting for natural language processing
NLTK-Lite: Effi cient Scripting for Natural Language Processing Steven Bird Department of Computer Science and Software Engineering University of Melbourne, Victoria 3010, AUSTRALIA Linguistic Data Consortium, University of Pennsylvania, Philadelphia PA 19104-2653, USA Abstract The Natural Language Toolkit is a suite of program modules, data sets, tutorials and exercises covering symbolic and statisti- cal natural language processing. NLTK is popular in teaching and research, and has been adopted in dozens of NLP courses. NLTK is written in Python and distributed under the GPL open source license. Over the past year the toolkit has been com- pletely rewritten, simplifying many lin- guistic data structures and taking advan- tage of recent enhancements in the Python language. This paper reports on the result- ing, simplifi ed toolkit, NLTK-Lite, and shows how it is used to support effi cient scripting for natural language processing. 1Introduction NLTK, the Natural Language Toolkit, is a suite of Python libraries and programs for symbolic and statistical natural language processing (Loper and Bird, 2002; Loper, 2004). NLTK includes graphical demonstrations and sample data. It is accompanied by extensive documentation, including tutorials that explain the underlying concepts behind the language processing tasks supported by the toolkit. NLTK is ideally suited to students who are learn- ing NLP (natural language processing) or conduct- ing research in NLP or closely related areas, includ- ing empirical linguistics, cognitive science, artifi - cial intelligence, information retrieval, and machine learning.NLTK has been used successfully as a teaching tool, as an individual study tool, and as a platform for prototyping and building research sys- tems (Liddy and McCracken, 2005; S?tre et al., 2005). We chose Python because it has a shallow learn- ing curve, its syntax and semantics are transparent, and it has good string-handling functionality.As an interpreted language, Python facilitates interac- tive exploration.As an object-oriented language, Python permits data and methods to be encapsulated and re-used easily.Python comes with an exten- sive standard library, including tools for graphical programming and numerical processing (Rossum, 2003a; Rossum, 2003b). Over the past four years the toolkit grew rapidly and the data structures became signifi cantly more complex. Each new processing task brought with it new requirements on input and output represen- tations. It was not clear how to generalize tasks so they could be applied independently of each other. As a simple example, consider the independent tasks of tagging and stemming, which both operate on sequences of tokens. If stemming is done fi rst, we lose information required for tagging. If tagging is done fi rst, the stemming must be able to skip over the tags. If both are done independently, we need to be able to align the results. As task combinations multiply, managing the data becomes extremely dif- fi cult. To address this problem, NLTK 1.4 introduced a new architecture for tokens based on Python’s native dictionary data type. Tokens could have an arbitrary number of named properties, like TAG, and STEM. Whole sentences, and even whole doc- uments, were represented as single tokens having a SUBTOKENS attribute to hold sequences of smaller tokens.Parse trees were likewise tokens, with a special CHILDREN property. The advantage of this token architecture was that it unifi ed many differ- ent data types, and permitted distinct tasks to be run independently. Unfortunately this architecture also came with a signifi cant overhead for program- mers, who had to keep track of a growing set of property names, and who were often forced to use “rather awkward code structures” (Hearst, 2005). It was clear that the re-engineering done in NLTK 1.4 mainly got in the way of effi cient authoring of NLP scripts. This paper presents a new, simplifi ed toolkit called NLTK-Lite.This paper presents a brief overview and tutorial on NLTK-Lite, and identifi es some areas where more contributions would be welcome. 2Overview of NLTK-Lite NLTK-Lite is a suite of Python packages providing a range of standard NLP data types, interface defi nitions and processing tasks, corpus samples and readers, together with animated algorithms, extensive tutorials, and problem sets.Data types include: tokens, tags, chunks, trees, and feature structures. Interface defi nitions and reference implementationsareprovidedfortokenizers, stemmers, taggers (regexp, ngram, Brill), chunkers, parsers(recursive-descent,shift-reduce,chart, probabilistic).Corpussamplesandreaders include: Brown Corpus, CoNLL-2000 Chunking Corpus,CMU pronunciation dictionary,NIST Information Extraction and Entity Recognition Corpus,Ratnaparkhi’sPrepositionalPhrase Attachment Corpus, Penn Treebank, and the SIL Shoebox corpus format. NLTK-Lite differs from NLTK in the following key respects: fundamental representations are kept as simple as possible (e.g. strings, tuples, trees); all streaming tasks are implemented as iterators instead of lists in order to limit memory usage and to ensure that data-intensive tasks produce output as early as possible; the default pipeline processing paradigm leads to more transparent code; taggers incorporate backoff leading to much smaller models and faster operation; method names are shorter (e.g. tokenizer.RegexpTokenizer becomes tokenize.regexp, and the barrier to entry for contributed software is removed now that there is no requirement to support the special NLTK token architecture. 3Simple Processing Tasks In this section we review some simple NLP pro- cessing tasks, and show how they are performed in NLTK-Lite. 3.1Tokenization and Stemming The following three-line program imports the tokenize package, defi nes a text string, and then tokenizes the string on whitespace to create a list of tokens. (Note that ”” is Python’s interactive prompt; ”.” is the second-level prompt.) from nltk_lite import tokenize text = ’Hello world.This is a test.’ list(tokenize.whitespace(text)) [’Hello’, ’world.’, ’This’, ’is’, ’a’, ’test’] Several other useful tokenizers are provided.We can stem the output of tokenization using the Porter Stemmer as follows: text = ’stemming can be fun and exciting’ tokens = tokenize.whitespace(text) porter = tokenize.PorterStemmer() for token in tokens: .print porter.stem(token), stem can be fun and excit The corpora included with NLTK-Lite come sup- plied with corpus readers that understand the fi le structure of the corpus, and load the data into Python data structures.For example, the following code reads the fi rst sentence of part a of the Brown Cor- pus. It prints a list of tuples, where each tuple con- sists of a word and its tag. from nltk_lite.corpora\ .import brown, extract print extract(0, brown.tagged(’a’)) [(’The’, ’at’), (’Fulton’, ’np-tl’), (’County’, ’nn-tl’), (’Grand’, ’jj-tl’), (’Jury’, ’nn-tl’), (’said’, ’vbd’), .] NLTK-Lite provides support for conditional fre- quency distributions, making it easy to count up items of interest in specifi ed contexts.The code sample and output in Figure 1 counts the usage of modal verbs in the Brown Corpus, displaying them cfdist = ConditionalFreqDist() for genre in brown.items:# each genre .for sent in brown.tagged(genre):# each sentence .for (word,tag) in sent:# each tagged token .if tag == ’md’:# found a modal .cfdist[genre].inc(word.lower()) modals = [’can’, ’could’, ’may’, ’might’, ’must’, ’will’] print “%-40s“ % ’Genre’, ’ ’.join([(“%6s“ % m) for m in modals]) for genre in cfdist.conditions():# generate rows .print “%-40s“ % brown.item_name[genre], .for modal in modals: .print “%6d“ % cfdist[genre].count(modal), .print Genrecancouldmaymightmustwill press: reportage9486663650387 press: reviews444045261856 press: editorial12256743753225 skill and hobbies273591302283259 religion845979125464 belles-lettres249216213113169222 popular lore1681421654595163 miscellaneous: government theyarenamed on the NLTK contributors page,linked from nltk.sourceforge.net. References Marti Hearst. 2005. Teaching applied natural language processing: Triumphsandtribulations. InProceedings of the Second ACL Workshop on Effective Tools and Methodologies for Teaching NLP and CL, pages 1–8, Ann Arbor, Michigan, June. Association for Compu- tational Linguistics. ElizabethLiddyandNancyMcCracken. 2005. Hands-on NLP for an interdisciplinary audience. In Proceedings of the Second ACL Workshop on Effective Tools and Methodologies for Teaching NLP and CL, pages 62– 68, Ann Arbor, Michigan, June. Association for Com- putational Linguistics. Edward Loper and Steven Bird. 2002. NLTK: The natu- ral language toolkit. In Proceedings of the ACL Work- shop on Effective Tools and Methodologies for Teach- ing Natural Language Processing and Computational Linguistics, pages 62–69. Somerset, NJ: Association for Computational Linguistics.//arXiv. org/abs/cs/0205028. Edward Loper.2004.NLTK: Building a pedagogical toolkit in Python. In PyCon DC 2004. Python Soft- ware Foundation.//www.python.org/ pycon/dc2004/papers/. Guido Van Rossum. 2003a. An Introduction to Python. Network Theory Ltd. Guido Van Rossum. 2003b. The Python Language Ref- erence. Network Theory Ltd. Rune S?tre, Amund Tveit, Tonje S. Steigedal, and Astrid L?greid. 2005. Semantic annotation of biomedical literature using google.In Data Mining and Bioin- formatics Workshop, volume 3482 of Lecture Notes in Computer Science. Springer.