Package tokenizers. August 29, 2016

Package ‘tokenizers’ August 29, 2016 Type Package Title A Consistent Interface to Tokenize Natural Language Text Version 0.1.4 Description Convert nat...

Author: Abel Parsons

0 downloads 2 Views 101KB Size

Report

Download PDF

Recommend Documents

Package XLConnect. August 29, 2016

Package generator. August 29, 2016

Package proc. August 29, 2016

Package ggvis. August 29, 2016

Package demography. August 29, 2016

Package adegenet. August 29, 2016

Package rfishbase. August 29, 2016

Package RMySQL. August 29, 2016

Package webreadr. August 29, 2016

Package searchable. August 29, 2016

Package gitter. August 29, 2016

Package rcrypt. August 29, 2016

Package DT. August 29, 2016

Package shinyfiles. August 29, 2016

Package TideTables. August 29, 2016

Package fdth. August 29, 2016

Package smcfcs. August 29, 2016

Package RMRAINGEN. August 29, 2016

Package polca. August 29, 2016

Package mvmeta. August 29, 2013

Package pxr. August 29, 2013

Package WDI. R topics documented: August 29, 2016

Package qrcode. R topics documented: August 29, 2016

Package CHNOSZ. May 29, 2016

Package ‘tokenizers’ August 29, 2016 Type Package Title A Consistent Interface to Tokenize Natural Language Text Version 0.1.4 Description Convert natural language text into tokens. The tokenizers have a consistent interface and are compatible with Unicode, thanks to being built on the 'stringi' package. Includes tokenizers for shingled n-grams, skip n-grams, words, word stems, sentences, paragraphs, characters, lines, and regular expressions. License MIT + file LICENSE LazyData yes URL https://github.com/ropensci/tokenizers BugReports https://github.com/ropensci/tokenizers/issues RoxygenNote 5.0.1 Depends R (>= 3.1.3) Imports stringi (>= 1.0.1), Rcpp (>= 0.12.3), SnowballC (>= 0.5.1) LinkingTo Rcpp Suggests testthat, covr, knitr, rmarkdown VignetteBuilder knitr NeedsCompilation yes Author Lincoln Mullen [aut, cre], Dmitriy Selivanov [ctb] Maintainer Lincoln Mullen Repository CRAN Date/Publication 2016-08-29 22:59:29

R topics documented: basic-tokenizers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ngram-tokenizers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

2 3

2

basic-tokenizers stopwords . . . . . . . . . . tokenizers . . . . . . . . . . tokenize_character_shingles tokenize_word_stems . . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

Index

basic-tokenizers

. . . .

. . . .

. . . .

5 5 6 7 9

Basic tokenizers

Description These functions perform basic tokenization into words, sentences, paragraphs, lines, and characters. The functions can be piped into one another to create at most two levels of tokenization. For instance, one might split a text into paragraphs and then word tokens, or into sentences and then word tokens. Usage tokenize_characters(x, lowercase = TRUE, strip_non_alphanum = TRUE, simplify = FALSE) tokenize_words(x, lowercase = TRUE, stopwords = NULL, simplify = FALSE) tokenize_sentences(x, lowercase = FALSE, strip_punctuation = FALSE, simplify = FALSE) tokenize_lines(x, simplify = FALSE) tokenize_paragraphs(x, paragraph_break = "\n\n", simplify = FALSE) tokenize_regex(x, pattern = "\\s+", simplify = FALSE) Arguments x

A character vector or a list of character vectors to be tokenized into n-grams. If x is a character vector, it can be of any length, and each element will be tokenized separately. If x is a list of character vectors, where each element of the list should have a length of 1.

lowercase

Should the tokens be made lower case? The default value varies by tokenizer; it is only TRUE by default for the tokenizers that you are likely to use last. strip_non_alphanum Should punctuation and white space be stripped? simplify

FALSE by default so that a consistent value is returned regardless of length of input. If TRUE, then an input with a single element will return a character vector of tokens instead of a list.

stopwords

A character vector of stop words to be excluded.

ngram-tokenizers

3

strip_punctuation Should punctuation be stripped? paragraph_break A string identifying the boundary between two paragraphs. pattern

A regular expression that defines the split.

Value A list of character vectors containing the tokens, with one element in the list for each element that was passed as input. If simplify = TRUE and only a single element was passed as input, then the output is a character vector of tokens. Examples song