Once you have opened Scrubber, you will see a box where you can choose a file to upload

Scrubber Tutorial The first step in preparing an electronic text for lexomic analysis is eliminating all the file’s formatting. This step is called “s...
Author: Ophelia Carroll
0 downloads 0 Views 3MB Size
Scrubber Tutorial The first step in preparing an electronic text for lexomic analysis is eliminating all the file’s formatting. This step is called “scrubbing.” To scrub, we use the Scrubber tool. Once you have opened Scrubber, you will see a box where you can choose a file to upload.

These files can be .txt, .xml, .html, or .docx. If you click “Choose File,” you can choose the file you want to scrub from directories on your computer. Then press the “Submit” button.

This will bring you to another screen. As you can see, the text from your file appears in the top box labeled “Unscrubbed.”

If you want to start over with a new text, simply hit the “Clear” button in the top right-hand corner, and you will be returned to the initial screen and asked to upload a new file. There is a list of scrubbing options along the right-hand side of the screen. The first option is “Remove Punctuation.” If you do not remove punctuation, then periods, commas, etc., will be counted as individual words and will affect the word counts that are used to create dendrograms later. “Remove Punctuation” will not remove standard SGML characters such as &t;.

The second option is “Remove Digits.” Note that this option will remove numbers within the text as well as numbers in the formatting, like page numbers. Numbers will be counted as words unless they are removed from the text. The next option, “Strip Tags,” is especially important for .html files. HTML files include information about the text within tags, which are denoted by carats (< >). This formatting must be removed. The “Strip Tags” option will only appear if there are tagged objects within your text. Once you click “Strip Tags” there will be two options, to “Keep” or “Discard” words inside the tags. By keeping words inside tags, you are simply removing the tags within the carats, so “wealdan” here would remain part of the text. By discarding words inside tags, you are deleting the carats and whatever information is within them, so “wealdan” would be removed. In this case we’re going to keep the words inside the tags. The “Make Lowercase” option is next. It forces all capital letters to lowercase. This is important, because without forcing everything to lowercase, diviText and treeView will count, “king” with a lower case “k” as a different word than “King” with an uppercase “K.” Next, you will want to choose whether or not to “Remove Stopwords.” Stopwords are whole words you want removed from the text. For example, if you have a text divided into chapters, you may not want to count every instance of the word “chapter” that appears in the file, because it is not really part of the body of the text. With stopwords, you must upload your own list of words that you want to remove by clicking the “Choose File” button.

It must be a text file in which the words you want to remove are separated by commas. After choosing the file you want, make sure to hit “Upload Stopwords.” When you upload your stopwords file, a box will pop up that displays your stopwords list.

Here we have uploaded a list of Roman numerals to be deleted from the text. You can scroll down to see the entire list, which will appear in alphabetical order. The stopwords that you choose to delete will only be deleted as entire words. If you choose to delete “and,” “and” will only be deleted as a unique word, and not as it appears within words like “land” or “hand.” The “Lemmatize” option allows you to count multiple variations of the same word as the same thing. In Old English, for example, there are three ways to write the word “and.” If we do not want to count every version of “and” as a separate and unrelated word, then we lemmatize them, or force all versions to become a single version. We do this by creating a text file with a comma-delimited list, in which we put the variation, comma, and the lemma, or what we want the word to become. You must upload your own lemmatization file by clicking “Upload Lemmas,” which will display in a separate box, like the stopwords.

Also like stopwords, lemmas will only be changed if they are entire words. So, if you choose to change “and” to “ond,” you will not be changing “land” to “lond.” The next option, “Consolidate,” is similar to lemmatize, but it makes changes within words. In Old English there are two letters, thorn and eth, that create the same sound and are in all respects orthographically interchangeable. However, Lexomic analysis counts the same word spelled with a thorn versus an eth as two different words, even if the spelling difference is completely arbitrary. We thus might want to force all the eths to become thorns. We do this by uploading a comma-delimited list (text document) including the letter (or letters) you want changed, comma, and then the letter (or letters) you want to replace them with. When you click “Upload Consolidations,” a box will appear displaying them.

Remember that consolidation occurs within words, so be sure to distinguish your lemmas from your consolidations. Finally, you have the option to format “Special Characters.” The “Use Common Characters” options will convert all the SGML characters, like those used to denote special characters in the Dictionary of Old English, to their Unicode equivalents. To format other types of special characters, choose the “Format Special Characters” option. This allows you to upload your own list of special characters. The list must be comma-delimited, with the character (or characters) you want to change, comma, and the character (or characters) you want to replace them with. When you click “Upload Special Characters” a box will pop up to display the characters in the uploaded file. This option works like the consolidations option and will change special characters regardless if they are their own word, or part of a word.

Once you have picked all the options to your satisfaction, click the “Scrub” button in the top right-hand corner.

Another box will appear below the box displaying the original, unscrubbed text. This box displays the scrubbed text.

You can scroll through both the original and scrubbed texts to make sure that it is scrubbed properly. If you want to change any of the scrubbing options, simply change the option and re-click the “Scrub” button. This will re-scrub the file with the new set of scrubbing options. When you are ready to download the scrubbed file, click the “Download” button in the top right-hand corner.

The name of the scrubbed file will be the original uploaded file name followed by “_scrubbed.” You can use diviText to cut this file into segments. By: Rose Berger, Michael Drout and Leah Smith lexomics.wheatoncollege.edu

"Any views, findings, conclusions, or recommendations expressed in this presentation do not necessarily reflect those of the National Endowment for the Humanities.”

Suggest Documents