|
GALLOWS VARIANTS AS NULL CHARACTERS IN THE VOYNICH MANUSCRIPT
|
|
METHODOLOGY This study intended to answer a simple question: How does the elimination of gallows variants from the transcription set change the results of statistical queries on the Voynich manuscript? The data collection and analysis associated with this research project were carried out in Manning Hall, using the resources provided to all students in the School of Information and Library Science at the University of North Carolina, Chapel Hill. It was hypothesized that the gallows variants in the Voynich manuscript alphabet are null characters, and that removing them will not have a statistically relevant impact on correlational power curves, such as those generated by the application of Spearman's rank correlation coefficient. This study was designed to create samples that, despite various characters being completely removed, continued to strongly correlate with CURRIER, the base text derived from the voynich.now file. It is possible that such correspondence would indicate the presence of null characters, whose removal did not affect the statistics of the modified text with statistical significance. Conversely, if the modified samples exhibit variation consistent with their rank and frequency within the manuscript, this is strong evidence that the characters are not null. Sample texts that deviated from the CURRIER model and more closely resembled QU'RAN (an excerpt from the Holy Qu'ran) or GENESIS (a transcription of the Book of Genesis written in vulgate Latin) might indicate that the omissions in the sample illustrated a semantic relationship with a known language. Such correspondence would be tenuous, since a variety of causes could account for it, but the possibility of such a correlation was not dismissed, and samples were compared with QU'RAN and GENESIS. The actual analysis was a straightforward application of Spearman's rank correlation coefficient to nine separate data samples, along with the source text and two natural language control files. Data Collection All the raw data used in this study was obtained from public FTP archives accessed via the World Wide Web. The quantitative analysis was based on the voynich.now file, which is freely available on the World Wide Web (Gillogly, 2001). voynich.now is a machine-readable version of Prescott Currier's Voynich transcription, using his alphabetic coding scheme. Voynich.now was chosen over a subset of the European Voynich Manuscript Transcription project transcription file, which was created and is maintained by Gabriel Landini, and which is also available on the World Wide Web (Landini et al, 1998). The EVMT transcription files consist of an error-checked compilation of earlier transcription files (principally the Currier and FSG files) using a clear interpretation of the Voynich manuscript character set that includes gallows characters as amalgams of individual symbols rather than pairs or ligatures, as has previously been assumed. The EVMT files are the most accurate available, and represent the best hope of a standardized alphabet for disparate researchers to adopt at this time. Although not perfect (like any Voynich manuscript transcription alphabet, it carries with it a set of assumptions about the underlying text), the EVMT alphabet (called EVA) was designed to be over-broad rather than over-narrow. The Currier transcription, although older and arguably less accurate, was chosen because the gallows characters are discrete, rather than composed of groupings of characters. Using voynich.now made the sample texts less ambiguous and simpler to code and process. As an example, the same gallows character represented by "W" in Currier's alphabet is "cph" in EVA. Since the analysis was conducted on the complete manuscript (the sum total of Voynich character information in existence) the more granular EVA character set was deemed unnecessary. Data Analysis The principal tools for analysis were the SPSS and TACT statistical software packages. TACT is freely distributed on the World Wide Web (Bradley et al, 1993), and the correlational data supplied by SPSS could be generated by hand, or by other hardware or software. As controls, two known-language samples of near-identical size were analyzed along side the voynich.now base text and modified samples. Arabic, a language that contains a significant volume of null characters, was chosen as the first control. A romanized 140K sample of the Holy Qu'ran was used, since this, like the vulgate Latin bible Landini et al use as a benchmark (Landini, 1987), is likely to represent an Arabic contemporary of the Voynich manuscript with some verisimilitude. The second control was a sample from a vulgate Latin translation of the book of Genesis. The Source Texts Two texts were used as controls in this study. The first is a GENESIS (in ASCII format), a 163K text file. The second is QU'RAN, written in the ASMO 708 (ISO-8859-6) encoding scheme, a 124K text file. The Voynich file used for this study is the readily-available voynich.now version of Currier's transcription (identified hereafter as CURRIER), written in ASCII and encoded in Currier's version of the Voynich alphabet. The version used for analysis was 120K in size. All three were standardized by removing comments and extraneous material. In the case of the Qu'ran, ISO-8859-6 characters were translated to arbitrary vanilla ASCII characters on a one-to-one basis, working through the Latin alphabet in the order the characters appeared in the text. The resulting file uses the Latin characters A-Z and a-k. The Modifications of CURRIER The CURRIER text file was modified in nine different ways, to investigate the nature and relevance of various characters within the Voynich manuscript. Two groups of characters were of particular interest: the "gallows" characters represented by B, V, P, and F in Currier's alphabet, and the "gallows ligatures" represented by W, Q, Y, and X. (FIGURE 1: Gallows Characters in Currier's Alphabet and Voynichese Equivalents) W, Q, Y, and X all appear in combination with another Voynichese symbol, represented in Currier's alphabet as S. B, V, W, and Y are quite similar, each having only one "leg", while P, F, Q, and X all have two. Given these facts, several possibilities present themselves. The NO B modification is intended to explore the possibility that each character in the gallows group is discrete by removing only the B, and not it's one-legged analogs (V, W, and Y). Similarly, The W to S modification is intended to accomplish the same thing with an "overlaid" gallows character, converting only W to the underlying S. The NO A modification serves as a checksum, since the A character is statistically similar to the gallows variations in frequency, but shares none of their characteristics in the Voynich manuscript. The NO BV modification removes a one-legged pair of gallows but leaves W and Y, their "over S" analogs, in place. This version was intended for comparison with NO PF, which removed the two-legged gallows pair. NO BV, WY to S explores the possibility that the gallows characters are linked. All one-legged versions are removed from this modification. BVPF removes B, V, P, and F characters, while retaining the "overlaid" gallows analogs. There were 752 B, 3319 P, 202 V, and 5469 F instances, reducing the MS by 9722 characters, from 110,977 to 101,255. The WQYX to S variant is the opposite of BVPF. It removes the gallows overlays and converts them to the underlying character, S. Assuming they are null, the sematic value of the underlying S would be intact. The result of this was the replacement of 146 W, 709 Q, 51 Y, and 561 X characters. The new total is 8064 S characters, up from 6,597. The MS length was obviously unchanged. ALLGONE removes every gallows character from the text. W, Q, X, and Y are replaced with S, the underlying character, and B, V, P, and F are simply expunged. Once all twelve text files were prepared, each was processed using TACT (Text Analysis and Computing Tools), a text-retrieval and analysis software package developed at the University of Toronto. TACT, which was designed for use with small groups of literary texts using western alphabets (Bradley et al, 1993), parsed the text and returned detailed information on frequency (rank, percentage, and number of words) as well as type and token information. TACT also generated thesauri, and word and character lists useful for further statistical analysis. Assumptions and Limitations This study assumes that a null is truly meaningless and not a blank space - that it is without meaning in the context of the document formatting. Thus, when it is removed, the adjacent characters are truncated into a new, shorter word, rather than becoming two separate words. The Arabic sample used in this study is modern and unvocalized, rather than classical, Arabic. The Currier transcription is incomplete and imperfect, and makes assumptions about the alphabet that may be entirely incorrect. Front Matter ... Introduction ... Literature Review ... Methodology Findings ... Conclusions ... Bibliography ... Files ... Resources |