TinyCC 2.0 User’s Manual
TinyCC 2.0 is a text corpus production engine that can be used to produce corpora in Leipzig Corpus Collection (LCC) format. The LCC-DVD 1.0, distributed from May 2006 on, was created using tinyCC 1.1 and other procedures. For further explanations on LCC corpus building, see [Quasthoff et al. 2006].
TinyCC 2.0 splits the text into sentences and creates tab-separated files, containing:
The log-likelihood ratio [Dunning 1993] is used as significance test.
The implementation consists of a shell script calling programs written in JAVA and PERL. It is platform-dependent and was tested only on LINUX.
Download the archive tinyCC2.tar.gz into a folder of your choice and unzip it by:
gzip -d tinyCC2.tar.gz
tar -xf tinyCC2.tar.gz
A maintenance update which covers some problems with processing UTF-8 text is available from tinyCC2.1.1.tar.gz.
!! Windows Users !!: A somewhat slower and less comfortable version of tinyCC is available at tinyCC1.5win.zip. Only use this version if you do not have the possibility to run tinyCC2.0 in a UNIX-like environment.
To run tinyCC 2.0, you need a Java Runtime Environment (JRE) of version 1.5 or later. You can obtain it at http://java.sun.com/j2se/1.5.0/download.jsp. Please ensure that java is in the path – to check this, type “java -version” in your shell – it should respond with a version number of 1.5.0 or higher. Further, you need PERL version 5 or higher. The latest version can be downloaded at http://www.perl.com/download.csp. Please ensure that PERL is in the path – to heck this, type “perl -v” in your shell – it should respond with a version number of 5 or higher.
Raw text corpora can be fed into tinyCC 2.0 in three different ways: HTML and plain text. For retaining the source per sentence (e.g. the name of the text), the SATZ.S-Format allows to provide this information directly. Otherwise, the source will carry the name of the file the sentence was found in.
The text is given in plain format. Sentences should not cross lines: if your corpus is formatted such that carriage-returns can be found within sentences, please remove them beforehand. The text must be given in files with “.txt” extension.
The text is given in HTML encoding in files with “.htm” or “.html” extension. In pre-processing, all HTML elements will be removed.
The text can only be given in a plain format. To feed sources to the process, the following line should be present BEFORE every different text source:
Please note that this line starts with a space-character.
To provide text data to the process, please put all files containing the text in these formats into one folder (subfolders are possible).
Change to the directory you unpacked the archive to. The program accepts three parameters:
The distribution comes with a small sample in all three formats in the folder sampledata. You can check the functionality of tinyCC by typing
./tinyCC mycorpusPLAIN sampledata/PLAIN none
in your shell. The program will produce seven files in a subfolder “result”:
For corpora this small, please do not expect meaningful co-occurrences.
This section describes the format of the seven output files and what the output means.
This file has two columns and contains the sentences of the corpus.
1st column: sentence-id, as used in inv_so and inv_w
2nd column: sentence text as in original. the internal tokenizing of tinyCC is not reflected here.
This file has three columns and contains the words of the corpus
1st column: word-id, as used in inv_w, co_s, co_n
2nd column: word
3rd column word frequency count in the corpus
The first 100 word-ids are reserved for special characters such as punctuation, begin-of-sentence (%^%), end-of-sentence (%$%) and numeral (_NUMBER_).
This file has two columns and contains the sources
1st column: source-id as used in inv_so
2nd column: Source name: either file name or contents of <name>-tag in SATZ.S format
This File has two columns and indexes sentences by source
1st column: sentence-id as used in sentences
2nd column: source-id as used in sources
This file has four columns of which the fourth is optional. It indexes sentences by words
1st column: word-id as in words
2nd column: sentence-id as in sentences
3rd column: position in sentence. Here, the internal tokenization is reflected.
4th column (optional): Contains “-“ if word is part of a multi word unit.
This file has four columns and contains significant neighbour-based co-occurrences
1st column: word-id of left word in a word bigram
2nd column: word-id of right word in a word bigram
3rd column: frequency of word bigram consisting of left and right word
4th column: log-likelihood ratio
This file has four columns and contains significant sentence-based co-occurrences
1st column: word-id of word 1
2nd column: word-id of word 2
3rd column: frequency of joint occurrence
4th column: log-likelihood ratio
The data is symmetric in columns 1 and 2.
The parameters of tinyCC 2.0 have been carefully set in a sensible way. For normal text corpora, there should be no need to change them. However, this section describes how to do exactly this.
For changing internal parameters, open the file “tinyCC.sh” in the main folder of your installation and look for the following section, starting after the initial comments:
# input text is in format (latin|utf8)
# locales for latin__must__ be installed on your system!
# See `localedef --list-archive` for a list of installed locales
# Edit /etc/locale.gen and sudo locale-gen to enable specific locales
# locale to be used for processing ISO 8859 text
# name of this locale as understood by `recode`
# locale to be used for processing UTF-8 text
# Memory max usage in MB (approximate)
# min frequency for scoocs
# min sig for scooc
# min freq for nbcooc
# min sig for NBcooc
# number of digits after .
# temp directory
# result directory
These parameters are explained now:
· TEXTFORM: Specifies which format to assume for the processed texts. `latin` should work for most encodings, such as ISO-8859-* and windows12++.
· LTYPE: Name of an available locale for ISO-8859-* encoding.
· LNAME: Name of the above encoding as understood by GNU recode (see recode -l for the list of supported encodings)
· UTYPE: Name of an available locale for UTF-8 encoding. See localedef --list for a list of the locales installed on your system. If there is no locale supporting UTF-8 available on your system select one from /usr/share/i18n/SUPPORTED and sudo $EDITOR /etc/locale.gen to add it to your locales list. Then sudo locale-gen to rebuild the locales on your system.
· MAXMEM: The maximum RAM in megabytes the process is allowed to use. As this value is very approximate, please set it considerably lower than your main memory. Larger values speed up co-occurrence computation (especially for large corpora), but too large values will result in swapping.
· SMINFREQ/NMINFREQ: The minimum joint occurrence frequency to be taken into account for sentence/neighbour-based co-occurrences. A value of 1 should not be used, see [Moore 2004].
· SMINSIG/NMINSIG: The minimum log-likelihood ratio to be taken into account for co-occurrences. 3.84 corresponds to 5% error probability, 6.63 corresponds to 1% error probability, also cf. [Moore 2004].
· DIGITS: Output precision for log-likelihood ratios.
· TEMP: temporary working directory
· RES: where to store the output
Further, you might change
· tokenisation: Dive into “perl/tokenize.pl” (latin1) and “perl/tokenize_utf8.pl” (UTF-8). Please be careful to preserve the file's encoding.
· behaviour on carriage-returns inside sentences: remove “-n” in the text2satz call
· significance formula: dive into “perl/nbcooc.pl” and “perl/ssig.pl”
· platform dependence: the most crucial point is the usage of “bin/sort64” which is UNIX sort compiled for 64 bits. 32-bit sorts do not handle temporary files larger than 2GB.
TinyCC 2.0 merely converts plain text data into the LCC format, thereby computing co-occurrences. Duplicates and ‘dirt’ are not removed. TinyCC was tested up to 50 Million sentence (750 Million words) corpora.
TinyCC 2.0 was developed by Chris Biemann at the University of Leipzig. The component handling sources and performing sentence splitting was developed by Fabian Schmidt. Some fixes for UTF-8 handling were implemented by Matthias Richter. Thanks goes to all the testers from the NLP Department, University of Leipzig.
[Dunning 1993] Ted E. Dunning, Accurate Methods for the Statistics of Surprise and Coincidence, Computational Linguistics 19(1):1993 http://www.comp.lancs.ac.uk/ucrel/papers/tedstats.pdf
[Moore 2004] Moore, R. C. (2004): On Log-Likelihood-Ratios and the Significance of Rare Events. Proceedings of EMNLP 2004, Barcelona, Spain http://research.microsoft.com/users/bobmoore/rare-events-final-rev.pdf
[Quasthoff et al. 2006] Quasthoff, U., Richter, M. and Biemann, C. (2006): Corpus Portal for Search in Monolingual Corpora. Proceedings of LREC-06, Genoa, Italy QuasthoffBiemannRichter06portal.pdf