GERMAN NAMED ENTITY RECOGNIZER - JUNE, 2010
===========================================
Manaal Faruqui, manaal.iitkgp@gmail.com

LICENSE
=======

Both the Stanford NER and the German classifiers are available under
the GNU GPL. That is, it can be used for academic (or any other)
research purposes, but cannot be integrated in commercial software. By
downloading the software and data, you acknowledge the terms and
conditions of the GPL.

TUTORIAL
========

This document contains quickstart guidelines for end users who wish to
apply the pretrained NER models.  For further instructions on training
your own NER model, please see
http://www-nlp.stanford.edu/software/crf-faq.shtml 
and refer to the README.txt distributed with the Stanford-NER as well.

USAGE
=====

The Stanford NER system requires Java 1.5 or later.
We have only tested it on the SUN JVM.

(1) Download the Stanford-NER version 1.1.1 from
    http://www-nlp.stanford.edu/software/CRF-NER.shtml
    and unpack the archive
(2) cd stanford-ner-2009-01-16/
(2) Download the chosen German classifier from the website.
(3) Move the classifier to stanford-ner-2009-01-16/classifiers/

(4) Tokenize the test file by using the following command :-
  * java -cp stanford-ner.jar edu.stanford.nlp.process.PTBTokenizer test > test.tok
(5) perl -ne 'chomp; print "$_ O O O O\n"' test.tok > test.tok.ready

(6) Use the following command to tag your file :-
  * java -cp stanford-ner.jar edu.stanford.nlp.ie.crf.CRFClassifier -loadClassifier classifiers/classifierName -testFile test.tok.ready
(7) Output format : Token NE-tag O ( tab separated columns )


GERMAN NER CLASSIFIERS
======================

(1) Huge German Corpus-generalized classifier : This classifier has
    been trained on the CoNLL 2003 German data and has been
    generalized with the distributional similarity lexicon formed
    using the 175 million tokens of the HGC which have been clustered
    in 600 clusters. HGC is a collection of news-wire text, thus we
    suggest to use it for tagging the same.

(2) deWac-generalized classifier : This classifier has been trained on
    the CoNLL 2003 German data and has been generalized with the
    distributional similarity lexicon formed using the 175 million
    tokens of the deWac which have been clustered in 400
    clusters. deWac corpus has been created by scraping off content
    from the web, hence it is unclean and contains data from all
    genres, thus we suggest to use it for tagging all other kinds of
    documents.

--------------------------------------------------------
               |      HGC       |           deWac      |
--------------------------------------------------------
Type of Corpus |  Newswire-Text |  Data from Web (Raw) |
--------------------------------------------------------
Amount of Data |   175M tokens  |       175M tokens    |
--------------------------------------------------------
#Clusters      |      600       |            400       |
--------------------------------------------------------


CONTACT
=======

For more information, bug reports, and fixes, contact:

    Manaal Faruqui
    Dept of Computer Science & Engg, IIT Kharagpur
    Kharagpur 721302, West Bengal
    INDIA
    manaal.iitkgp@gmail.com
    http://cse.iitkgp.ac.in/~manaalf
