public marks

PUBLIC MARKS from parmentierf with tag taln


NLTK Home (Natural Language Toolkit)

Open source Python modules, linguistic data and documentation for research and development in natural language processing, supporting dozens of NLP tasks, with distributions for Windows, Mac OSX and Linux.


hutchens_phd_1999_the_upwrite_predictor.pdf (Objet application/pdf)

The UpWrite Predictor: A General Grammatical Inference Engine for Symbolic Time Series with Applications in Natural Language Acquisition and Data Compression Blog: Query interfaces for the Semantic Web

An interesting presentiation at Google Tech Talks about different interfaces to query semantic data. Casual users were presented with 4 increasingly formal systems: keyword search, natural language search, controlled language search, and a graphical interface to build query patterns. Interestingly enough, the users liked natural language best, although keyword queries gave more accurate results.

uClassify - free text classifier web service

uClassify is a free web service where you can easily create your own text classifiers. Examples * Spam filter * Web page categorization * Automatic e-mail support * Language detection * Written text gender recognition * Sentiment * See below for some examples So what do you want to classify on? Only your imagination is the limit!

Home | OpenCalais

by 2 others
We want to make all the world's content more accessible, interoperable and valuable. Some call it Web 2.0, Web 3.0, the Semantic Web or the Giant Global Graph - we call our piece of it Calais. Calais is a rapidly growing toolkit of capabilities that allow you to readily incorporate state-of-the-art semantic functionality within your blog, content management system, website or application.


NLP Tools for free (GPL) : * LIA PHON: a French Text-To-Phoneme converter * LIA: TAGG a French and English tagger, lemmatizer, bracketter + a reaccentueur (for French) * LIA SCT: an implementation of Semantic Classification Tree, with some extra features * LIA NE (beta): a simple Named-Entity tagger for French based on CRF++

Benoît Sagot - WOLF

Le WOLF (Wordnet Libre du Français) est une ressource lexicale sémantique (wordnet) libre pour le français.

Proxem > Home

Proxem isn't just a technology. It's a company and a vision informed by creativity and passion for all things possible with Natural Language Processing.

Cypher - Beta Release — - ai - ai software - semantic web - semantic web software - ai company - natural language processing - natural language processing software - RDF, FOAF, Friend of a Friend, DC, Dublin Core, RSS, SeRQL and SPARQL softwa

The Cypher™ beta release is the AI software program available which generates the .rdf (RDF graph) and .serql (SeRQL query) representation of a plain language input, allowing users to speak plain language to update and query databases. With robust definition languages, Cypher's grammar and lexicon can quickly and easily be extended to process highly complex sentences and phrases of any natural language, and can cover any vocabulary. Equipped with Cypher, programmers can now begin building next generation semantic web applications that harness what is already the most widely used tool known to man - natural language.


Double Metaphone - Wikipédia

Le Double Metaphone est un algorithme de recherche phonétique écrit par Lawrence Philips et est la deuxième génération de l'algorithme Metaphone. Son implémentation a été décrite en juin 2000 dans le magazine C/C Users Journal. Il est appelé « Double » car il peut retourner un code primaire et secondaire pour une chaîne de caractères (String) ; cela compte pour des cas ambigus ou pour des variantes multiples avec des ascendances communes. Par exemple, l'encodage du nom « Smith » rapporte le code primaire SM0 et le code secondaire XMT, lorsque le nom « Schmidt » rapporte le code primaire XMT et le code secondaire de SMT ; les deux ont XMT en commun.

TEL :: [tel-00145147, version 1] Définitions et caractérisations de modèles à base d'analogies pour l'apprentissage automatique des langues naturelles

Définitions et caractérisations de modèles à base d'analogies pour l'apprentissage automatique des langues naturelles


Présentation de Theuth et de Blue Moon

Présentation de Theuth et de Blue Moon. Un nouveau type d'algo de parsing, dit "asyntagmatique". Sans entrer dans les détails, le fait que le parsing soit asyntagmatique débloque tout : on peut désormais tenir compte des contextes, comprendre les déictiques, détecter les jeux de mots et les contrepèteries, reconnaître la langue d'un texte ou traduire des textes où plusieurs langues sont mélangées, y compris dans la même phrase.

Nick Montfort's Computer and Information Science Work


Official Google Research Blog: All Our N-gram are Belong to You

by 1 other (via)
Here at Google Research we have been using word n-gram models for a variety of R&D projects, such as statistical machine translation, speech recognition, spelling correction, entity detection, information extraction, and others. While such models have usually been estimated from training corpora containing at most a few billion words, we have been harnessing the vast power of Google's datacenters and distributed processing infrastructure to process larger and larger training corpora. We found that there's no data like more data, and scaled up the size of our data by one order of magnitude, and then another, and then one more - resulting in a training corpus of one trillion words from public Web pages.

Python for Linguistics - py4lx

This is an ickle collection of tutorials on using Python for doing interesting stuff with (human!) languages. They are posted initially on Hacklog, the Blogamundo developer blog, and then moved here, where they are endlessly tweaked to remove embarrassing errors improve clarity. In theory they should be doable by folks with no programming background, or just a little.

Chatterbots, Tinymuds, and the Turing Test

This paper describes the development of one such Turing System, including the technical design of the program and its performance on the first three Loebner Prize competitions. We also discuss the program's four year development effort, which has depended heavily on constant interaction with people on the Internet via Tinymuds (multiuser network communication servers that are a cross between role-playing games and computer forums like CompuServe). Finally, we discuss the design of the Loebner competition itself, and address its usefulness in furthering the development of Artificial Intelligence.

Charming Python: Get started with the Natural Language Toolkit

In this installment, David introduces you to the Natural Language Toolkit, a Python library for applying academic linguistic techniques to collections of textual data. Programming that goes by the name "text processing" is a start; other capabilities for syntactic and even semantic analysis are further specialized to studying natural languages.

L.Pointal - Python

by 7 others (via)
Page Python de Laurent Pointal


SCIgen - An Automatic CS Paper Generator

by 4 others (via)
SCIgen is a program that generates random Computer Science research papers, including graphs, figures, and citations. It uses a hand-written context-free grammar to form all elements of the papers. Our aim here is to maximize amusement, rather than coherence. One useful purpose for such a program is to auto-generate submissions to "fake" conferences; that is, conferences with no quality standards, which exist only to make money. A prime example, which you may recognize from spam in your inbox, is SCI/IIIS and its dozens of co-located conferences (for example, check out the gibberish on the WMSCI 2005 website). Using SCIgen to generate submissions for conferences like this gives us pleasure to no end. In fact, one of our papers was accepted to SCI 2005! See Examples for more details.

EuroWordNet:Building a multilingual database with wordnets for several European languages.

EuroWordNet is a multilingual database with wordnets for several European languages (Dutch, Italian, Spanish, German, French, Czech and Estonian). The wordnets are structured in the same way as the American wordnet for English ( Princeton WordNet, Miller et al 1990) in terms of synsets (sets of synonymous words) with basic semantic relations between them. Each wordnet represents a unique language-internal system of lexicalizations. In addition, the wordnets are linked to an Inter-Lingual-Index, based on the Princeton wordnet. Via this index, the languages are interconnected so that it is possible to go from the words in one language to similar words in any other language. The index also gives access to a shared top-ontology of 63 semantic distinctions. This top-ontology provides a common semantic framework for all the languages, while language specific properties are maintained in the individual wordnets. The database can be used, among others, for monolingual and cross-lingual information retrieval, which was demonstrated by the users in the project.