public marks

PUBLIC MARKS from ogrisel

February 2008

Simple SSE optimized sin, cos, log and exp

I chose to write them in pure SSE1 MMX so that they run on the pentium III of your grand mother, and also on my brave athlon-xp, since thoses beast are not SSE2 aware. Intel AMath showed me that the performance gain for using SSE2 for that purpose was not large enough (10%) to consider providing an SSE2 version (but it can be done very quickly). The functions use only the _mm_ intrinsics , there is no inline assembly in the code. Advantage: easier to debug, works out of the box on 64 bit setups, let the compiler choose what should be stored in a register, and what is stored in memory. Inconvenient: some versions of gcc 3.x are badly broken with certain intrinsic functions ( _mm_movehl_ps , _mm_cmpeq_ps etc). Mingw's gcc for example -- beware that the brokeness is dependent on the optimization level. A workaround is provided (inline asm replacement for the braindead intrinsics), it is not nice but robust, and broken compilers are detected by the validation program below.

January 2008

Vincent Zoonekynd's Blog

Blog on programming, machine learning and financial analysis

Professor Karl Friston – Selected Publications

Theoretical neurobiology (dynamics and optimisation)

Financial time series forecasting with support vector machines - TeachWiki

Stock return predictability has been a subject of great controversy. The debate followed issues from market efficiency to the number of factors containing information on future stock returns. The analytical tool of support vector regression on the other hand, has gained great momentum in its ability to predict time series in various applications and also in finance (Smola and Schölkopf, 1998). The construction of a prediction model requires factors that are believed to have some intrinsic explanatory power. These explanatory factors fall largely into two categories: fundamental and technical. Fundamental factors include for example macroeconimical indicators, which however, are usually only unfrequently published. Technical factors are based solely on the properties of the underlying time series and can therefore be calculated at the same frequency as the time series. Since this study applies support vector regression to high frequent data, only technical factors are considered.

November 2007

Apache UIMA - Apache UIMA

by 1 other (via)
Unstructured Information Management applications are software systems that analyze large volumes of unstructured information in order to discover knowledge that is relevant to an end user. UIMA is a framework and SDK for developing such applications. An example UIM application might ingest plain text and identify entities, such as persons, places, organizations; or relations, such as works-for or located-at. UIMA enables such an application to be decomposed into components, for example "language identification" -> "language specific segmentation" -> "sentence boundary detection" -> "entity detection (person/place names etc.)". Each component must implement interfaces defined by the framework and must provide self-describing metadata via XML descriptor files. The framework manages these components and the data flow between them. Components are written in Java or C ; the data that flows between components is designed for efficient mapping between these languages. UIMA additionally provides capabilities to wrap components as network services, and can scale to very large volumes by replicating processing pipelines over a cluster of networked nodes. Apache UIMA is an Apache-licensed open source implementation of the UIMA specification (that specification is, in turn, being developed concurrently by a technical committee within OASIS , a standards organization). We invite and encourage you to participate in both the implementation and specification efforts.

Index of /RDF

RDF/OWL wikipedia exports by the wikipedia foundation.

Semantic MediaWiki - Ontoworld.org

by 2 others (via)
Semantic MediaWiki (SMW) is an extension of MediaWiki – the wiki-system powering Wikipedia – with semantic technology, thus turning it into a semantic wiki. While articles in MediaWiki are just plain texts, SMW allows users to add structured data, comparable to the data one would usually store in a database. SMW uses the fact that such data is already contained in many articles: users just need to "mark" the according places so that the system can extract the relevant data without "understanding" the rest of the text. With this information, SMW can help to search, organise, browse, evaluate, and share the wiki's content. This wiki (the one you're just using) is usually running on the most recent version of the Semantic MediaWiki extensions, and thus also serves as a demonstration for the system. Semantic MediaWiki is used on many other sites and has also been featured in the press.

Getting started with OpenNLP (Natural Language Processing)

I found a great set of tools for natural language processing. The Java package includes a sentence detector, a tokenizer, a parts-of-speech (POS) tagger, and a treebank parser. It took me a little while to figure out where to start so I thought I'd post my findings here. I'm no linguist and I don't have previous experience with NLP, but hopefully this will help some one get setup with OpenNLP.

wiki.dbpedia.org : Documentation

The DBpedia community uses a flexible and extensible framework to extract different kinds of structured information from Wikipedia. The DBpedia information extraction framework is written using PHP 5. The framework is available from the DBpedia SVN (GNU GPL License). This pages describes the DBpedia information extraction framework. The framework consists of the interfaces: Destination, Extractor, Page Collection and RDFnode, plus the essential classes Extraction Group, Extraction Job, Extraction Manager, Extraction Result and RDFtriple.

The OpenNLP Homepage

OpenNLP is an organizational center for open source projects related to natural language processing. Its primary role is to encourage and facilitate the collaboration of researchers and developers on such projects. Click here to see the current list of OpenNLP projects. We'll also try to keep a fairly up-to-date list of useful links related to NLP software in general. OpenNLP also hosts a variety of java-based NLP tools which perform sentence detection, tokenization, pos-tagging, chunking and parsing, named-entity detection, and coreference using the OpenNLP Maxent machine learning package. To start using these tools download the latest release here, and check out the OpenNLP Tools API. For the latest news about these tools and to participate in discussions, check out OpenNLP's Sourceforge project page.

The Stanford NLP (Natural Language Processing) Group

(via)
The Stanford NLP Group makes a number of pieces of NLP software available to the public. All these software distributions are licensed under the GNU Public License for non-commercial and research use. (Note that this is the full GPL, which allows its use for research purposes or other free software projects but does not allow its incorporation into any type of commercial software, even in part or in translation. Please contact us if you are interested in NLP software with commercial licenses.) All the software we distribute is written in Java. Recent distributions require Sun JDK 1.5 (some of the older ones run on JDK 1.4). Distribution packages include components for command-line invocation, jar files, a Java API, and source code.

Renderstate » Blog Archive » PS3 Programming: libspe vs. libspe2 with Multi-Threaded Hello World in C

This little guide covers a multi-threaded Hello World Tutorial for the Cell BE found in the Playstation 3. First we’ll step over the code using the deprecated libspe 1.2 and the new libspe 2.1 and finally look at the output we get from both examples.

Sony PS3 Cluster (IBM Cell BE)

(via)
Description (pics included) of a Sony PS3 cluster running Linux at NCSU with useful links to resources for programming on the PS3.

Category Theory for the Java Programmer « reperiendi

(via)
There are several good introductions to category theory, each written for a different audience. However, I have never seen one aimed at someone trained as a programmer rather than as a computer scientist or as a mathematician. There are programming languages that have been designed with category theory in mind, such as Haskell, OCaml, and others; however, they are not typically taught in undergraduate programming courses. Java, on the other hand, is often used as an introductory language; while it was not designed with category theory in mind, there is a lot of category theory that passes over directly.

October 2007

WikiPediaVision (beta)

by 2 others (via)
Anonymous edits to English Wikipedia (almost) in real-time using a google map layout.

Nick Szabo -- Introduction to Algorithmic Information Theory

(via)
Recent discoveries have unified the fields of computer science and information theory into the field of algorithmic information theory. This field is also known by its main result, Kolmogorov complexity. Kolmogorov complexity gives us a new way to grasp the mathematics of information, which is used to describe the structures of the world. Information is used to describe the cultural structures of science, legal and market institutions, art, music, knowledge, and beliefs. Information is also used in describing the structures and processes of biological phenomena, and phenomena of the physical world. The most obvious application of information is to the engineering domains of computers and communications. This essay will provide an overview of the field; only passing knowledge of computer science and probability theory is required of the reader.

Writing An Hadoop MapReduce Program In Python - Michael G. Noll

Even though the Hadoop framework is written in Java, programs for Hadoop need not to be coded in Java but can also be developed in other languages like Python or C (the latter since version 0.14.1). However, the documentation and the most prominent Python example on the Hadoop home page could make you think that you must translate your Python code using Jython into a Java jar file. Obviously, this is not very convenient and can even be problematic if you depend on Python features not provided by Jython. Another issue of the Jython approach is the overhead of writing your Python program in such a way that it can interact with Hadoop - just have a look at the example in ${HADOOP_INSTALL}/src/examples/python/WordCount.py and you see what I mean. I still recommend to have at least a look at the Jython approach and maybe even at the new C MapReduce API called Pipes, it's really interesting. Having that said, the ground is prepared for the purpose of this tutorial: writing a Hadoop MapReduce program in a more Pythonic way, i.e. in a way you should be familiar with.

Blended Technologies » Blog Archive » Machine Learning and Dragons - a Game

You’re a knight and your job is to kill as many dragons as you can. The twist is that the dragons use Genetic Programming to learn from every encounter. (You can optionally have them use Reinforcement learning instead too.)

The OCaml Summer Project

(via)
From August 15th-17th we had our OSP end-of-summer meeting. Twelve participants from nine of the projects attended. We also had invited talks from Olin Shivers and Phil Wadler. Several people from local universities (NYU, Long Island University) and companies also attended.

LDAP Schema Design

It is possible to make one LDAP directory serve many applications in an organisation. This has the advantage of reducing the effort required to maintain the data, but it does mean that the design must be thought out very carefully before implementation starts. LDAP directories are structured as a tree of entries, where each entry consists of a set of attribute-value pairs describing one object. The objects are often people, organisations, and departments, but can be anything at all. Schema is the term used to describe the shape of the directory and the rules that govern its content. A hypothetical organisation is described, with requirements for “white pages” directory service as well as a wide range of authentication, authorisation, and application-specific directory needs. The issues arising from the LDAP standards are discussed, along with the problems of maintaining compatibility with a range of existing LDAP clients. A plan is proposed for the layout of the directory tree, with particular emphasis on avoiding the need to re-organise it later. This involves careful separation of the data describing people, departments, groups, and application-specific objects. A simple approach to entry design is shown, based on the use of locally-defined auxiliary object classes. The effects of schema design on lookup performance are discussed. Some design tricks and pitfalls are presented, based on recent consulting experience.

September 2007

An Intuitive Explanation of the Information Entropy of a Random Variable

There is a popular game called twenty questions that works like this. One person is the "Knower" and picks a point out of the probability space of all objects (thinks of an object). The other is the "Guesser" and asks the Knower to evaluate various random variables (questions) at that point (answer the questions of the object). The Guesser wins if he can guess the object the Knower is thinking about by asking at most twenty questions.

James Gardner » Amazon EC2 Basics For Python Programmers

(via)
Amazon EC2 (which stands for Amazon Elastic Compute Cloud) is a virtual machine hosting service from Amazon which forms a component of Amazon Web Services along with other services like S3 and SQS which provide data storage and message queuing respectively.

Power PostgreSQL - PerfList

(via)
This is a set of rules of thumb for setting up your PostgreSQL 8.0 server. A lot of the below is based on anecdotal evidence or practical scaling tests; there's a lot about database performance that we, and OSDL, are still working out. However, this should get you started. All information below is useful as of January 12, 2005 and will likely be updated later. Discussions of settings below supercede the recommendations I've made on General Bits.

August 2007

ICML 2007 - PRELIMINARY VIDEOS FROM THE SPOT

(via)
The 24th Annual International Conference on Machine Learning is being held in conjunction with the 2007 International Conference on Inductive Logic Programming at Oregon State University in Corvallis, Oregon. As a broad subfield of artificial intelligence, machine learning is concerned with the design and development of algorithms and techniques that allow computers to "learn". At a general level, there are two types of learning: inductive, and deductive.