A messy and incomplete list of open source (and some notable closed-source) Artificial General Intelligence projects, as well as lists of various components and tools that can be used within existing, or in new AGI projects. These components cover everything from NLP and language generation to data clustering and machine-learning algorithms, large data stores, knowledgebases, reasoning engines, program-learning systems, and the like.
A good overview is given by Pei Wang's Artificial General Intelligence : A Gentle Introduction. See also the Wikipedia article on Artificial Consiousness and Strong AI.
See also a large list of free/open-source "narrow AI" software, at the GNU/Linux AI & Alife HOWTO.
Suggested Education for Future AGI Researchers.
The most advanced open-source general cognition/reasoning system. Includes an NLP subsystem, reasoning, learning, 3d virtual avatar interfaces, robotics interfaces. Open-source, GPL license.
Critique: It's an experimental research platform. That is, it consists of a collection of parts that can be assembled, with some considerable difficulty, into working systems, which can then be used in practical applications, or to perform experiments.
OpenCog has many warts and serious architectural failings. However, it does more things, more correctly than any other system that I know of. In fact, it it the only system that I know of that correctly unifies logic and (Bayesian) probability, anchoring it on a solid theoretical foundation of model theory (term algebras and relational algebras) category theory (pushouts, functors, colimits) and type theory.
Nominally associated with Artificial General Intelligence Research Institute, SIAI and Novamente.
Demo: AI Virtual Pet Answering Simple Questions
NARS, the Non-Axiomatic Reasoning System, aims to explain a large variety of cognitive phenomena with a unified theory, and, in particular, reasoning, learning, and planning. Site holds a number of white-papers. Was inspiration for OpenCog. (OpenCog claims to overcome certain limitations in NARS) OpenNARS is Pei Wang's implementation. Released under GPLv2.
An intelligent agent, communicating by email. Built for the US Navy. Based on Baar's Global Workspace Theory. Answers only one question: "What do I do next?". See Tutorial
General framework for running cognitive experiments(?). Java source code available under unspecified license.
Aims to couple common-sense knowledge-base systems to natural langauge text processing. Open source project.
Seems primarily aimed at robots.
Cognitive architecture research platform, aimed at simulating and understanding human cognition.
Polyscheme is a cognitive framework intended to achieve human-level artificial intelligence and to explain the power of human intelligence. Variety of research papers published, no source code available.
Commercialized "Heierarchical Temporal Memory"
SNePS is a knowledge representation, reasoning, and acting (KRRA) system. See also the Wikipedia page See also a paper by Shapiro, part of the SNePS group.
Primarily an implementation of Markov Logic Networks (MLN). MLN are remarkable because they unify, in a single conceptual framework, both statistical and logical (reasoning, first-order-logic) approaches to AI. This seems to endow the theory with a partucularly strong set of powers, and in particular, the ability to learn, without supervision, some of the harder NLP tasks, such as dependency grammars, automatic thesaurus/ synonym-set learning, entity extraction, reasoning, textual entailment, etc.
Primarily an implementation of Markov Logic Networks, for Statistical Relational Learning, including dependency parsing, semantic role leabelling, etc. Perhaps more NLP focused than Alchemy.
YAGO is a huge semantic knowledge base, consisting primarily of information about entities. Contains 2M entities, and 20M facts about them. The YAGO-NAGA project also includes SOFIE, a system for automatically extending an ontology via NLP and reasoning.
FreeHAL is a ... ?? chatbot and stuff ... ?? TODO -- figure this one out. Hard to tell if this is "real" or a hack.
Nutcracker performs textual entailment using a first-order-logic (FOL) theorem prover, and an FOL model builder. Built on top of Boxer, which takes the output of a combinatory categorical grammar (CCG grammar) parser, and converts this first-order logic based on Hans Kamp's "Discourse Representation Theory".
Written in prolog. Non-free license, bars commercial use.
Question-Answering system. Probably works,well, but my biggest criticism is that it's hand-crafted, rather than trying to actually learn anything. Viz, no attempt to learn grammars, no attempt to learn how to normalize a question. GPL license.
The MultiNet paradigm - Knowledge Representation with Multilayered Extended Semantic Networks by Hermann Helbig. Wires up NLP processing to hard-wired upper ontology, and adds reasoning. No source code available.
Developed by Vulcan Inc. in association with SRI International, Cyc Corp. and the UTexas/Austin CS/AI labs, aims to provide reasoning and question-answering over large data sets. All knowlege entry is done manually, by experts. Some research results are available publicly.
Developed by Hakia Labs, proprietary, commercial software for taking NLP input and generating ontological frames/expressions from it. See also ontologicalsemantics.com.
Powers Hakia search.
Below is a list of reasoning and/or inference engines only, without accompanying ontologies/datasets.
What am I (personally) looking for? I am looking for a system that represents a logical expression as a (hyper-)graph. Why? First, because the natural setting for logic is model theory. The natural setting for algebraic structure is a term algebra. The universal term algebra or free theory is the free term algebra. The natural way to express an equivalence relation, production rule, or a re-write rule for a term algebra is as a hypergraph. Thus, if one wants to apply machine learning technology to learning new equivalence relations or reduction rules, one must be able to represent one's systems as hypergraphs. Unfortunately, very few have made this leap or connection. The only system that I know of that represents both logical relations and re-write rules as hypergraphs is OpenCog.
Implements a probabalistic analog of first-order logic. Ideal for uncertain inference. Beta available now. In the process of being ported to Opencog. First-order logic statements are expressed in terms of hypergraphs. The nodes and edges of the hypergraphs can hold various different "truth value" structures. A set of basic types define how truth values are to be combined, resulting in the primitives needed for uncertain reasoning. These are described in Ben Geortzel's book of the same name. A specific claim is that the rules are explictly founded on probablility theory.
Truth values are probability distributions, usually represented as compound objects, e.g. having not only a probability, but also having upper and lower on the uncertainty of the probability estimate.
Actual implementation works primarily by applying typed pattern matching to hypergraphs, to implement a backward-chainer. That is, PLN defines a typed firt-order logic; it does not (yet?) define a typed functional programming language (although it comes close to doing so). Inference control is through various aglorithms, including "economic attention allocation" and Hebbian activation nets.
GNU GPLv3 Affero license.
Similar to PLN in various ways, but uses a different set of formulas for inference. Truth values are represented with a pair of real numbers: strength and confidence.
Open source, written in Lisp.
An extension of Markov networks to first-order logic. Ungrounded first-order logic expressions are hooked togethr into a graph. Each expression may have a variety of different groundings. The "most likely grounding" is obtained by applying maximum entropy principles aka Boltzmann statistics computed from a partition function that describes the network. One important stumbling block is that computing the partition function can be intractable. Thus, sometimes a data representation is used such that certain probabilities are solvable in closed form, and the hard (combinatorial) problems are pushed off to clustering algorithms. (See e.g. Hoifung Poon).
MLN's stick to a very simple "truth value" -- a real number, ranging from 0.0 to 1.0 -- indicating the probability of an expression being true. Normally, no attempt is made to bound the uncertainty of this truth value, except possibly by analogy to physics (e.g. second-order derivatives expressing permeability, susceptibility, etc. or strong order when far from the Curie temperature, etc.) That is, maximum entropy principles are used to maximize the number of "true" formulas that fit the maximum amount of the (contraditory) input data. However, it is unclear how confident one should be of a given deduction
Several implementations, including "Alchemy" listed below.
Similar to MLN, but avoids making certain assumptions about Bayesian priors. Rarely applied to logic/reasoning directly. Uses a single real number to represent the probability.
Similar to MLN, but abandons maximum entropy for clustering/classification based on mutual information. That is, datasets are search emprically for small patterns that have a high value of mutual information. These are then clustered together as approrpriate, and then the search is repeated on patterns based on the clusters.
What am I looking for? I am looking for a system that represents a probabilistic program operatation with a very simple syntax, so that a machine learning system can learn new probilisitc programs. The only such system that I know of, that is capable of doing this, is the OpenCog system. Although, at the current time, opencog rather sucks for programming.
Prolog engine, open source. Supports tabling/memoing, well-founded negation. This is one of the fastest inference engines out there, per results of the Madrid 2009 Semantic Web OpenRuleBench results. Personally, I suspect that this is because of a strong grounding in inference and language design theory on the part of the developers.
Prolog engine. For performance, adds "demand-driven indexing". This is one of the fastest inference engines out there, per results of the Madrid 2009 Semantic Web OpenRuleBench results. Personally, I suspect that this is because of a strong grounding in inference and language design theory on the part of the developers.
Inference engine, bottom-up. Implements the datalog query system. Has "Magic Set" optimization. Implemented in Java. Immature? LGPL license.
PowerLoom uses a fully expressive, logic-based representation language (a variant of KIF). It uses a natural deduction inference engine that combines forward and backward chaining to derive what logically follows from the facts and rules asserted in the knowledge base. Has interfaces to common-lisp, C++ and Java. GPL license.
Among the first expert system/rule engines ever. Originally from NASA, now public domain. C language. Designed for embeding expert systems into devices, etc. See also Wikipedia page. Extensive number of features.
Inference engine, specifically tailored to work well with Python. Features:
Primarily an inference engine coupled to an ontology. GPL license.
Drools is a business rule management system (BRMS) and an enhanced Rules Engine implementation, ReteOO, based on Charles Forgy's Rete algorithm tailored for the Java language. Despite using RETE, this is possibly the slowest inference engines out there, as well as the least stable (per WWW Madrid 2009 Semantic Web OpenRuleBench results).
Function symbols. Meant for event processing, not data processing ...
Use Boolean SAT for traditional propositional logic solvers, use SMT for solvers that include arithmetic expressions.
Java, on sourceforge. Recommended for small-to-medium systems. A frame-slot type system.
Theorem prover. Usually used for formal verification. BSD license.
With integrated theorem prover. CMU Lisp. GPL license.
Some of the logic and reasoning systems above make explicit use of a graph re-writing system. Most do not. The RelEx language system explicitly makes use of one to perform dependency parsing.
What am I looking for? I want a graph rewriting system that expresses the re-write rules themselves as graphs. The rules should also be expressible as strings, and should have a very simple syntax, so that a machine-learning system could learn new rules. Ideally, the graphs would actually be hypergraphs, as it is difficult and cumbersome to implement certain constructs with ordinary graphs. In particular, it is difficult to implement certain dependency relations in natural languages with ordinary graphs. It is also difficult to specify functors as ordinary graphs (since the arguments to a functor are typed, and the type itself is usually a graph. Thus, one needs to allow the nodes of a graph to be graphs themselves, i.e. to be hypergraphs.) Put another way: in model theory, the universal algebra is the free term algebra: given a fixed signature, terms may be freely composed in any way; there are no reductions or relations. A free term algebra is most easily represented as a directed tree graph. Any equivalence relation or re-write rule is then a hypergraph! (In fact, re-write rules that replace functors by other functors are functors themselves; this leads to the concept of a 2-category) Currently, there is only one such system that I know of: it is the pattern matcher in OpenCog.
A list of graph rewriting systems can be found at in the Wikipedia Graph rewriting page. These include:
Works on graphs with labelled edges and nodes (i.e. category-theoretic). Written in C++. From CNRS. License: free of charge, but proprietary.
Java. License unclear. The graph transformation rules themselves must be written in Java. Category-theoretic approach, single push-out. Meant to be embedable in other projects. Source available, license unclear.
Wrtten in Java. Not obviously extensible, scalable, or usable as a component within another system. (??)
Fast! Small exectuables! ML/OCaml-like type system. Supports several programming styles: Functional programming (both lazy and eager evaluation) Imperative programming (including safety via theorem proving), Concurrent programming (multi-core GC) Weakness: very very new, current version is 0.1.6.
Purely functional programming, good concurrency support, good FFI, goood compiler. Lazy evaluation. Weakness: difficult for programmer to predict time/space performance.
Concurrent, functional, fault-tolerant programming. See also wikipedia article.
Fast! Unifies functional, imperative, and object-oriented programming styles. Provides a strong type and type-inference system derived from ML. Weakness: no multi-core/concurrent support. Type system can be subtle. Poor FFI, modules system. See also Wikipedia page.
Object-oriented, functional programming. Focus on scalability. Targets JVM. Good Java integration. Weakness: no tail recursion in JVM!! which means mutually recursive proceedures are icky/slow.
Modern Lisp dialect, targeted at JVM. Good Java integration. Weakness: no tail recursion in JVM!!
A wiki containing an extensive listing of software and other things is at ACLWeb, and in particular, at the Tools and Software page. A small list is at the NLP Resources wiki page at agiri.org. A general overview of the state of the art is at AAAI Natural Language page.
A particularly important theory is Dick Hudson's Word Grammar.
Other NLP resources include:
See also http://www.singinst.org/research/researchareas
Grammatical Framework is actaully programming language for writing grammars. Its built on the categorical grammar formalism. As a programming language, its a functional language with type support. Code is GPL, libraries are LGPL and BSD.
CRF++ is an implementation of Conditional Random Fields. Has pre-existing modules for text chunking, named entity recognition, information extraction. Open source, written in C++.
Includes a shallow parser, a sentence splitter, entity detection, sense annotation (using wordnet senses), etc. Strong Spanish/Latin language support.
OpenNLP is more of a directory of other NLP projects. Includes some good maximum-entropy implementations.
Has a book, multiple articles. Integration into WordNet. Written in python. Not clear whether it has an actual parser. Seems to do some sort of entity extraction, esp. for biomedical terms.
The IMS Open Corpus Workbench (CWB) is a collection of tools for managing and querying large text corpora (100 M words and more) with linguistic annotations. Its central component is the flexible and efficient query processor CQP.
Java, GPL'ed. Big. Also in use for Dialogue processing and Natural Language Generation.
The only unspervised grammar inducers that I know of are:
Unsupervised grammar induction refers to the task of learning a grammar with the input being only sentences in natural language. Upside: this is interesting because it learns without any supervision. Downside: low accuracy, weak dependency grammar (DVM). Accuracy has been measured to be about 50% on 6 different languages, which is currently state-of-the-art. The DVM grammar is a bit lacking: only valence-2 links allowed (thus, no indirect objects, adjectival modifiers) and the number of word-classes (parts of speech) is far too small. Alas. Promising start, though. GPLv3 license.
From Carnegie-Mellon. A parser for English, Russian, Arabic, Persian, German languages, based on "link grammar", a novel theory of natural language syntax. Written in C, with a BSD license. English dictionary includes 90K words. Actively maintained. The most accurate parser out there, I don't know any that are more accurate, free or commercial. (Accuracy is in the 97-99% range) Fast, too.
Built on top of the Carnegie Mellon link parser. Extracts dependency relations from link data. Creates FrameNet-like semantic frames from the dependency graphs. Includes ability to handle multi-sentence corpus, entity detection, and perform anaphora (pronoun) resolution via Hobbs algo. Apache v2 license. Written in Java. Actively developed/maintained.
Now includes not one, but two! natural language generation facilities: NLGen/SegSim and NLGen2.
Rule-driven dependency parser. English, Spanish, Galician, French, and Portuguese. Parser-compiler in Ruby; parser is in Perl. GPL license.
Dependency parser, generating output similar to RelEx. Statistical parser. Trains on treebank data, has been applied to half-a-dozen different languages. Slow, RelEx+linkgrammar is 3x to 4x faster. Java, GPL v2 license.
Trainable, fast, accurate dependency parser. Has four different training methods. Uses a fast shift-reduce algorithm for single-pass parsing. Reads CoNLL. C++. unclear license? Unclear accuracy?
Maltparser is a system for data-driven dependency parsing, which will learn a parsing model from treebank data, and can then be used to parse new data using the induced model. Java, BSD license. old URL.
Trainable, fast dependency parser. Uses minimum spanning tree methods. Reads CoNLL. Doesn't seem to be very active. Java, CPL license, Apache V2.0 license. download
Incremental Sigmoid Belief Network Dependency Parser. Trainable. GPL license. Unmaintained, last release was in 2008.
Dependency output. Linguist-written rules. GPL license.
Idea from Luc Steels. There is a LISP implementation at http://www.emergent-languages.org/ A Java implementation at TexAI.
There is a large list of NL generators located at the ACLWeb Natural Language Generation Portal.
NER is commonly done in one of several ways:
A powerfull system for extracting entities and entity relations from free text. See the YAGO-NAGA listing above.
Java, GPL'ed. Big. GATE is supplied with an Information Extraction system called ANNIE, which seems to be focused on "entity extraction".
From the website: "Meta-optimizing semantic evolutionary search (MOSES) is a new approach to program evolution, based on representation-building and probabilistic modeling. MOSES has been successfully applied to solve hard problems in domains such as computational biology, sentiment evaluation, and agent control. Results tend to be more accurate, and require less objective function evaluations, in comparison to other program evolution systems. Best of all, the result of running MOSES is not a large nested structure or numerical vector, but a compact and comprehensible program written in a simple Lisp-like mini-language." For details, see Moshe Looks' PhD thesis.
Performs clustering using genetic programming techniques. (i.e. attempts to find small algorithmic expressions that will cluster the data). Omniclust is an n-ary agglomerative search algorithm. For details, see, Clustering gene expression data via mining ensembles of classification rules evolved using moses. Looks M, Goertzel B, de Souza Coelho L, Mudado M, Pennachin C. Genetic and Evolutionary Computation Conference. (GECCO 2007): 407-414. Java codebase.
Java. Has been used to build a POS tagger, end of sentence detector, tokenizer, name finder. LGPL/Apache license.
Portable toolkit for building and manipulating hidden Markov models. C source code, non-free-license prohibits redistribution.
A particularly interesting subset concerns Compositional data, which is data located on a simplex and/or a projective space.
Caution: All of the systems listed below fail horribly when applied to real-world data sets of any reasonable size -- e.g. datasets with 100K entries. This is typically because they try to compute similarity measures between all 100K x 100K = 10 billion pairs of elements, which is intractable on contemporary single-CPU systems. You can win big by avoiding these systems, and exploiting any sort of pre-existing organization in your data set. Only after breaking your problem down to itty-bitty-sized chunks should you consider any of the below.
From thier website: "The VLFeat open source library implements popular computer vision algorithms including SIFT, MSER, k-means, hierarchical k-means, agglomerative information bottleneck, and quick shift. It is written in C for efficiency and compatibility, with interfaces in MATLAB for ease of use, and detailed documentation throughout. It supports Windows, Mac OS X, and Linux."
Appears to be aimed at image processing. GPL license.
Assumes data is located on a simplex, and uses that fact in it's algo's. Includes an algo for PCA analysis, another using a partition clustering algorithm, and an agglomerative hierarchical clustering using the Aitchison distance. Command-line interface. Written in C. (No library interfaces currently defined.) Focused on genetic/bio data. GPL license.
Mfuzz clustering. Aimed at genetic expression time-series data, claimed to be robust against noise. Uses R language. GPLv2 license.
R-based data mining. GPL.
Data mining, clustering. Java. GPL. From personal experience -- fails totally on any but the very smallest data sets. Dying/dead mailing list.
Fast, decision-tree-based implementation of k-nearest neighbor classification. Implements half-dozen algo's. GPL'ed. (Might not scale well for large problems?) Used in the MaltParser NLP parser, thus has been applied to NLP tasks.
Library that implements Support Vector Machine, which is one of many ways of doing a linear classifier.
Per website: "STXXL implements containers and algorithms that can process huge volumes of data that only fit on disks."
Clustering, runs in memory, thus much faster than Hadoop. Scala interfaces.
Implementation of MapReduce ideas in C++.
Implementation of MapeReduce ideas in Java. Part of the Apache project. Notable things built on Hadoop: Hive, an analyis and query system. HBase, a BigTable-like non-relational database.
Database for storing hypergraphs. Pretty Cool. Java based. Strange BSD-like license, but requires source code! Compatibility of license with GPL is unclear.
Shard overview describes an alternate to centralized, normalized datbases.
I've moved the list of ontologies to near the bottom of this page, because I have come to beleive that they are useless unless they have been learned natively, by some specific learning system. Thus, for example, an AGI system would use an ontology not by loading one of the below, but by learning one, by reading books. Or reading wikipedia.
A giant list can be found at Peter Clark's Some Ongoing KBS/Ontology Projects and Groups. Problems with ontologies are reviewed in Ontology Development Pitfalls.
Big ones include
Common-sense knowledgebase. Large. GPL license. Users can edit data online, at http://torg.media.mit.edu:3000/
Collection of english-language sentences, rather than using a strict upper ontology. This is actually quite conventient, if you have a good NLP input system, as it helps avoid the strictures of pre-designed ontologies; and rather gets you to deal with the structure of your NLP-to-KR layter. From MIT. -- large -- 700K sentences
YAGO is a huge semantic knowledge base, consisting primarily of information about entities. Contains 2M entities, and 20M facts about them. The YAGO-NAGA project also includes SOFIE, a system for automatically extending an ontology via NLP and reasoning.
See also: Wordnet::Similarity A perl module implementing various word similarity measures from Wordnet data. i.e. Thesaurus-like.
Licensing is unclear.
SUMO WP article. Includes an open source Sigma knowledge engineering environment, includes a theorem prover. Sigma uses KIF.
"The largest formal public ontology in existence", availble under GPL. (although OpenCyc is arguably bigger, and is free.) Has mappings to WordNet.
Large KB under artistic license. Source for engine not available. KB seems messy and capricious. The uppper ontology is not clear. See however, remarks above.
Common sense KB, available in CycL. GPL'ed
A knowledge representation system. Conceptual Graph Interchange Format is an ISO standard. See also "Common Logic Interchange Format (CLIF)", which is more lisp-like.
Seems well-engineered. Actual KB is slim. Source not available. Might be a dead project??
Provides a firm theoretical foundation for representing ontologies; no actual data. OWL version of GFO under a modified BSD license. Examples include the periodic table of elements, amino acids. See also WP article.