Back to the trees: Identifying plants with Human Intelligence

We investigate a way to build a convivial plant identification tool halfway between the complex determination keys of botanists and the more recent but poorly explainable approaches based on AI image recognition. Our approach consists of a formal language to organize morphological traits and a Bayesian technique to describe plants with possible polymorphisms at all taxonomic levels, and to handle errors and uncertainties. From these structured data, automatic approaches can be designed to generate versatile determination keys , i.e. decision trees, which are otherwise tedious to design by hand.


INTRODUCTION
Knowledge about our vegetal neighbors has always been a key skill for human survival, as plants provide food and medicine, among others.The activity of describing and classifying plants is thus ancient, and a science has slowly developed from it over the centuries.Taxonomic classifications were formalized by Linnaeus, with the most widely used levels being species, genera, and families.These classifications have been constantly modernized, recently following phylogenetic criteria and using genetic markers instead of morphological ressemblance.Yet the ways to identify plants have remained relatively constant, and continue to be used by scientists, farmers, gardeners, pharmacists and amateurs alike.
Identification can be achieved by determination keys [de Lamarck 1779], which are decision trees, not necessarily following the taxonomy, whose leaves are species (or varieties, or families, or genera), and nodes correspond to a morphological observation.Determination keys remain very popular today among botanists (Flora Gallica [Société botanique de France 2014], Flora Europaea [Tutin et al. 2001] for instance) as precise tools to identify an unknown plant, despite some limits: (1) they require an expertise in plant morphology; (2) they suppose the ability to answer questions concerning all the organs (and in particular the flower) even if these organs are not observable (because of seasonality for example) on the considered plant; and (3) they leave little place for errors or uncertainties, both from the descriptions and from the observations.The popularization of computers has yielded attempts to digitize determination keys 12 , design new ones3 , invent other ways to identify a plant than decision trees, as a flat "robot portrait" instead of the decision tree 456 , or design algorithms to automatically construct a key from data [Kerner and Lebbe 2019].
Surprisingly, these attempts have yielded little formalisation of both the existing descriptions for use in computers, and the properties of the determination keys.Descriptions were mainly achieved by matrices with species as lines and attributes as columns, in which an entry is a value of one attribute for one species.Uncertainties, dependencies between attributes, polymorphism, were hardly addressed by this solution.
Research in this direction has fallen out of fashion because of several concomitant scientific and cultural revolutions.On the one hand the rise of genomics provided a more immediately formalizable data source for scientific studies than morphological descriptions.On the other hand the activity of identifying plants has been extended to a wide public thanks to the combination of progress on image recognition through deep learning and the cultural shift that made billions of people own an internet-connected smartphone.Applications identifying plants based on a single picture (iNaturalist [Horn et al. 2018], Pl@ntNet [Affouard et al. 2019], Google Lens) became very popular among amateurs, as they give an almost instant answer.
We would nonetheless like to argue that such systems, although we recognize their efficiency and usefulness, do not close definitively the identification question.Firstly, precise identification often requires a more thorough observation than what can be seen on an amateur picture.Secondly, the answer of such applications lacks explainability: it does not fully replace determination keys that connect observable features with the identification.AI systems are oracles that have to be trusted almost blindly, and they cannot provide human-understandable feedback on the identification.Furthermore, they require important infrastructure: servers to train models on, a smartphone with a camera and a dense mobile coverage of the earth's surface allowing internet connections.
Identification tools can be investigated and compared according to their conviviality, with a reference to Illich [Illich 1973], as well as to the connected notion of self-obviating systems [Tomlinson et al. 2015].Conviviality means, in our sense, inspired by Illich, a tendency for a tool to increase the autonomy of users, and to avoid creating either dependencies to the tool, or unwished dependencies to other social groups or structures.Self-obviating tools are good examples of convivial tools: they are designed so that their usage makes themselves more and more useless.As remarked by Tomlinson et al [Tomlinson et al. 2015] an ideal plant identification tool would be self-obviating because each usage teaches to the user an association of an observation with a name.Identification keys reinforce this quality: the user associates the distinctive plant morphological traits to the name, and even if very few persons would be able to identify all species in a country without a key, its use becomes easier through time as one knows more species, and the distinctive traits at higher taxonomic levels.The explainability of a tool thus plays an important role in this property: an association is better learned if it is given a meaning that users can understand and remember.
We here aim to use computers in plant identification in a way that keeps this self-obviating aspect of identification keys, and at the same time render them more convivial.In this paper we revisit botanical determination keys in the light of both modern computer science and conviviality.We take advantage of formal language descriptions, bayesian statistics, and the generation of decision trees [Quinlan 1986] to show that computer science has the potential to improve determination keys without losing their interesting characteristics.
Our proposition is based on (1) a formalisation of the morphological description of species, including variabilities and uncertainties (Section 3) (2) an algorithm generating determination keys (Section 4) for formalized data.In Section 5, we present our ongoing attempts at gathering data.

FORMALISING PLANT MORPHOLOGY
Descriptions of families, genera and species can be found in floras, books compiling a list of existing species in a particular region (e.g.province, country, mountain range, continent).These books are written in natural language which is more or less codified and thus the information they contain can be difficult to use in an automatic way.We first summarise some difficulties in describing plants in natural language, and then explore some difficulties in translating them into a formal language.

Describing plants
A description is a list of properties that should be observable without too much technological involvement (e.g. the leaf is directly attached to the stem).Properties have to be inclusive enough to capture the variation within one species: they are based on observations of several individuals of the species, in different locations, at different times of the year, and at different stages of growth.They also have to be exclusive enough to differentiate it from other species.They contain discriminating aspects of the plant.For instance, few floras mention that leaves are green as this is implied by default, and is not a very helpful property to identify a plant.
Within a flora, descriptions tend to show some uniformity: reusing the same terminology and the same properties across different species exhibiting similar morphology, whenever possible.
For example, the description of Potamogeton gramineus, of which a photo is shown on Figure 1, in the Flora Europaea [Tutin et al. 2001] is: • "Floating leaves (if present) up to 7 × 3 cm, elliptical or ovateelliptical, cuneate or rounded at the base, opaque; petiole often longer than the lamina.Submerged leaves (at least the lower) up to 8 × 3 cm, sessile, narrowly elliptic-oblong to narrowly elliptical or oblanceolate-oblong, cuneate at the base, acute or acuminate, minutely denticulate at least when young, with regularly ascending secondary veins (occasional leaves more or less reduced to phyllodes).Stipules conspicuous, herbaceous.Peduncles thickened upwards."

Challenges in formalising descriptions
An immediate idea to transform floras into a formal language is to transform a natural language property like the one extracted from the description above "Submerged leaves (at least the lower) up to 8 × 3 cm, sessile" into an attribution of values to variables (the attributes): Attribute Attachment of the leaf has value sessile (attached to the stem), attribute length has value up to 8, width has value up to 3, position has value "submerged*.
A first challenge is the familiarity with the botanical vocabulary (sessile, stipule, are not necessary common).For this we use a set of drawings showing, at each occurrence, the associated organ or character.In this article we illustrate this way of taking care of neophytes by showing two botanical panels and refering to them when using a specialised vocabulary.The first one, strawberry, illustrate standard terms, with flowers, petals, sepals, stamen, pistil, fruits.Commons by wikimedia commons.The species has compound leaves (each leaf consisting of three leaflets).1,2.Flower bud at two developmental stage.3. Flower, where petals (white), sepals (green), stamens (yellow) and pistils are visible.4. Stamen. 5. Pistil.6.The strawberry "fruit" consists of the expanded flower receptacle, the actual fruits (in a botanical sense) being the seeds attached to the surface f this receptacle.
The second one illustrates a specialised inflorescence composed of flowers that lack perianth (petals or sepals).
Another challenge is that from the example description above it is obvious that this transformation is not immediate.For example, there are several kinds of leaves, some submerged, some floating.This polymorphism (several forms for a single subject) is frequent also for non aquatic herbs, with different basal (attached to the ground part of the plant) and caulinar (attached to the aerian part of the stem) leaves.The range of quantitative traits is unprecise, "more or less reduced to phyllodes" is hardly formalized.
Polymorphism of leaves on a single plant like in the description of Potamogeton gramineus can be generalized to polymorphism at any level: from within a single organ (a petal can have several colors, several petals of a single flower can have different colors, petals from different flowers from the same species can have different colors).Attributing several values is thus ambiguous.3. What appears to be a flower is actually a compact inflorescence, called "cyathium", consisting of flowers that are highly reduced to one pistil or one stamen (5 is a cross-section).4. Dissected involucre, consisting of fused bracts surrounding the flowers in the cyathium.6.Male flowers, each consisting of one stamen.7,8.Female flower consisting of one pistil (7. a horizontal cross-section of the ovary.)9. Seed.Indeed, variation is present at every level, both within a plant, within populations or species, and among species.Although we often present it in a discrete way (classification), it is mostly the manifestation of some quantitative variation.There are for instance many ways in which flowers can be organized into inflorescences, and while some are readily identified (e.g. the "flower head" of Asteraceae, the umbel of Apiaceae), there are many intermediate cases.It's like classifying colors, we have a few words (red, orange, yellow, etc.) to describe something that appears in nature as a continuum.
This variation also makes it sometimes hard to correctly identify the organ.That the white structures that surround the daisy "flower" aren't petals but flowers themselves is relatively widely known, even outside botanical circles.Most Euphorbia species (a very large and widespread genus, see an exemplar in Figure 3) have highly reduced flowers, each consisting of one pistil or one stamen, organized in such a way that the inflorescence (called "cyathium") mimics a flower quite perfectly: it usually takes a trained botanist to know this.In some cases, determining the exact status of a structure or organ requires detailed comparative anatomical and genetic studies that have not yet been performed.E.g. petals, sepals, stamens and pistils are floral structures all thought to be more of less modified leaves, yet some species have petaloid sterile anthers and lack "true" petals, so what one calls a petal is a matter of scientific consensus that can change when the understanding of flower development changes.Indeed, botany is a science in itself, with its own vocabulary, theories and open questions.
Last but not least, the botanical knowledge is spread out in different works.They have been published at different times, in different countries and different languages, and thus use different sets of attributes and values.There have been some efforts towards standardization, but even a flora of a medium-sized country like France contains about 6000 species, and it isn't easily updated to reflect the latest accepted terminology.Besides, for older works, the terminology is not always explicitly defined, making it hard to precisely compare the information contained in different floras.
In summary, floras are a way to shape botanical knowledge, which is a first formalization step.Transforming floras into chosen categories and values can only express knowledge at the cost of sometimes arbitrary and unpredictable choices at the limits of the categories, and at the risk of a loss of part of the knowledge, which is not expressed with those categories.The process of transforming knowledge into information is always at this cost.We will try to partly tame this arbitraryness by using probabilities above the categories.

Existing formalised databases
2.3.1 Existing botanical databases.Attempts to formalise plant morphology are too numerous to be exhaustively cited, we will only consider a few recent attempts that produced publicly available data.They are often focused on particular aspects of plant morphology to allow comparisons between species, populations or individuals, and thus include traits that can be defined for (almost) all species [e.g.Sauquet et al. 2017].More specialized databases in ecology focus on functional traits that are designed to be applicable to a large number of species, such as the "Specific Leaf Area' ' which measures the ratio of leaf surface and dry mass.While plants species differ in this trait, it is neither distinctive enough to identify species, nor easily observable without specific tools.The popular database TRY contains many of such functional traits.

Existing morphology descriptions.
To go beyond relational data (attribute/value for each species), knowledge bases tend to be described ontologies.Ontologies are graphs where nodes are concepts (in our case: plant, leaf, flower, . . .); and edges relations between concepts (in our case: plants may have leaves, meaning a relation plant → leaf).We can then describe species as individuals of the concept plant using a logical language such as OWL [OWL 2012].This uniform framework allows the design of tools that work on all knowledge bases using this formalism (query, statistics, . . .).Ontologies for plants [Jaiswal et al. 2005] focus on the anatomy of the plants and growth stages, rather than identification criteria.We also found that existing ontology frameworks were not exactly what we wanted: in our opinion, concepts and relations tend to make describing the hierarchy of attributes and values a bit heavy.Moreover, there is no standard support for probabilistic features we need, and thus we would have to probably modify an existing system anyway.
For these reasons, we have instead drawn inspiration from ontology systems but have designed something we believe to be simpler to tailor to our specific needs.Our system is designed to be compatible with OWL so that the data can be exported into a format most OWL-related software can work with.

A HIERARCHICAL, EVOLVABLE, PROBABILISTIC DESCRIPTION OF PLANTS
We present our proposition of structure of a morphological database of plants, suitable to construct a determination key, and by the way usable in a large spectrum of scientific activities around plants, their morphologies, functions, ecology, interactions, evolution.

Desired properties
1. Explicitation of the schema.The list of attributes and their values is sometimes left implicit in existing databases.We would like our schema to have an explicit description in a formal language easy to understand, discuss and revise.2. Hierarchy of attributes.Not all species have a value for attributes.Indeed, there is a natural dependency relation between attributes that informs when an attribute is relevant for a plant.For instance, if the species is not flowering, all the attributes about flowers are irrelevant.We want a schema that makes these dependencies explicit.This helps with entering data (as we see directly which fields are relevant or not) but also with understanding the structure of the schema.Hierarchy can also encourage modularity of the schema by splitting it into subschema.3. Structured representation.Most databases are a flat representation which must lose out on the structure of the flora description.We believe a more structured approach is necessary, in particular to understand correlations between traits.
In databases, each trait is supposed to be independent, which is not always the case.Indeed, to represent species with different kinds of leaves (basal or on the stem for instance) or flowers, matrices cannot provide a detailed representation.
Guided by intuitions from programming language theory, we view schema as types of a programming language.Our method is thus: • To define a type theory, i.e. a language to define and manipulate schema.We want this language to satisfy the criteria above, but also allow us to describe the schema (1) modularly (easily split in subschemas) and (2) mimic the way morphology is presented in textbooks.• To define a denotational semantics for this theory, that is for every type (representing a schema)  , we describe a set ( ) of probabilistic observations valid according to schema  .This resumes to attributing a value to an attribute with a certain probability.

Schemas as types
Our first work was towards thinking about how to formalise a list of attributes and values that are relevant for identification and used in most floras.This list is meant to be evolvable, as the work of actually compiling an exhaustive list adapted to all families, if at all possible, is a an endeavor spanning years.
In terms of computer science, we want to describe our schema as a type P in a suitable type theory, representing the structure of observable traits on plants.Traditional databases have a simple schema.Each attribute can be seen as a sum type (also known as enum types), and the overall scheme is a product (or record, or structs) of each attributes.Thus, such schema are products of sums.
To improve on this flat structure, we follow a more sophisticated approach based on algebraic types found in ML languages [Milner et al. 1997].Their tree structure have two main advantages: • Hierarchy.They can represent nicely the hierarchical structure of observable traits: flowers, then perianth, then petals, . . .They also make explicit the dependency between traits: petals only make sense for flowering plants.• Incrementality.It makes it easy to leave some parts undescribed to later detail and replace it with a more precise description.For instance, we can start by describing the hair on the stem as a boolean before moving on to a more complex description of the hair.This is essential as we cannot wait to have a complete type before starting to describe species, and allows to easily insert new branches for describing specific families (e.g. the conifers with the needles, the grasses, or the composite flowers).
3.2.1 Algebraic types for describing plant structure.Algebraic types are used in functional programming languages.They are built from product types (often known as records in non-functional languages) and sum types (related to union or enum in non-functional languages).They can be seen as describing the shape of hierarchies, thus well-suited to represent morphological descriptions (plant, then flowers, then perianth, then corolla, . . ..).Standard traits can be represented as simple cases of sum types: To describe an organ or a component composed of several subcomponents, records can be used: corolla := { color: color; number: This defines a corolla has having two sub-components: its color, described by the previous color type, and the number of petals.
Sum types are convenient for representing parts of the description that are specific to certain species.For example, the case of flower heads can be represented as follows: The first line defines the type of inflorescence.This definition can be read as defining an attribute inflorescence-type with at least two values capitulum and umbel.The (cap-descr) indicates that when the value capitulum is selected, new attributes are unlocked, described by cap-descr.
Finally, as leaves of the algebraic types, we also allow quantitative traits (for representing lengths for example).

3.2.2
The formal grammar of types.To study mathematically types obtained by such constructions, we define the following grammar: Quantitative trait In the first two lines,  can be zero leading to the empty type 0 and the unit type 1.This abstract syntax does not have names for constructors and fields, but our concrete syntax does.When we write [yes | no], we mean the sub-type 1 ⊕ 1.And a record {flower: [yes|no]; leaf:[yes|no]} becomes (1 ⊕ 1) × (1 ⊕ 1).For instance, written formally the type inflorescence above is: where 2 := 1 ⊕ 1.Each operand to the ⊕ correspond to a case of inflorescence.Empty cases (e.g.umbel) become 1.
The last line is a quantitative trait:  is the unit, and [, ] is an interval describing the default probability distribution.For instance, leaf length could be represented as [1 − 100]cm.
The most surprising construct is the third lines, that we call Multiple types.This can be used when the sub-type  can have several forms that need independent description.It can be thought of as a type of multiset and is inspired by the exponentials in Linear Logic [Girard 1987] and cardinality constraints in OWL.

3.2.3
Defining the type: methodology.We have started writing a plant.schemefollowing this methodology, gathering information both from floras and determination keys, as well as morphological books, guided by the botanical expertise of members of the group.
As mentioned in Section 2, devising this formal scheme proves to be a arduous endeavour.First, different authors have a different way of structuring certain aspects and finding out which is the best way to formalise can be tricky.Moreover, identification traits are about delimiting certain behaviours (certain leaves are simple, other are compound), but for each such delimitation, there are always corner cases for which there is a bit of uncertainty involved in choosing the value.As a guideline in formalising, we try to limit the number of values for a trait, which means trying to split traits with a large number of values into sub-traits.This may help pushing the uncertainty to sub-traits which are less critical to the identification.
Figure 4 represents part of our current plant.scheme.Nodes in orange are attributes; nodes in purple values.The unlocking of attributes corresponds to arrows from values to attributes.

Probabilistic and structured description of plants
To model uncertainty in botanical knowledge and possible polymorphisms at all levels (within an organ, a plant or a population) we follow a bayesian approach.For this, we need a probabilistic description of species.In its simplest form, this means that a species description does not assign a value to every attribute, but rather a distribution of probability representing uncertainties.Obviously, translating the often qualitative uncertainty present in floras into quantitative probabilities involves a certain amount of guesstimating, but we believe that, even though the probabilities are not extremely reliable, they can be used to protect ourselves from errors in the data, and difference in appreciation, as well as representing species polymorphism evoked in Section 2.2.Moreover, as we will see, we also want to adopt a bayesian approach for identifying a species (see Section 4) and so it will fit right in.
As we will see in the next section, it is not quite enough to just have probabilistic distribution of values for each attribute, as this misses correlations between traits.Those correlations tend to be quite rare but important for a large class of species.Typical correlation include for instance structure of the leaf and position in the plant (stem or base).

3.3.1
The space of observations.From the type P mentioned in the previous section, we derive a structure for descriptions.Descriptions live in a space (P).Since we want descriptions to be probabilistic, we could have taken (P) := (P) the space of probabilistic distributions over P.
Since P is complex, so is describing directly a probability distribution on it.It is much easier to define a probability distribution per attribute most cases, while using multiple types to represent correlations.
We take advantage of the inductive definition of P to describe (P) by induction using simplification formulas.
• Records.It is well-known that there is a correspondence between ( × ) -distributions over  × and () ×( ) -pairs of distributions.This correspondence is not one-to-one as from left-to-right it forgets correlations between  and  .We still choose to use this correspondence here to simplify ( × ), leaving correlations to be dealt with multiple types.Thus we let: • Sum.A distribution over a disjoint union  ⊕  can be described by a distribution over  and one over  , plus a weight in [0, 1] that says which side we are closer to.Thus, we let: • Quantitative data.For quantitative data, we just interpret them as continuous distributions over R + .In the implementation for now, we only support normal distributions for simplicity.
We can represent a plant with small basal leaves and large stem leaves as follows:

GENERATING A DETERMINATION KEY
Determination keys are particular cases of classification trees.We review standard approaches to generate such trees from data.

Learning classification trees from examples
The problem of automating the reasoning of experts in a field (here: identifying a plant) dates back to the beginning of Artificial Intelligence, in particular with expert systems, where rules are created in collaboration with an expert of the field and a knowledge engineer.These rules tend to be deduction rules that are then fed to an inference engine that can use them to try and replicate predictions from expert.This process proved to be time-consuming and quickly people tried to have the computer learn the rules on its own.This gave rise to the field of machine learning, where the machine learns by itself, without the need of understanding how an expert reasons.
In the case of classification problems (such as determining a plant's species), the approach using classification trees as learnt knowledge became very popular.The computer learns from a set of examples a trees whose leaves contain a prediction for the category.Concretely an example is a tuple ( 1 , ...,   , ) where each   ∈   is an attribute value picked in a set of values   .  may be finite (for instance the petal colors) or infinite (length of the leaf).The final component,  is the class of the example, i.e. the expected result.
From such a set of examples, the goal is to learn classification tree: leaves are predictions, i.e. elements of  , and nodes are attributes (e.g. 2 ), with one child per value of the attribute.
The main metric for such trees is (1) correctness and (2) conciseness.One of the first system to learn such trees is CSL [Hunt et al. 1966], which uses a min-max approach to generate an efficient tree.This min-max approach involves exploring different branches of the tree to evaluate how optimal a condition is.This min-max approach tends to require a lot of computation, so other approaches were developed.

The ID3 algorithm [Quinlan 1986]
ID3 [Quinlan 1986] is a greedy algorithm that at every step picks the best condition according to some metric, and then recurses in the two sub-branch.Compared to CSL, there is no backtracking involved and so this algorithm is much faster.To get a performant tree, all the magic happens in the choice of the metric to evaluate the condition.In ID3, entropy is used to evaluate how well a condition partitions the set of possible predictions.

Data of the algorithm.
The algorithm assumes a list of attributes, given as finite sets  1 , ...,   , and a set of  .We assume a set  ⊆  1 × ... ×   ×  .The output of the algorithm is a tree, whose leaves are labelled with an element of  , and nodes one of the   .A node labelled   has |  | children.

Entropy.
The entropy is a central notion in the ID3 algorithm which measures how uncertain a set of examples  is.The deeper we go in the tree, the smaller  becomes and thus the more certain it is.
Given a set  of examples and  ∈  , we define the probability of  relative to : where   is the subset of  containing only examples with class .The entropy of a set of examples is based on Shannon's formula [Shannon 1948] for entropy as follows: The minimal value of zero is reached when  only contains examples of the same class.In that case, the prediction is easy: it must be that class.

Adapting ID3 for our needs
There is a large body of work building on this algorithm trying to improve different aspects of it.One key aspect is avoiding overfitting, which is often done by pruning the tree after generation.This section is mostly informal, details can be found in Appendix A.

Bayesian approach.
To take into account polymorphism, uncertainties and subjectivities our description are probabilistic.Thus, we need to turn the algorithm probabilistic, which means maintaining a distribution over species  rather than a subset  ⊆  of possible species.Since the notion of entropy was defined first on probability distributions, this modifies little the algorithm.To compute the effect of an answer, we use Bayes' law (since our probabilities are discrete).

Structured description.
Our description is structured instead of examples.To take this structure into account to improve the algorithm, we use a score function defined by induction: This function computes the conditional probability needed for the Bayes' law.

Discussion and shortcomings.
Prior distribution.One advantage of this approach compared to the one described in Section 4.2 is that the algorithm is parameterized by an initial probability distribution  0 (the prior in bayesian term).This distribution can be taken to be the uniform distribution over the set of species .However, the weight of a species  can be related to its related to its frequency: the frequency in the region of observation is an a priori probability to observe a plant.In our proof of concept, we have used as a starting point,  1/5 where  is the number of occurrences in the french territory as indicated by GBIF7 .The low exponent is necessary to flatten the disparities between rarely and frequently observed and reported plants.Otherwise, rarely reported plants cannot be found with the key as their weight is too low.
This algorithm picks the most informative question, relative to the current probability distribution.In other words, it picks the question that is the best at discriminating the most likely species.This is an improvement over standard key algorithms that may pick questions to rule out very rare species.
Generating questions.In this algorithm, questions need to be specified by the creator of the key.From the description of the type P, it is possible to generate a set of basic questions about one trait.For instance if P is described by: Then we can easily generate two questions, one about the presence of the leaves and the second one about the color of flowers.In general we can generate a set of questions inductively on the structure of P.
These questions are too sharp: for instance not everyone uses the same words to describe flower colors, and the boundaries between two colours, e.g.pink and red, are unsharp.To alleviate this, we equip every sum type appearing in P with a confusion matrix  that explicits how likely is it to confuse a trait value for another.For instance, here we could have  red,orange = 10%, while  red,blue = 5%.
Doing so protects us against user making mistakes while answering the question.Without the matrix, if the user answers orange, all the red flowers disappear from the distribution (their weight becomes zero).With the matrix, their relative weight is divided by five: they take a hit, but are still around.This mechanism can allow the user to make one or two mistakes in the exploration of the tree and still get a correct identification while standard determination keys tend to assume that the user is infallible.Ambiguous questions.If  = ( 1 , ...,   ) is a question and  ∈ () a probability distribution, we can look at the sum: We would expect this sum to be equal to one, which means that every plant description is matched by exactly one of the   .This is not the case in general, especially for questions generated with a confusion matrix.We are interested in the case when the sum is greater than one, which means there is an overlap between different answers.This overlap may also come from the description of species present in  which may have several values for a trait.
This overlap quantifies how ambiguous the question is, and may reflect how likely a user may make a mistake while answering this question, given the current distribution .While we have not tested it yet, we believe this quantity could be taken into account to make a determination key that instead of taking the most informative question, tries also to avoid too ambiguous question by picking Questions unlocking questions.When the sum is below one, the question is partial.This means there are individuals for which no answer really makes sense.This means typically that an answer is missing.This happens for generated questions for nested sub-types.Consider the following type for plants: This type has a single attribute, flower which can be yes or no.In the case of yes, we are also interested in the color of the petals.This type generates two questions: presence of flowers, and color of petals.It is obvious that the question about color is partial, indeed it only makes sense if we have established that the plant observed has flowers.Since we do not want to present a question to the user where they could get stuck, i.e. have no relevant answer, we need to ask questions in order.However, the question "is it a flowering plant" will often have a low score since most plants in identification keys are flowering.Thus the question will not get ask, even though it unlocks juicy questions such as the shape of the inflorescence that are highly discriminating.To circmvent this problem, we need to explore a few questions ahead to compute an amortized score taking into account questions that are unlocked by potential questions.

GATHERING AND USING DATA
To test the determination key generator we need a reasonable number of descriptions following our scheme.In this section, we describe our past and future attempts at gathering and formalizing morphological data.We have mostly considered three approaches: (1) mining existing sources online, (2) citizen science and (3) machine learning.

Database mining
So many databases of plant descriptions exist, either on books or digitised, that it is tempting to gather data from this previous work.However we were surprised by the recurrent obstacles.Firstly, very few databases have a permissive license, which is understandable because gathering data is a huge work and those who accomplished it fear that it is used without any benefit for them.We could not engage in individual collaborations with all of them, so we abandoned the idea of using a big part of the existing data.Secondly, datasets are not easily parsed.Let us take the example of the french flora by Bonnier [Gaston and de Layens 1909] which has been digitised by Tela-Botanica8 .We attempted to translate the digitised questions and answers into our formalism.It is easy to obtain a graph whose nodes are questions (or taxons) and edges are answers to questions.We had hoped at the beginning that we could just follow a path to a species and get partial data for that species as the conjunction of its edges.This proved difficult because, we had expected the graph to be a tree (as in decision tree) but it is actually a DAG, that is there are several paths leading to the same species.This phenomenon is due to the key having a particular feature: it first identifies the family, so several paths converging to a single family do not necessarily concern the identified species in question.
We have also tried to parse the data from TRY, which is a trait database for plants.Although it contains a lot of data, it actually is a "database of databases", thus including many sources which are sometimes contradictory, and contains a lot of missing data.As explained above, it was mainly conceived for plant ecological studies with the goal to compare plants, not to distinguish them, and thus not contain a lot of useful traits for plant identification.

Manual description filling
So we explored ways to reconstruct from scratch the datasets, informed by existing floras, digitised or not, but manually filling the fields instead of automatically mining.We have designed an interface, parameterized by the plant.schemethat allows to input the description of a species.Concretely, it means for each sum type in the scheme, giving the value(s) for that particular species.Since our model is probabilistic, we can also specify a probability distribution on values for a trait, if there are several.The editor takes advantage of the hierarchy of the plant.scheme:when selecting a trait value, this may unlock new traits that were depending on it.For instance, in the case of inflorescence, if we select capitulum, this would unlock the traits relative to them.This allows us to only see the traits that are relevant to the current species being edited, and also to take it slowly, as the traits unfold the more we add information.
Data for a species is usually filled from some bibliographical source.Due to the polymorphism and the uncertainty evoked above, there is in general not a complete consensus about the existing plant observation in the literature.For the database to be scientific value, it is important to be able to source each value in the entry for a species (especially the least obvious ones).To that end, the editor allows to add bibliographic sources for each entry.

5.2.1
Experiments in citizen science.We would also like to involve associations interested in conservation and environment education to help us fill the database.We have obtained a small grant to work with local and regional associations.The goal is to use the knowledge from the employees and volunteers of the association in exchange for producing determination keys for areas of interest.This is also a good way for people to strengthen their botanical knowledge, which is often a side goal of these associations.We have devised the following methodology so far, where the goal is to study a small ecosystem.Our partnership works as follows: 1. First fieldwork: Botanical inventory of the ecosystem, paying attention to how frequent the species are (this is needed to get a good prior distribution).2. Lab session: Adding missing species to the database.3. Second fieldwork: After generating the key, we try to validate it going back to where the inventory was made.
This methodology strengthens also a goal of the project which is to get interested citizens more involved with their local ecosystems.For a given environment, the methodology can be repeated several times across the year as the species distribution tend to vary a lot.This opens the door of making seasonal determination keys that could even be more effective.Indeed, the current season tends to be the first criteria used by botanists to guide what species they are expected to find (and in what form).

5.2.2
Partnering with national networks of botanists.We also are in the process of partnering with BotaScopia9 a french national initiative led by Tela-Botanica and Université Paris-Saclay aiming at providing uniform descriptions of species based on a formal representation.In this project, the data is entered by students for which this process is part of their curriculum.

Machine Learning
Another way to obtain data is to use machine learning techniques.We see two possibilities.

Using NLP.
There is a way to use NLP techniques to retrieve information about natural language sources such as floras.It is important that the model is able to source its values for attributes.We have done some experiments with ChatGPT but they still have to be thoroughly investigated.Before, however, we should ask ourselves whether using such a tool, given its energy consumption and its use of "human resources" in training it fits with our general objectives.It seems preferable to directly work with botanists as entering botanical data is meaningful for them.

Using image recognition.
Deep learning could be imagined to associate photos with traits, based on a training set.In this way a subset of traits might be immediately recovered from a simple image.However this might not be compatible with our idea of staying in certain limits in terms of computations and resource usage.

CONCLUSION AND FUTURE WORK
Determination keys are algorithms.They consist in deterministic sequences of operations that lead to identification.This makes appealing the idea to use some computer science theory to analyse, generate and improve them.Surprisingly, even if computers have been used a lot to help the botanical work, close concepts from theoretical computer science (bayesian inference, decision trees, information theory, formal languages) have been little explored in this scope, apart from deep learning and notable exceptions like xper [Kerner and Lebbe 2019].
We have presented a methodology for formalising botanical knowledge, generating data according to this formalisation, and using the data to generate determination keys.We believe that this can be an improvement over the existing techniques for handling in a probabilistic framework polymorphism, errors, uncertainties, and proposing a hierarchical evolvable language that may solve the conundrum between precision and accessibility of plant descriptions.The system presented here is very much a work in progress.Some aspects have been implemented, and we are running experiments.
Ultimately, we would like the determination keys to be possibly printed on paper, at least for small floras, as well as being usable offline on a modest smartphone or computer.Instead of a giant national or international flora, we would also like to favour smaller floras more adapted to specific ecosystems.In this way the computing systems are used to construct objects, and these objects, and not the computers, are the daily used objects.
In this sense this work is an occasion to explore concretely innovative computer science concepts such as self-obviating systems and transition computing, related to conviviality.
It is necessary to discuss how to evaluate a priori and a posteriori the claim of convivial and self-obviating characters of the tools we propose.According to Illich [Illich 1973], "The simple, poor, transparent tool is a humble servant; the elaborate, complex, secret tool is an arrogant master." We will not achieve conviviality when we rely on computer tools connected to the Internet that require a complex arrangement of technologies.However compared to AI systems we can argue that we progress in the three cited dimensions: compared to AI based identification tools, our planned tools are simpler.Not in their usage but in their conception.Indeed our proposed techniques require some amount of computer science development and botany knowledge, however these can be described without the reference to theories that require a deep involvement to understand as neural networks.Furthermore, our tools can be considered a progress on the transparency vs secrecy dimension.Indeed, current AI tools, even if explainability is an active field of research, never reveal their tricks, even though they are necessarily based (implcitly or explicitly) on already available human knowledge.Finally we will have progressed on the poor vs complex dimension if for small floras we can generate and print keys on papers.Then we will have achieved the poor, transparent and humble qualities.
The explainability property is also a wished path to self-obviating systems.Every use of the application should make it less useful for a user, because every answer of the algorithm goes with a careful observation of a plant and with an association of the observation and the result.This remains to be evaluated: a posteriori, it will be possible to measure if the users of explainable tools feel more autonomous in their dialog with their vegetal environment than AI users.Note that self-obviating for a user does not mean selfobviating for a society: there will always be users who need training, so that the tool never self-disappears eventually.However given that we are living an important transformation of our societies due to social and environmental crises, it is reasonable to consider all tools as transitory, and their effects in this transition more than in a stable society.
Eventually we claim the development of an humble computer science.Humble in several meanings: (1) we use much less resources than traditional AI approaches (2) we build on, and with the people who already have the botanical knowledge and (3) we consider computer science knowledge as a "back-office" supporting external research questions, more than an "avant-garde" that would transform the society by itself, to borrow the words of Bruno Latour during his very last conference in Paris [Latour 2022].Our goal is not to replace botanists but to spread their knowledge and make it accessible.
We believe our framework will be helpful for training botanists, both formally (biology students) and informally (gardeners, amateurs).Novices in botany could use a key based on the common plants in their region, restricted to the species of interest (they often don't start by identifying grasses), which will generate a rather easy-to-use key.If the users progress, they might want to include rarer species, or those occurring in a larger region, or those that are notoriously difficult to identify.The keys will become increasingly precise, requiring more botanical knowledge, but this procedure offers the possibility to learn that knowledge step-by-step.Such an incremental learning procedure is not possible with classical floras and thus could have many applications in teaching and participative science.
Finally, this work can have benefits outside the generation of determination keys.Formalized botanical knowledge is useful in evolutionary biology, ecology, and conservation biology.The question of why there are so many plant species, and why some groups are more species-rich than others (e.g.flowering plants vs. gymnosperms) is still largely unanswered.Morphology is at the intersection of evolution and ecology, as some traits are clearly adaptations to enhance survival in certain environments (e.g.thorns prevent plants to be eaten by large herbivores, leaves densely covered with hairs protect against sunlight and evaporation), while others are the result of evolutionary and developmental constraints (e.g.parallel leaf nerves probably don't influence the species' ecological interactions).Among others, the morphology of the species can be used to quantify structural and functional diversity, allowing to better understand the functioning of ecosystems than by species numbers alone.Morphological descriptions should be available for all species that have been described so far, thus allowing biodiversity studies to benefit from centuries of botanical descriptions without the need to spend time, human work and physical energy, all of which are limited, for fieldwork to collect the data.

Fig. 2 .
Fig. 2. A botanical panel describing Fragaria vesca, the wood strawberry, pictured from the book by Otto Wilhelm Thomé Flora von Deutschland, Österreich und der Schweiz 1885, Gera, Germany, and used under Creative Commons by wikimedia commons.The species has compound leaves (each leaf consisting of three leaflets).1,2.Flower bud at two developmental stage.3.Flower, where petals (white), sepals (green), stamens (yellow) and pistils are visible.4. Stamen. 5. Pistil.6.The strawberry "fruit" consists of the expanded flower receptacle, the actual fruits (in a botanical sense) being the seeds attached to the surface f this receptacle.

Fig. 3 .
Fig. 3. Botanical panel of Euphorbia helioscopia (A) and Euphorbia esula (B), from the same source as Figure 2, under Creative Common licence.1. Compound inflorescence, showing three cyathia (see 3). 2. Outer inflorescence bract.3.What appears to be a flower is actually a compact inflorescence, called "cyathium", consisting of flowers that are highly reduced to one pistil or one stamen (5 is a cross-section).4. Dissected involucre, consisting of fused bracts surrounding the flowers in the cyathium.6.Male flowers, each consisting of one stamen.7,8.Female flower consisting of one pistil (7. a horizontal cross-section of the ovary.)9. Seed.

Fig. 4 .
Fig. 4. A part of our plant.schemedisplayed as a tree.Note that for illustration purposes, only a subset of the attributes (orange) and their values (purple) are shown.