Identifying an isolated leaf, especially if preserved as a fossil, can be a painstaking process for botanists. A new computer program that learns to categorize leaves into large evolutionary categories could help.
Researchers “trained” a machine-learning algorithm to identify leaves based on a set of nearly 7,600 digital images of leaves that had been chemically treated to emphasize their shape and venation.
The software discerned relevant patterns so well from that set of examples that it went on to identify the family of novel leaf images with greater than 70 percent accuracy (a rate 13 times better than chance) and the order with about 60 percent accuracy.
Study lead author Peter Wilf of Penn State University says that for the algorithm to identify family or order is “an incredible achievement.”
To make such classifications, the software had to come to “understand” that despite wide variations among a great many species, there were nevertheless unifying characteristics that meant that some leaves belonged to some distinct broader groups (families and orders) while other leaves belonged in others.
“Paleobotanists have collected many millions of fossil leaves and placed them in the world’s museums,” says Wilf, professor of geosciences. “They represent one of the most underused resources for understanding plant evolution. Variation in leaf shape and venation, whether living or fossil, is far too complex for conventional botanical terminology to capture. Computers, on the other hand, have no such limitation.”
When botanists identify modern plants, they look at the leaves, but rely mostly on the associated fruits, seeds, and flowers to categorize the specimens. In fossil collections, fruits, seeds, and flowers are usually much less common than leaves.
Even with modern leaves it is a slow process figuring out which features are botanically informative. If a computer vision approach works on modern leaves, it could help in the classification of fossil leaves as well.
“Leaf characterization builds on an 1800’s system of description that we call leaf architecture,” says Wilf. “It looks at leaf teeth, margins, lobes, and venation patterns and uses specialized terminology to describe them. For the most part, this procedure tells us how to describe a leaf, not how to identify one and place it on the tree of life.
“Cracking the leaf code and accessing the evolutionary information in leaf architecture is the central problem I feel I must try to solve in my career as a paleobotanist.”
How the software works
The software visually highlighted the subtle venation features that it used to make its classifications, providing botanists with new ideas of relevant traits to consider.
“Along with the demonstration that computers can recognize major clades of angiosperms from leaf images and the promising outlook for computer-assisted leaf classification, our results have opened a tap of novel, valuable botanical characters,” the authors write in the Proceedings of the National Academy of Sciences.
Thomas Serre, coauthor of the paper and an assistant professor of cognitive, linguistic, and psychological sciences at Brown University, studies how the brain accomplishes visual perception with the goal of modeling it in computers.
The new study began when Wilf invited Serre to apply computer vision to botany after reading a publication derived from Serre’s doctoral work on computerized image classification in 2007. Wilf’s hope was that computers could help botanists sort through massive collections of leaf fossils to determine how they may be related to modern species.
Thousands of times faster
The researchers currently have a 72 percent accuracy rate over 19 leaf families compared to about 5 percent for random chance. This project is not the first to computerize leaf identification. A popular app, Leafsnap: An Electronic Field Guide, matches the shape of an unknown leaf from a particular region and identifies it down to the species level.
However, this current work is the first to analyze cleared leaves or leaf venation for thousands of species from around the world, to learn the traits of evolutionary groups above the species level such as plant families, or to directly visualize informative new characteristics.
The variation among the hundreds to thousands of species in a family is many times greater than within a species, and yet, the computer algorithms could learn a set of features and apply it successfully. Because nearly all leaf fossils are of extinct species, family-level identification is usually the first target for paleobotanists.
“This approach is a key distinction between what we call image processing, where literally a computer expert programs a computer to see, as opposed to machine learning and computer vision, where the machine is not programmed to exhibit a particular behavior but rather it learns from examples,” says Serre. “Here, our examples were leaf images together with category labels corresponding to family and order.”
The researchers provide the computer program with half the photos already identified so that it can automatically learn a dictionary of special features such as vein intersections and tiny bumps and asymmetries that turn out to matter quite a bit in identifying leaves. The system also learns to disregard the typical problems of low image quality, insect bites, and mounting defects. Then the algorithm receives unlabeled test photos and uses its dictionary to identify them.
The researchers repeated this procedure 10 times, randomly choosing the training and test images. The results agreed with only 1 percent difference between the runs.
“It normally takes a trained person a few hours to describe one leaf according to the standard protocol, which uses about fifty terms, ” says Wilf. “The computer program is thousands of times faster, automatically generates a dictionary of more than 1,000 elements, and then actually shows us what parts of the leaf are diagnostic.”
Instead of producing only a black box of results, the computer generates a “heat” map directly on the leaf image, identifying and rating areas of importance for correct identification. This approach generates a flood of previously hidden botanical information.
Roses and coffee
Wilf notes that leaf teeth in the rose family have always been considered distinctive, but the heat maps highlight previously unknown features of their tips. Leaves of the coffee family, with 13,000 living species, are very hard to identify when not attached to twigs, but the computer program found it one of the least problematic at 90 percent accuracy.
The ability of computer vision to classify leaves quickly and to generate vast quantities of new botanical knowledge will allow scientists to develop more accurate evolutionary pedigrees for plants and plant fossils.
Serre says he was excited to contribute to a novel example in which computer vision technology can aid scientific research (computer vision has been applied to leaf classification before, but it has only attempted species classification and typically relied on leaf shape). He says he has begun to strike up collaborations with Brown plant scientists such as Andrew Leslie, assistant professor of ecology and evolutionary biology, to see how else machine vision could help the field.
“I think it can change the way we do science,” Serre says. “We can do things with computer vision that would be simply impossible if we were to rely on human annotations.”
At Brown, Serre worked with former postdoc Shengping Zhang on the study. Other authors are Sharat Chikkerur of Microsoft, Stefan Little of Penn State, and Scott Wing of the Smithsonian Institution.