Additional Materials#

Plant traits - TRY database#

The TRY database contains trait measurements from individual plants and, typically, multiple individual measurements per trait and species. We want to extract a mean for each trait value per species.

We have prepared data for this course, however, for future reference, to download data from the TRY database, create an account at https://www.try-db.org/de.

We choose the option of open access data only, but the curators of this database still require you to add a short project description to your download request. You will then be sent a download link via e-mail.

For this study we will use continuous (con) traits used in the sPlot analysis from Buehlheide et al. 2018:

Trait	ID	Unit
Leaf area (in case of compound leaves: leaflet, undefined if petiole is in- or excluded)	3113	mm^2
Leaf area per leaf dry mass (specific leaf area, SLA or 1/LMA): undefined if petiole is in- or excluded)	3117	m^2/kg
Stem specific density (SSD) or wood density (stem dry mass per stem fresh volume)	4	g/cm^3
Leaf carbon (C) content per leaf dry mass	13	mg/g
Leaf nitrogen (N) content per leaf dry mass	14	mg/g
Leaf phosphorus (P) content per leaf dry mass	15	mg/g
Plant height vegetative	3106	m
Seed dry mass	26	mg
Seed length	27	mm
Leaf dry mass per leaf fresh mass (leaf dry matter content, LDMC)	47	g/g
Leaf nitrogen (N) content per leaf area	50	g/m^2
Leaf nitrogen/phosphorus (N/P) ratio	56	g/g
Leaf nitrogen (N) isotope signature (delta 15N)	78	ppm
Seed number per reproducton unit	138
Leaf fresh mass	163	g
Stem conduit density (vessels and tracheids)	169	mm-2
Dispersal unit length	237	mm
Wood vessel element length; stem conduit (vessel and tracheids) element length	282	μm

When asked which traits you would like to download, type in the following list. This filters TRY data for our traits of interest, listed in the table above.

3113, 3117, 4, 13, 14, 15, 3106, 26, 27, 47, 50, 56, 78, 138, 163, 169, 237, 282

Load Data#

First, load the TRY data as a data frame, selecting only the following columns:

AccSpeciesName - Consolidated species name
SpeciesName - Species name
TraitID - Unique identifier for traits (only if the record is a trait)
TraitName - Name of trait (only if the record is a trait)
StdValue - Standardized value: available for standardized traits

TRYdata = pd.read_csv("Data/iNaturalist/Data/TRY/19287.txt", sep = "\t", encoding="iso-8859-1", 
                      usecols = ["AccSpeciesName", "SpeciesName", "TraitID", "TraitName", "StdValue"],
                     dtype={'TraitID': float})

/tmp/ipykernel_2918340/474720487.py:1: DtypeWarning: Columns (10) have mixed types. Specify dtype option on import or set low_memory=False.
  TRYdata = pd.read_csv("TRY/19287.txt", sep = "\t", encoding="iso-8859-1",

TRYdata.head()

	SpeciesName	AccSpeciesName	TraitID	TraitName	StdValue
0	Acer campestre	Acer campestre	NaN	NaN	NaN
1	Acer campestre	Acer campestre	26.0	Seed dry mass	14.38
2	Acer platanoides	Acer platanoides	NaN	NaN	NaN
3	Acer platanoides	Acer platanoides	26.0	Seed dry mass	59.90
4	Acer pseudoplatanus	Acer pseudoplatanus	NaN	NaN	NaN

# drops rows with missing values
TRYdata = TRYdata.dropna(subset=["TraitID"])

# check number of unique trait names
TRYdata["TraitID"].nunique()

# number of unique species
TRYdata["AccSpeciesName"].nunique()

We remove author annotation and subspecies information from species names.

# make all letters lower case
TRYdata['AccSpeciesName'] = TRYdata['AccSpeciesName'].str.lower()
# capitalize first letter in string
TRYdata['AccSpeciesName'] = TRYdata['AccSpeciesName'].str.capitalize()
# get only two first words (split at space)
TRYdata['AccSpeciesName']  = TRYdata['AccSpeciesName'].apply(lambda x: ' '.join(x.split()[0:2]))
# change type to string
TRYdata['AccSpeciesName'] = TRYdata['AccSpeciesName'].astype(str)

# same for species name
TRYdata['SpeciesName'] = TRYdata['SpeciesName'].str.lower()
TRYdata['SpeciesName'] = TRYdata['SpeciesName'].str.capitalize()
TRYdata['SpeciesName'] = TRYdata['SpeciesName'].astype(str)
TRYdata['SpeciesName']  = TRYdata['SpeciesName'].apply(lambda x: ' '.join(x.split()[0:2]))

TRYdata['AccSpeciesName'].nunique()

TRYdata['SpeciesName'].nunique()

Check for duplicate names#

TRY_sp = TRYdata["AccSpeciesName"].apply(str)
TRY_sp = TRY_sp.unique()
len(TRY_sp)

from rapidfuzz import process, fuzz

def fuzzy_match(choices, queries, cutoff):
    
    score_sort = [(x,) + i
             for x in queries
             for i in process.extract(x, choices, score_cutoff=cutoff, scorer=fuzz.token_sort_ratio) ]
    
    similarity_sort = pd.DataFrame(score_sort)
    similarity_sort = similarity_sort[similarity_sort[2] != 100.0]
    return similarity_sort

TRY_matches = fuzzy_match(TRY_sp, TRY_sp, 95)

TRY_matches.head()

TRY_matches[0].nunique()

(len(TRY_matches)/2)/len(TRY_sp)

Only 0.5% of unique species in TRY have potential duplicates (similar names). Since we are looking at vast scales and, we can diregard this slight uncertainty and accept that these species might not be matched to the iNaturalist observations.

We devide the number for matches by 2, since every pair is listed twice (positions switched).

Create summary stats with consolidated species name#

Use groupby function to group data by consolidated species name and trait; grouping variables: AccSpeciesName, TraitName, TraitID.

More information: https://www.tutorialspoint.com/python_pandas/python_pandas_groupby.htm

# group data by species name and trait

grouped = TRYdata.groupby(['AccSpeciesName', 'TraitID', 'TraitName'])
TRY = grouped['StdValue'].agg([np.mean]).reset_index()

#check output
TRY.head()

	AccSpeciesName	TraitID	TraitName	mean
0	Aa	14.0	Leaf nitrogen (N) content per leaf dry mass	26.400000
1	Aa	50.0	Leaf nitrogen (N) content per leaf area	2.798400
2	Aa	3117.0	Leaf area per leaf dry mass (specific leaf are...	9.433962
3	Aaronsohnia pubescens	3106.0	Plant height vegetative	0.200000
4	Abacaba (palm)	3106.0	Plant height vegetative	15.000000

def shorten_names(df):

    df.rename(columns = {'Stem specific density (SSD) or wood density (stem dry mass per stem fresh volume)':'SSD'}, inplace = True)
    df.rename(columns = {'Leaf carbon (C) content per leaf dry mass':'Leaf C'}, inplace = True)
    df.rename(columns = {'Leaf nitrogen (N) content per leaf dry mass':'Leaf N per mass'}, inplace = True)
    df.rename(columns = {'Leaf phosphorus (P) content per leaf dry mass':'Leaf P'}, inplace = True)
    df.rename(columns = {'Leaf dry mass per leaf fresh mass (leaf dry matter content, LDMC)':'LDMC'}, inplace = True)
    df.rename(columns = {'Seed dry mass':'Seed mass'}, inplace = True)
    df.rename(columns = {'Seed length':'Seed length'}, inplace = True)
    df.rename(columns = {'Leaf nitrogen (N) content per leaf area':'Leaf N per area'}, inplace = True)
    df.rename(columns = {'Leaf nitrogen/phosphorus (N/P) ratio':'Leaf N P ratio'}, inplace = True)
    df.rename(columns = {'Leaf nitrogen (N) isotope signature (delta 15N)':'Leaf delta15N'}, inplace = True)
    df.rename(columns = {'Leaf fresh mass':'Leaf fresh mass'}, inplace = True)
    df.rename(columns = {'Seed number per reproducton unit':'Seeds per rep. unit'}, inplace = True)
    df.rename(columns = {'Stem conduit density (vessels and tracheids)':'Stem conduit density'}, inplace = True)
    df.rename(columns = {'Dispersal unit length':'Dispersal unit length'}, inplace = True)
    df.rename(columns = {'Wood vessel element length; stem conduit (vessel and tracheids) element length':'Conduit element length'}, inplace = True)
    df.rename(columns = {'Plant height vegetative':'Plant Height'}, inplace = True)
    df.rename(columns = {'Leaf area (in case of compound leaves: leaflet, undefined if petiole is in- or excluded)':'Leaf Area'}, inplace = True)
    df.rename(columns = {'Leaf area per leaf dry mass (specific leaf area, SLA or 1/LMA): undefined if petiole is in- or excluded':'SLA'}, inplace = True)

Change data frame from long to wide using pandas.DataFrame.pivot. And shorten trait names.

TRY = TRY.pivot(index=["AccSpeciesName"], columns="TraitName", values="mean")

# reset indeces (species name) as columns in data frame
TRY.reset_index(inplace=True)

# rename trait variables to shorter names
shorten_names(TRY)

TRY.head(3)

TraitName	AccSpeciesName	Dispersal unit length	Leaf Area	SLA	Leaf C	LDMC	Leaf fresh mass	Leaf N per area	Leaf N per mass	Leaf delta15N	Leaf N P ratio	Leaf P	Plant Height	Seed mass	Seed length	Seeds per rep. unit	Stem conduit density	SSD	Conduit element length
0	Aa	NaN	NaN	9.433962	NaN	NaN	NaN	2.7984	26.4	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	Aaronsohnia pubescens	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0.2	NaN	NaN	NaN	NaN	NaN	NaN
2	Abacaba (palm)	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	15.0	NaN	NaN	NaN	NaN	NaN	NaN

# Optional: Save file
#TRY.to_csv("TRY/TRY_summary_stats.csv", index=False)

Create summary stats with original name#

# group data by species name and trait, same analysis as above
grouped_syn = TRYdata.groupby(['SpeciesName', 'TraitID', 'TraitName'])

TRY_syn = grouped_syn['StdValue'].agg([np.mean]).reset_index()

# change df shape
TRY_syn = TRY_syn.pivot(index=["SpeciesName"], columns="TraitName", values="mean")

# reset indeces (species name) as columns in data frame
TRY_syn.reset_index(inplace=True)

# shorten column names
shorten_names(TRY_syn)

#optional
#TRY_syn.to_csv("TRY/TRY_summary_stats_syn.csv", index=False)

TRY_syn.head(3)

TraitName	SpeciesName	Dispersal unit length	Leaf Area	SLA	Leaf C	LDMC	Leaf fresh mass	Leaf N per area	Leaf N per mass	Leaf delta15N	Leaf N P ratio	Leaf P	Plant Height	Seed mass	Seed length	Seeds per rep. unit	Stem conduit density	SSD	Conduit element length
0	(fabaceae)	NaN	NaN	21.3385	NaN	NaN	NaN	1.578157	33.150000	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	(fabaceae) 20-25oblong	NaN	NaN	NaN	NaN	NaN	NaN	1.761453	32.513864	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2	(fabaceae) brillafuzzy	NaN	NaN	NaN	NaN	NaN	NaN	1.397197	33.837593	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

Link iNaturalist to TRY#

Non-fuzzy merge with TRY summary stats on consolidated TRY species name:

import pandas as pd # for handling dataframes in python

iNat = pd.read_csv('iNat_observations.csv')

iNat_TRY = pd.merge(iNat, TRY, 
                    left_on= ['scientificName'],
                    right_on= ['AccSpeciesName'], 
                    how='inner')
iNat_TRY.head(3)

	gbifID	scientificName	decimalLatitude	decimalLongitude	eventDate	dateIdentified	AccSpeciesName	Dispersal unit length	Leaf Area	SLA	...	Leaf delta15N	Leaf N P ratio	Leaf P	Plant Height	Seed mass	Seed length	Seeds per rep. unit	Stem conduit density	SSD	Conduit element length
0	1229615436	Commelina communis	35.987483	-79.057546	2013-07-07T00:00:00	2013-07-07T20:33:11	Commelina communis	NaN	NaN	NaN	...	NaN	12.631579	1.71	NaN	8.48	NaN	NaN	NaN	NaN	NaN
1	3384000233	Commelina communis	42.093762	-75.923660	2021-08-23T13:06:06	2021-09-17T21:15:37	Commelina communis	NaN	NaN	NaN	...	NaN	12.631579	1.71	NaN	8.48	NaN	NaN	NaN	NaN	NaN
2	1807276585	Commelina communis	40.787636	-73.933728	2017-09-04T12:47:58	2017-09-04T21:58:57	Commelina communis	NaN	NaN	NaN	...	NaN	12.631579	1.71	NaN	8.48	NaN	NaN	NaN	NaN	NaN

3 rows × 25 columns

Extract from TRY those observations that have not been matched:

# filter for observations not in merged dataframe:
iNat_rest = iNat[~iNat.gbifID.isin(iNat_TRY['gbifID'])]
iNat_rest.shape

(2541013, 6)

We repeat the same with the ‘original’ species name in TRY:

# non-fuzzy merge with TRY summary stats on original TRY species name:

iNat_TRY_syn = pd.merge(iNat_rest, TRY_syn, 
                    left_on= ['scientificName'],
                    right_on= ['SpeciesName'], 
                    how='inner')
iNat_TRY_syn.head(3)

	gbifID	scientificName	decimalLatitude	decimalLongitude	eventDate	dateIdentified	SpeciesName	Dispersal unit length	Leaf Area	SLA	...	Leaf delta15N	Leaf N P ratio	Leaf P	Plant Height	Seed mass	Seed length	Seeds per rep. unit	Stem conduit density	SSD	Conduit element length
0	1802610589	Blitum capitatum	40.320259	-105.604856	2013-08-24T13:30:00	2019-09-02T01:11:54	Blitum capitatum	NaN	NaN	NaN	...	NaN	NaN	NaN	0.45	NaN	NaN	NaN	NaN	NaN	NaN
1	2283078677	Blitum capitatum	50.744232	-120.511303	2019-06-29T17:50:28	2019-09-02T01:16:41	Blitum capitatum	NaN	NaN	NaN	...	NaN	NaN	NaN	0.45	NaN	NaN	NaN	NaN	NaN	NaN
2	2864818488	Blitum capitatum	53.938056	-106.068553	2020-08-22T12:22:09	2020-08-22T19:13:24	Blitum capitatum	NaN	NaN	NaN	...	NaN	NaN	NaN	0.45	NaN	NaN	NaN	NaN	NaN	NaN

3 rows × 25 columns

subsets = [iNat_TRY, iNat_TRY_syn]

iNat_TRY_all = pd.concat(subsets)
iNat_TRY_all = iNat_TRY_all.drop(['AccSpeciesName', 'SpeciesName'], axis = 1)

# replace infinite values as NaN

iNat_TRY_all = iNat_TRY_all.replace(-np.inf, np.nan)
iNat_TRY_all = iNat_TRY_all.replace(np.inf, np.nan)

iNat_TRY_all.head()

	gbifID	scientificName	decimalLatitude	decimalLongitude	eventDate	dateIdentified	Dispersal unit length	Leaf Area	SLA	Leaf C	...	Leaf delta15N	Leaf N P ratio	Leaf P	Plant Height	Seed mass	Seed length	Seeds per rep. unit	Stem conduit density	SSD	Conduit element length
0	1229615436	Commelina communis	35.987483	-79.057546	2013-07-07T00:00:00	2013-07-07T20:33:11	NaN	NaN	NaN	NaN	...	NaN	12.631579	1.71	NaN	8.48	NaN	NaN	NaN	NaN	NaN
1	3384000233	Commelina communis	42.093762	-75.923660	2021-08-23T13:06:06	2021-09-17T21:15:37	NaN	NaN	NaN	NaN	...	NaN	12.631579	1.71	NaN	8.48	NaN	NaN	NaN	NaN	NaN
2	1807276585	Commelina communis	40.787636	-73.933728	2017-09-04T12:47:58	2017-09-04T21:58:57	NaN	NaN	NaN	NaN	...	NaN	12.631579	1.71	NaN	8.48	NaN	NaN	NaN	NaN	NaN
3	3355124418	Commelina communis	39.643158	-76.764245	2020-08-26T10:19:56	2020-08-27T13:21:22	NaN	NaN	NaN	NaN	...	NaN	12.631579	1.71	NaN	8.48	NaN	NaN	NaN	NaN	NaN
4	1802638502	Commelina communis	43.109505	1.622543	2017-10-21T10:01:00	2017-10-21T09:02:42	NaN	NaN	NaN	NaN	...	NaN	12.631579	1.71	NaN	8.48	NaN	NaN	NaN	NaN	NaN

5 rows × 24 columns

trait = iNat_TRY_all.columns[6:24]

iNat_TRY_all.loc[:, trait] = np.log(iNat_TRY_all[trait])

/net/home/swolf/.conda/envs/traitmaps/lib/python3.8/site-packages/pandas/core/internals/blocks.py:402: RuntimeWarning: divide by zero encountered in log
  result = func(self.values, **kwargs)
/net/home/swolf/.conda/envs/traitmaps/lib/python3.8/site-packages/pandas/core/internals/blocks.py:402: RuntimeWarning: invalid value encountered in log
  result = func(self.values, **kwargs)

iNat_TRY_all.to_csv("iNat_TRY_log.csv", index=False)

iNat_TRY_all.head()

	gbifID	scientificName	decimalLatitude	decimalLongitude	eventDate	dateIdentified	Dispersal unit length	Leaf Area	SLA	Leaf C	...	Leaf delta15N	Leaf N P ratio	Leaf P	Plant Height	Seed mass	Seed length	Seeds per rep. unit	Stem conduit density	SSD	Conduit element length
0	1229615436	Commelina communis	35.987483	-79.057546	2013-07-07T00:00:00	2013-07-07T20:33:11	NaN	NaN	NaN	NaN	...	NaN	2.5362	0.536493	NaN	2.13771	NaN	NaN	NaN	NaN	NaN
1	3384000233	Commelina communis	42.093762	-75.923660	2021-08-23T13:06:06	2021-09-17T21:15:37	NaN	NaN	NaN	NaN	...	NaN	2.5362	0.536493	NaN	2.13771	NaN	NaN	NaN	NaN	NaN
2	1807276585	Commelina communis	40.787636	-73.933728	2017-09-04T12:47:58	2017-09-04T21:58:57	NaN	NaN	NaN	NaN	...	NaN	2.5362	0.536493	NaN	2.13771	NaN	NaN	NaN	NaN	NaN
3	3355124418	Commelina communis	39.643158	-76.764245	2020-08-26T10:19:56	2020-08-27T13:21:22	NaN	NaN	NaN	NaN	...	NaN	2.5362	0.536493	NaN	2.13771	NaN	NaN	NaN	NaN	NaN
4	1802638502	Commelina communis	43.109505	1.622543	2017-10-21T10:01:00	2017-10-21T09:02:42	NaN	NaN	NaN	NaN	...	NaN	2.5362	0.536493	NaN	2.13771	NaN	NaN	NaN	NaN	NaN

5 rows × 24 columns

After matching with consolidated and original name, we were able to match about 84% of the iNaturalist observations with trait information. Many rare species seem to be absent in either one of the two databases.

print('percentage of iNat observations linked with at least one TRY trait:')
print(len(iNat_TRY_all)/len(iNat))

print('percentage of species in iNaturalist matched with TRY:')
print(iNat_TRY_all["scientificName"].nunique()/iNat["scientificName"].nunique())

percentage of iNat observations linked with at least one TRY trait:
0.8421341704587321
percentage of species in iNaturalist matched with TRY:
0.3059127945386479

Jupyter Notebook Tutorial

Additional Materials

Contents