Additional Materials
Contents
Additional Materials#
Plant traits - TRY database#
The TRY database contains trait measurements from individual plants and, typically, multiple individual measurements per trait and species. We want to extract a mean for each trait value per species.
We have prepared data for this course, however, for future reference, to download data from the TRY database, create an account at https://www.try-db.org/de.
We choose the option of open access data only, but the curators of this database still require you to add a short project description to your download request. You will then be sent a download link via e-mail.
For this study we will use continuous (con) traits used in the sPlot analysis from Buehlheide et al. 2018:
Trait |
ID |
Unit |
---|---|---|
Leaf area (in case of compound leaves: leaflet, undefined if petiole is in- or excluded) |
3113 |
mm^2 |
Leaf area per leaf dry mass (specific leaf area, SLA or 1/LMA): undefined if petiole is in- or excluded) |
3117 |
m^2/kg |
Stem specific density (SSD) or wood density (stem dry mass per stem fresh volume) |
4 |
g/cm^3 |
Leaf carbon (C) content per leaf dry mass |
13 |
mg/g |
Leaf nitrogen (N) content per leaf dry mass |
14 |
mg/g |
Leaf phosphorus (P) content per leaf dry mass |
15 |
mg/g |
Plant height vegetative |
3106 |
m |
Seed dry mass |
26 |
mg |
Seed length |
27 |
mm |
Leaf dry mass per leaf fresh mass (leaf dry matter content, LDMC) |
47 |
g/g |
Leaf nitrogen (N) content per leaf area |
50 |
g/m^2 |
Leaf nitrogen/phosphorus (N/P) ratio |
56 |
g/g |
Leaf nitrogen (N) isotope signature (delta 15N) |
78 |
ppm |
Seed number per reproducton unit |
138 |
|
Leaf fresh mass |
163 |
g |
Stem conduit density (vessels and tracheids) |
169 |
mm-2 |
Dispersal unit length |
237 |
mm |
Wood vessel element length; stem conduit (vessel and tracheids) element length |
282 |
μm |
When asked which traits you would like to download, type in the following list. This filters TRY data for our traits of interest, listed in the table above.
3113, 3117, 4, 13, 14, 15, 3106, 26, 27, 47, 50, 56, 78, 138, 163, 169, 237, 282
Load Data#
First, load the TRY data as a data frame, selecting only the following columns:
AccSpeciesName - Consolidated species name
SpeciesName - Species name
TraitID - Unique identifier for traits (only if the record is a trait)
TraitName - Name of trait (only if the record is a trait)
StdValue - Standardized value: available for standardized traits
TRYdata = pd.read_csv("Data/iNaturalist/Data/TRY/19287.txt", sep = "\t", encoding="iso-8859-1",
usecols = ["AccSpeciesName", "SpeciesName", "TraitID", "TraitName", "StdValue"],
dtype={'TraitID': float})
/tmp/ipykernel_2918340/474720487.py:1: DtypeWarning: Columns (10) have mixed types. Specify dtype option on import or set low_memory=False.
TRYdata = pd.read_csv("TRY/19287.txt", sep = "\t", encoding="iso-8859-1",
TRYdata.head()
SpeciesName | AccSpeciesName | TraitID | TraitName | StdValue | |
---|---|---|---|---|---|
0 | Acer campestre | Acer campestre | NaN | NaN | NaN |
1 | Acer campestre | Acer campestre | 26.0 | Seed dry mass | 14.38 |
2 | Acer platanoides | Acer platanoides | NaN | NaN | NaN |
3 | Acer platanoides | Acer platanoides | 26.0 | Seed dry mass | 59.90 |
4 | Acer pseudoplatanus | Acer pseudoplatanus | NaN | NaN | NaN |
# drops rows with missing values
TRYdata = TRYdata.dropna(subset=["TraitID"])
# check number of unique trait names
TRYdata["TraitID"].nunique()
18
# number of unique species
TRYdata["AccSpeciesName"].nunique()
54739
We remove author annotation and subspecies information from species names.
# make all letters lower case
TRYdata['AccSpeciesName'] = TRYdata['AccSpeciesName'].str.lower()
# capitalize first letter in string
TRYdata['AccSpeciesName'] = TRYdata['AccSpeciesName'].str.capitalize()
# get only two first words (split at space)
TRYdata['AccSpeciesName'] = TRYdata['AccSpeciesName'].apply(lambda x: ' '.join(x.split()[0:2]))
# change type to string
TRYdata['AccSpeciesName'] = TRYdata['AccSpeciesName'].astype(str)
# same for species name
TRYdata['SpeciesName'] = TRYdata['SpeciesName'].str.lower()
TRYdata['SpeciesName'] = TRYdata['SpeciesName'].str.capitalize()
TRYdata['SpeciesName'] = TRYdata['SpeciesName'].astype(str)
TRYdata['SpeciesName'] = TRYdata['SpeciesName'].apply(lambda x: ' '.join(x.split()[0:2]))
TRYdata['AccSpeciesName'].nunique()
51908
TRYdata['SpeciesName'].nunique()
61181
Check for duplicate names#
TRY_sp = TRYdata["AccSpeciesName"].apply(str)
TRY_sp = TRY_sp.unique()
len(TRY_sp)
from rapidfuzz import process, fuzz
def fuzzy_match(choices, queries, cutoff):
score_sort = [(x,) + i
for x in queries
for i in process.extract(x, choices, score_cutoff=cutoff, scorer=fuzz.token_sort_ratio) ]
similarity_sort = pd.DataFrame(score_sort)
similarity_sort = similarity_sort[similarity_sort[2] != 100.0]
return similarity_sort
TRY_matches = fuzzy_match(TRY_sp, TRY_sp, 95)
TRY_matches.head()
TRY_matches[0].nunique()
(len(TRY_matches)/2)/len(TRY_sp)
Only 0.5% of unique species in TRY have potential duplicates (similar names). Since we are looking at vast scales and, we can diregard this slight uncertainty and accept that these species might not be matched to the iNaturalist observations.
We devide the number for matches by 2, since every pair is listed twice (positions switched).
Create summary stats with consolidated species name#
Use groupby
function to group data by consolidated species name and trait; grouping variables: AccSpeciesName, TraitName, TraitID
.
More information: https://www.tutorialspoint.com/python_pandas/python_pandas_groupby.htm
# group data by species name and trait
grouped = TRYdata.groupby(['AccSpeciesName', 'TraitID', 'TraitName'])
TRY = grouped['StdValue'].agg([np.mean]).reset_index()
#check output
TRY.head()
AccSpeciesName | TraitID | TraitName | mean | |
---|---|---|---|---|
0 | Aa | 14.0 | Leaf nitrogen (N) content per leaf dry mass | 26.400000 |
1 | Aa | 50.0 | Leaf nitrogen (N) content per leaf area | 2.798400 |
2 | Aa | 3117.0 | Leaf area per leaf dry mass (specific leaf are... | 9.433962 |
3 | Aaronsohnia pubescens | 3106.0 | Plant height vegetative | 0.200000 |
4 | Abacaba (palm) | 3106.0 | Plant height vegetative | 15.000000 |
def shorten_names(df):
df.rename(columns = {'Stem specific density (SSD) or wood density (stem dry mass per stem fresh volume)':'SSD'}, inplace = True)
df.rename(columns = {'Leaf carbon (C) content per leaf dry mass':'Leaf C'}, inplace = True)
df.rename(columns = {'Leaf nitrogen (N) content per leaf dry mass':'Leaf N per mass'}, inplace = True)
df.rename(columns = {'Leaf phosphorus (P) content per leaf dry mass':'Leaf P'}, inplace = True)
df.rename(columns = {'Leaf dry mass per leaf fresh mass (leaf dry matter content, LDMC)':'LDMC'}, inplace = True)
df.rename(columns = {'Seed dry mass':'Seed mass'}, inplace = True)
df.rename(columns = {'Seed length':'Seed length'}, inplace = True)
df.rename(columns = {'Leaf nitrogen (N) content per leaf area':'Leaf N per area'}, inplace = True)
df.rename(columns = {'Leaf nitrogen/phosphorus (N/P) ratio':'Leaf N P ratio'}, inplace = True)
df.rename(columns = {'Leaf nitrogen (N) isotope signature (delta 15N)':'Leaf delta15N'}, inplace = True)
df.rename(columns = {'Leaf fresh mass':'Leaf fresh mass'}, inplace = True)
df.rename(columns = {'Seed number per reproducton unit':'Seeds per rep. unit'}, inplace = True)
df.rename(columns = {'Stem conduit density (vessels and tracheids)':'Stem conduit density'}, inplace = True)
df.rename(columns = {'Dispersal unit length':'Dispersal unit length'}, inplace = True)
df.rename(columns = {'Wood vessel element length; stem conduit (vessel and tracheids) element length':'Conduit element length'}, inplace = True)
df.rename(columns = {'Plant height vegetative':'Plant Height'}, inplace = True)
df.rename(columns = {'Leaf area (in case of compound leaves: leaflet, undefined if petiole is in- or excluded)':'Leaf Area'}, inplace = True)
df.rename(columns = {'Leaf area per leaf dry mass (specific leaf area, SLA or 1/LMA): undefined if petiole is in- or excluded':'SLA'}, inplace = True)
Change data frame from long to wide using pandas.DataFrame.pivot
. And shorten trait names.
TRY = TRY.pivot(index=["AccSpeciesName"], columns="TraitName", values="mean")
# reset indeces (species name) as columns in data frame
TRY.reset_index(inplace=True)
# rename trait variables to shorter names
shorten_names(TRY)
TRY.head(3)
TraitName | AccSpeciesName | Dispersal unit length | Leaf Area | SLA | Leaf C | LDMC | Leaf fresh mass | Leaf N per area | Leaf N per mass | Leaf delta15N | Leaf N P ratio | Leaf P | Plant Height | Seed mass | Seed length | Seeds per rep. unit | Stem conduit density | SSD | Conduit element length |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Aa | NaN | NaN | 9.433962 | NaN | NaN | NaN | 2.7984 | 26.4 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
1 | Aaronsohnia pubescens | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.2 | NaN | NaN | NaN | NaN | NaN | NaN |
2 | Abacaba (palm) | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 15.0 | NaN | NaN | NaN | NaN | NaN | NaN |
# Optional: Save file
#TRY.to_csv("TRY/TRY_summary_stats.csv", index=False)
Create summary stats with original name#
# group data by species name and trait, same analysis as above
grouped_syn = TRYdata.groupby(['SpeciesName', 'TraitID', 'TraitName'])
TRY_syn = grouped_syn['StdValue'].agg([np.mean]).reset_index()
# change df shape
TRY_syn = TRY_syn.pivot(index=["SpeciesName"], columns="TraitName", values="mean")
# reset indeces (species name) as columns in data frame
TRY_syn.reset_index(inplace=True)
# shorten column names
shorten_names(TRY_syn)
#optional
#TRY_syn.to_csv("TRY/TRY_summary_stats_syn.csv", index=False)
TRY_syn.head(3)
TraitName | SpeciesName | Dispersal unit length | Leaf Area | SLA | Leaf C | LDMC | Leaf fresh mass | Leaf N per area | Leaf N per mass | Leaf delta15N | Leaf N P ratio | Leaf P | Plant Height | Seed mass | Seed length | Seeds per rep. unit | Stem conduit density | SSD | Conduit element length |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | (fabaceae) | NaN | NaN | 21.3385 | NaN | NaN | NaN | 1.578157 | 33.150000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
1 | (fabaceae) 20-25oblong | NaN | NaN | NaN | NaN | NaN | NaN | 1.761453 | 32.513864 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2 | (fabaceae) brillafuzzy | NaN | NaN | NaN | NaN | NaN | NaN | 1.397197 | 33.837593 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
Link iNaturalist to TRY#
Non-fuzzy merge with TRY summary stats on consolidated TRY species name:
import pandas as pd # for handling dataframes in python
iNat = pd.read_csv('iNat_observations.csv')
iNat_TRY = pd.merge(iNat, TRY,
left_on= ['scientificName'],
right_on= ['AccSpeciesName'],
how='inner')
iNat_TRY.head(3)
gbifID | scientificName | decimalLatitude | decimalLongitude | eventDate | dateIdentified | AccSpeciesName | Dispersal unit length | Leaf Area | SLA | ... | Leaf delta15N | Leaf N P ratio | Leaf P | Plant Height | Seed mass | Seed length | Seeds per rep. unit | Stem conduit density | SSD | Conduit element length | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1229615436 | Commelina communis | 35.987483 | -79.057546 | 2013-07-07T00:00:00 | 2013-07-07T20:33:11 | Commelina communis | NaN | NaN | NaN | ... | NaN | 12.631579 | 1.71 | NaN | 8.48 | NaN | NaN | NaN | NaN | NaN |
1 | 3384000233 | Commelina communis | 42.093762 | -75.923660 | 2021-08-23T13:06:06 | 2021-09-17T21:15:37 | Commelina communis | NaN | NaN | NaN | ... | NaN | 12.631579 | 1.71 | NaN | 8.48 | NaN | NaN | NaN | NaN | NaN |
2 | 1807276585 | Commelina communis | 40.787636 | -73.933728 | 2017-09-04T12:47:58 | 2017-09-04T21:58:57 | Commelina communis | NaN | NaN | NaN | ... | NaN | 12.631579 | 1.71 | NaN | 8.48 | NaN | NaN | NaN | NaN | NaN |
3 rows × 25 columns
Extract from TRY those observations that have not been matched:
# filter for observations not in merged dataframe:
iNat_rest = iNat[~iNat.gbifID.isin(iNat_TRY['gbifID'])]
iNat_rest.shape
(2541013, 6)
We repeat the same with the ‘original’ species name in TRY:
# non-fuzzy merge with TRY summary stats on original TRY species name:
iNat_TRY_syn = pd.merge(iNat_rest, TRY_syn,
left_on= ['scientificName'],
right_on= ['SpeciesName'],
how='inner')
iNat_TRY_syn.head(3)
gbifID | scientificName | decimalLatitude | decimalLongitude | eventDate | dateIdentified | SpeciesName | Dispersal unit length | Leaf Area | SLA | ... | Leaf delta15N | Leaf N P ratio | Leaf P | Plant Height | Seed mass | Seed length | Seeds per rep. unit | Stem conduit density | SSD | Conduit element length | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1802610589 | Blitum capitatum | 40.320259 | -105.604856 | 2013-08-24T13:30:00 | 2019-09-02T01:11:54 | Blitum capitatum | NaN | NaN | NaN | ... | NaN | NaN | NaN | 0.45 | NaN | NaN | NaN | NaN | NaN | NaN |
1 | 2283078677 | Blitum capitatum | 50.744232 | -120.511303 | 2019-06-29T17:50:28 | 2019-09-02T01:16:41 | Blitum capitatum | NaN | NaN | NaN | ... | NaN | NaN | NaN | 0.45 | NaN | NaN | NaN | NaN | NaN | NaN |
2 | 2864818488 | Blitum capitatum | 53.938056 | -106.068553 | 2020-08-22T12:22:09 | 2020-08-22T19:13:24 | Blitum capitatum | NaN | NaN | NaN | ... | NaN | NaN | NaN | 0.45 | NaN | NaN | NaN | NaN | NaN | NaN |
3 rows × 25 columns
subsets = [iNat_TRY, iNat_TRY_syn]
iNat_TRY_all = pd.concat(subsets)
iNat_TRY_all = iNat_TRY_all.drop(['AccSpeciesName', 'SpeciesName'], axis = 1)
# replace infinite values as NaN
iNat_TRY_all = iNat_TRY_all.replace(-np.inf, np.nan)
iNat_TRY_all = iNat_TRY_all.replace(np.inf, np.nan)
iNat_TRY_all.head()
gbifID | scientificName | decimalLatitude | decimalLongitude | eventDate | dateIdentified | Dispersal unit length | Leaf Area | SLA | Leaf C | ... | Leaf delta15N | Leaf N P ratio | Leaf P | Plant Height | Seed mass | Seed length | Seeds per rep. unit | Stem conduit density | SSD | Conduit element length | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1229615436 | Commelina communis | 35.987483 | -79.057546 | 2013-07-07T00:00:00 | 2013-07-07T20:33:11 | NaN | NaN | NaN | NaN | ... | NaN | 12.631579 | 1.71 | NaN | 8.48 | NaN | NaN | NaN | NaN | NaN |
1 | 3384000233 | Commelina communis | 42.093762 | -75.923660 | 2021-08-23T13:06:06 | 2021-09-17T21:15:37 | NaN | NaN | NaN | NaN | ... | NaN | 12.631579 | 1.71 | NaN | 8.48 | NaN | NaN | NaN | NaN | NaN |
2 | 1807276585 | Commelina communis | 40.787636 | -73.933728 | 2017-09-04T12:47:58 | 2017-09-04T21:58:57 | NaN | NaN | NaN | NaN | ... | NaN | 12.631579 | 1.71 | NaN | 8.48 | NaN | NaN | NaN | NaN | NaN |
3 | 3355124418 | Commelina communis | 39.643158 | -76.764245 | 2020-08-26T10:19:56 | 2020-08-27T13:21:22 | NaN | NaN | NaN | NaN | ... | NaN | 12.631579 | 1.71 | NaN | 8.48 | NaN | NaN | NaN | NaN | NaN |
4 | 1802638502 | Commelina communis | 43.109505 | 1.622543 | 2017-10-21T10:01:00 | 2017-10-21T09:02:42 | NaN | NaN | NaN | NaN | ... | NaN | 12.631579 | 1.71 | NaN | 8.48 | NaN | NaN | NaN | NaN | NaN |
5 rows × 24 columns
trait = iNat_TRY_all.columns[6:24]
iNat_TRY_all.loc[:, trait] = np.log(iNat_TRY_all[trait])
/net/home/swolf/.conda/envs/traitmaps/lib/python3.8/site-packages/pandas/core/internals/blocks.py:402: RuntimeWarning: divide by zero encountered in log
result = func(self.values, **kwargs)
/net/home/swolf/.conda/envs/traitmaps/lib/python3.8/site-packages/pandas/core/internals/blocks.py:402: RuntimeWarning: invalid value encountered in log
result = func(self.values, **kwargs)
iNat_TRY_all.to_csv("iNat_TRY_log.csv", index=False)
iNat_TRY_all.head()
gbifID | scientificName | decimalLatitude | decimalLongitude | eventDate | dateIdentified | Dispersal unit length | Leaf Area | SLA | Leaf C | ... | Leaf delta15N | Leaf N P ratio | Leaf P | Plant Height | Seed mass | Seed length | Seeds per rep. unit | Stem conduit density | SSD | Conduit element length | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1229615436 | Commelina communis | 35.987483 | -79.057546 | 2013-07-07T00:00:00 | 2013-07-07T20:33:11 | NaN | NaN | NaN | NaN | ... | NaN | 2.5362 | 0.536493 | NaN | 2.13771 | NaN | NaN | NaN | NaN | NaN |
1 | 3384000233 | Commelina communis | 42.093762 | -75.923660 | 2021-08-23T13:06:06 | 2021-09-17T21:15:37 | NaN | NaN | NaN | NaN | ... | NaN | 2.5362 | 0.536493 | NaN | 2.13771 | NaN | NaN | NaN | NaN | NaN |
2 | 1807276585 | Commelina communis | 40.787636 | -73.933728 | 2017-09-04T12:47:58 | 2017-09-04T21:58:57 | NaN | NaN | NaN | NaN | ... | NaN | 2.5362 | 0.536493 | NaN | 2.13771 | NaN | NaN | NaN | NaN | NaN |
3 | 3355124418 | Commelina communis | 39.643158 | -76.764245 | 2020-08-26T10:19:56 | 2020-08-27T13:21:22 | NaN | NaN | NaN | NaN | ... | NaN | 2.5362 | 0.536493 | NaN | 2.13771 | NaN | NaN | NaN | NaN | NaN |
4 | 1802638502 | Commelina communis | 43.109505 | 1.622543 | 2017-10-21T10:01:00 | 2017-10-21T09:02:42 | NaN | NaN | NaN | NaN | ... | NaN | 2.5362 | 0.536493 | NaN | 2.13771 | NaN | NaN | NaN | NaN | NaN |
5 rows × 24 columns
After matching with consolidated and original name, we were able to match about 84% of the iNaturalist observations with trait information. Many rare species seem to be absent in either one of the two databases.
print('percentage of iNat observations linked with at least one TRY trait:')
print(len(iNat_TRY_all)/len(iNat))
print('percentage of species in iNaturalist matched with TRY:')
print(iNat_TRY_all["scientificName"].nunique()/iNat["scientificName"].nunique())
percentage of iNat observations linked with at least one TRY trait:
0.8421341704587321
percentage of species in iNaturalist matched with TRY:
0.3059127945386479