- Pandas Cheat Sheet Github 2
- Python Pandas Cheat Sheet Pdf
- Pandas Cheat Sheet Pdf
- Pandas Cheat Sheet Github Download
Cheatsheet for RDKit
package in python: (1) Draw molecules in jupyter enviroment; (2) use with Pandas Dataframe
(3) Descriptors/Fingerprints and (4) Similarity Search etc.
I hope this provides you with either new tools to use in pandas or refreshes your memory on what you already know. Remember, to keep practicing, and if you can get to the so-called 'Master' level, Malcolm Gladwell stated, perhaps one day, you won't need a cheat sheet to reference. Connect with me on Linkedin or Github. Exhaustive, simple, beautiful and concise. A truly Pythonic cheat sheet about Python programming language. Version16August (20142(Draft( 5(Workingwith&dates,timesandtheirindexes & & (Datesand time&–&pointsandspans & Withitsfocusontime 2seriesdata,pandas.
Installation
The RDKit
pacakge only supports conda
installation. Buttered whole kernel corn recipe.
Setup
Chem vs. AllChem
As mentioned in the Getting Started:
The majority of 'basic' chemical functionality (e.g. reading/writing molecules, substructure searching, molecular cleanup, etc.) is in the rdkit.Chem
module. More advanced, or less frequently used, functionality is in rdkit.Chem.AllChem
.
If you find the Chem/AllChem thing annoying or confusing, you can use python's 'import … as …' syntax to remove the irritation:
Basic
Get a RDKit molecule
from SMILES. RDKit molecule
enable several features to handle molecules: drawing, computing fingerprints/properties, molecular curation etc.
The RDKit molecules can be directly printed in jupyter enviroment.
Convert a RDKit molecule to SMILES.
Convert a RDKit molecule to InchiKey.
Convert a RDKit molecule to coordinative representation (which can be stored in .sdf
file).
Reading sets of molecules
Major types of molecular file formats:
.csv
file that includes a column ofSMILES
. SeePandasTools
section..smi/.txt
file that includesSMILES
. Collect the SMILES as a list. The following code is an example to read a.smi
file that contains one SMILES per line.
.sdf
file that includesatom coordinates
. Reading molecules from.sdf
file. Code Example
Draw molecules in Jupter environment
Print molecules in grid.
PandasTools
PandasTools
enables using RDKit molecules as columns of a Pandas Dataframe
.
smiles | logSolubility | |
---|---|---|
0 | N#CC(OC1OC(COC2OC(CO)C(O)C(O)C2O)C(O)C(O)C1O)c.. | -0.77 |
Add ROMol
to Pandas Dataframe.
smiles | logSolubility | ROMol |
---|---|---|
0 | N#CC(OC1OC(COC2OC(CO)C(O)C(O)C2O)C(O)C(O)C1O)c.. | -0.77 |
ROMol
column stores rdchem.Mol
object.
Draw the structures in grid.
Adding new columns of properites use Pandas
map method.
smiles | logSolubility | ROMol | n_Atoms |
---|---|---|---|
0 | N#CC(OC1OC(COC2OC(CO)C(O)C(O)C2O)C(O)C(O)C1O)c.. | -0.77 | 32 |
Before saving the dataframe as csv file, it is recommanded to drop the ROMol
column.
smiles | logSolubility | n_Atoms | |
---|---|---|---|
0 | N#CC(OC1OC(COC2OC(CO)C(O)C(O)C2O)C(O)C(O)C1O)c.. | -0.77 | 32 |
Descriptors/Fingerprints
Pandas Cheat Sheet Github 2
The RDKit has avariety of built-in functionality for generating molecular fingerprints/descriptors. A detialed description can be found here.
smiles | logSolubility | ROMol |
---|---|---|
0 | N#CC(OC1OC(COC2OC(CO)C(O)C(O)C2O)C(O)C(O)C1O)c.. | -0.77 |
Python Pandas Cheat Sheet Pdf
Morgan Fingerprint (ECFPx)
AllChem.GetMorganFingerprintAsBitVect
Parameters:
radius
: no default value, usually set 2 for similarity search and 3 for machine learning.nBits
: number of bits, default is 2048. 1024 is also widely used.- other parameterss are ususlly left to default
More examples can be found in this notebook from my previous work.
ECFP6 fingerprint for each molecule has 1024 bits.
Pandas Cheat Sheet Pdf
Save as a .csv
file for futher use (e.g., machine learning). Markdown cheat sheet jupyter. I usually save (1) SMILES as index and (2) each bit as a column to the csv file.
Bit_0 | Bit_1 | Bit_2 | Bit_3 | Bit_4 | Bit_5 | Bit_6 | Bit_7 | Bit_8 | Bit_9 | .. | Bit_1014 | Bit_1015 | Bit_1016 | Bit_1017 | Bit_1018 | Bit_1019 | Bit_1020 | Bit_1021 | Bit_1022 | Bit_1023 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
smiles | |||||||||||||||||||||
N#CC(OC1OC(COC2OC(CO)C(O)C(O)C2O)C(O)C(O)C1O)c1ccccc1 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | .. | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
1 rows × 1024 columns Haskell 2020 cheat sheet.
Similarity Search
Compute the similarity of a reference molecule and a list of molecules. Here is an example of using ECFP4 fingerprint to compute the Tanimoto Similarity
(the default metric of DataStructs.FingerprintSimilarity.
- compute fingerprints
We can also add the similarity_efcp4
to the dataframe and visualize the structure and similarity.
Sort the result from highest to lowest.
More Reading
Pandas Cheat Sheet Github Download
- Offical documentation.
- RDKit Cookbook
This document provides example recipes of how to carry out particular tasks using the RDKit functionality from Python. The contents have been contributed by the RDKit community, tested with the latest RDKit release, and then compiled into this document.