Instructions to use ctheodoris/Geneformer with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use ctheodoris/Geneformer with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("fill-mask", model="ctheodoris/Geneformer")# Load model directly from transformers import AutoTokenizer, AutoModelForMaskedLM tokenizer = AutoTokenizer.from_pretrained("ctheodoris/Geneformer") model = AutoModelForMaskedLM.from_pretrained("ctheodoris/Geneformer") - Inference
- Notebooks
- Google Colab
- Kaggle
Ensembl gene ID version used in Geneformer (ENSG)
Hello,geneformer team,
I have a question regarding the file ensembl_mapping_dict_gc30M.pkl.
Could you please clarify which Ensembl human gene annotation version was used to generate the ENSG identifiers in this mapping dictionary?
This information would be very helpful for ensuring consistent cross-species gene mapping and compatibility with the Geneformer tokenizer.
Thank you very much for your help.
Thank you for your question. Most public data is not annotated by version so when we integrate data we go by the Ensembl ID number or if it is provided as a gene name we convert it to the Ensembl ID based on the current version of public tools like MyGene. Generally though, the IDs should be stable for our purposes when we are working with already aligned counts. The 30m token dictionary used the current Ensembl IDs as of early 2021.