Markdown is a well-liked light-weight markup language with plain textual content formatting syntax designed to be straightforward to learn, write, and perceive. Markdown makes it straightforward for AI algorithms to parse and perceive the construction of textual content because of its constant and predictable syntax. It’s also broadly supported by common instruments, together with GitHub, Jupyter notebooks, and extra.
Microsoft not too long ago launched an open-source instrument known as MarkItDown on GitHub. MarkItDown is a Python library for changing recordsdata and workplace paperwork to Markdown. The transformed recordsdata can then be used for indexing, textual content evaluation, and extra. Microsoft’s MarkItDown library presently helps the next file codecs:
- PDF (.pdf)
- PowerPoint (.pptx)
- Phrase (.docx)
- Excel (.xlsx)
- Pictures (EXIF metadata, and OCR)
- Audio (EXIF metadata, and speech transcription)
- HTML (particular dealing with of Wikipedia, and so forth.)
- Varied different text-based codecs (csv, json, xml, and so forth.)
Builders may also configure the MarkItDown library to make use of Giant Language Fashions to explain photographs. To do that, they must set mlm_client and mlm_model parameters to the MarkItDown object as under:
from markitdown import MarkItDown
from openai import OpenAI
shopper = OpenAI()
md = MarkItDown(mlm_client=shopper, mlm_model=”gpt-4o”)
outcome = md.convert(“instance.jpg”)
print(outcome.text_content)
For the reason that MarkItDown library is accessible underneath the MIT open-source license, builders can freely use, modify, and distribute it. The one requirement is that they embody the unique license and copyright discover of their distribution.
Builders can obtain the MarkItDown Python library right here. They’ll additionally set up it utilizing the “pip set up markitdown” command, or from the supply utilizing “pip set up -e” command.
NEW: Microsoft simply dropped a library for changing Workplace recordsdata to markdown.
— matt palmer (@mattppal) December 13, 2024
It is tremendous quick and straightforward to make use of.
I constructed an app so that you can attempt it out. Right here it’s changing a boilerplate pptx. pic.twitter.com/NrG6C5DCaq
If you’re not a developer, you possibly can check out the MarkItDown library as an online app right here.
No Comment! Be the first one.