I have created a python script that automates a workflow converting PDF to txt files. I want to be able to store and query these files in MongoDB. Do I need to turn the .txt file into JSON/BSON? Should I be using a program like PyMongo?
I am just not sure what the steps of such a project would be let alone the tools that would help with this.
I've looked at this post: How can one add text files in Mongodb?, which makes me think I need to convert the file to a JSON file, and possibly integrate GridFS?
解决方案
You don't need to JSON/BSON encode it if you're using a driver. If you're using the MongoDB shell, you'd need to worry about it when you pasted the contents.
You'd likely want to use the Python MongoDB driver:
from pymongo import MongoClient
client = MongoClient()
db = client.test_database # use a database called "test_database"
collection = db.files # and inside that DB, a collection called "files"
f = open('test_file_name.txt') # open a file
text = f.read() # read the entire contents, should be UTF-8 text
# build a document to be inserted
text_file_doc = {"file_name": "test_file_name.txt", "contents" : text }
# insert the contents into the "file" collection
collection.insert(text_file_doc)
(Untested code)
If you made sure that the file names are unique, you could set the _id property of the document and retrieve it like:
text_file_doc = collection.find_one({"_id": "test_file_name.txt"})
Or, you could ensure the file_name property as shown above is indexed and do:
text_file_doc = collection.find_one({"file_name": "test_file_name.txt"})
Your other option is to use GridFS, although it's often not recommended for small files.
There's a starter here for Python and GridFS.