Steven Lott

Works at (Retired)

Green Cove Springs, US

Joined May 2008

http://slott-softwarearchitect.blogspot.com

About

Steven has been programming since the 70s, when computers were large, expensive, and rare. As a former contract software developer and architect, he worked on hundreds of projects from very small to very large. He's been using Python to solve business problems for over 20 years. His titles with Packt Publishing include Python Essentials, Mastering Object-Oriented Python, Functional Python Programming, Python3 Object-Oriented Programming, and Python for Secret Agents. Steven is currently a technomad who lives in various places on the east coast of the US. @s_lott

Stats

Reputation:	1569
Pageviews:	1.6M
Articles:	8
Comments:	0

Articles
Trend Reports

Articles

Python and Low-Code Development: Smooth Sailing With Jupyter Notebooks

Editor's Note: The following is an article written for and published in DZone's 2021 Low-Code Development Trend Report. If you ride on a sailboat in a steady breeze, it glides through the water effortlessly. Few things can compare to crossing a bay without the noise and commotion of a thundering internal combustion engine. The dream of no-code and low-code development is to effortlessly glide from problem to solution. Back in the ‘80s, no-code/ low-code development was called “end-user computing.” Since the invention of the spreadsheet, we’ve had a kind of low-code computing. The technologies continue to evolve. Let’s take our boat out of the slip and sail around a little. We’ll look at two ways Python is commonly used to create no-code and low-code solutions to problems. The first thing we’ll look at is using JupyterLab to create solutions to problems with minimal code. I liken this to using winches to help lift the sails. It’s not an internal combustion engine, but it is a machine that helps us manipulate heavy, bulky items like sails and anchors. The second thing we’ll sail past is using Python as an integration framework to knit together tools and solutions that aren’t specifically written in Python themselves. In this case, we’re going to be writing integration code, not solution code. This is how sailboats work; given a hull and some masts, you’ll have to select and set the sails that will make the boat move. JupyterLab As a developer, and as a writer about Python, I rely on JupyterLab a lot. It’s often the tool I start with because I get handy, pleasant spreadsheet-like interaction. The “spreadsheet” feature that I’m talking about is the ability to change the value of one cell, and then recompute the remaining cells after that change. This lets me create elegant, interactive solutions where the bulk of the interaction is handled by the notebook’s internal model: a lattice of inter-dependent cells. In this picture, we can see the computation of cell 2 depends on prior results in cell 1. I consider this “low code” because there’s a vast amount of code we don’t need to write. The Jupyter Notebook provides us an interactive framework that’s robust and reliable. More important than that, the framework fits the way a lot of people have grown to use computers by putting values into some cells and checking results in other cells. The big difference between Jupyter and a spreadsheet is Jupyter lets us write extensions and expressions in Python. Recently, I had a boat-related problem crop up that was perfect for this kind of low-code processing. The tank under the pointy part of the boat (the “V-berth”) has a fairly complex shape. The question really is: “How big is it?” The problem is access; I have to make a series of approximations and models. I really need a spreadsheet-like calculator, but I need a lot more mathematical processing than is sensible in a spreadsheet. Consequently, let me apologize to any math-phobic readers. This example involves a lot of math. For folks who don’t like math, think of using a spreadsheet where you never looked at the formula in a given column. A notebook can (and often does) hide the details. This specific example doesn’t try to hide the details. Here’s an example: https://github.com/slott56/replacing-a-spreadsheet. I want to lift up a few key features of this low-code development approach. I created three notebooks, each of which had a common structure. They all have a collection of measurements as the input. As the output, it reports a computed volume. The rest of the cells provide some background to help me make sure the math is right. Here’s the input cell’s content in the “prism.ipynb” notebook. This is cell 8. Python measured = { # Forward triangle, in inches "h_f": 8, "w_f": 10 + Rational(1, 2), # Aft triangle, in inches "h_a": 27, "w_a": 48, # Overall length from forward to aft, in inches. "l_fa": 46, } This is a Python dictionary that has a number of input values. This is a bit more complex-looking than spreadsheet cells, making it solidly “low-code” not “no-code.” The output is computed in cell 10, providing a single numeric result with a format like the following: '50 85/88 gallons' This shows me that the volume of space, given the measurements, is just shy of 51 gallons. The best part about this data is there are two kinds of changes I can make. The most important change is to the measurements, which leads to recomputing the notebook. Anyone who can use a spreadsheet can alter the numbers to the “measured” dictionary to recompute the volume. The other change I can make is to adjust some of the underlying assumptions. This is a more nuanced change to the model that is also implemented in the notebook. I find that for a wide variety of problems, a notebook is the place to start. It lets me gather data, formulate alternative approaches to algorithms and data structures, and work out candidate solutions that are richly interactive. This is an excerpt from DZone's 2021 Low-Code Development Trend Report. For more: Read the Report The JupyterLab project is like a boat with three masts and dozens upon dozens of sails for all conditions and situations. There are a lot of features that can be used to create interactive solutions to problems. The idea is to write only the code that’s directly related to the problem domain and leverage the notebook’s capabilities to present the solution so that people can make decisions and take actions. In addition to the notebook for a low-code interactive user experience, we can look at Python as an engine for integrating disparate applications together. Python as Integration Engine Python’s standard library includes modules to help us work with OS resources, including other applications. Rather than build or modify existing code, we can treat applications as opaque boxes and combine them to create an integrated solution. To a limited extent, integration applications is what shell scripts do. There’s a world of difference, however, between integration with a shell script and integration with Python. Shell scripting involves the almost impossible-to-understand shell programming language. (See this article on replacing shell scripts with Python for more thoughts on this.) When we integrate applications with Python, we can easily introduce additional computations and transformations. This can help to smooth over gaps, remove manual operations, and prevent potential errors. I’m a fan of code like the following: Python command = [ "markdown_py", "-v", "-e", "utf-8" ] temp = options.input.with_suffix(".temp") output = options.input.with_suffix(".html") params = dict(CSS=str(options.style), TITLE=options.title) with temp.open('w') as temporary, options.input.open() as source: subprocess.run(command, stdout=temporary, stdin=source) This is part of a larger and more complex script to publish a complex document written in markdown. It has a lot of code examples, which are a lot easier to read in HTML format. I must do some pre-processing of the markdown and some post-processing of the HTML. It seems easiest to execute the markdown_py command from inside a Python script, avoiding a complex python-bash-python kind of process. Since I’m not modifying the underlying applications, I find this fits with a low-code approach. I’m using the source application (markdown_py) for the thing it does best — adjusting the inputs and outputs using Python. Conclusion We can use Python in a variety of ways. It’s a programming language, so we can build code. More importantly, we can use the vast number of pre-built Python libraries to create low-code solutions. We can use a Jupyter Notebook as a low-code way to create a sophisticated interactive experience for users, and we can use Python to integrate other applications. Sailing isn’t effortless. The boat glides only when the sails are set properly, and we keep the rudder in the right position. Just as skill and expertise are required to make a boat move, so too is careful attention needed to write the minimal Python code to solve an information processing problem. Steven Lott, Writer, Python Guru & Retiree @slott on DZone | @s_lott on Twitter | slott-softwarearchitect.blogspot.com Steven has been programming since the 70s, when computers were large, expensive, and rare. As a former contract software developer and architect, he worked on hundreds of projects from very small to very large. He’s been using Python to solve business problems for over 20 years. His titles with Packt Publishing include Python Essentials, Mastering Object-Oriented Python, Functional Python Programming, Python3 Object-Oriented Programming, and Python for Secret Agents. Steven is currently a technomad who lives in various places on the east coast of the US.

August 27, 2021

· 9,857 Views · 2 Likes

NMEA Data Acquisition: An IoT Exercise With Python

This comprehensive post covers the basic data arc that many IoT projects have—exploration, modeling, filtering, and persistence—using Python.

June 21, 2017

· 13,093 Views · 4 Likes

Literate Programming and GitHub

I remain captivated by the ideals of Literate Programming. My fork of PyLit (https://github.com/slott56/PyLit-3) coupled with Sphinx seems to handle LP programming in a very elegant way. It works like this. Write RST files describing the problem and the solution. This includes the actual implementation code. And everything else that's relevant. Run PyLit3 to build final Python code from the RST documentation. This should include the setup.py so that it can be installed properly. Run Sphinx to build pretty HTML pages (and LaTeX) from the RST documentation. I often include the unit tests along with the sphinx build so that I'm sure that things are working. The challenge is final presentation of the whole package. The HTML can be easy to publish, but it can't (trivially) be used to recover the code. We have to upload two separate and distinct things. (We could use BeautifulSoup to recover RST from HTML and then PyLit to rebuild the code. But that sounds crazy.) The RST is easy to publish, but hard to read and it requires a pass with PyLit to emit the code and then another pass with Sphinx to produce the HTML. A single upload doesn't work well. If we publish only the Python code we've defeated the point of literate programming. Even if we focus on the Python, we need to do a separate upload of HTML to providing the supporting documentation. After working with this for a while, I've found that it's simplest to have one source and several targets. I use RST ⇒ (.py, .html, .tex). This encourages me to write documentation first. I often fail, and have blocks of code with tiny summaries and non-existent explanations. PyLit allows one to use .py ⇒ .rst ⇒ .html, .tex. I've messed with this a bit and don't like it as much. Code first leaves the documentation as a kind of afterthought. How can we publish simply and cleanly: without separate uploads? Enter GitHub and gh-pages. See the "sphinxdoc-test" project for an example. Also thishttps://github.com/daler/sphinxdoc-test. The bulk of this is useful advice on a way to create the gh-pages branch from your RST source via Sphinx and some GitHub commands. Following this line of thinking, we almost have the case for three branches in a LP project. The "master" branch with the RST source. And nothing more. The "code" branch with the generated Python code created by PyLit. The "gh-pages" branch with the generated HTML created by Sphinx. I think I like this. We need three top-level directories. One has RST source. A build script would run PyLit to populate the (separate) directory for the code branch. The build script would also run Sphinx to populate a third top-level directory for the gh-pages branch. The downside of this shows up when you need to create a branch for a separate effort. You have a "some-major-change" branch to master. Where's the code? Where's the doco? You don't want to commit either of those derived work products until you merge the "some-major-change" back into master. GitHub Literate Programming There are many LP projects on GitHub. There are perhaps a dozen which focus on publishing with the Github-flavored Markdown as the source language. Because Markdown is about as easy to parse as RST, the tooling is simple. Because Markdown lacks semantic richness, I'm not switching. I've found that semantically rich markup is essential. This is a key feature of RST. It's carried forward by Sphinx to create very sophisticated markup. Think:code:`sample` vs. :py:func:`sample` vs. :py:mod:`sample` vs.:py:exc:`sample`. The final typesetting may be similar, but they are clearly semantically distinct and create separate index entries. A focus on Markdown seems to be a limitation. It's encouraging to see folks experiment with literate programming using Markdown and GitHub. Perhaps other folks will look at more sophisticated markup languages like RST. Previous Exercises See https://sourceforge.net/projects/stingrayreader/ for a seriously large literate programming effort. The HTML is also hosted at SourceForge: http://stingrayreader.sourceforge.net/index.html. This project is awkward because -- well -- I have to do a separate FTP upload of the finished pages after a change. It's done with a script, not a simple "git push." SourceForge has a GitHub repository. https://sourceforge.net/p/stingrayreader/code/ci/master/tree/. But. SourceForge doesn't use GitHub.com's UI, so it's not clear if it supports the gh-pages feature. I assume it doesn't, but, maybe it does. (I can't even login to SourceForge with Safari... I should really stop using SourceForge and switch to GitHub.) See https://github.com/slott56/HamCalc-2.1 for another complex, LP effort. This predates my dim understanding of the gh-pages branch, so it's got HTML (in doc/build/html), but it doesn't show it elegantly. I'm still not sure this three-branch Literate Programming approach is sensible. My first step should probably be to rearrange the PyLit3 project into this three-branch structure.

June 24, 2015

· 1,494 Views · 0 Likes

Configuration Files, Environment Variables, and Command-Line Options

We have three major tiers of configuration for applications. Within each tier, we have sub-tiers, larding on yet more complexity. The organization of the layers is a bit fungible, too. Making good choices can be rather complex because there are so many variations on the theme of "configuration". The desktop GUI app with a preferences file has very different requirements from larger, more complex applications. The most dynamic configuration options are the command-line arguments. Within this tier of configuration, we have two sub-tiers of default values and user-provided overrides to those defaults. Where do the defaults come from? They might be wired in, but more often they come from environment variables or parameter files or both. There's some difference of opinion on which tier is next in the tiers of dynamism. The two choices are configuration files and environment variables. We can consider environment variables as easier to edit than configuration files. In some cases, though, configuration files are easier to change than environment variables. Environment variables are typically bound to the process just once (like command-line arguments), where configuration files can be read and re-read as needed. The environment variables have three sub-tiers. System-level environment variables tend to be fixed. The variables set by a .profile or .bashrc tend to be specific to a logged-in user, and are somewhat more flexible that system variables. The current set of environment variables associated with the logged-in session can be modified on the command line, and are as flexible as command-line arguments. Note that we can do this in Linux: http://slott-softwarearchitect.blogspot.com/2015/03/configuration-files-environment.html This will set an environment variable as part of running a command. The configuration files may also have tiers. We might have a global configuration file in /etc/our-app. We might look for a ~/.our-app-rc as a user's generic configuration. We can also look for our-app.config in the current working directory as the final set of overrides to be used for the current invocation. Some applications can be restarted, leading to re-reading the configuration files. We can change the configuration more easily than we can bind in new command-line arguments or environment variables. Representation Issues When we think about configuration files, we also have to consider the syntax we want to use to represent configurable parameters. We have five common choices. Some folks are hopelessly in love with Windows-style .ini files. The configparser module will parse these. I call it hopelessly in love because the syntax is rather quite limited. Look at the logging.config module to see how complex the .ini file format is for non-trivial cases. Some folks like Java-style properties files. These have the benefit of being really easy to parse in Python. Indeed, scanning a properties file is great exercise in functional-style Python programming. I'm not completely sold on these, either, because they don't really handle the non-trivial cases well. Using JSON or YAML for properties has some real advantages. There's a lot of sophistication available in these two notations. While JSON has first-class support, YAML requires an add-on module. We can also use Python as the language for configuration. For good examples of this, look at the Django project settings file. Using Python has numerous advantages. The only possible disadvantage is the time wasted arguing with folks who call it a "security vulnerability." Using Python as the configuration language is only considered a vulnerability by people who fail to realize that the Python source itself can be hacked. Why waste time injecting a bug into a configuration file? Why not just hack the source? My Current Fave My current favorite way to handle configuration is by defining some kind of configuration class and using the class object throughout the application. Because of Python's import processing, a single instance of the class definition is easy to guarantee. We might have a module that defines a hierarchy of configuration classes, each of which layers in additional details. class Defaults: mongo_uri = "mongodb://localhost:27017" some_param = "xyz" class Dev(Defaults): mongo_uri = "mongodb://sandbox:27017" class QA(Defaults): mongo_uri = "mongodb://username:password@qa02:27017/?authMechanism=PLAIN&authSource=$external" Yes. The password is visible. If we want to mess around with higher levels of secrecy in the configuration files, we can use PyCrypto and a key generator to use an encrypted password that's injected into the URI. That's a subject for another post. The folks to can edit the configuration files often know the passwords. Who are we trying to hide things from? How do we choose the active configuration to use from among the available choices in this file? We have several ways. Add a line to the configuration module. For example, Config=QA will name the selected environment. We have to change the configuration file as our code marches through environments from development to production. We can use from configuration import Config to get the proper configuration in all other modules of the application. Rely on the environment variable to specify which configuration use. In enterprise contexts, an environment variable is often available.We can import os, and use Config=globals()[os.environ['OURAPP_ENVIRONMENT']] to pick a configuration based on an environment variable. In some places, we can rely on the host name itself to pick a configuration. We can use os.uname()[1] to get the name of the server. We can add a mapping from server name to configuration, and use this: Config=host_map(os.uname()[1],Defaults). Use a command-line options like "--env=QA". This can a little more complex than the above techniques, but it seems to work out nicely in the long run. Command-line args to select a specific configuration To select a configuration using command-line arguments, we must decompose configuration into two parts. The configuration alternatives shown above are placed in a config_params.py module. The config.py module that's used directly by the application will import the config_params.py module, parse the command-line options, and finally pick a configuration. This module can create the required module global, Config. Since it will only execute once, we can import it freely. The config module will use argparse to create an object named options with the command-line options. We can then do this little dance: import argparse import sys import config_params parser= argparse.ArgumentParser() parser.add_argument("--env", default="DEV") options= parser.parse_args() Config = getattr(config_params, options.env) Config.options= options This seems to work out reasonably well. We can tweak the config_params.py flexibly. We can pick the configuration with a simple command-line option. If we want to elegantly dump the configuration, we have a bit of a struggle. Each class in the hierarchy introduces names: it's a bit of work to walk down the __class__.__mro__ lattice to discover all of the available names and values that are inherited and overridden from the parents. We could do something like this to flatten out the resulting values: Base= getattr(config_params, options.env) class Config(Base): def __repr__(self): names= {} for cls in reversed(self.__class__.__mro__): cls_names= dict((nm, (cls.__name__, val)) for nm,val in cls.__dict__.items() if nm[0] != "_") names.update( cls_names ) return ", ".join( "{0}.{1}={2}".format(class_val[0], nm, class_val[1]) for nm,class_val in names.items() ) It's not clear this is required. But it's kind of cool for debugging.

April 4, 2015

· 10,514 Views · 1 Like

Password Encryption -- Short Answer: Don't.

First, read this. Why passwords have never been weaker—and crackers have never been stronger. There are numerous important lessons in this article. One of the small lessons is that changing your password every sixty or ninety days is farcical. The rainbow table algorithms can crack a badly-done password in minutes. Every 60 days, the cracker has to spend a few minutes breaking your new password. Why bother changing it? It only annoys the haxorz; they'll be using your account within a few minutes. However. That practice is now so ingrained that it's difficult to dislodge from the heads of security consultants. The big lesson, however, is profound. Work Experience Recently, I got a request from a developer on how to encrypt a password. We have a Python back-end and the developer was asking which crypto package to download and how to install it. "Crypto?" I asked. "Why do we need crypto?" "To encrypt passwords," they replied. I spat coffee on my monitor. I felt like hitting Caps Lock in the chat window so I could respond like this: "NEVER ENCRYPT A PASSWORD, YOU DOLT." I didn't, but I felt like it. Much Confusion The conversation took hours. Chat can be slow that way. Also, I can be slow because I need to understand what's going on before I reply. I'm a slow thinker. But the developer also needed to try stuff and provide concrete code examples, which takes time. At the time, I knew that passwords must be hashed with salt. I hadn't read the Ars Technica article cited above, so I didn't know why computationally intensive hash algorithms are best for this. We had to discuss hash algorithms. We had to discuss algorithms for generating unique salt. We had to discuss random number generators and how to use an entropy source for a seed. We had to discuss http://www.ietf.org/rfc/rfc2617.txt in some depth, since the algorithms in section 3.2.2. show some best practices in creating hash summaries of usernames, passwords, and realms. All of this was, of course, side topics before we got to the heart of the matter. What's Been Going On After several hours, my "why" questions started revealing things. The specific user story, for example, was slow to surface. Why? Partly because I didn't demand it early enough. But also, many technology folks will conceive of a "solution" and pursue that technical concept no matter how difficult or bizarre. In some cases, the concept doesn't really solve the problem. I call this the "Rat Holes of Lost Time" phenomena: we chase some concept through numerous little rat-holes before we realize there's a lot of activity but no tangible progress. There's a perceptual narrowing that occurs when we focus on the technology. Often, we're not actually solving the problem. IT people leap past the problem into the solution as naturally as they breathe. It's a hard habit to break. It turned out that they were creating some additional RESTful web services. They knew that the RESTful requests needed proper authentication. But, they were vague on the details of how to secure the new RESTful services. So they were chasing down their concept: encrypt a password and provide this encrypted password with each request. They were half right, here. A secure "token" is required. But an encrypted password is a terrible token. Use The Framework, Luke What's most disturbing about this is the developer's blind spot. For some reason, the existence of other web services didn't enter into this developer's head. Why didn't they read the code for the services created on earlier sprints? We're using Django. We already have a RESTful web services framework with a complete (and high quality) security implementation. Nothing more is required. Use the RESTful authentication already part of Django. In most cases, HTTPS is used to encrypt at the socket layer. This means that Basic Authentication is all that's required. This is a huge simplification, since all the RESTful frameworks already offer this. The Django Rest Framework has a nice authentication module. When using Piston, it's easy to work with their Authentication handler. It's possible to make RESTful requests with Digest Authentication, if SSL is not being used. For example, Akoha handles this. It's easy to extend a framework to add Digest in addition to Basic authentication. For other customers, I created an authentication handler between Piston and ForgeRock OpenAM so that OpenAM tokens were used with each RESTful request. (This requires some care to create a solution that is testable.) Bottom Lines Don't encrypt passwords. Ever. Don't write your own hash and salt algorithm. Use a framework that offers this to you. Read the Ars Technica article before doing anything password-related.

August 28, 2012

· 21,002 Views · 0 Likes

The Passive-Aggressive Programmer (again)

I'm not even interested in psychology. But. This kind of thing seems to come up once in a great while. You're asked (or "forced") to work with someone who—essentially—fails to cooperate. They don't actively disagree or actively suggest something better. They passively fail to agree. In fact, they probably disagree. They may actually have an idea of their own. But they prefer to passively "fail to agree." I have little patience to begin with. And I had noted my personal inability to cope in The Passive-Aggressive Programmer or Why Nothing Gets Done. Recently, I received this. "I thought I was going crazy and started doubting myself when dealing with a PAP co-worker. I went out on the internet searching for help and ran into your blog and its helped me realize that I'm not crazy. The example conversations you posted are almost every day occurrences here at my job when dealing with my co-worker. From outside of the department it was oh those two just butt-heads because I never knew how to communicate or point out that really I'm a targeted victim of a PAP rather than a butting heads issue. No matter what approach I took with the PAP I was doomed and still not quite sure where to go from here. Would you happen to offer any advice on how to actually deal with PAP? It's driven me to a point where I'm looking for new employment because my employer won't deal with it." I really have no useful advice. There's no way to "force" them to agree with anything specific. In some cases, there's no easy to even determine what they might agree with. Your employer will only "deal with" problems that cause them real pain. If you're butting heads, but still getting things done, then there's no real pain. You're successful, even if you're unhappy. If you want to be both happy and successful, you need to stop doing things that make you unhappy. If you can't agree with a co-worker, you can butt heads (which makes you unhappy) or you can ignore them (which may make you happy.) Ignoring them completely may mean that things will stop getting done. You may appear less successful. If you stop being successful, then your employer will start to feel some pain. When you employer feels pain, they will take action to relieve the pain. You might want to try to provide clear, complete documentation of your colleague's ideas, whatever they are. If you write down the Passive-Aggressive Programmer's "suggestions", then you might be able to demonstrate what's causing the pain. Since a properly passive programmer never actually agrees with anything, it's tricky to pin them down to anything specific. You might be able to make it clear that they're the roadblock that needs to be removed.

June 2, 2012

· 11,754 Views · 0 Likes

Mining Data from PDF Files with Python

PDF files aren't pleasant. The good news is that they're documented (http://www.adobe.com/devnet/pdf/pdf_reference.html). The bad news is that they're rather complex. I found four Python packages for reading PDF files. http://pybrary.net/pyPdf/ - weak http://www.swftools.org/gfx_tutorial.html - depends on binary XPDF http://blog.didierstevens.com/programs/pdf-tools/ - limited http://www.unixuser.org/~euske/python/pdfminer/ - acceptable I elected to work with PDFMiner for two reasons. (1) Pure Python, (2) Reasonably Complete. This is not, however, much of an endorsement. The implementation (while seemingly correct for my purposes) needs a fair amount of cleanup. Here's one example of remarkably poor programming. # Connect the parser and document objects. parser.set_document(doc) doc.set_parser(parser) Only one of these two is needed; the other is trivially handled as part of the setter method. Also, the package seems to rely on a huge volume of isinstance type checking. It's not clear if proper polymorphism is even possible. But some kind of filter that picked elements by type might be nicer than a lot of isinstance checks. Annotation Extraction While shabby, the good news is that PDFMiner seems to reliably extract the annotations on a PDF form. In a couple of hours, I had this example of how to read a PDF document and collect the data filled into the form. from pdfminer.pdfparser import PDFParser, PDFDocument from pdfminer.psparser import PSLiteral from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter, PDFTextExtractionNotAllowed from pdfminer.pdfdevice import PDFDevice from pdfminer.pdftypes import PDFObjRef from pdfminer.layout import LAParams, LTTextBoxHorizontal from pdfminer.converter import PDFPageAggregator from collections import defaultdict, namedtuple TextBlock= namedtuple("TextBlock", ["x", "y", "text"]) class Parser( object ): """Parse the PDF. 1. Get the annotations into the self.fields dictionary. 2. Get the text into a dictionary of text blocks. The key to the dictionary is page number (1-based). The value in the dictionary is a sequence of items in (-y, x) order. That is approximately top-to-bottom, left-to-right. """ def __init__( self ): self.fields = {} self.text= {} def load( self, open_file ): self.fields = {} self.text= {} # Create a PDF parser object associated with the file object. parser = PDFParser(open_file) # Create a PDF document object that stores the document structure. doc = PDFDocument() # Connect the parser and document objects. parser.set_document(doc) doc.set_parser(parser) # Supply the password for initialization. # (If no password is set, give an empty string.) doc.initialize('') # Check if the document allows text extraction. If not, abort. if not doc.is_extractable: raise PDFTextExtractionNotAllowed # Create a PDF resource manager object that stores shared resources. rsrcmgr = PDFResourceManager() # Set parameters for analysis. laparams = LAParams() # Create a PDF page aggregator object. device = PDFPageAggregator(rsrcmgr, laparams=laparams) # Create a PDF interpreter object. interpreter = PDFPageInterpreter(rsrcmgr, device) # Process each page contained in the document. for pgnum, page in enumerate( doc.get_pages() ): interpreter.process_page(page) if page.annots: self._build_annotations( page ) txt= self._get_text( device ) self.text[pgnum+1]= txt def _build_annotations( self, page ): for annot in page.annots.resolve(): if isinstance( annot, PDFObjRef ): annot= annot.resolve() assert annot['Type'].name == "Annot", repr(annot) if annot['Subtype'].name == "Widget": if annot['FT'].name == "Btn": assert annot['T'] not in self.fields self.fields[ annot['T'] ] = annot['V'].name elif annot['FT'].name == "Tx": assert annot['T'] not in self.fields self.fields[ annot['T'] ] = annot['V'] elif annot['FT'].name == "Ch": assert annot['T'] not in self.fields self.fields[ annot['T'] ] = annot['V'] # Alternative choices in annot['Opt'] ) else: raise Exception( "Unknown Widget" ) else: raise Exception( "Unknown Annotation" ) def _get_text( self, device ): text= [] layout = device.get_result() for obj in layout: if isinstance( obj, LTTextBoxHorizontal ): if obj.get_text().strip(): text.append( TextBlock(obj.x0, obj.y1, obj.get_text().strip()) ) text.sort( key=lambda row: (-row.y, row.x) ) return text def is_recognized( self ): """Check for Copyright as well as Revision information on each page.""" bottom_page_1 = self.text[1][-3:] bottom_page_2 = self.text[2][-3:] pg1_rev= "Rev 2011.01.17" == bottom_page_1[2].text pg2_rev= "Rev 2011.01.17" == bottom_page_2[0].text return pg1_rev and pg2_rev This gives us a dictionary of field names and values. Essentially transforming the PDF form into the same kind of data that comes from an HTML POST request. An important part is that we don't want much of the background text. Just enough to confirm the version of the form file itself. The cryptic text.sort( key=lambda row: (-row.y, row.x) ) will sort the text blocks into order from top-to-bottom and left-to-right. For the most part, a page footer will show up last. This is not guaranteed, however. In a multi-column layout, the footer can be so close to the bottom of a column that PDFMiner may put the two text blocks together. The other unfortunate part is the extremely long (and opaque) setup required to get the data from the page. Source: http://slott-softwarearchitect.blogspot.com/2012/02/pdf-reading.html

February 14, 2012

· 95,878 Views · 1 Like

Python and the Star Schema

The star schema represents data as a table of facts (measurable values) that are associated with the various dimensions of the fact. Common dimensions include time, geography, organization, product and the like. I'm working with some folks whose facts are a bunch of medical test results, and the dimensions are patient, date, and a facility in which the tests were performed. I got an email with the following situation: "a client who is processing gigs of incoming fact data each day and they use a host of C/C++, Perl, mainframe and other tools for their incoming fact processing and I've seriously considered pushing Python in their organization.". Here are my thoughts on using Python for data warehousing when you've got Gb of data daily. Small Dimensions The pure Python approach only works when your dimension will comfortably fit into memory -- not a terribly big problem with most dimensions. Specifically, it doesn't work well for those dimensions which are so huge that the dimensional model becomes a snowflake instead of a simple star. When dealing with a large number of individuals (public utilities, banks, medical management, etc.) the "customer" (or "patient") dimension gets too big to fit into memory. Special bridge-table techniques must be used. I don't think Python would be perfect for this, since this involves slogging through a lot of data one record at a time. However, Python is considerably faster than PL/SQL. I don't know how it compares with Perl. Any programming language will be faster than any SQL procedure, because there's no RDBMS overhead. For all small dimensions. Load the dimension values from the RDBMS into a dict with a single query. Read all source data records (ideally from a flat file); conform the dimension, tracking changes; write a result record with the dimension FK information to a flat file. Iterate through the dimension dictionary and persist the dimension changes. The details vary with the Slowly Changing Dimension (SCD) rules you're using. The conformance algorithm is is essentially the following: row= Dimension(...) ident= ( row.field, row.field, row.field, ... ) dimension.setdefault( ident, row ) In some cases (like the Django ORM) this is called the get-or-create query. The Dimension Bus For BIG dimensions, I think you still have to implement the "dimension bus" outlined in The Data Warehouse Toolkit. To do this in Python, you should probably design things to look something like the following. For any big dimensions. Use an external sort-merge utility. Seriously. They're way fast for data sets too large to fit into memory. Use CSV format files and the resulting program is very tidy. The outline is as follows: First, sort the source data file into order by the identifying fields of the big dimension (customer number, patient number, whatever). Second, query the big dimension into a data file and sort it into the same order as the source file. (Using the SQL ORDER BY may be slower than an external sort; only measurements can tell which is faster.) Third, do a "match merge" to locate the differences between the dimension and the source. Don't use a utility like diff, it's too slow. This is a simple key matching between two files. The match-merge loop looks something like this. src= sourceFile.next() dim= dimensionFile.next() try: while True: src_key = ( src['field'], src['field'], ... ) dim_key= ( dim['field'], dim['field'], ... ) if src_key < dim_key: # missing some dimension values update_dimension( src ) src= sourceFile.next() elif dim_key < src_key: # extra dimension values dim= dimensionFile.next() else: # src and dim keys match # check non-key attributes for dimension change. src= sourceFile.next() except StopIteration, e: # if source is at end-of-file, that's good, we're done. # if dim is at end of file, all remaining src rows are dimension updates. for src in sourceFile: update_dimension( src ) At the end of this pass, you'll accumulate a file of customer dimension adds and changes, which is then persisted into the actual customer dimension in the database. This pass will also write new source records with the customer FK. You can also handle demographic or bridge tables at this time, too. Fact Loading The first step in DW loading is dimensional conformance. With a little cleverness the above processing can all be done in parallel, hogging a lot of CPU time. To do this in parallel, each conformance algorithm forms part of a large OS-level pipeline. The source file must be reformatted to leave empty columns for each dimension's FK reference. Each conformance process reads in the source file and writes out the same format file with one dimension FK filled in. If all of these conformance algorithms form a simple OS pipe, they all run in parallel. It looks something like this. src2cvs source | conform1 | conform2 | conform3 | load At the end, you use the RDBMS's bulk loader (or write your own in Python, it's easy) to pick the actual fact values and the dimension FK's out of the source records that are fully populated with all dimension FK's and load these into the fact table. I've written conformance processing in Java (which is faster than Python) and had to give up on SQL-based conformance for large dimensions. Instead, we did the above flat-file algorithm to merge large dimensions. The killer isn't the language speed, it's the RDBMS overheads. Once you're out of the database, things blaze. Indeed, products like the syncsort data sort can do portions of the dimension conformance at amazing speeds for large datasets. Hand Wringing "But," the hand-wringers say, "aren't you defeating the value of the RDBMS by working outside it?" The answer is NO. We're not doing incremental, transactional processing here. There aren't multiple update transactions in a warehouse. There are queries and there are bulk loads. Doing the prep-work for a bulk load outside the database is simply more efficient. We don't need locks, rollback segments, memory management, threading, concurrency, ACID rules or anything. We just need to match-merge the large dimension and the incoming facts.

May 20, 2008

· 10,258 Views · 1 Like

Trend Reports

Trend Report

Low-Code Development

Development speed, engineering capacity, and technical skills are among the most prevalent bottlenecks for teams tasked with modernizing legacy codebases and innovating new solutions. In response, an explosion of “low-code” solutions has promised to mitigate such challenges by abstracting software development to a high-level visual or scripting language used to build integrations, automate processes, construct UI, and more. While many tools aim to democratize development by reducing the required skills, others seek to enhance developer productivity by eliminating needs such as custom code for boilerplate app components. Over the last decade, the concept of low code has matured into a category of viable solutions that are expected to be incorporated within mainstream application development. In this Trend Report, DZone examines advances in the low-code space, including developers' perceptions of low-code solutions, various use cases and adoption trends, and strategies for successful integration of these tools into existing development processes.